Biography

My name is Artjom Kurapov and I was born in Tallinn. I spent my youth mostly in Ukraine. By fifth grade I returned, continued studying and received Estonian citizenship. I am a professional web-developer, which means that i do it for a living, not that i am guru. I specialize in php and relational databases (oracle, mysql, postgre) and javascript (ajax, prototype). I am amateur in photography, SciFi and blogging.

Out of memorial dates i would pick..

  • My birth on the 11 of january 1985 at 04:15, Capricorn.
  • My study before year 2002 in Tallinn Central Russian Gymnasium (thats what we call high school and college in one). Did some math olympics in 2000, visited Tartu.
  • During the summer of 2002 i have started studying in Tallinn Technical University with the chair of informatics, specializing in computer and system engineering.
  • On the summer of 2004 I received my driver's licence.
  • In december 2004 I have decided that working would also be nice, thats how I got to Mikare.net as web-developer.
  • 13 july 2005 after medical commission i started serving in the army, and after training in communications batalion,i have received drivers licence for driving trucks and military personell, later on i was directed to military support center.
  • 13 june 2006 I wase demobilized and on the next day I left Mikare baltic for a new place Web Expert with a position of senior web-developer.
  • 18 may 2007 I left Web Expert and went to Elitec
  • 11 january 2007 I finished my last exam
  • 11 june 2007 I got "B" in my bachelor's degree diploma on topic "Agile web crawler: Design and implementation".
  • 17 june 2007 I was baptised in Russian Orthodox P?htitsa Convent
  • 17 august 2007 I married a wonderful pianist, Dina.
  • As of september 2007 I am studying at the Tallinn Technical University to become a magister in infotechnology.

Agile web crawler : design and implementation. Artjom Kurapov BSc work.

Introduction

The Internet has over 90 million domains [1], over 70 million personal blogs [2] which are viewed by over 1 billion people around the world [3]. In this vast space of information it is important to create order even without global scale. This may be done either by building classification catalog or a search engine. Both of them require a web-crawling tool to ease the burden of manual data processing. How it can be built is a question this work raises.

Deciding architecture scale is the first step. Data management functions, such as link extraction, can be divided among system components – crawler, indexer, ranker, lexicon due to robustness requirements. For example Google has over 200 000 servers [4], and dividing processing and network loads is essential for building fast and reliable computation processes.

The way crawler interprets data and how it is used further leads to the following types:

  • Knowledge focused crawlers classify page by topic, extract only relevant data, use data mining filters and interact with data more like humans do.
  • Freshness crawlers extract data fast and regularly, analyze what has changed, adjust to update frequency. This is used by news and RSS feed crawlers.
  • Experience crawlers focus on making full document copy and storing it for a long time. A good example is Internet archive [5] and backup services.

In practical part of this thesis, a general crawler is designed, which incorporates basic methods from these types, but its main aspect is the agility if offers for the developers. This data extraction tool, can be used on most of modern hosting platforms. For source storage - a single MySQL database [6] is used and entire crawler application source is written in PHP [7] server-side language which is run in CLI mode on Apache server [8].

Although market offers many services that index, order and provide data, the author found few solutions that were available with open source and under LAMP. Data extraction tools are needed more than ever and not only with indexing a page, but with tracking changes, exporting data to other applications, analyzing an existing web-site. That is why the author has decided to make own program which would be available for changes to open public. Examining different algorithms should clear out its purposes and what is the most suitable for this work.


Some loss of fidelity can be seen because of reformat to Word 2003 version.