22-05-2012, 03:29 PM
Crawling the Web
crawling.pdf (Size: 360.66 KB / Downloads: 13)
Introduction
Web crawlers are programs that exploit the graph structure of the Web to
move from page to page. In their infancy such programs were also called
wanderers, robots, spiders, ¯sh, and worms, words that are quite evocative
of Web imagery. It may be observed that the noun `crawler' is not indicative
of the speed of these programs, as they can be considerably fast. In our own
experience, we have been able to crawl up to tens of thousands of pages within
Building a Crawling Infrastructure
the °ow of a basic sequential crawler (in section 2.6 we con-
sider multi-threaded crawlers). The crawler maintains a list of unvisited URLs
called the frontier. The list is initialized with seed URLs which may be pro-
vided by a user or another program. Each crawling loop involves picking the
next URL to crawl from the frontier, fetching the page corresponding to the
URL through HTTP, parsing the retrieved page to extract the URLs and ap-
plication speci¯c information, and ¯nally adding the unvisited URLs to the
frontier. Before the URLs are added to the frontier they may be assigned a
score that represents the estimated bene¯t of visiting the page corresponding
to the URL. The crawling process may be terminated when a certain number
of pages have been crawled. If the crawler is ready to crawl another page and
the frontier is empty, the situation signals a dead-end for the crawler. The
crawler has no new page to fetch and hence it stops.
Frontier
The frontier is the to-do list of a crawler that contains the URLs of unvisited
pages. In graph search terminology the frontier is an open list of unexpanded
(unvisited) nodes. Although it may be necessary to store the frontier on disk
for large scale crawlers, we will represent the frontier as an in-memory data
structure for simplicity. Based on the available memory, one can decide the
maximum size of the frontier. Due to the large amount of memory available
on PCs today, a frontier size of a 100,000 URLs or more is not exceptional.
Given a maximum frontier size we need a mechanism to decide which URLs to
ignore when this limit is reached. Note that the frontier can ¯ll rather quickly
as pages are crawled. One can expect around 60,000 URLs in the frontier with
a crawl of 10,000 pages, assuming an average of about 7 links per page [30].
History and Page Repository
The crawl history is a time-stamped list of URLs that were fetched by the
crawler. In e®ect, it shows the path of the crawler through the Web starting
from the seed pages. A URL entry is made into the history only after fetching
the corresponding page. This history may be used for post crawl analysis
and evaluations. For example, we can associate a value with each page on the
crawl path and identify signi¯cant events (such as the discovery of an excellent
resource). While history may be stored occasionally to the disk, it is also
maintained as an in-memory data structure. This provides for a fast lookup
to check whether a page has been crawled or not.