Crawling the Web

**seminar ideas** · 22-05-2012, 03:29 PM

Crawling the Web

.pdf

crawling.pdf (Size: 360.66 KB / Downloads: 13)

Introduction
Web crawlers are programs that exploit the graph structure of the Web to
move from page to page. In their infancy such programs were also called
wanderers, robots, spiders, ¯sh, and worms, words that are quite evocative
of Web imagery. It may be observed that the noun `crawler' is not indicative
of the speed of these programs, as they can be considerably fast. In our own
experience, we have been able to crawl up to tens of thousands of pages within

Building a Crawling Infrastructure

the °ow of a basic sequential crawler (in section 2.6 we con-
sider multi-threaded crawlers). The crawler maintains a list of unvisited URLs
called the frontier. The list is initialized with seed URLs which may be pro-
vided by a user or another program. Each crawling loop involves picking the
next URL to crawl from the frontier, fetching the page corresponding to the
URL through HTTP, parsing the retrieved page to extract the URLs and ap-
plication speci¯c information, and ¯nally adding the unvisited URLs to the
frontier. Before the URLs are added to the frontier they may be assigned a
score that represents the estimated bene¯t of visiting the page corresponding
to the URL. The crawling process may be terminated when a certain number
of pages have been crawled. If the crawler is ready to crawl another page and
the frontier is empty, the situation signals a dead-end for the crawler. The
crawler has no new page to fetch and hence it stops.

Frontier
The frontier is the to-do list of a crawler that contains the URLs of unvisited
pages. In graph search terminology the frontier is an open list of unexpanded
(unvisited) nodes. Although it may be necessary to store the frontier on disk
for large scale crawlers, we will represent the frontier as an in-memory data
structure for simplicity. Based on the available memory, one can decide the
maximum size of the frontier. Due to the large amount of memory available
on PCs today, a frontier size of a 100,000 URLs or more is not exceptional.
Given a maximum frontier size we need a mechanism to decide which URLs to
ignore when this limit is reached. Note that the frontier can ¯ll rather quickly
as pages are crawled. One can expect around 60,000 URLs in the frontier with
a crawl of 10,000 pages, assuming an average of about 7 links per page [30].

History and Page Repository
The crawl history is a time-stamped list of URLs that were fetched by the
crawler. In e®ect, it shows the path of the crawler through the Web starting
from the seed pages. A URL entry is made into the history only after fetching
the corresponding page. This history may be used for post crawl analysis
and evaluations. For example, we can associate a value with each page on the
crawl path and identify signi¯cant events (such as the discovery of an excellent
resource). While history may be stored occasionally to the disk, it is also
maintained as an in-memory data structure. This provides for a fast lookup
to check whether a page has been crawled or not.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	The Web Service Modeling Ontology (WSMO) ppt	seminar ideas	1	2,772	15-09-2017, 12:19 PM Last Post: jaseela123
	Usability of Semantic Web for Enhancing Digital Living Experience	seminar flower	1	2,695	11-09-2017, 04:39 PM Last Post: jaseela123
	multiple parameter for web service	seminar ideas	1	2,371	09-09-2017, 09:27 AM Last Post: jaseela123
	Web Spoofing Seminar PPT	project girl	1	3,100	02-09-2017, 02:50 PM Last Post: jaseela123
	The Web	project girl	1	1,675	02-09-2017, 01:45 PM Last Post: jaseela123
	Packet Route Tracer of Web Request PPT	study tips	1	1,560	29-08-2017, 11:36 AM Last Post: jaseela123
	Report on Web Search Engine	project girl	1	676	28-08-2017, 02:54 PM Last Post: jaseela123
	WEB SPOOFING A SEMINAR REPORT	Computer Science Clay	0	20,769,278	25-08-2017, 09:32 PM Last Post: Computer Science Clay
	Seaside, a very cool framework to develop web application	computer science crazy	0	11,080,378	25-08-2017, 09:32 PM Last Post: computer science crazy
	Mashup (web application hybrid)	computer science crazy	0	14,753,733	25-08-2017, 09:32 PM Last Post: computer science crazy

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.