31-05-2012, 12:38 PM
Web Crawler
Introduction-:
WebCrawler is a Web service that assists users in their Web navigation by automating the task of link traversal, creating a searchable index of the web, and fulfilling searchers’ queries from the index.
Crawling is the means by which WebCrawler collects pages from the Web. The end result of crawling is a collection of Web pages at a central location. Given the continuous expansion of the Web, this crawled collection is guaranteed to be a subset of the Web and, indeed, it may be far smaller than the total size of the Web. By design, WebCrawler aims for a small, manageable collection that is representative of the
entire Web.
How it Works-:
Starts with a list of URLs to visit, called the seeds. As the crawler visit the URLs, it identifies all the hyperlinks in the page and add them to the list of visited URLs called the crawl frontier.
Crawling Policies-:
• Selection policy
• Re-visit policy
• Politeness policy
• Parallelization policy
• Selection policy
– Pageranks
– Path ascending
– Focused crawling
• Re-visit policy
– Freshness
– Age
• Politeness
– So that crawlers don’t overload web servers
– Set a delay between GET requests
• Parallelization
– Distributed web crawling
– To maximize download rate
Features of Crawler-:
• Robustness: spider traps
Infinitely deep directory structures
Pages filled a large number of characters.
• Politeness: which pages can be crawled, and which cannot
robots exclusion protocol: robots.txt
• Should provide:
• Distributed
• Extensible
• Freshness
• Quality
• Performance and efficiency
Examples of Web Crawlers-:
RBSE
World Wide Web Worm
Google Crawler
WebFountain
WebRace