26-07-2012, 03:13 PM
Crawling the Hidden Web
CrawlingtheHiddenWeb.ppt (Size: 644 KB / Downloads: 27)
Web Crawlers
Automatically traverse the Web graph, building a local repository of the portion of the Web that they visit
Traditionally, crawlers have only targeted a portion of the Web called the publicly indexable Web (PIW)
PIW – the set of pages reachable purely by following hypertext links, ignoring search forms and pages that require authentication
The Hidden Web
Recent studies show that a significant fraction of Web content in fact lies outside the PIW
Large portions of the Web are ‘hidden’ behind search forms in searchable databases
HTML pages are dynamically generated in response to queries submitted via the search forms
Also referred as the ‘Deep’ Web
Deep Web Stats
The Deep Web is 500 times larger than PIW !!!
Contains 7,500 terabytes of information (March 2000)
More than 200,000 Deep Web sites exist
Sixty of the largest Deep Web sites collectively contain about 750 terabytes of information
95% of the Deep Web is publicly accessible (no fees)
Google indexes about 16% of the PIW, so we search about 0.03% of the pages available today
The Solution
Build a hidden Web crawler
Can crawl and extract content from hidden databases
Enable indexing, analysis, and mining of hidden Web content
The content extracted by such crawlers can be used to categorize and classify the hidden databases