10-12-2012, 06:15 PM
Introduction to Web Crawling and Regular Expression
web-crawler.ppt (Size: 296.5 KB / Downloads: 147)
Utilities of a crawler
Web crawler, spider.
Definition:
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. (Wikipedia)
Utilities:
Gather pages from the Web.
Support a search engine, perform data mining and so on.
Object:
Text, video, image and so on.
Link structure.
Features of a crawler
Must provide:
Robustness: spider traps
Infinitely deep directory structures:
Pages filled a large number of characters.
Politeness: which pages can be crawled, and which cannot
robots exclusion protocol: robots.txt
User-agent:
Disallow: /manage/
Architecture of a crawler
URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set.
DNS: domain name service resolution. Look up IP address for domain names.
Fetch: generally use the http protocol to fetch the URL.
Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted.
Regular Expression
Usage:
Regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words or patterns of characters.
Today’s target:
Introduce the basic principle.
A tool to verify the regular expression: Regex Tester
http://www.dotnet2themaxblogs/fbalena/Pe...859f9.aspx