Introduction to Web Crawling and Regular Expression

**seminar tips** · 10-12-2012, 06:15 PM

Introduction to Web Crawling and Regular Expression

.ppt

web-crawler.ppt (Size: 296.5 KB / Downloads: 147)

Utilities of a crawler

Web crawler, spider.
Definition:
A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. (Wikipedia)
Utilities:
Gather pages from the Web.
Support a search engine, perform data mining and so on.
Object:
Text, video, image and so on.
Link structure.

Features of a crawler

Must provide:
Robustness: spider traps
Infinitely deep directory structures:
Pages filled a large number of characters.
Politeness: which pages can be crawled, and which cannot
robots exclusion protocol: robots.txt
User-agent:
Disallow: /manage/

Architecture of a crawler

URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set.
DNS: domain name service resolution. Look up IP address for domain names.
Fetch: generally use the http protocol to fetch the URL.
Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted.

Regular Expression

Usage:
Regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words or patterns of characters.

Today’s target:
Introduce the basic principle.

A tool to verify the regular expression: Regex Tester
http://www.dotnet2themaxblogs/fbalena/Pe...859f9.aspx

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Design and Analysis Of Algorithms : Seminar Report and PPT	seminar projects maker	1	1,315	21-09-2017, 12:04 PM Last Post: jaseela123
	Wireless LAN Security Introduction	study tips	1	959	20-09-2017, 12:40 PM Last Post: jaseela123
	A Technical Introduction to USB 2.0	seminar flower	1	1,657	19-09-2017, 10:42 AM Last Post: jaseela123
	INTRODUCTION TO COMPUTER NETWORKS PPT	project girl	1	2,426	19-09-2017, 09:48 AM Last Post: jaseela123
	The Web Service Modeling Ontology (WSMO) ppt	seminar ideas	1	2,772	15-09-2017, 12:19 PM Last Post: jaseela123
	OPERATING SYSTEM INTRODUCTION PPT	project girl	1	1,226	13-09-2017, 03:22 PM Last Post: jaseela123
	Usability of Semantic Web for Enhancing Digital Living Experience	seminar flower	1	2,695	11-09-2017, 04:39 PM Last Post: jaseela123
	REAL TIME FACIAL EXPRESSION RECOGNITION USING A NOVEL METHOD	seminar tips	1	1,061	09-09-2017, 04:43 PM Last Post: jaseela123
	Network Simulator 2: Introduction pdf	project girl	1	1,637	09-09-2017, 01:53 PM Last Post: jaseela123
	multiple parameter for web service	seminar ideas	1	2,371	09-09-2017, 09:27 AM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.