A Major Seminar on A Novel Architecture for Domain Specific Parallel Crawler

**study tips** · 30-05-2013, 11:50 AM

A Major Seminar on A Novel Architecture for Domain Specific Parallel Crawler

.pptx

A Novel Architecture.pptx (Size: 844.16 KB / Downloads: 24)

INTRODUCTION-What Is a Crawler?

A program that downloads and stores web pages:
Starts off by placing an initial set of URLs, S0 , in a queue, where all URLs to be retrieved are kept and prioritized.
From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue.
This process is repeated until the crawler decides to stop.
Collected pages are later used for other applications, such as a web search engine or a web cache.

What Is a Parallel Crawler?

As the size of the web grows, it becomes more difficult to retrieve the whole or a significant portion of the web using a single process.
It becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time.
We refer to this type of crawler as a parallel crawler.
The main goal in designing a parallel crawler, is to maximize its performance (Download rate) & minimize the overhead from parallelization.

Proposed Architecture

A Novel architecture of parallel crawler which based on intelligent domain specific crawling is being proposed.
Crawling on this base makes the task more effective in terms of relevancy and load sharing .
The architecture has number of domain specific queues, for the various domains like, .edu, .org, .ac, .com etc. The URLs for these queues are sent to the respective Crawl Workers. This leads to the load sharing by the parallel crawlers on the basis of specific domains.

Crawl Worker

This is the most important module of the entire proposed architecture. The working of the crawl module involves the following steps:
(1) Fetch URL
(2) Download Web page
(3) Extract URLs
(4) Analyze URL
(5) Forward URL

URL Distributor

URL distributor uses the URL distribution algorithm.
URL distributor maintains information on the domains identification of the URL.
It gets a seed URL from seed URL queue and distributes the seed URL to the concerned queue of domain specific crawl worker, for further processing.
Given a seed URL, the number of downloaded pages or the crawling time depends on the seed URL. The distribution of the load on the crawl workers is likely to be different depending on the frequency of the demand of the domains.

Web Cache

A web cache stores Web resources in anticipation of future requests. Web caching works because of popularity the most popular resource is, the more likely it is to be requested in the future. The advantages of Web Cache are
reduces network bandwidth usage, which can save money for both consumer and the creator.
lessens user-perceived delay, which increases user perceived value.
lightens load on the origin servers , saving hardware and support costs for content providers and providing
consumers a shorter response time for no cached resources.

Conclusion

This novel architecture is proposed, for building a parallel crawler on the lines domain specific crawling, having the following characteristics-
Full Distribution: DS Crawler (Domain Specific Crawler) is distributed over multiple crawling machines (each for specific domain) for better performance. Crawling machines download web pages independently without communication between them.
Scalability: Due to the fully distributed architecture of DS Crawler, its performance can be scaled by adding extra machines depending on the bases of the increase of domains, thus manage to handle the rapidly growing Web.
Load Balancing: The URLs to be crawled are distributed by the URL Distributor to the particular Domain Specific Queues, thus distributing the crawling to different crawlers which leads the balancing of the crawling load.
Reliability: Multiple, independently working crawlers increases the reliability of the whole system, as failure of the single crawl worker will not affect the function remaining crawl workers.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Parallel Processing	seminar projects crazy	0	14,588,555	25-08-2017, 09:32 PM Last Post: seminar projects crazy
	Cognitive architecture	computer science crazy	0	8,814,404	25-08-2017, 09:32 PM Last Post: computer science crazy
	A Novel approach for data hiding in the video motion using crypt analytical	mkaasees	0	339	06-10-2016, 12:11 PM Last Post: mkaasees
	A NOVEL APPROACH FOR IMAGE AND VIDEO STEGANOGRAPHY TECHNIQUE TO EMBED IMAGES	mkaasees	0	281	17-09-2016, 11:57 AM Last Post: mkaasees
	Lifetime of MANET Networks Increased Using Cooperative DEL-CMAC Protocol by Parallel	dhanabhagya	0	814	02-03-2016, 03:25 PM Last Post: dhanabhagya
	Contrast Enhancement of Dark Images using Stochastic Resonance in Wavelet Domain	dhanabhagya	0	284	21-01-2016, 12:16 PM Last Post: dhanabhagya
	Design of 2-D Filters using a Parallel Processor Architecture	presentation Abstract	0	429	29-05-2015, 02:58 PM Last Post: presentation Abstract
	Parallel Computing In India	presentation Abstract	0	310	29-05-2015, 02:36 PM Last Post: presentation Abstract
	Dynamic Domain Name Service	presentation Abstract	0	297	27-05-2015, 03:47 PM Last Post: presentation Abstract
	Compute Unified Device Architecture CUDA	presentation Abstract	0	395	27-05-2015, 03:42 PM Last Post: presentation Abstract

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.