30-05-2013, 11:50 AM
A Major Seminar on A Novel Architecture for Domain Specific Parallel Crawler
A Novel Architecture.pptx (Size: 844.16 KB / Downloads: 24)
INTRODUCTION-What Is a Crawler?
A program that downloads and stores web pages:
Starts off by placing an initial set of URLs, S0 , in a queue, where all URLs to be retrieved are kept and prioritized.
From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue.
This process is repeated until the crawler decides to stop.
Collected pages are later used for other applications, such as a web search engine or a web cache.
What Is a Parallel Crawler?
As the size of the web grows, it becomes more difficult to retrieve the whole or a significant portion of the web using a single process.
It becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time.
We refer to this type of crawler as a parallel crawler.
The main goal in designing a parallel crawler, is to maximize its performance (Download rate) & minimize the overhead from parallelization.
Proposed Architecture
A Novel architecture of parallel crawler which based on intelligent domain specific crawling is being proposed.
Crawling on this base makes the task more effective in terms of relevancy and load sharing .
The architecture has number of domain specific queues, for the various domains like, .edu, .org, .ac, .com etc. The URLs for these queues are sent to the respective Crawl Workers. This leads to the load sharing by the parallel crawlers on the basis of specific domains.
Crawl Worker
This is the most important module of the entire proposed architecture. The working of the crawl module involves the following steps:
(1) Fetch URL
(2) Download Web page
(3) Extract URLs
(4) Analyze URL
(5) Forward URL
URL Distributor
URL distributor uses the URL distribution algorithm.
URL distributor maintains information on the domains identification of the URL.
It gets a seed URL from seed URL queue and distributes the seed URL to the concerned queue of domain specific crawl worker, for further processing.
Given a seed URL, the number of downloaded pages or the crawling time depends on the seed URL. The distribution of the load on the crawl workers is likely to be different depending on the frequency of the demand of the domains.
Web Cache
A web cache stores Web resources in anticipation of future requests. Web caching works because of popularity the most popular resource is, the more likely it is to be requested in the future. The advantages of Web Cache are
reduces network bandwidth usage, which can save money for both consumer and the creator.
lessens user-perceived delay, which increases user perceived value.
lightens load on the origin servers , saving hardware and support costs for content providers and providing
consumers a shorter response time for no cached resources.
Conclusion
This novel architecture is proposed, for building a parallel crawler on the lines domain specific crawling, having the following characteristics-
Full Distribution: DS Crawler (Domain Specific Crawler) is distributed over multiple crawling machines (each for specific domain) for better performance. Crawling machines download web pages independently without communication between them.
Scalability: Due to the fully distributed architecture of DS Crawler, its performance can be scaled by adding extra machines depending on the bases of the increase of domains, thus manage to handle the rapidly growing Web.
Load Balancing: The URLs to be crawled are distributed by the URL Distributor to the particular Domain Specific Queues, thus distributing the crawling to different crawlers which leads the balancing of the crawling load.
Reliability: Multiple, independently working crawlers increases the reliability of the whole system, as failure of the single crawl worker will not affect the function remaining crawl workers.