15-06-2010, 04:46 PM
Abstract:
The objective was to build a customized multithreaded, focused crawler. This will crawl the web, based on the relevance of the web page, reducing thereby the crawl space and searching for the required page efficiently. The world-wide web, having over 350 million pages, continues to grow rapidly at an amazing pace of a million pages per day. About 600 GB of text changes every month. Such a tremendous growth and flux poses basic limits of scale for today's generic crawlers and search engines. In spite of using high-end multiprocessors and exquisitely crafted crawling software, the largest crawls cover only 30-40% of the web, and refreshes take weeks to a month. With such unprecedented scaling challenges for general-purpose crawlers and search engines, we propose a hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents
Softwares Used:
The feasibility of the developed software developed can be stated as under:
1.
The new application requires Net beans1(advanced compiler of Java).
2.
JDBC- ODBC bridge to create a link between Java application and data stored.
Hardware Used:
The present hardware setup of the organization may be sufficient for running the software applications to be developed. The hardware feasibility of the software developed is as under:
1.
Servers used to manage the application.
2.
High bandwidth for faster parsing of data.