Seminar Topics & Project Ideas On Computer Science Electronics Electrical Mechanical Engineering Civil MBA Medicine Nursing Science Physics Mathematics Chemistry ppt pdf doc presentation downloads and Abstract

Full Version: DESIGN OF A NOVEL INCREMENTAL PARALLEL WEBCRAWLER
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
[attachment=16136]





World Wide Web (WWW) is a vast repository of interlinked hypertext documents known as web pages. A hypertext document consists of both, the contents and the hyperlinks to related documents [1]. Users access these hypertext documents via a software known as web browser. It is used to view the web pages that may contain information in form of text, images, videos and other multimedia. The documents are navigated using hyperlinks, also known as Uniform Resource Locators (URLs). Though the concept of hypertext is much older but WWW was originated after Tim Berners-Lee, an English physicist who wrote a proposal using hypertext to link and access information in 1990 [2]. Since then websites were being created around the world using hypertext mark up languages and connected through Internet.

Now it has become an integral part of human life to use Internet to access the information from WWW. The current population of the world is approximately 6.77 billion out of which approximately 1.67 billion people (24.7%) use Internet [3]. In fact from .36 billion in 2000, the number of Internet users has increased to 1.67 billion in 2009 i.e. an increase of 362% from 2000 to 2009. Same growth rate is expected in future too. In Asia alone, around .7 billion people use Internet that is approximately 42.2% of worldwide Internet users. India where approximately 0.08 billion people use Internet, is the third largest country of Internet users in Asia after China and Japan. Thus it is not far away when one will start feeling that life is incomplete without Internet.

Since its inception in 1990, World Wide Web has grown exponentially in size. As of today, it is estimated that it contains approximately 50 billion publicly accessible/indexable web documents [4] distributed all over the world on thousands of web servers. It is very difficult to search information from such a huge collection of web documents on World Wide Web as the web pages/documents are not organized as books on shelves in a library, nor are web pages completely catalogued at one central location. It is not guaranteed that users will be able to retrieve information even after knowing where to look for information by knowing its URLs as Web is constantly changing [5-6]. Therefore, there was a need to develop information retrieval tools to search the required information from WWW. Information retrieval tools are divided into three categories as follows:
• Web directories
• Meta search engines
• Search engines

Web documents are organized in a hierarchical taxonomy tree on the basis of topics and subtopics in Web directories. To access information from a Web directory on a topic, it is necessary to traverse a path in the taxonomy tree from the root to the desired node in the tree. The hierarchical tree is organized in such a way that general topics are sub-divided into more specified topics if one follows towards the lower order from the root of the hierarchical tree. Arrangement of the web documents in this hierarchical way helps even a non-expert user to easily access the information but the basic problem with a Web directory is that the hierarchical tree is maintained manually and therefore only a small fraction of the Web is covered. Another problem in the Web directory is that the depth of an item in the hierarchical tree is not based on the access pattern [7]. As a result, longer path may be followed to retrieve more relevant information in comparison to less relevant information if it is down the order in the tree. However, this problem is not faced in search engines which use flat approach to access information. So user can get the most relevant information in one go in response to an appropriate search query.

Meta search engines do not maintain their own indexes/repository rather they are developed to exploit the best features of many search engines. They provide a single interface where user queries are issued and later on sent to many search engines. Results obtained from these multiple search engines are compiled [8-9] to get the final result. The documents are ranked after eliminating duplicates and the final results are displayed to the user.
Internet would have not become so popular if search engines would not have been developed. Starting in 1994, a number of search engines were launched, including AltaVista, Excite, Infoseek, Inktomi, Lycos, and of course the evergreen, Yahoo and Google. Most of these search engines save a copy of the web pages in their central repository and then make appropriate indexes of them for later search/retrieval of information. User interface, Query engine, Indexer, Crawlers and Repository are the basic components of search engines [10-12].

To access information from the WWW, users provide search queries in the search engine's interface. For the search query provided, search results, in the order of their relevance [13, 23], are displayed on the screen. On behalf of the search engine, it is the query engine which processes the search queries for getting the relevant documents stored in the search engines database/repository.

In fact the databases/repositories for search engines are maintained with the help of Web crawlers. Web crawlers also known as spiders, robot, web pot etc, are programs that traverse the Web and download web documents [14]. The working of Web crawlers is initiated with initial set of URLs known as seed URLs. They download web documents for the seed URLs and extract new links present in the downloaded documents. The downloaded web documents are stored and properly indexed in the repository so that with the help of their indexes they may later be retrieved when required. The extracted URLs from the downloaded web page are verified to know whether their corresponding documents have already been downloaded or not. If they are not already downloaded, the URLs are again assigned to crawlers for further downloading. This process is repeated till no more URLs are left for downloading or target numbers of documents are being downloaded. Millions of web pages are downloaded per day by a crawler to achieve the target.

From above discussions it is very clear that a search engine is the most popular information retrieval tool having the following objectives:


1. It should explore and download web documents from WWW as much as possible.
2. It should bring high quality documents so that the user gets the required relevant information within acceptable time.
3. The documents must be displayed in the order of their relevance with respect to the user query.
4. As the web documents are very much dynamic in nature, search engine should update its repository as frequently as possible. The ideal case would be of synchronizing updation of repository with the web document's actual change frequency.

To satisfy the first objective i.e. to cover the Web as much as possible, nowadays search engines do not depend on a single but on multiple crawlers that execute in parallel to achieve the target. While working in parallel, crawlers still face many challenging problems such as overlapping, quality and network bandwidth that need to be addressed.

Search engines employ ranking algorithms to meet second and third objectives mentioned above. The most popular algorithm is the back link count, proposed by Sergey Brin and Lawrence Page, the Google founders in 1998 [2]. Though back link count helps in efficiently displaying the documents in the order of their relevance, it fails to bring quality documents. The reason being that it needs to have image of entire Web in terms of back link count over the entire Web, which is not possible for the crawler especially when its database is in the starting or growing stage.

The fourth objective of search engine is to keep its database up-to-date with respect to the web pages maintained at Web server end. The optimum case will be if the updating frequency is synchronized with web page's change frequency. In fact it is almost impossible to find the exact change frequencies of web documents as they get changed at random and follow the Poisson process [15]. Nevertheless it is equally important to find whether a document has changed or not.

A critical look at the available literature indicates the following issues that need to be addressed: Issue 1: Overlapping of web documents
Overlap problem occurs when multiple crawlers running in parallel download the same web document multiple times due to the reason that one web crawler may not be aware of another having already downloaded the same page. Also many organizations mirror their documents on multiple servers to avoid arbitrary server corruption [21-22]. In such a situation, crawlers may also unnecessarily download many copies of the same document.

Issue 2: Quality of downloaded web documents
The quality of downloaded documents can be ensured only when web pages of high relevance are downloaded by the crawlers. Therefore, to download such relevant web pages at the earliest, multiple crawlers running in parallel must have global image of collectively downloaded web pages so that the redundancy in terms of duplicate documents may be avoided.

Issue 3: Network bandwidth/traffic problem
In order to maintain the quality, the crawling process is carried out using either of the following approaches:
• Crawlers can be generously allowed to communicate among themselves or
• They can not be allowed to communicate among themselves at all.
In the first approach network traffic will increase because crawlers communicate among themselves more frequently to reduce the overlap problem whereas in second approach, if they are not allowed at all to communicate then as a result same web documents may be downloaded multiple times thereby consuming the network bandwidth. Thus both approaches put extra burden on network traffic.

Issue 4: Change of web documents
Changing of web documents is a continuous process. Of course, the frequency of change varies from document to document. This change must be reflected at the search engine repository failing which a user may have to access an obsolete web document.

Main Contribution of the Thesis
The focus of this thesis is to investigate the issues in concern to parallel crawling and search engines database updating.
1. A survey has been conducted to know the users search trend on WWW. It has helped to identify the major problems users face while searching required information from the Internet.
2. A novel architecture for incremental parallel web crawler has been designed that helps to reduce the overlap and network bandwidth problem among crawlers while working in parallel. The proposed crawler (see Figure 1) has a client server based architecture consisting of following main components:

• Multi Threaded server
• Client crawlers
• Change detection module

The Multi Threaded (MT) server is the main coordinating component of the architecture. On its own, it does not download any web document but manages a connection pool with client machines which actually download the web documents. URL dispatcher, ranking module, URL distributor, URL allocator, indexer and repository are the sub components of MT Server. The salient features of the MT server are as given below:
• It establishes communication links between MT server and client crawlers.
• If a client crawler loses its connection with the MT server, the whole system does not get disrupted as other client crawlers remain functional.
• The system is scalable and it is possible to add-up new client crawlers/machines to an ongoing crawling system without any temporary halt of the overall system. The MT server automatically starts allocation of URLs to the newly added client crawler without any inherent effects on the other client processes or the whole system.
• Once the communication is established between the MT server and the clients, the URL allocator sends seed URL to a client crawler in order to download the corresponding web document and waits for its response.