04-07-2012, 10:56 AM
Focused Crawler Optimization Using Genetic Algorithm
Focused Crawler Optimization.pdf (Size: 133.66 KB / Downloads: 44)
Abstract
As the size of the Web continues to grow, searching it for useful information has become more
difficult. Focused crawler intends to explore the Web conform to a specific topic. This paper discusses the
problems caused by local searching algorithms. Crawler can be trapped within a limited Web community
and overlook suitable Web pages outside its track. A genetic algorithm as a global searching algorithm is
modified to address the problems. The genetic algorithm is used to optimize Web crawling and to select
more suitable Web pages to be fetched by the crawler. Several evaluation experiments are conducted to
examine the effectiveness of the approach.
Introduction
Nowadays the Web becomes a huge information source, which has attracted many
people from all over the world. For a Web crawler, one of the most important parts of search
engines, searching through so many documents to select the compatible ones is a tedious task.
Moreover, the Web, which contains more than 11 million pages still keeps growing and
changing rapidly.
Focused crawler [1] is used to selectively collect smaller Web pages collections
according to a particular topic with high precision. A focused crawler will try to predict whether a
target URL is pointing to a relevant Web page before actually fetching it. Focused crawlers rely
on two kinds of algorithm to keep the crawling process on the track. First, Web analysis
algorithm will evaluate the quality and relevance of Web pages pointed by target URLs. Second,
Web searching algorithm will determine the optimal order in which the targets URLs are visited.
Proposed Genetic Algorithm
In this paper, GA is used to improve the quality of searching results in focused crawling.
GA is an adaptive and heuristic method for solving optimization and searching for problems. GA
exploits several techniques inspired by biological evolution such as inheritance, selection, crossover,
and mutation. GA is a member of evolutionary algorithms which is included to the rapidly
growing area of Artificial Intelligence.
Because it is hard to represent Web pages in bit strings and other conventional genetic
operators cannot directly be applied in the Web context, a focused crawler is designed based
on the previous study by Chen et al. [10]. The flowchart of the GA-crawler is shown in Figure 2.
Although GA-crawler does not add new terms like the Gcrawler [16] and the MultiCrawler Agent
(MCA) [17] do, it is expected to maintain a good tracking throughout Web links. In different field,
a multi-objective GA for generator contribution based congestion management was proposed by
Sen et al. [18]. The algorithm optimizes both real and reactive losses using optimal power flow
model. In the many applications GA have successfully implemented [19-20]. The GAs are often
modified to solve some specific problems.
Results and Analysis
The numbers of collections in different categories were compared. The collections built
from the traditional focused crawler contained about 3,000 nodes (Web page). They were
divided into five different categories. While the collections built from GA-crawler contained about
3,300 nodes were also divided into five different categories. Those additional Web pages had
been derived by mutation phase in GA. The GA-crawler was expected to visit more compatible
Web communities than the best-first search crawler and traditional focused crawler.
Some indicators of the crawler’s performance are Web searching precision or the Web
pages’ relevance which is collected by the crawlers, Web crawling’s scope, speed and
robustness, and also the total number of resources used in crawling the Web. Several
experiments were conducted using different starting URL and keywords for each category. They
were taken place on an Intel Dual Core CPU T4200 running at 2.0 GHz, 1 GB RAM and an
enhanced 3G network or 3.5G network called High Speed Download Packet Access (HSDPA)
which supports down-link speeds up to 3.6 Mbps for the Internet connection. The first 100 Web
pages within each category are examined to calculate the precision of the crawling process. The
crawler’s precision was measured by checking the Web page’s relevance compared to the
starting URL and category or keyword which was presented.