04-10-2016, 11:16 AM
1457613581-IJCAESCSE2012073.pdf (Size: 284.67 KB / Downloads: 14)
Abstract— The Web, the largest unstructured database of
the world has greatly improved access to the documents.
As the number of Internet users and the number of
accessible web pages grow, it is becoming increasingly
difficult for users to find documents that are relevant to
their particular needs. Users must either browse through a
hierarchy of concepts to find the information they need or
submit a query to a Search Engine and wade through
hundreds of results most of them irrelevant.
Web Crawlers are one of the most crucial
components used by the Search Engines to collect pages
from the Web. It is an intelligent means of browsing used
by the Search Engine. The requirement of a web crawler
that downloads most relevant web pages from such a large
web is still a major challenge in the field of Information
Retrieval Systems. Most Web Crawlers use Keywords base
approach for retrieving the information from Web. But
they retrieve many irrelevant pages as well.
We propose a novel method of addressing this issue
without compromising with the relevancy of the retrieved
documents through the crawler. The proposed technique
makes use of Semantics which helps to downloaded only
relevant pages. Semantics can be provided by ontologies.
An Ontology Based Web Crawler uses ontological
Engineering concepts for improving its crawling
performance. The Crawler, guided by an ontology
describing the domain of interest, crawls the Web focusing
on pages relevant to a given topic ontology. As a result
ontologies found during the crawl will be relevant to the
domain and produce a set of candidate mappings with the
topic ontology. The crawler deals with prioritizing the
URL Queue for crawling more relevant pages based on
domain dependent ontologies. Also explores the possibility
of using the semantic nature of the URL which has been
obtained from the ontology tree to filter out the URL’s
Queue.
The main advantage of Ontology Based Web Crawler
over other Focused crawler is that it does not need any
Relevance Feedback or Training procedure in order to act
intelligently. Moreover the number of extracted documents
will reduce as well as the crawling time thus leading to
greater search efficiency.
INTRODUCTION
A Web Crawler is a relatively simple, automated program, or script that methodically scans or
―crawls‖ through internet pages to create an index of the
data it is looking for.
Usually for indexing, crawler-based engines consider
much more factors than those they can find on the web
pages. Thus before putting any web page into an index,
a crawler will look how many other pages in the index
are linking to the current web page, the text used in the
links the user points to, what the page rank of the linking
pages is, whether the page is present in some directories
under related categories, etc. These ―off-the-page‖
factors play a weighty part when the page is evaluated
by a crawler-based engine. While theoretically, the web
page developer can artificially increase the page
relevance for certain keywords by adjusting the
corresponding areas of the HTML code, user still have
much less control over other pages in the internet that
are linking to the user.
Thus off-the-page relevance prevails in
the crawler’s eye leading to the following problems
faced by the general web crawler:
1. Web is increasing in size day by day. It is
observed that 600 GB of text changes every
month. Since web crawler fetches each and
every page, it requires large storage area and it
consumes much time also.
2. Hardware requirement for the crawler is too
high (CPU, disks etc).
3. Crawlers cover only 30-40% of web.
4. The Search Engine which uses this general web
Crawler returns links in which most of the
times the first few links may not be relevant to
the topic.
Crawlers have been around for a long time and
have proven their usefulness and success on the Web.
None the less, these general-purpose crawlers are not
sufficient to tackle the stated problem. They crawl the
Web in a blind and exhaustive manner. Since our goal is
to find very specific data on the Web, this exhaustive
approach will not find the requested information
considering the current size of the Web.
Therefore, in this paper we propose a focused crawling
process based on domain specification so that the
crawler is guided to the relevant information and no time is wasted on irrelevant resources. These kinds of
ontology based crawlers are also referred to as
preferential or heuristic-based crawlers. The heuristic
we use in our nominated solution is ontology matching.
Since the goal is to find information resources on the
Web, we expect that most of these resources are
semantically annotated and linked using an ontology
hierarchy. The algorithm that we use to develop an
Ontology Based Web Crawler solves the major problem
of finding the relevancy of pages before the process of
crawling, to an optimal level. It presents an intelligent
focused crawling algorithm in which ontology is
embedded to evaluate the page’s relevance to the topic
with a relevancy limit. Raman et al. [1] and Chang et al.
[6] have also presented intelligent crawler algorithms.
II. ONTOLOGY
Ontology is a formal, explicit specification of
shared conceptualization. Ontology provides a common
vocabulary of an area and defines, with different level of
formality, the meaning of terms and relationships
between them. Ontologies were developed in Artificial
Intelligence to facilitate Knowledge sharing and reuse.
Since the early 1990‟s, ontologies have become a
popular research topic. They have been studied by
several Artificial Intelligence research Communities,
including Knowledge engineering, natural-language
processing and Knowledge Representation. It helps in
describes a Semantic Web- based Knowledge
management architecture and a suite of innovative tools
for semantic information Processing [2].
III. THE PROPOSED ALGORITHM OVERVIEW
In Ontology Based Web Crawler the web pages are
first checked for validity (i.e. of the type html, php, jsp
etc). If it is valid then it is parsed and the parsed content
is matched with the ontology. If the page is relevant it is
indexed otherwise it is not considered.
Hence the algorithm is as follows:
1. Get the seed URL.
2. If the web page is valid that is it is of the
defined type (html, php, jsp etc.) then it is
added to queue.
3. Parse the content.
4. Get the response from the server if it is ok then
read the protégé file of ontology. and match the
content of web page with the terms of ontology.
5. Count the Relevance Score of web page and
add the web page to index and caches file to a
folder. With the help of cache and index
searching can be done.
FUTURE SCOPE
Though we believe that our projected crawler takes care
of everything an efficient crawler needs, there is still a
window of improvement in our crawler that can be
addressed. In our relevancy calculation algorithm, we
have to set the weight of the ontology term manually. A
mechanism can be devised such that after reading the
ontology and after visiting certain Web pages it can
provide the weight of the ontology term automatically.
Also, the processing time of the Web crawler can be
improved. In our algorithm the ontology remains static;
ontology can be evolved dynamically by adding new
concepts and relations while visiting Web pages.
VI. CONCLUSION
The main aim of our paper is to retrieve relevant Web
pages and discards the irrelevant ones. We have
developed an ontology based crawler which retrieves
Web pages according to a relevancy calculation
algorithm and discards the irrelevant Web pages. In
doing this we have use the concept of Ontology which
provides the meaning of terms and relationship between
them. We believe that our aimed crawler will not only
be helpful in exploiting fewer web pages such that only
relevant pages are retrieved but also will be an important
component for the future „Semantic Web‟ which
is going to become very popular in the years to
come. Hence, such an improved crawler suggested by us
in this paper can help in applications areas like Social
Networking Portal, Online Library for Books
Information etc. and can add to the benefits of them in
their respective fields.