17-07-2012, 12:20 PM
Research on Discovering Deep Web Entries
deep web.pdf (Size: 487.43 KB / Downloads: 50)
Abstract
Ontology plays an important role in locating Domain-Specific
Deep Web contents, therefore, this paper presents a novel framework
WFF for efficiently locating Domain-Specific Deep Web databases
based on focused crawling and ontology by constructing Web Page
Classifier(WPC), Form Structure Classifier(FSC) and Form Content
Classifier(FCC) in a hierarchical fashion. Firstly, WPC discovers
potentially interesting pages based on ontology-assisted focused
crawler. Then, FSC analyzes the interesting pages and determines
whether these pages subsume searchable forms based on structural
characteristics. Lastly, FCC identifies searchable forms that belong to a
given domain in the semantic level, and stores these URLs of Domain-
Specific searchable forms to a database. Through a detailed
experimental evaluation, WFF framework not only simplifies
discovering process, but also effectively determines Domain-Specific
databases.
Keywords: Deep Web, ontology, WPC, FSC, FCC.
1. Introduction
With the rapid development of the web, more and more information has been
transferred from static web pages (that is Surface Web) into web databases
(that is Deep Web) managed by web servers[1][2]. As Fig.1 conceptually
illustrates, on this so-called “Deep Web”, numerous online databases provide
dynamic query-based data access through their query interfaces, instead of
static URL links[3]. The data in Deep Web are of great value, but difficult to
Ying Wang, Huilai Li, Wanli Zuo, Fengling He, Xin Wang, and Kerui Chen
ComSIS Vol. 780 8, No. 3, June 2011
query and search. With new web databases added and old web databases
modified and removed constantly, artificial classification is a laborious and
time-consuming task, so it is imperative to accelerate research on
discovering effectively which searchable databases are most likely to contain
the relevant information for which a user is looking.
Web
Databases
Fig. 1. Deep Web provides dynamic query-based data access through their query
interfaces
Discovering Deep Web entries is the first significant step in integrating
Deep Web data, in order to assist users accessing Deep Web, recent efforts
have focused on two kinds of approaches to discover Deep Web entries
automatically: Pre-Query and Post-Query[4].
Pre-Query identifies web databases by analyzing the wide variation in
content and structure of forms. In 2005, Barbosa L and Freire J.[5] propose a
crawling framework FFC to automatically locate Deep Web databases by
focusing the search on a given topic; by learning to identify promising links;
and by using appropriate stop criteria that avoid unproductive searches within
individual sites. However, this method has some limitations: it requires
substantial manual tuning and the form set retrieved by FFC is very
heterogeneous. After two years, Barbosa L and Freire J.[6][7][8] present
again a new framework ACHE that addresses these limitations, which
automatically and accurately classifies online databases based on features
that can be easily extracted from web forms. Manuel Alvarez et al.[9] provide
the architecture of DeepBot, a prototype of hidden-web focused crawler able
to access Deep Web content. Their approach is based on a set of domain
Research on Discovering Deep Web Entries
ComSIS Vol. 8, No. 3, June 2011 781
definitions, each one describing a data-collecting task. From the domain
definition, the system uses several heuristics to automatically identifying
relevant query forms. Hui Wang and Wanli Zuo[10] propose a three-step
framework to automatically identify domain-specific hidden Web entries. With
those obtained query interfaces, they can be integrated to obtain a unified
interface which is given to query for users. Li Yingjun et al.[11] propose a
Domain-Oriented Deep Web data source Discovery method (DO-DWD) and
a novel Domain Identification strategy of Deep Web data sources (DIDW). In
the discovery stage, using machine learning algorithms and some heuristic
rules to find query interfaces of the data sources; In the identification stage,
identifying Deep Web data sources associated with the domain by calculating
the relevance between a query interface and the domain based on semantic
similarity. Pengyi Zhang et al.[12] propose a novel hybrid approach to
construct a collection of government Deep Web resources. It combines
automatic computation power and human intelligence through social
computing. This approach presents the opportunity of building information
structures on deep web portals in a scalable and sustainable manner.
However, most of the above approaches do not consider applying
background knowledge, which is important to understand problems and
situations.
Post-Query approach identifies web databases from the retrieved results
by submitting probing queries to the forms. In 2003, Luis Gravano and
Panagiotis G.Ipeirotis[13] introduce QProber, a modular system that
automates the classification process by using a small number of query
probes, generated by document classifiers. However, this approach relies on
a pre-learned set of queries for database classification. Additionally, if new
categories are added or old categories removed from the hierarchy, new
probes must be learned and each source re-probed. After five years, Luis
Gravano and Panagiotis G.Ipeirotis[14] present a novel “focused-probing”
sampling algorithm that detects the topics covered in a database and
adaptively extracts documents that are representative of the topic coverage
of the database. However, if the topic is not self-contained, then it will affect
the database selection. Victor Z.Liu, et al.[15] develop a probabilistic
approach to use dynamic probing(issuing the user query to the databases on
the fly) in a systematic way, so that the correctness of database selection is
significantly improved while the meta-searcher contacts the minimum number
of databases. However, when the user does not care about the answer’s
correctness, the method will not applicable. Lu Jiang et al.[16] propose a
novel Deep Web crawling method with Diverse Features. They thought that
the key to Deep Web crawling was to submit promising keywords to query
form and retrieve Deep Web content efficiently. Keywords are encoded as a
tuple by its linguistic, statistic and HTML features so that a harvest rate
evaluation model can be learned from the issued keywords for the un-issued
in future. One year later, Lu Jiang et al.[17] propose a novel Deep Web
crawling framework based on reinforcement learning, in which the crawler is
regarded as an agent and deep web database as the environment. The agent
perceives its current state and selects an action (query) to submit to the
Ying Wang, Huilai Li, Wanli Zuo, Fengling He, Xin Wang, and Kerui Chen
ComSIS Vol. 782 8, No. 3, June 2011
environment according to Qvalue. The framework not only enables crawlers
to learn a promising crawling strategy from its own experience, but also
allows for utilizing diverse features of query keywords. However, it is some of
wasting network and server resources by submitting a large number of
queries only for the purpose of classification.
From the analysis above, Post-Query approach cannot be adapted to
structured multi-attribute forms[18], so it is difficult for Post-Query approach
to obtain better classification effects. Therefore, the method of Pre-Query
which depends on visual features of searchable forms, namely, attribute
labels and other available resources, are able to deal with highly
heterogeneous form sets and usually used to indicative the database domain.
That is to say, the discovery of Deep Web entries can be translated into the
issue of distinguishing query forms. In this paper, we apply the Pre-Query
approach for automatically classifying Domain-Specific forms by importing
focused crawling and ontology technique. The paper is organized as follows:
The section 2 presents the overview of discovering Deep Web entries, which
includes problem formulation and WFF framework. The section 3 presents
the process of WFF framework during discovering Deep Web entries. The
section 4 presents the experiment results of WFF framework. Finally, in
section 5, conclusions are drawn and future work is considered.