07-05-2014, 02:09 PM
Annotating Search Results from Web Databases
Annotating Search Results.pdf (Size: 1.33 MB / Downloads: 52)
Abstract
An increasing number of databases have become web accessible through HTML form-based search interfaces. The data
units returned from the underlying database are usually encoded into the result pages dynamically for human browsing. For the
encoded data units to be machine processable, which is essential for many applications such as deep web data collection and Internet
comparison shopping, they need to be extracted out and assigned meaningful labels. In this paper, we present an automatic
annotation approach that first aligns the data units on a result page into different groups such that the data in the same group have the
same semantic. Then, for each group we annotate it from different aspects and aggregate the different annotations to predict a final
annotation label for it. An annotation wrapper for the search site is automatically constructed and can be used to annotate new result
pages from the same web database. Our experiments indicate that the proposed approach is highly effective.
INTRODUCTION
A large portion of the deep web is database based, i.e., for
many search engines, data encoded in the returned
result pages come from the underlying structured data-
bases. Such type of search engines is often referred as Web
databases (WDB). A typical result page returned from a WDB
has multiple search result records (SRRs). Each SRR contains
multiple data units each of which describes one aspect of a
real-world entity. Fig. 1 shows three SRRs on a result page
from a book WDB. Each SRR represents one book with
several data units, e.g., the first book record in Fig. 1 has
data units “Talking Back to the Machine: Computers and
Human Aspiration,” “Peter J. Denning,” etc.
In this paper, a data unit is a piece of text that semantically
represents one concept of an entity. It corresponds to the
value of a record under an attribute. It is different from a text
node which refers to a sequence of text surrounded by a pair
of HTML tags. Section 3.1 describes the relationships
between text nodes and data units in detail. In this paper,
we perform data unit level annotation.
RELATED WORK
Web information extraction and annotation has been an
active research area in recent years. Many systems [18],
[20] rely on human users to mark the desired information
on sample pages and label the marked data at the same
time, and then the system can induce a series of rules
(wrapper) to extract the same set of information on
webpages from the same source. These systems are often
referred as a wrapper induction system. Because of the
supervised training and learning process, these systems
can usually achieve high extraction accuracy. However,
they suffer from poor scalability and are not suitable for
applications [24], [31] that need to extract information from
a large number of web sources.
Embley et al. [8] utilize ontologies together with several
heuristics to automatically extract data in multirecord
documents and label them. However, ontologies for differ-
ent domains must be constructed manually. Mukherjee et al.
[25] exploit the presentation styles and the spatial locality
of semantically related items, but its learning process
for annotation is domain dependent. Moreover, a seed of
instances of semantic concepts in a set of HTML documents
needs to be hand labeled. These methods are not fully
automatic.
Data Unit and Text Node Features
We identify and use five common features shared by the
data units belonging to the same concept across all SRRs,
and all of them can be automatically obtained. It is not
difficult to see that all these features are applicable to text
nodes, including composite text nodes involving the same
set of concepts, and template text nodes.
Alignment Algorithm
Our data alignment algorithm is based on the assumption
that attributes appear in the same order across all SRRs on
the same result page, although the SRRs may contain
different sets of attributes (due to missing values). This is
true in general because the SRRs from the same WDB are
normally generated by the same template program. Thus, we
can conceptually consider the SRRs on a result page in a table
format where each row represents one SRR and each cell
holds a data unit (or empty if the data unit is not available).
Each table column, in our work, is referred to as an alignment
group, containing at most one data unit from each SRR. If an
alignment group contains all the data units of one concept
and no data unit from other concepts, we call this group well-
aligned. The goal of alignment is to move the data units in the
table so that every alignment group is well aligned, while the
order of the data units within every SRR is preserved.
CONCLUSION
In this paper, we studied the data annotation problem and
proposed a multiannotator approach to automatically con-
structing an annotation wrapper for annotating the search
result records retrieved from any given web database. This
approach consists of six basic annotators and a probabilistic
method to combine the basic annotators. Each of these
annotators exploits one type of features for annotation and
our experimental results show that each of the annotators is
useful and they together are capable of generating high-
quality annotation. A special feature of our method is that,
when annotating the results retrieved from a web database, it
utilizes both the LIS of the web database and the IIS of
multiple web databases in the same domain. We also
explained how the use of the IIS can help alleviate the local
interface schema inadequacy problem and the inconsistent
label problem.