11-02-2013, 04:29 PM
Combining Tag and Value Similarity for Data Extraction and Alignment
Combining Tag and Value .pdf (Size: 2.05 MB / Downloads: 30)
Abstract
Web databases generate query result pages based on a user’s query. Automatically extracting the data from these query
result pages is very important for many applications, such as data integration, which need to cooperate with multiple web databases.
We present a novel data extraction and alignment method called CTVS that combines both tag and value similarity. CTVS
automatically extracts data from query result pages by first identifying and segmenting the query result records (QRRs) in the query
result pages and then aligning the segmented QRRs into a table, in which the data values from the same attribute are put into the same
column. Specifically, we propose new techniques to handle the case when the QRRs are not contiguous, which may be due to the
presence of auxiliary information, such as a comment, recommendation or advertisement, and for handling any nested structure that
may exist in the QRRs. We also design a new record alignment algorithm that aligns the attributes in a record, first pairwise and then
holistically, by combining the tag and data value similarity information. Experimental results show that CTVS achieves high precision
and outperforms existing state-of-the-art data extraction methods.
INTRODUCTION
ONLINE databases, called web databases, comprise the deep
web [4] and [7]. Compared with webpages in the
surface web, which can be accessed by a unique URL, pages
in the deep web are dynamically generated in response to a
user query submitted through the query interface of a web
database. Upon receiving a user’s query, a web database
returns the relevant data, either structured or semistructured,
encoded in HTML pages.
Many web applications, such as metaquerying, data
integration and comparison shopping, need the data from
multiple web databases. For these applications to further
utilize the data embedded in HTML pages, automatic data
extraction is necessary. Only when the data are extracted and
organized in a structured manner, such as tables, can they be
compared and aggregated. Hence, accurate data extraction is
vital for these applications to perform correctly.
QRR EXTRACTION
Fig. 2 shows the framework for QRR extraction. Given a
query result page, the Tag Tree Construction module first
constructs a tag tree for the page rooted in the <HTML>
tag. Each node represents a tag in the HTML page and its
children are tags enclosed inside it. Each internal node n
of the tag tree has a tag string tsn, which includes the tags
of n and all tags of n’s descendants, and a tag path tpn,
which includes the tags from the root to n. Next, the Data
Region Identification module identifies all possible data
regions, which usually contain dynamically generated
data, top down starting from the root node. The Record
Segmentation module then segments the identified data
regions into data records according to the tag patterns in
the data regions. Given the segmented data records, the
Data Region Merge module merges the data regions
containing similar records. Finally, the Query Result Section
Identification module selects one of the merged data
regions as the one that contains the QRRs.4 The following
four sections describe each of the last four modules in
more detail.
Record Segmentation
To illustrate the record segmentation algorithm, assume
that in Region 1 of the artificial tag tree in Fig. 3, nodes 3, 6,
8, and 10 are similar and nodes 4, 7, and 9 are similar, while
in Region 2, nodes 12 and 13 are similar. Record segmentation
first finds tandem repeats within a data region. For
example, Region 1 in Fig. 3 can be represented as
ABABABA if we use characters A to represent an element
of the similar node set {3, 6, 8, 10} and B to represent an
element of the similar node set {4, 7, 9}. In this case, there
are two tandem repeats, AB and BA. Similarly, Region 2 in
Fig. 3 can be represented as CC, which contains only one
tandem repeat, C.
Data Region Merge
The data region identification step may identify several data
regions in a query result page. Moreover, the actual data
records may span several data regions. In the websites we
examined, 12 percent had QRRs with different parents in
the HTML tag tree. Thus, before we can identify all the
QRRs in a query result page, we need to determine whether
any of the data regions should be merged.
Given any two data regions, we treat them as similar if
the segmented records they contain are similar. The
similarity between any two records from two data regions
is measured by the similarity of their tag strings. The
similarity between two data regions is calculated as the
average record similarity. Two data regions can be merged
into a merged data region if the records in the two data
regions have an average similarity greater or equal to 0.6,
which is a threshold used to judge whether two records are
similar in [24].