13-11-2012, 02:03 PM
Combining Tag and Value Similarity for Data Extraction and Alignment
ABSTRACT
Web databases generate query result pages based on a
user’s query. Automatically extracting the data from
these query result pages is very important for many
applications, such as data integration, which need to
cooperate with multiple web databases. We present a
novel data extraction and alignment method called CTVS
that combines both tag and value similarity. CTVS
automatically extracts data from query result pages by
first identifying and segmenting the query result records
(QRRs) in the query result pages and then aligning the
segmented QRRs into a table, in which the data values
from the same attribute are put into the same column.
Specifically, we propose new techniques to handle the
case when the QRRs are not contiguous, which may be
due to the presence of auxiliary information, such as a
comment, recommendation or advertisement, and for
handling any nested structure that may exist in the QRRs.
We also design a new record alignment algorithm that
aligns the attributes in a record, first pairwise and then
holistically, by combining the tag and data value
similarity information. Experimental results show that
CTVS achieves high precision and outperforms existing
state-of-the-art data extraction methods.