08-12-2012, 01:47 PM
Tree Wrap-data Extraction Using Tree Matching Algorithm
1Tree Wrap-data Extraction.pdf (Size: 376.15 KB / Downloads: 52)
ABSTRACT:
In this paper, we develop a non-visual automatic wrapper to extract data records from search engine results pages
which contain important information for computer users. Our wrapper consists of a series of data filter to detect and
remove irrelevant data from the web page. In the filtering stages, we incorporate two main algorithms which are able
to check the similarity of data records and to detect and extract the correct data region based on their component sizes.
To evaluate the performance of our algorithm, we carry out experimental and deletion tests. Experimental tests show
that our wrapper outperforms the existing state of the art wrappers such as ViNT and DEPTA. Deletion studies by
replacing our novel techniques with state of the art conventional techniques show that our wrapper design is efficient
and could robustly extract data records from search engine results pages. With the speed advantages, our wrapper
could be beneficial in processing large amount of web sites data, which could be helpful in meta search engine
development.
INTRODUCTION
The extraction of relevant data from a target source
is called Information Extraction. The target source can
be a natural language source or structured records (data
records) which usually contain important information.
Therefore, there is a need to develop wrappers to
extract these structured records. Wrappers developed
recently are mostly fully automated and they could
have significant speed advantages when processing
large volumes of web site data, therefore they could be
helpful in meta search engine development [1], [2], [3]
and in comparing and evaluating shopping lists [4].
Non visual wrappers use tree matching algorithm to
check the similarity of data records by comparing the
position and identity of each node in the trees (data
structure to represent the data records’ structure in a
tree form) to remove irrelevant data records with
dissimilar structure [5], [20], [4]. However, the
implementation and coding of the algorithm are
complicated [4]. This algorithm also runs in a time
complexity of O(n1n2) where n1 is the number of
nodes in the first tree and n2 is the number of nodes in
the second tree. In general, most web pages consist of
complex trees with a large number of nodes. Therefore,
these complexities slow down the current tree matching
algorithm.
RELATED WORK
The key component of a wrapper is the algorithm
that checks the similarity of data records. Data records
are retained and considered valid if they are similar and
discarded if they are dissimilar. Current wrappers such
as Data Extraction based on Partial Tree Alignment
(DEPTA) [4] and Mining Data Region (MDR) [35] use
edit distance techniques to check the similarity of the
structure of data records. Common edit distance
techniques in such area are string edit distance and tree
edit distance [26], [8]. For more information on edit
distance techniques the readers are encouraged to refer
to the surveys of Baeza-Yates [26], Gusfield [8] and
Navarro [14].
The string edit distance algorithm involves
matching two strings and the determination of how the
first string is to be transformed into the second string.
String edit distance algorithms are generally fast in
operation and run in a time complexity of O(m) (m is
the number of tags in a data record). However, these
algorithms are unable to compare two trees having
nearly similar tree structures, with iterative and
disjunctive data. This is because these algorithms
match flat level data, which occur in single level
(strings) rather than tree structures. String edit distance
algorithms are also unable to distinguish HTML Tag as
a single entity (they tend to compare strings by
examining the characters in these strings), therefore this
may result in inaccurate matching.
THE IMPLEMENTATION OF TREEWRAP
Overview of Tree Wrap
In this section we discuss the requirements and the
assumptions made for Tree Wrap. For Tree Wrap to
work successfully the sample pages used for data
extraction should be obtained from a search engine
query and each of these sample pages must contain at
least three data records. Tree Wrap however, does not
require the HTML page to be converted to XHTML
format as the parser can recognize the HTML format.
The first component involves parsing the HTML page
and organizing it into the DOM tree representation. In
the second component, Tree Wrap extracts data records
using dummy tree matching algorithm and scoring
function. Component 1 is described in Section 3.2.1.
Detailed description of Component 2 is presented in
Section 3.2.2 which includes set of filtering rules.
Method of Evaluation
HTML web page parsing is a difficult task.
Therefore, it is very unlikely that a publicly available
parser can achieve 100% parsing rate. For those web
pages that the parser failed to parse, we rule them out
from consideration in our evaluation. The experiment
was conducted with a PC specification of Pentium 4 2.4
Ghz, with 1GB of RAM memory. The measures of
wrapper’s efficiency are based on three factors, the
number of actual data records to be extracted, the
number of extracted data records from the test cases,
and the number of correct data records extracted from
the test cases.
CONCLUSIONS
In this study we propose a non-visual wrapper Tree
Wrap which is able to extract data records from
structured web pages. Our results show that our
wrapper is able to obtain results as well as and in most
cases better than the current state of the art visual
wrappers such as ViNT and DEPTA. Our approach
uses a set of filtering methods based on the DOM Tree
structure of data records and a more accurate algorithm
to calculate the space occupied by the data region.