23-08-2012, 04:12 PM
Mining Data Records in Web Pages
minig data records on web pages seminar.pdf (Size: 309.49 KB / Downloads: 39)
ABSTRACT
A large amount of information on the Web is contained in
regularly structured objects, which we call data records. Such
data records are important because they often present the essential
information of their host pages, e.g., lists of products and services.
It is useful to mine such data records in order to extract
information from them to provide value-added services. Existing
approaches to solving this problem mainly include the manual
approach, supervised learning, and automatic techniques. The
manual method is not scalable to a large number of pages.
Supervised learning needs manually prepared positive and
negative training data and thus also require substantial human
effort. Current automatic techniques are still unsatisfactory
because of their poor performances. In this paper, we propose a
much more effective automatic technique to perform the task.
This technique is based on two important observations about data
records on the Web and a string matching algorithm. The
proposed technique is able to mine both contiguous and noncontiguous
data records. By non-contiguous data records, we
mean that two or more data records intertwine in terms of their
HTML codes. When they are displayed on a browser, each of
them appears contiguous. Existing techniques are unable to mine
such records. Our experimental results show that the proposed
technique outperforms existing techniques substantially.
Categories and Subject Descriptors
I.5 [Pattern Recognition]: statistical and structural.
H.2.8 [Database Applications]: data mining
Keywords
Web data records, Web mining
1. INTRODUCTION
A large amount of information on the Web is presented in
regularly structured objects. A list of such objects in a Web page
often describes a list of similar items, e.g., a list of products or
services. They can be regarded as database records displayed in
Web pages with regular patterns. In this paper, we also call them
data records. Mining data records in Web pages is useful because
it allows us to extract and integrate information from multiple
sources to provide value-added service, e.g., customizable Web
information gathering, comparative-shopping, meta-search, etc.
Figure 1 gives an example, which is a segment of a Web page that
lists two Apple notebooks. The full description of each notebook
is a data record. The objective of the proposed technique is to
automatically mine all the data records in a given Web page.
Figure 1. An example: two data records
Several approaches have been reported in the literature for mining
data records (or their boundaries) from Web pages. The first
approach is the manual approach. By observing a Web page and
its source code, the programmer can find some patterns and then
writes a program to identify each data record. This approach is
not scalable to a large number of pages. Other approaches [2][4]
[6][7][8][9][10][12][14][16][17][18][19][21][22][23] all have
some degree of automation. They rely on some specific HTML
tags and/or machine learning techniques to separate objects.
These methods either require prior syntactic knowledge or human
labeling of specific regions in the Web page to mark them as
interesting. [10] presents an automatic method which uses a set of
heuristics and domain ontology to perform the task. Domain
ontology is costly to build (about 2-man weeks for a given Web
site) [10]. [2] extends this approach by designing some additional
heuristics without using any domain knowledge. We will show in
the experiment section that the performance of this approach is
poor. [4] proposes another automatic method, which uses Patricia
tree [11] and approximate sequence alignment to find patterns
(which represent a set of data records) in a Web page. Due to the
inherent limitation of Patricia tree and inexact sequence matching,
it often produces many patterns and most of them are spurious. In
many cases, none of the actual data records is found. Again, this
method performs poorly. [19] proposes a method using clustering
and grammar induction of regular languages. As shown in [19],
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
Conference ’00, Month 1-2, 2000, City, State.
Copyright 2000 ACM 1-58113-000-0/00/0000…$5.00.
2
the results are not satisfactory. More detailed discussions on these
and other related works will be given in the next section.
Another problem with existing automatic approaches is that they
assume that the relevant information of a data record is contained
in a contiguous segment of the HTML code. This model is
insufficient because in some Web pages, the description of one
object (or a data record) may intertwine with the descriptions of
some other objects in the HTML source code. For example, the
descriptions of two objects in the HTML source may follow this
sequence, part1 of object1, part1 of object2, part2 of object1,
part2 of object2. Thus, the descriptions of both object1 and
object2 are not contiguous. However, when they are displayed on
a Web browser, they appear contiguous to human viewers.
In this paper, we propose a novel and more effective method to
mine data records in a Web page automatically. The algorithm is
called MDR (Mining Data Records in Web pages). It currently
finds all data records formed by table and form related tags, i.e.,
table, form, tr, td, etc. A large majority of Web data records are
formed by them. Note that it is also possible to find nested data
records. Our method is based on two observations:
1. A group of data records that contains descriptions of a set of
similar objects are typically presented in a contiguous region
of a page and are formatted using similar HTML tags. Such a
region is called a data record region (or data region in short).
For example, in Figure 1 two notebooks are presented in one
contiguous region. They are also formatted using almost the
same sequence of HTML tags. If we regard the HTML
formatting tags of a page as a long string, we can use string
matching to compare different sub-strings to find those similar
ones, which may represent similar objects or data records.
The problem with this approach is that the computation is
prohibitive because a data record can start from anywhere and
end anywhere. A set of data records typically do not have the
same length in terms of their tag strings because they may not
contain exactly the same pieces of information (see Figure 1).
The next observation helps us to deal with this problem.
2. The nested structure of HTML tags in a Web page naturally
forms a tag tree [3]. Our second observation is that a group of
similar data records being placed in a specific region is
reflected in the tag tree by the fact that they are under one
parent node, although we do not know which parent (our
algorithm will find out). For example, the tag tree for the page
in Figure 1 is given in Figure 2 (some details are omitted).
Each notebook (a data record) in Figure 1 is wrapped in 5 TR
nodes with their sub-trees under the same parent node
TBODY (Figure 2). The two data records are in the two dashlined
boxes. In other words, a set of similar data records are
formed by some child sub-trees of the same parent node.
A further note is that it is very unlikely that a data record
starts inside of a child sub-tree and ends inside another child
sub-tree. Instead, it starts from the beginning of a child subtree
and ends at the same or a later child sub-tree. For
example, it is unlikely that a data record starts from TD* and
ends at TD# (Figure 2). This observation makes it possible to
design a very efficient algorithm to mine data records.
Our experiments show that these observations are true. So far, we
have not seen any Web page containing a list of data records that
violates these observations. Note that we do not assume that a
Web page has only one data region that contains data records. In
fact, a Web page may contain a few data regions. Different
regions may have different data records. Our method only
requires that a data region to have two or more data records.
Given a Web page, the proposed technique works in three steps:
Step 1: Building a HTML tag tree of the page.
Step 2: Mining data regions in the page using the tag tree and
string comparison. Note that instead of mining data records
directly, which is hard, our method mines data regions first and
then find the data records within them. For example, in Figure
2, we first find the single data region below node TBODY.
Step 3: Identifying data records from each data region. For
example, in Figure 2, this step finds data record 1 and data
record 2 in the data region below node TBODY.
Figure 2. Tag tree of the page in Figure 1
Our contributions
1. A novel and effective technique is proposed to automatically
mine data records in a Web page. Extensive evaluation using a
large number of Web pages from diverse domains show that
the proposed technique is able to produce dramatically better
results than the state-of-the-art automatic systems in [2][4]
(even without considering non-contiguous data records).
2. Our new method is able to discover non-contiguous data
records, which cannot be handled by existing methods. Our
technique is able to handle such cases because it explores the
nested structure and presentation features of Web pages.
2. RELATED WORK
Web information extraction from regularly structured data records
is an important problem. Identifying individual data records is
often the first step. So far, several attempts have been made to
deal with the problem. We discuss them below.
Related works to ours are mainly in the area of wrapper
generation. A wrapper is a program that extracts data from a
website and put them in a database [3][12][13][16]. There are
several approaches to wrapper generation. The first approach is to
manually write a wrapper for each Web page based on observed
format patterns of the page. This approach is labor intensive and
very time consuming. It cannot scale to a large number of pages.
The second approach is wrapper induction, which uses supervised
learning to learn data extraction rules. A wrapper induction
system is typically trained by using manually labeled positive and
negative data. Thus, it still needs substantial manual effort as
labeling of data is also labor intensive and time consuming.
Additionally, for different sites or even pages in the same site, the
manual labeling process may need to be repeated. Example
wrapper induction systems include WIEN [17], Softmealy [14],
Stalker [21], WL [6], etc. Our technique requires no human
involvement. It mines data records in a page automatically. [19]
reports a unsupervised method based on clustering and grammar
induction. However, the results are unsatisfactory due to the
deficiencies of the techniques as noted in [19].
[9] reports a comparative shopping agent, which also tries to
identify each product from search results. A number of heuristics
rules are used to find individual products, e.g., price information,
and required attributes from the application domain
In [10], a more general study is made to automatically identify
data record boundaries. The method is based on 5 heuristic rules.
(1) Highest-count tags: This rule says that those highest-count
tags are more likely to be record boundary tags.
(2) Identifiable “separate” tags: This rule says that tags, such as
ht, td, tr, table, p, h1, etc., are more likely to be boundary tags.
(3) Standard deviation: This rule computes the number of
characters between each occurrence of a candidate separator
tag. Those with smallest standard deviation are assumed to be
more likely boundary tags.
(4) Repeating-tag pattern: This rule says that boundaries often
have consistent patterns of two or more adjacent tags.
(5) Ontology-matching: This rule uses domain knowledge to find
those domain specific terms that only appear once and only
once in data records.
A combination technique (based on certainty factor [10] in AI) is
used to combine the heuristics to make the final decision.
[2] proposes some more heuristics to perform the task, e.g.,
sibling tag heuristic, which counts the pairs of tags that are
immediate siblings in a tag tree, and partial path heuristic, which
lists the paths from a node to all other reachable nodes and counts
the number of occurrences of each path. It is shown in [2] that the
new method (OMINI) performs better than the system in [10],
even without using domain ontology. Our technique is very
different from these tag based heuristic techniques. We will show
that our method outperforms these existing methods dramatically.
[4] proposes a method (IEPAD) to find patterns from the HTML
tag string, and then use the patterns to extract objects. The method
uses a PAT tree (a Patricia tree [11]) to find patterns. The problem
with the PAT tree is that it is only able to find exact match of
patterns. In the context of the Web, data records are seldom
exactly the same. Thus, [4] also proposes a heuristic method
based on string alignment to find inexact matches. However, this
method results in many patterns, most of which are spurious. For
example, for the same segment of a tag string many patterns may
be found and they intersect one another. In many cases the right
patterns in the page are not found. As we will see in the
experiment section, the result of this method is also poor.
Finally, all the above automatic methods assume that each record
is contiguous in the HTML source. This is not true in some Web
pages as we will see in Sections 3.3 and 4. Our method does not
make this assumption. Thus, it is able to deal with the problem.
3. THE PROPOSED TECHNIQUE
As mentioned earlier, the proposed method has three main steps.
This section presents them in turn.
3.1 Building the HTML Tag Tree
Web pages are hypertext documents written in HTML that
consists of plain texts, tags and links to image, audio and video
files, etc. In this work, we only use tags in string comparison to
find data records. Most HTML tags work in pairs. Each pair
consists of an opening tag and a closing tag (indicated by <> and
</> respectively). Within each corresponding tag-pair, there can
be other pairs of tags, resulting in nested blocks of HTML codes.
Building a tag tree from a Web page using its HTML code is thus
natural. In our tag tree, each pair of tags is considered as one
node, and the nested pairs of tags within it are the children of the
node. This step performs two tasks:
1. Preprocessing of HTML codes: Some tags do not require
closing tags (e.g., <li> and <hr>). Hence, additional closing
tags are inserted to ensure all tags are balanced. Next, useless
or redundant tags are removed. Examples include tags related
to HTML comments <!--, <script>, and others.
2. Building a tag tree: It follows the nested blocks of the HTML
tags in the page to build a tag tree. This is fairly easy. We will
not discuss it further. Figure 2 shows an example.
3.2 Mining Data Regions
This step mines every data region in a Web page that contains
similar data records. Instead of mining data records directly,
which is hard, we first mine generalized nodes (defined below) in
a page. A sequence of adjacent generalized nodes forms a data
region. From each data region, we will identify the actual data
records (discussed in Section 3.3). Below, we define generalized
nodes and data regions using the HTML tag tree:
Definition: A generalized node (or a node combination) of length
r consists of r (r ≥ 1) nodes in the HTML tag tree with the
following two properties:
1) the nodes all have the same parent.
2) the nodes are adjacent.
The reason that we introduce the generalized node is to capture
the situation that an object (or a data record) may be contained in
a few sibling tag nodes rather than one. For example, in Figures 1
and 2, we can see that each notebook is contained in five table
rows (or 5 TR nodes). Note that we call each node in the HTML
tag tree a tag node to distinguish it from a generalized node.
Definition: A data region is a collection of two or more
generalized nodes with the following properties:
1) the generalized nodes all have the same parent.
2) the generalized nodes all have the same length.
3) the generalized nodes are all adjacent.
4) the normalized edit distance (string comparison) between
adjacent generalized nodes is less than a fixed threshold.