21-05-2012, 04:46 PM
Automatic Template Extraction from Heterogeneous Web Pages
Automatic Template Extraction.pdf (Size: 1.49 MB / Downloads: 46)
INTRODUCTION
WORLD Wide Web (WWW) is widely used to publish
and access information on the Internet. In order to
achieve high productivity of publishing, the webpages in
many websites are automatically populated by using
common templates with contents. For human beings, the
templates provide readers easy access to the contents
guided by consistent structures even though the templates
are not explicitly announced. However, for machines, the
unknown templates are considered harmful because they
degrade the accuracy and performance due to the irrelevant
terms in templates. Thus, template detection and extraction
techniques have received a lot of attention recently to
improve the performance of web applications, such as data
integration, search engines, classification of web documents,
and so on [3], [4], [12], [14], [15], [23]. For example,
biogene data are published on the Internet by many
organizations with different formats and scientists want to
integrate these data into a unified database. For price
comparison services, the price information is gathered from
various Internet marketplaces. Good template extraction
technologies can significantly improve the performance of
these applications.
RELATED WORK
The template extraction problem can be categorized into
two broad areas. The first area is the site-level template
detection where the template is decided based on several
pages from the same site. Crescenzi et al. [10] studied
initially the data extraction problem and Yossef and
Rajagopalan [4] introduced the template detection problem.
Previously, only tags were considered to find templates but
Arasu and Garcia-Molina [3] observed that any word can be
a part of the template or contents.
Essential Paths and Templates
Given a web document collection D ¼ fd1; d2; . . . ; dng, we
define a path set PD as the set of all paths in D. Note that,
since the document node is a virtual node shared by every
document, we do not consider the path of the document
node in PD. The support of a path is defined as the number
of documents in D, which contain the path. For each
document di, we provide a minimum support threshold tdi .
Notice that the thresholds tdi and tdj of two distinct
documents di and dj, respectively, may be different. If a
path is contained by a document di and the support of the
path is at least the given minimum support threshold tdi ,
the path is called an essential path of di.
HTML Documents and Document Object Model
The DOM defines a standard for accessing documents, like
HTML and XML [1]. The DOM presents an HTML
document as a tree structure. The entire document is a
document node, every HTML element is an element node,
the texts in the HTML elements are text nodes, every HTML
attribute is an attribute node, and comments are comment
nodes. However, we do not distinguish the type of nodes,
since, as defined in [3], any type of node can be a part of a
template in our problem.