19-02-2013, 09:09 AM
Closing the Loop in Webpage Understanding
Closing the Loop.pdf (Size: 1.04 MB / Downloads: 28)
Abstract
The two most important tasks in information extraction from the Web are webpage structure understanding and natural
language sentences processing. However, little work has been done toward an integrated statistical model for understanding webpage
structures and processing natural language sentences within the HTML elements. Our recent work on webpage understanding
introduces a joint model of Hierarchical Conditional Random Fields (HCRFs) and extended Semi-Markov Conditional Random Fields
(Semi-CRFs) to leverage the page structure understanding results in free text segmentation and labeling. In this top-down integration
model, the decision of the HCRF model could guide the decision making of the Semi-CRF model. However, the drawback of the topdown
integration strategy is also apparent, i.e., the decision of the Semi-CRF model could not be used by the HCRF model to guide its
decision making. This paper proposed a novel framework called WebNLP, which enables bidirectional integration of page structure
understanding and text understanding in an iterative manner. We have applied the proposed framework to local business entity
extraction and Chinese person and organization name extraction. Experiments show that the WebNLP framework achieved
significantly better performance than existing methods.
INTRODUCTION
THE World Wide Web contains huge amounts of data.
However, we cannot benefit very much from the large
amount of raw webpages unless the information within
them is extracted accurately and organized well. Therefore,
information extraction (IE) [1], [2], [3] plays an important
role in Web knowledge discovery and management.
Among various information extraction tasks, extracting
structured Web information about real-world entities (such
as people, organizations, locations, publications, products)
has received much attention of late [4], [5], [6], [7], [8].
However, little work has been done toward an integrated
statistical model for understanding webpage structures and
processing natural language sentences within the HTML
elements of the webpage. Our recent work on Web object
extraction has introduced a template-independent approach
to understand the visual layout structure of a webpage and
to effectively label the HTML elements with attribute names
of an entity [9], [10].
Motivating Example
We have been working on local entity extraction to increase
the data coverage of the Windows Live Local search service
by automatically extracting structured information about
local businesses from the crawled webpages. In Fig. 1, we
show an example webpage containing local entity information.
As we can see, the address information of the local
business on the webpage is regularly formatted in a visually
structured block: the first line of the block contains the
business name in bold font; the second line contains the
street information; the third line contains the city, state, and
zip code. Such a structured block containing multiple
attribute values of an object is called an object block. We
can use the HCRF algorithm [9] together with the Semi-CRF
algorithm [12] to detect the object block first and then label
the attributes within the block [11].
PROBLEM DEFINITION
This paper aims at introducing a joint framework that can
segment and label both the structure layout and text in the
webpage. In this section, we first introduce the data
representation of the structure layout of the webpage and
the text content within the webpage. Then, we formally
define the webpage understanding problem.
Data Representation
We use the VIPS approach to segment a webpage into
visually coherent blocks [16]. VIPS makes use of page layout
features, such as client region, font, color, and size, to
construct a vision tree representation of the webpage.
The Extended Models
As we introduced previously, the state-of-the-art models for
webpage structure understanding and text understanding
are the HCRF model and the Semi-CRF model, respectively.
However, there is no way to make them interact with each
other in their original forms. Therefore, we extend them by
introducing additional input parameters to the feature
functions. The original forms of the HCRF model and the
Semi-CRF model have been introduced in Section 4. Therefore,
we will only introduce the forms of the extended HCRF
model and the extended Semi-CRF model in this section.
CONCLUSIONS
Webpage understanding plays an important role in Web
search and mining. It contains two main tasks, i.e., page
structure understanding and natural language understanding.
However, little work has been done toward an
integrated statistical model for understanding webpage
structures and processing natural language sentences within
the HTML elements.
In this paper, we introduced the WebNLP framework for
webpage understanding. It enables bidirectional integration
of page structure understanding and natural language
understanding. Specifically, the WebNLP framework is
composed of two models, i.e., the extended HCRF model
for structure understanding and the extended Semi-CRF
model for text understanding. The performance of both
models can be boosted in the iterative optimization procedure.
The auxiliary corpus is introduced to train the
statistical language features in the extended Semi-CRF
model for text understanding, and the multiple occurrence
features are also used in the extended Semi-CRF model by
adding the decision of the model in last iteration. Therefore,
the extended Semi-CRF model is improved by using both the
label of the vision nodes assigned by the HCRF model and
the text segmentation and labeling results, given by the
extended Semi-CRF model itself in last iteration as additional
input parameters in some feature functions.