Automatic Template Extraction from Heterogeneous Web Pages full report

**seminar ideas** · 21-05-2012, 04:46 PM

Automatic Template Extraction from Heterogeneous Web Pages

.pdf

Automatic Template Extraction.pdf (Size: 1.49 MB / Downloads: 46)

INTRODUCTION
WORLD Wide Web (WWW) is widely used to publish
and access information on the Internet. In order to
achieve high productivity of publishing, the webpages in
many websites are automatically populated by using
common templates with contents. For human beings, the
templates provide readers easy access to the contents
guided by consistent structures even though the templates
are not explicitly announced. However, for machines, the
unknown templates are considered harmful because they
degrade the accuracy and performance due to the irrelevant
terms in templates. Thus, template detection and extraction
techniques have received a lot of attention recently to
improve the performance of web applications, such as data
integration, search engines, classification of web documents,
and so on [3], [4], [12], [14], [15], [23]. For example,
biogene data are published on the Internet by many
organizations with different formats and scientists want to
integrate these data into a unified database. For price
comparison services, the price information is gathered from
various Internet marketplaces. Good template extraction
technologies can significantly improve the performance of
these applications.

RELATED WORK
The template extraction problem can be categorized into
two broad areas. The first area is the site-level template
detection where the template is decided based on several
pages from the same site. Crescenzi et al. [10] studied
initially the data extraction problem and Yossef and
Rajagopalan [4] introduced the template detection problem.
Previously, only tags were considered to find templates but
Arasu and Garcia-Molina [3] observed that any word can be
a part of the template or contents.

Essential Paths and Templates
Given a web document collection D ¼ fd1; d2; . . . ; dng, we
define a path set PD as the set of all paths in D. Note that,
since the document node is a virtual node shared by every
document, we do not consider the path of the document
node in PD. The support of a path is defined as the number
of documents in D, which contain the path. For each
document di, we provide a minimum support threshold tdi .
Notice that the thresholds tdi and tdj of two distinct
documents di and dj, respectively, may be different. If a
path is contained by a document di and the support of the
path is at least the given minimum support threshold tdi ,
the path is called an essential path of di.

HTML Documents and Document Object Model
The DOM defines a standard for accessing documents, like
HTML and XML [1]. The DOM presents an HTML
document as a tree structure. The entire document is a
document node, every HTML element is an element node,
the texts in the HTML elements are text nodes, every HTML
attribute is an attribute node, and comments are comment
nodes. However, we do not distinguish the type of nodes,
since, as defined in [3], any type of node can be a part of a
template in our problem.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Biometrics Security System Full Download Seminar Report and Paper Presentation	computer science crazy	30	190,561,110	24-02-2021, 08:13 AM Last Post: buy cialis generic
	Ultrasonic Trapping In Capillaries For Trace-Amount Bi (Download Full Seminar Report)	Computer Science Clay	2	104,277,107	17-01-2018, 11:59 AM Last Post: dhanabhagya
	nanorobotics full report	project topics	24	176,551,278	16-01-2018, 05:50 PM Last Post: Guest
	robotic surgery full report	project report tiger	16	150,961,205	07-01-2018, 07:28 PM Last Post: Raymondnof
	Human Computer Interface : Seminar Report and PPT	seminar post	1	1,337	22-09-2017, 11:23 AM Last Post: jaseela123
	4G Broadband : Seminar Report and PPT	study tips	1	1,261	22-09-2017, 11:19 AM Last Post: jaseela123
	Amoeba full report	project topics	1	1,631,984	22-09-2017, 10:38 AM Last Post: jaseela123
	Itanium Processor : Seminar Report and PPT	seminar projects maker	1	1,052	21-09-2017, 12:46 PM Last Post: jaseela123
	Design and Analysis Of Algorithms : Seminar Report and PPT	seminar projects maker	1	1,315	21-09-2017, 12:04 PM Last Post: jaseela123
	Data Mining: What is Data Mining? Report	project girl	1	2,262	21-09-2017, 11:47 AM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.