Extracting Structure of Web Site Based on Hyperlink Analysis

seminar class · 05-05-2011, 11:02 AM

noise hyperlinks.pdf (Size: 426.41 KB / Downloads: 57)
Abstract
Structure of a Web site usually reflects the implicitlogical relationship among Web pages, and is widely applied toWeb mining and Web information retrieval. However, it isdifficult for machine to extract structure of a Web siteautomatically out of varied noise hyperlinks. This paper proposesan algorithm to extract the structure of a Web site automaticallybased on hyperlink analysis. The algorithm identifies and filtersnoise hyperlinks by patterns of Web pages these hyperlinksconnected, instead of patterns of the hyperlinks. It promisesbetter performances than previous approaches. The preliminaryresults show that the proposed algorithm has a greatimprovement on both precision and recall ratio.
Keywords- Web Site Structure, Web Mining, Hyperlink Analysis.
I. INTRODUCTION
Web site is a collection of Web pages that are linked toeach other and very often to Web pages on other Web sites. Inthis paper, the structure of a Web site refers to the hyperlinkstructure of the Web site that is used to organize Web pages inhierarchy. Because this hyperlink structure usually reflects theimplicit logical relationship among Web pages, it is directlyapplied to extracting relationship among the core content thatconnected Web pages are providing [1].However, it is difficult for machine to extract structure of aWeb site automatically. It is due to the existing of “noise”hyperlink. Currently, Web pages and Web sites are designedfor human exploration. Beside “semantical” hyperlinks thatform the hierarchical structure of a Web site, “noise”hyperlinks are explicitly employed for satisfying users witheasy access of the information [2]. Although these “noise”hyperlinks can be easily identified by human users, it isdifficult to be recognized by machine automatically. Thereason is that all “noise” hyperlinks and “semantical”hyperlinks exhibit the same presentation format. Moreover,patterns of “noise” hyperlinks often vary page by page and siteby site. Therefore, a wrapper program that effectivelyrecognizes structure of a Web site maybe works ineffectivelyon another Web site.This paper addresses how to filter the “noise” hyperlinks.We propose an algorithm to extract Web site structure based onhyperlink analysis, called the WSE (Web site StructureExtracting) algorithm. It intends to provide a stepping-stone forextracting Web site structure automatically. Instead ofrecognizing and removing “noise” hyperlinks based on patternsof hyperlinks in literature, our approach prefers to filter “noise”hyperlinks based on patterns of linked Web pages. Thealgorithm is based on the assumption that Web pages havesimilar hyperlink characteristics if they are sibling childrennodes of a same node in hierarchical structure of the Web site.The organization of the rest of the paper is as follows.Section 2 gives an overview of previous related work. Section3 describes the problem and the WSE algorithm is proposed.Section 4 shows experimental results of the algorithm onseveral real world Web sites. Section 5 presents ourconclusions and future work.II. RELATED WORKIt is natural to map a Web site to a directed graph wherenodes correspond to Web pages and arcs to hyperlinks. But thisgraph is distinctly different from the structure of a Web site,because occurrence of “noise” hyperlinks distorts the realstructure of the Web site. To recognize “noise” hyperlinks toform Web site structure, many algorithms were proposed.InfoDiscoverer identifies “semantically redundant” blocksof a Web page based on information entropy value of eachblock [3]. Because InfoDiscoverer partitions the page byHTML tag “TABLE”,

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Development of a workflow based Complaint Management System (where the complaints are	mechanical engineering crazy	2	28,844,331	26-11-2018, 12:11 PM Last Post: Guest
	RIA based E- Shopping Portal for Electronic Gadgets Report	study tips	1	1,588	21-09-2017, 01:25 PM Last Post: jaseela123
	Web Application for College Automation ( WACA ) report	project girl	1	1,445	20-09-2017, 11:04 AM Last Post: jaseela123
	System Analysis (Modeling of the Existing and Proposed System using OOD)	seminar flower	1	2,459	15-09-2017, 03:39 PM Last Post: jaseela123
	Integrating and Designing the Data Mining Technique System Based on Customer	seminar projects maker	1	782	15-09-2017, 02:45 PM Last Post: jaseela123
	DESIGN AND PERFORMANCE ANALYSIS OF OPTICAL CDMA SYSTEM USING NEWLY DESIGNED MULTIWAVE	project girl	1	1,270	15-09-2017, 01:34 PM Last Post: jaseela123
	Uisce: Characteristic-based Routing in Mobile Ad Hoc Networks	project uploader	1	1,721	14-09-2017, 03:30 PM Last Post: jaseela123
	DEVELOPMENT OF A GSM BASED VEHICLE MONITORING & SECURITY SYSTEM	seminar flower	1	1,547	14-09-2017, 10:15 AM Last Post: jaseela123
	A Study on Comparative Analysis of Risk and Return with reference to Selected stocks	project maker	1	767	14-09-2017, 10:03 AM Last Post: jaseela123
	Fragmentation Of Dynamic Web Pages	mechanical engineering crazy	1	13,943,942	13-09-2017, 04:11 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.