05-05-2011, 11:02 AM
noise hyperlinks.pdf (Size: 426.41 KB / Downloads: 57)
Abstract
Structure of a Web site usually reflects the implicitlogical relationship among Web pages, and is widely applied toWeb mining and Web information retrieval. However, it isdifficult for machine to extract structure of a Web siteautomatically out of varied noise hyperlinks. This paper proposesan algorithm to extract the structure of a Web site automaticallybased on hyperlink analysis. The algorithm identifies and filtersnoise hyperlinks by patterns of Web pages these hyperlinksconnected, instead of patterns of the hyperlinks. It promisesbetter performances than previous approaches. The preliminaryresults show that the proposed algorithm has a greatimprovement on both precision and recall ratio.
Keywords- Web Site Structure, Web Mining, Hyperlink Analysis.
I. INTRODUCTION
Web site is a collection of Web pages that are linked toeach other and very often to Web pages on other Web sites. Inthis paper, the structure of a Web site refers to the hyperlinkstructure of the Web site that is used to organize Web pages inhierarchy. Because this hyperlink structure usually reflects theimplicit logical relationship among Web pages, it is directlyapplied to extracting relationship among the core content thatconnected Web pages are providing [1].However, it is difficult for machine to extract structure of aWeb site automatically. It is due to the existing of “noise”hyperlink. Currently, Web pages and Web sites are designedfor human exploration. Beside “semantical” hyperlinks thatform the hierarchical structure of a Web site, “noise”hyperlinks are explicitly employed for satisfying users witheasy access of the information [2]. Although these “noise”hyperlinks can be easily identified by human users, it isdifficult to be recognized by machine automatically. Thereason is that all “noise” hyperlinks and “semantical”hyperlinks exhibit the same presentation format. Moreover,patterns of “noise” hyperlinks often vary page by page and siteby site. Therefore, a wrapper program that effectivelyrecognizes structure of a Web site maybe works ineffectivelyon another Web site.This paper addresses how to filter the “noise” hyperlinks.We propose an algorithm to extract Web site structure based onhyperlink analysis, called the WSE (Web site StructureExtracting) algorithm. It intends to provide a stepping-stone forextracting Web site structure automatically. Instead ofrecognizing and removing “noise” hyperlinks based on patternsof hyperlinks in literature, our approach prefers to filter “noise”hyperlinks based on patterns of linked Web pages. Thealgorithm is based on the assumption that Web pages havesimilar hyperlink characteristics if they are sibling childrennodes of a same node in hierarchical structure of the Web site.The organization of the rest of the paper is as follows.Section 2 gives an overview of previous related work. Section3 describes the problem and the WSE algorithm is proposed.Section 4 shows experimental results of the algorithm onseveral real world Web sites. Section 5 presents ourconclusions and future work.II. RELATED WORKIt is natural to map a Web site to a directed graph wherenodes correspond to Web pages and arcs to hyperlinks. But thisgraph is distinctly different from the structure of a Web site,because occurrence of “noise” hyperlinks distorts the realstructure of the Web site. To recognize “noise” hyperlinks toform Web site structure, many algorithms were proposed.InfoDiscoverer identifies “semantically redundant” blocksof a Web page based on information entropy value of eachblock [3]. Because InfoDiscoverer partitions the page byHTML tag “TABLE”,