19-05-2012, 11:10 AM
Data Mining for Web Intelligence
Data Mining for Web Intelligence.doc (Size: 74.5 KB / Downloads: 31)
Abstract:
A maximal forward reference of a Web user is a longest consecutive sequence of Web pages visited by the user in a session without revisiting some previously visited page in the sequence. Efficient Mining of frequent traversal path patterns, i.e., large reference sequences of maximal forward references, from very large Web logs is a fundamental problem in Web mining. This chapter aims at designing algorithms for this problem with the best possible efficiency. First, two optimal linear time algorithms are designed for finding maximal forward references from Web logs. Second, two algorithms for mining frequent traversal path patterns are devised with the help of a fast construction of “shallow” generalized suffix trees over a very large Alphabet. These two algorithms have respectively provable linear and sub linear time complexity, and their performances are analyzed in comparison with the apriority-like algorithms and the Unknown algorithm. It is shown that these two new algorithms are substantially more efficient than the apriority-like algorithms and the Unknown algorithm.
Introduction
Through the billions of Web pages created with HTML and XML, or generated dynamically by underlying Web database service engines, the Web captures almost all aspects of human endeavor and provides a fertile ground for data mining. However, searching, comprehending, and using the semi structured information stored on the Web poses a significant challenge because this data is more sophisticated and dynamic than the information that commercial database systems store. To supplement
Keyword-based indexing, which forms the cornerstone for Web search engines; researchers have applied data mining to Web-page ranking. In this context, data mining helps Web search engines find high-quality Web pages1 and enhances Web click stream analysis.2 For the Web to reach its full potential, however, we must improve its services, make it more comprehensible, and increase its usability. As researchers continue to develop data mining techniques, we believe this technology will play an increasingly important role in meeting the challenges of developing the intelligent Web.
WHY DATA MINING?
The Web—an immense and dynamic collection of pages that includes countless hyperlinks and huge volumes of access and usage information—provides a rich and unprecedented data mining source. However, the Web also poses several challenges to effective resource and knowledge discovery:
• Web page complexity far exceeds the complexity of any traditional text document collection. Although the Web functions as a huge digital library, the pages themselves lack a uniform structure and contain far more authoring style and content variations than any set of books or traditional text-based documents. Moreover, the tremendous number of documents in this digital library has not been indexed, which makes searching the data it contains extremely difficult.
• The Web constitutes a highly dynamic information source:
Not only does the Web continue to grow rapidly, the information it holds also receives constant updates. News, stock market, service center, and corporate sites revise their Web pages regularly. Linkage information and access records also undergo frequent updates.
The Web serves a broad spectrum of user communities: The Internet’s rapidly expanding user community connects millions of workstations. These users have markedly different backgrounds, interests, and usage purposes. Many lack good knowledge of the information network’s structure, are unaware of a particular search’s heavy cost, frequently get lost within the Web’s ocean of information, and can chafe at the many access hops and lengthy waits required to retrieve search results.
• Only a small portion of the Web’s pages contain truly relevant or useful information:
A given user generally focuses on only a tiny portion of the Web, dismissing the rest as uninteresting data that serves only to swamp the desired search results. How can a search identify that portion of the Web that is truly relevant to one user’s interests? How can a search find high-quality Web pages on a specified topic?
Data mining holds the key to uncovering and cataloging the authoritative links, traversal patterns, and semantic structures that will bring intelligence and direction to our Web interactions.
C O V E R F E A T U R E
Currently, users can choose from three major approaches when accessing information stored on the Web:
• keyword-based search or topic-directory browsing with search engines such as Google or Yahoo, which use keyword indices or manually built directories to find documents with specified keywords or topics;
• querying deep Web sources—where information, such as amazon.com’s book data and realtor.com’s real-estate data, hides behind searchable database query forms—that, unlike the surface Web, cannot be accessed through static URL links; and
• random surfing that follows Web linkage pointers. The success of these techniques, especially with the more recent page ranking in Google and other search engines,3 shows the Web’s great promise to become the ultimate information system.
Design challenges
Defining how to design an intelligent Web presents a major research challenge. Achieving our vision of the Web’s potential requires overcoming two fundamental problems. First, at the abstraction level, the traditional schemes for accessing the immense amounts of data that reside on the Web fundamentally assume the text-oriented, keyword-based view of Web pages. We believe a data-oriented abstraction will enable a new range of functionalities. Second, at the service level, we must replace the current primitive access schemes with more sophisticated versions that can exploit the Web fully.
Access limitations
Although keyword-, address-, and topic-based Web search engines already support information searches, data mining will play an important role in Web intelligence because the Web’s current incarnation still cannot provide high-quality, intelligent services. Several factors contribute to this problem and motivate our research.
Other data mining tasks for Web intelligence
Many other promising data mining methods can help achieve effective Web intelligence. Customizing service to a particular individual requires tracing that person’s Web traversal history to build a pro- file, then providing intelligent, personalized Web services based on that information. To date, some Web-based e-commerce service systems, such as amazon.com and expedia.com, register every user’s past traversal or purchase history and build customer profiles from that data. Based on a user’s profile and preferences, these sites select appropriate sales promotions and recommendations, thereby providing better quality service than sites that do not track and store this information. Using data mining to find a user’s purchase or traversal patterns can further enhance these services. Although a personalized Web service based on a user’s traversal history could help recommend appropriate services, a system usually cannot collect enough information about a particular individual to warrant a quality recommendation. Either the traversal history has too little historical information about that person, or the possible spectrum of recommendations is too broad to set up a history November 2002.
Standardization would enhance information extraction for the construction of a multilayered Web information base.
Computer for any one individual. For example, many people make only a single book purchase, thus providing insufficient data to generate a reliable pattern. In this case, collaborative filtering is effective because it does not rely on a particular individual’s past experience but on the collective recommendations of the people who share patterns similar to the individual being examined. Thus, if people who have preferences similar to those of a given individual buy book A, they are likely to buy books B and C as well. The site could then recommend B and C to that individual. This approach generates quality recommendations by evaluating collective effort rather than basing recommendations on only one person’s past experience. Indeed, collective filtering has been used as a data mining method for Web intelligence.12 Data mining for Web intelligence will be an important research thrust in Web technology —one that makes it possible to fully use the immense information available on the Web. However, we must overcome many research challenges before we can make the Web a richer, friendlier, and more intelligent resource that we can all share and explore.