09-08-2012, 03:06 PM
A DATA WAREHOUSING AND DATA MINING FRAMEWORK FOR
WEB USAGE MANAGEMENT
DATA WAREHOUSING AND DATA MINING FRAMEWORK FOR.pdf (Size: 314.44 KB / Downloads: 21)
Abstract.
A new challenge in Web usage analysis is how to manage and discover informative
patterns from various types of Web data stored in structured or unstructured databases for system
monitoring and decision making. In this paper, a novel integrated data warehousing and data mining
framework for Website management and patterns discovery is introduced to analyze Web user
behavior. The merit of the framework is that it combines multidimensional Web databases to support
online analytical processing for improving Web services. Based on the model, we propose some
statistical indexes and practical solutions to intelligently discover interesting user access patterns for
Website optimization, Web personalization and recommendation etc. We use the Web data from a
sports Website as data sources to evaluate the effectiveness of the model. The results show that this
integrated data warehousing and mining model is effective and efficient to apply into practical Web
applications.
Key words: Data mining, Data warehousing, Web services, Website management
1. Introduction.
The rapid progress of our capabilities in data acquisition and
storage technologies has led to the fast growing of tremendous amount of data generated
and stored in databases, data warehouses, or other kinds of data repositories
such as the World-Wide Web. On the other hand, many current and emerging data
management applications require support for real-time analysis of large scale and
continuously changing data streams, e.g., online monitoring user patterns in Websites.
Hence, there is a great demand on designing innovative solutions for various
data-intensive data mining and data warehousing applications.
Web usage mining is the application of data mining techniques to discover usage
patterns from Web data, in order to understand and better serve the needs of Webbased
applications. J. Srivastva et. al (2000) also propose a three-step Web usage
mining process which are called preprocessing, pattern discovery, and pattern analysis.
Many researchers have proposed different data mining algorithms for mining user
access patterns or trends from the user access sessions. For instance, Mobasher et al.
(1996) used association rules mined to realize effective Web personalization. Shen et
al. (1999) suggested a three-step algorithm to mine the most interesting Web access
∗(Eds.) Wai Lam, Rui-song Ye, Haiying Wang, and Jun Zhang. Research supported by HKRGC
7130/02P, 7046/03P, 7035/04P and 7035/05P and FRG/04-05/II-51.
†Department of Statistics & Actuarial Science, The University of Hong Kong, Pokfulam Road,
Hong Kong. E-mail: hcwu[at]hkusua.hku.hk
‡Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong. E-mail:
mng[at]maths.hku.hk
§E-Business Technology Institute, The University of Hong Kong, Pokfulam Road, Hong Kong.
E-mail: jhuang[at]eti.hku.hk
301
302 EDMOND H. WU ET AL.
associations. Zaiane et al (1998) proposed to apply OLAP and data mining techniques
for mining access patterns based on a Web usage mining system.
Recently, Web applications such as personalization and recommendation have
raised the concerns of people because they are crucial to improve customer services
from business point of view, particularly for E-commerce Websites. Understanding
customer preferences and requirements in time is a premise to optimize these Web
services. The field of adaptive Websites is drawing attention from the community
(perkowitz, 1999). One of the new trends in Web usage mining is to develop Web
usage mining system that can effectively discover users access patterns and then intelligently
optimize the Web services Recent studies (Berendt, 2002), (Nakagawa, 2003),
(Wu, 2003) have suggested that the structural characteristics of Websites, such as the
Website topology, have a great impact on the performance or efficiency of Websites.
Hence, combining with the structure of Website, we can gain more interesting results
for Web usage analysis. Understanding the user behavior is the first step to provide
better Web services.
During the past two decades, database technologies have been developed very
fast. Traditional databases store sets of relatively static records, such as Web-logs in
Web servers. However, many current and emerging applications require the databases
to support online analysis of rapidly changing data streams. Limitations of traditional
database management systems in streaming data applications have raised the interests
of many researchers. Different data mining algorithms for streaming data have
been proposed with diverse infrastructure and domain applications. Recent research
includes mining stream signatures and representative trends (Cortes, 2000), decision
trees (Hulten, 2001), and regression analysis (Chen, 2002) etc. Therefore, the new
generation Web usage mining system should also be designed to be capable of discovering
changing patterns from a data stream environment with multi-type Web
sources.
The research issue we focus in this paper is the problem of dynamic user patterns
discovery from large-scale clickstreams in Websites. To solve this problem, our model
focuses on how to handle multi-typeWeb data and monitor the changing patterns for
analysis by using some novel mathematical models and statistical indexes. Based on
the patterns discovered, we also propose practical solutions for Website optimization
by reorganizing the Website content and its structure. In this paper, we present an
efficient data model for aggregating user access sessions to effectively support different
data mining applications. Based on the data model, we can easily perform various
knowledge discovery tasks, such as association rule mining, sequential pattern mining,
clustering, and Web usage predicting etc. Using these mining results, we can provide
multiple solutions for various Web applications. For example, online personalized
services, effective recommendation system, Website optimization etc.
This paper is organized as follows. In Section 2, we present the framework of a
A DATA WAREHOUSING AND DATA MINING FRAMEWORK 303
multidimensional Web usage mining model. In Section 3, we introduce the implementation
of the model, after that, experiment results are given. Then, we demonstrate
practical Web applications in a real Website for optimizing Website services based
on the model in Section 4. Finally, We give some conclusions and present our future
work in Section 5.
2. Integrating Infrastructure for Web Usage Analysis. Since Web usage
mining techniques have been widely used in various Web applications, it is necessary
to develop an integrated platform for effective Web usage analysis. For this reason,
we propose a multidimensional model to smoothly integrate Web data preprocessing,
Website content and topology information. The framework can also easily combine
different data mining algorithms to support differentWeb applications. In this section,
we will introduce the main components of the model individually.
2.1. Data Preprocessing Module. Yang et. al (2003) introduced a data-cube
model to contain the original access sessions for data mining from Web-logs. Based
on it, we also investigated the practice of dealing with Web-log data streams (Wu,
2004). These work provided feasible preprocessing solutions to turn large volumes of
Web logs into useful session information. So, the model designed can support both
online and offline Web usage analysis.
2.1.1. Data Cleaning. TheWeb log datasets, which include the URLs requests,
the IP addresses of users and timestamps, provide much of the potential information
of user access behavior in a Website. Usually, we need to do some data processing,
such as invalid data cleaning and user identification. Then, the original Web logs
are transferred into user access session datasets for analysis. The Web log datasets
(like server logs, cookies) contain useful information about the users’ navigational
behaviors. However, we need to do some preprocessing to turn the original Web log
data into user access sessions. It will also affect the quality of Web usage mining.
Fig. 1 is a sample ofWeb-log records (the format of the sample Web-log is IIS 5.0,
some system information is ignored). After preprocessing of these original Web-log
data sets, we can use these user access sessions directly for further pattern discovery
and data analysis in Websites.