04-01-2013, 12:05 PM
PROJECT ON DISCUSS ONLINE ANALYTICAL MINING OF DATA
1ONLINE ANALYTICAL.doc (Size: 365.5 KB / Downloads: 22)
Abstract
Great e_orts have been paid in the Intelligent Database Systems Research Lab for the research and development of ecient data mining methods and construction of on-line analytical data mining systems. Our work has been focused on the integration of data min-ing and OLAP technologies and the development of scalable,integrated, and multiple data mining functions. A data min-ing system, DBMiner, has been developed for interactive min-ing of multiple-level knowledge in large relational databasesand data warehouses. The system implements a wide spec-trum of data mining functions, including characterization,comparison, association, classi_cation, prediction, and clus-tering. It also builds up a user-friendly, interactive data mining environment and a set of knowledge visualization tools. In-depth research has been performed on the e_- ciency and scalability of data mining methods. Moreover, the research has been extended to spatial data mining, mul- timedia data mining, text mining, and Web mining with several new data mining system prototypes constructed or under construction, including GeoMiner, MultiMediaMiner, and WebLogMiner. This article summarizes our research and development activities in the last several years and shares our experiences and lessons with the readers.
INTRODUCTION
THE OLAM METHODOLOGY
Our OLAM system for path traversal patterns includes incremental Web usage mining updates. It stores the derived Web user access paths in a data warehouse. The system updates the user access path pattern in the data warehouse by data operation functions that are automated by webmaster. The result is an OLAM that uses the underlying ecustomer behavior graph, which is capable of discovering association semantics between tick sequences, e-customer profiles for customer segmentation, and a set of preferred and referring Web pages for analysis that will allow the development of effective Internet marketing plans. In addition, the eCB model has a self-learning capability. It allows webmaster to input confidence and support level values to adjust the knowledge that is generated. These revised factors provide adjusted values to the Web mining algorithm to regulate its mining process according to e-customers’ behavioral changes. The association rules that are discovered are constantly reviewed to reflect the actual on-line travel situation. Figure 3 depicts an overall conceptual architecture of our OLAM methodology.
To extract transaction activities for the discovery of association rules, we load desired data fields from the server log file as a text file into a relational table for further processing. Log data are updated to generate the association rules to discover knowledge for marketing decision support in Web site design. We capture the e-customer experience via the Web server created log records, and present them in continuous data streams. We then extract a useful data pattern so that we can discover stimuli factors that infer ecustomer behavioral changes. The patterns of click-stream data that are extracted are analyzed to clarify how users traverse the site from page to page, and identify the items that they select, the patterns of repeated visits, and the end-result of visits. This pattern analysis identifies trends in consumer browsing and purchase behavior that allows the comprehensive profiling of Web site visitors.
THE FOUNDATIONS OF DATA MINING
Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature:
• Massive data collection
• Powerful multiprocessor computers
• Data mining algorithms
Commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some industries, such as retail, these numbers can be much larger. The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods.
Architecture for on-line analytical mining
An OLAM engine performs analytical mining in data cubes in a similar manner as an OLAP engine performs on-line analytical processing. Therefore, it is suggested to have an integrated OLAM and OLAP architecture as shown in Figure 1, where the OLAM and OLAP en- gines both accept users' on-line queries (instructions) and work with the data cube in the analysis. Fur-thermore, an OLAM engine may perform multiple datamining tasks, such as concept description, association, classi_cation, prediction, clustering, time-series analy- sis, etc. Therefore, an OLAM engine is more sophisti- cated than an OLAP engine since it usually consists of multiple mining modules which may interact with each other for e_ective mining. Since some requirements in OLAM, such as the construction of numerical dimensions, may not be readily available in the commercial OLAP products, we have chosen to construct our own data cube and build the mining modules on such data cubes. With many OLAP products available on the market, it is important to develop on-line analytical mining mechanisms directly on top of the constructed data cubes and OLAP engines. Based on our analysis, there is no fundamental di_erence between the data cube required for OLAP and that for OLAM, although OLAM analysis may often involve the analysis of a larger number of dimensions with _ner granularities, and thus require more powerful data cube construction and accessing tools than OLAP analyses. Since OLAM engines are constructed either on customized data cubes which often work with relational atabase systems, or on top of the data cubes provided by the OLAP products, it is suggested to build on-line analytical mining systems on top of the existing OLAP and relational database systems, rather than from the ground up.
The Pre-Processing of Data Sources
Data cleaning is an important step of knowledge discovery in data preprocessing. As not all materials within the log file are relevant to the mining, a data preparation process is performed first. We focus on preprocessing the two server-level Web access log files, namely the common log format access log and the cookies file. The common log format is presented in Figure . The content of the cookie record varies in length and format, and acts as a user identification card. The log entries must be partitioned into logical clusters using one or a series of transaction identification modules, including user and session identifications.
APPLICATIONS
A wide range of companies have deployed successful applications of data mining. While early adopters of this technology have tended to be in information-intensive industries such as financial services and direct mail marketing, the technology is applicable to any company looking to leverage a large data warehouse to better manage their customer relationships. Two critical factors for success with data mining are: a large, well-integrated data warehouse and a well-defined understanding of the business process within which data mining is to be applied (such as customer prospecting, retention, campaign management, and so on).
CONCLUSION
We have proposed and developed an OLAM methodology that provides the means for management investigation on e-customers' click behavior, so as to further analyze their scale of preference and habit on a website surfing for the web advertisement planning and design. In our approach, a mechanism of automating the view of the data warehousing has been introduced. The view is provided by joining a dimension table and a fact table, and keeps record of user access paths in a fact table. As the click sequence and path traversal patterns represent the customer's theme, these findings could also be translated into web site design and could then be utilized to refine the web-site infrastructure. The refinement of the web-site design could generate much different pattern of e-customer web-pages click sequence. This phenomenon is a cyclic circle. To ensure timeliness, our OLAM method takes a dynamic mining approach for most updated analysis, by providing continue refinement according to the change of the web-site environment. However, problem exists of how to synchronize the update of the based relations with the update of the view. This paper offers a frame model metadata to facilitate the trigger event, which will be invoked whene er an incremental update occurs in the based relation, i.e. access log. The frame model metadata consists of data operation, which can be used to update the user access path. As a result, with OLAM, we can transform the data warehousing into an active data warehousing which can activate the incremental data update from the based relation into an existing view, after update during time interval.