16-06-2012, 11:24 AM
Web Log Cleaning for Mining of Web Usage Patterns
Web Log Cleaning for Mining.pdf (Size: 80.36 KB / Downloads: 67)
Abstract
Web usage mining (WUM) is a type of web mining,
which exploits data mining techniques to extract valuable
information from navigation behavior of World Wide Web
users. The data should be preprocessed to improve the
efficiency and ease of the mining process. So it is important to
define before applying data mining techniques to discover user
access patterns from web log. The main task of data
preprocessing is to prune noisy and irrelevant data, and to
reduce data volume for the pattern discovery phase.
INTRODUCTION
Web mining refers to the use of data mining techniques
to automatically retrieve, extract and analyze information
for knowledge discovery from Web documents and services.
The expansion of the World Wide Web (WWW) has
resulted in a large amount of data that is now in general
freely available for user access. The different types of data
have to be managed and organized in such a way that
different users can access them efficiently. Several data
mining methods are used to discover the hidden information
in the Web. Therefore, the application of data mining
techniques on the Web is now the focus of an increasing
number of researchers.
RELATED WORK
R.Cooley et al. 99 have clarified the preprocessing tasks
necessary for Web usage mining. Their approach basically
follows their steps to prepare Web log data for mining [1].
Mohammad Ala’a Al- Hamami et al described an efficient
web usage mining framework. The key ideas were to
preprocess the web log files and then classify this log file
into number of files each one represent a class, this
classification done by a decision tree classifier. After the
web mining processed on each of classified files and
extracted the hidden pattern they didn’t need to analyze
these discovered patterns because it would be very clear and
understood in the visualization level [2].
WEB USAGE MINING
Web Usage Mining (WUM) is the application of data
mining techniques to discover usage patterns from Web data.
In a general process of WUM, distinguish three main steps:
data preprocessing, pattern discovery and pattern analysis.
During preprocessing phase, raw Web logs need to be
cleaned, analyzed and converted before further pattern
mining. The data recorded in server logs, such as the user IP
address, browser, viewing time, etc, are available to identify
users and sessions. However, because some page views may
be cached by the user browser or by a proxy server, we
should know that the data collected by server logs are not
entirely reliable.
DATA PREPROCESSING
Preprocessing converts the raw data into the data
abstractions necessary for pattern discovery. The purpose of
data preprocessing is to improve data quality and increase
mining accuracy. Preprocessing consists of field extraction,
data cleansing. This phase is probably the most complex and
ungrateful step of the overall process.
This system only describe it shortly and say that its
main task is to ”clean” the raw web log files and insert the
processed data into a relational database, in order to make it
appropriate to apply the data mining techniques in the
second phase of the process.
CONCLUSION
Data preprocessing is an important task of WUM
application. Therefore, data must be processed before
applying data mining techniques to discover user access
patterns from web log. The data preparation process is often
the most time consuming. This paper presents two
algorithms for field extraction and data cleaning. Not every
access to the content should be taken into consideration. So
this system removes accesses to irrelevant items and failed
requests in data cleaning.