01-02-2013, 10:30 AM
A Survey on Preprocessing Methods for Web Usage Data
1A Survey on Preprocessing.pdf (Size: 563.99 KB / Downloads: 67)
Abstract
World Wide Web is a huge repository of web pages
and links. It provides abundance of information for the Internet
users. The growth of web is tremendous as approximately one
million pages are added daily. Users’ accesses are recorded in
web logs. Because of the tremendous usage of web, the web log
files are growing at a faster rate and the size is becoming huge.
Web data mining is the application of data mining techniques in
web data. Web Usage Mining applies mining techniques in log
data to extract the behavior of users which is used in various
applications like personalized services, adaptive web sites,
customer profiling, prefetching, creating attractive web sites
etc., Web usage mining consists of three phases preprocessing,
pattern discovery and pattern analysis. Web log data is usually
noisy and ambiguous and preprocessing is an important process
before mining. For discovering patterns sessions are to be
constructed efficiently. This paper reviews existing work done
in the preprocessing stage. A brief overview of various data
mining techniques for discovering patterns, and pattern
analysis are discussed. Finally a glimpse of various applications
of web usage mining is also presented.
INTRODUCTION
Data mining is defined as the automatic extraction of
unknown, useful and understandable patterns from large
database. Enormous growth of World Wide Web increases the
complexity for users to browse effectively. To increase the
performance of web sites better web site design, web server
activities are changed as per users’ interests. The ability to
know the patterns of users’ habits and interests helps the
operational strategies of enterprises. Various applications like
e-commerce, personalization, web site designing, recommender
systems are built efficiently by knowing users navigation
through web. Web mining is the application of data mining
techniques to automatically retrieve, extract and evaluate
information for knowledge discovery from web documents and
services.
Web Usage Mining
Web usage mining also known as web log mining is the
application of data mining techniques on large web log
repositories to discover useful knowledge about user’s
behavioral patterns and website usage statistics that can be used
for various website design tasks. The main source of data for
web usage mining consists of textual logs collected by
numerous web servers all around the world. There are four
stages in web usage mining.
DATA COLLECTION
Data Collection is the first step in web usage mining
process. It consists of gathering the relevant web data. Data
source can be collected at the server-side, client-side, proxy
servers, or obtain from an organization’s database, which
contains business data or consolidated Web data [13].
DATA PREPROCESSING
The information available in the web is heterogeneous and
unstructured. Therefore, the preprocessing phase is a
prerequisite for discovering patterns. The goal of
preprocessing is to transform the raw click stream data into a
set of user profiles [8]. Data preprocessing presents a number
of unique challenges which led to a variety of algorithms and
heuristic techniques for preprocessing tasks such as merging
and cleaning, user and session identification etc [18]. Various
research works are carried in this preprocessing area for
grouping sessions and transactions, which is used to discover
user behavior patterns.
Data Cleaning
Data Cleaning is a process of removing irrelevant items
such as jpeg, gif files or sound files and references due to
spider navigations. Improved data quality improves the analysis
on it. The Http protocol requires a separate connection for
every request from the web server. If a user request to view a
particular page along with server log entries graphics and
scripts are download in addition to the HTML file. An
exception case is Art gallery site where images are more
important. Check the Status codes in log entries for successful
codes. The status code less than 200 and greater than 299
were removed.
PATTERN DISCOVERY AND ANALYSIS
Once user transactions have been identified, a variety of
data mining techniques are performed for pattern discovery in
web usage mining. These methods represent the approaches
that often appear in the data mining literature such as discovery
of association rules and sequential patterns and clustering and
classification etc., [13]. Classification is a supervised learning
process, because learning is driven by the assignment of
instances to the classes in the training data. Mapping a data
item into one of several predefined classes is done. It can be
done by using inductive learning algorithms such as decision
tree classifiers, naive Bayesian classifiers, Support Vector
Machines etc.,
CONCLUSION
Web sites are one of the most important tools for
advertisements in international area for universities and other
foundation. The quality of a website can be evaluated by
analyzing user accesses of the website. To know the quality of
a web site user accesses are to be evaluated by web usage
mining. The results of mining can be used to improve the
website design and increase satisfaction which helps in various
applications. Log files are the best source to know user
behavior. But the raw log files contains unnecessary details
like image access, failed entries etc., which will affect the
accuracy of pattern discovery and analysis. So preprocessing
stage is an important work in mining to make efficient pattern
analysis. To get accurate mining results user’s session details
are to be known. The survey was performed on a selection of
web usage methodologies in preprocessing proposed by
research community.