28-06-2012, 05:29 PM
Web Mining: Information and Pattern Discovery on the World Wide Web
Web Mining.pdf (Size: 362.25 KB / Downloads: 64)
Abstract
Application of data mining techniques to the World
Wide Web, referred to as Web mining, has been the
focus of several recent research projects and papers.
However, there is no established vocabulary, leading to
confusion when comparing research eorts. The term
Web mining has been used in two distinct ways. The
rst, called Web content mining in this paper, is the
process of information discovery from sources across
the World Wide Web. The second, called Web usage
mining, is the process of mining for user browsing and
access patterns. In this paper we dene Web mining
and present an overview of the various research is-
sues, techniques, and development eorts.
Introduction
With the explosive growth of information sources
available on the World Wide Web, it has become
increasingly necessary for users to utilize automated
tools in nd the desired information resources, and to
track and analyze their usage patterns. These factors
give rise to the necessity of creating server-side and
client-side intelligent systems that can eectively mine
for knowledge. Web mining can be broadly dened as
the discovery and analysis of useful information from
the World Wide Web. This describes the automatic
search of information resources available on-line, i.e.
Web content mining, and the discovery of user access
patterns from Web servers, i.e., Web usage mining.
A Taxonomy of Web Mining
In this section we present a taxonomy ofWeb min-
ing, i.e. Web content mining and Web usage mining.
We also describe and categorize some of the recent
work and the related tools or techniques in each area.
This taxonomy is depicted in Figure 1.
Web Content Mining
The lack of structure that permeates the informa-
tion sources on the World Wide Web makes auto-
mated discovery of Web-based information dicult.
Traditional search engines such as Lycos, Alta Vista,
WebCrawler, ALIWEB [29], MetaCrawler, and others
provide some comfort to users, but do not generally
provide structural information nor categorize, lter,
or interpret documents. A recent study provides a
comprehensive and statistically thorough comparative
evaluation of the most popular search engines [32].
Web Usage Mining
Web usagemining is the automatic discovery of user
access patterns from Web servers. Organizations col-
lect large volumes of data in their daily operations,
generated automatically by Web servers and collected
in server access logs. Other sources of user information
include referrer logs which contain information about
the referring pages for each page reference, and user
registration or survey data gathered via CGI scripts.
Pattern Discovery from Web Transactions
As discussed in section 2.2, analysis of how users
are accessing a site is critical for determining eec-
tive marketing strategies and optimizing the logical
structure of the Web site. Because of many unique
characteristics of the client-server model in the World
Wide Web, including dierences between the physical
topology of Web repositories and user access paths,
and the diculty in identication of unique users as
well as user sessions or transactions, it is necessary to
develop a new framework to enable the mining pro-
cess. Specically, there are a number of issues in pre-
processing data for mining that must be addressed be-
fore the mining algorithms can be run. These include
developing a model of access log data, developing tech-
niques to clean/lter the raw data to eliminate outliers
and/or irrelevant items, grouping individual page ac-
cesses into semantic units (i.e. transactions), integra-
tion of various data sources such as user registration
information, and specializing generic data mining al-
gorithms to take advantage of the specic nature of
access log data.