23-11-2012, 11:13 AM
Web Mining
WebMining(1).pdf (Size: 261.47 KB / Downloads: 145)
Abstract
This paper will look closer to different implementations on web mining and the
importance of filtering out calls made from robots to get knowledge about the actual
human usage of a website. This is to find patterns between different web pages and
create more customized and accessible web pages to users, which in turn creates
more traffic and trade to the website. We will address some common methods to find
and eliminate the web usage made from robots while keeping browsing data made
from human users intact. This paper will primarily focus on the field of web usage
mining, which is a direct need from the growth of the World Wide Web.
Introduction
Web mining deals with three main areas: web content mining, web usage mining and web structure
mining. In web usage mining it is desirable to find the habits and relations between what the
website’s users are looking for. To find the actual users some filtering has to be done to remove bots
that indexes structures of a website. Robots view all pages and links on a website to find relevant
content. This creates many calls to the website server and thereby creates a false image of the actual
web usage.
The paper we have chosen to start with [Tang et al. 2002] does not in depth discuss web content
and web structure mining, but instead look closer upon web usage mining. This field is supposed to
describe relations between web pages based on the interests of users, i.e. finding links often clicked
in a specific order which are of greater relevance to the user. The patterns revealed will then be
used to create a more visitor customized website by highlighting or otherwise expose web pages to
increase commerce. This is often demonstrated as a price cut in one product which will increase
sales in another. On the other hand it is also important to not to misclassify actual users that make
thorough searches of websites and label them as robots.
Techniques to Address the Problem
Preprocessing technique - Web Robots
When attempting to detect web robots from a stream it is desirable to monitor both the Web server
log and activity on the client-side. What we are looking for is to distinguish single Web sessions
from each other. A Web session is a series of requests to web pages, i.e. visits to web pages. Since
the navigation patterns of web robots differs from the navigation patterns of human users the
contribution from web robots has to be eliminated before proceeding with any further data mining,
i.e. when we are looking into web usage behaviour of real users.
One problem with identifying web robots is that they might hide their identity behind a facade
looking a lot like conventional web browsers. Standard approaches to robot detection will fail to
detect camouflaged web robots. As web robots are used for tasks like website indexing, e.g. by
Google, or detection of broken links they have to exist. There is a special file on every domain
called “robot.txt” which, according to the Robot Exclusion Standard [M. Koster, 1994], will be
examined by the robot in order to prevent the robot from visiting certain pages of no interest. Evil
web robots however aren’t guaranteed to follow the advice from robot.txt.
Detecting Web Robots
To detect web robots [Tang et al., 2002] uses a technique involving feature classification is used. The
classes chosen for evaluation are Temporal Features, Page Features, Communication Features and
Path Features. It is desirable to be able to detect the presence of a web robot after as few requests as
possible, this is ofcourse a tradeoff between computational effort and result accuracy.
Avoiding Mislabeled Sessions
To avoid mislabeling of sessions an ensemble filtering approach [C. Brodley et al., 1999] is used,
where the idea is to instead of just one model for classification, build several models which are used
to find classification errors via finding single mislabeled sessions.
The set of models acquired are used to classify all sessions respectively. For each session, the
amount of false negative and false positive classifications are counted. A large value of false positive
classifications imply that the session is currently assigned to be a non-robot despite being predicted
to be a robot in most of the models. A large value of false negative classifications imply that the
session might be a non-robot but has the robot classifier.
Indirect Association
Common association methods often employ patterns that connects objects to each other.
Sometimes, on the other hand, it might be valuable to consider indirect association between objects.
Indirect association is used to e.g. represent the behaviour of distinct user groups.
In general, two objects that are indirectly associated have the same path, but are themselves distinct
leafs to that path. That is, if one session is {A, B, C} and another is {A, B, D} then C and D are
indirectly associated because they share the same traversal path {A, B}, also called “mediator”.
The algorithm used to discover indirect associations first uses Apriori [R. Agrawal et al., 1994] to
distinguish frequent itemsets, i.e. common sessions from single clients. The frequent itemsets are
matched against each other in order to discover indirect association candidate triplets, <a, b, M>,
where a and b are indirectly associated values and M is their mediator. In the matching process a
triplet is formed once an itemset L1 and another itemset L2 matches except for one position, that is
where one has found indirect associated values. Each pair of indirectly associated values are noted
in a matrix. The matrix will, after all candidates are considered, contain values combining indirect
associated values. The larger a specific matrix value is, the stronger the indirect association.
Comparing methods
The methods described simplifies the work of one another, they aren’t really competitors addressed
to solve the same problem. Web robot detection is used to filter out human user sessions. Clustering
is used to cluster similar websites into a more general description. The clustering allows the
association methods, especially the indirect association method described here, to be ran as fast as
. Applications
Web mining is an important tool to gather knowledge of the behaviour of Websites’ visitors and
thereby to allow for appropriate adjustments and decisions with respect to Websites’ actual users
and traffic patterns. Along with a description of the processes involved in Web mining [Srivastava,
1999] states that Website Modification, System Improvement, Web Personalization and Business
Intelligence are four major application areas for Web mining. These are briefly described in the
following sections.
Website Modification
The content and structure of the Website is important to the user experience/impression of the
site and the site’s usability. The problem is that different types of users have different preferences,
background, knowledge etc. making it difficult (if not impossible) to find a design that is optimal
for all users. Web usage mining can then be used to detect which types of users are accessing the
website, and their behaviour, knowledge which can then be used to manually design/re-design the
website, or to automatically change the structure and content based on the profile of the user visiting
it. Adaptive Websites are described in more detail in [Perkowitz & Etzioni. 1998].
Summary
Web mining consists of three major parts: collecting the data, preprocessing the data and extracting
and analyzing patterns in the data. This paper focuses primarily on web usage data mining.
As expected, using Web mining when designing and maintaining Websites is extremely useful for
making sure that the Website conforms to the actual usage of the site. The area of Web mining was
invented with respect to the needs of web shops, which wanted to be more adaptive to customers.
A set of clustering techniques have been listed which significantly speeds up the process of mining
data on the Web. The different techniques has a corresponding computation cost and time cost
which can determine the technique of choice depending of the size of the data.