26-08-2014, 03:28 PM
Knowledge-Based Systems
Knowledg.pdf (Size: 331.5 KB / Downloads: 19)
abstract
In Web usage mining, fuzzy association rules that have a temporal property can provide useful know edge about when associations occur. However, there is a problem with traditional temporal fuzzy asso ciation rule mining algorithms. Some rules occur at the intersection of fuzzy sets’ boundaries where there
is less support (lower membership), so the rules are lost. A genetic algorithm (GA)-based solution is
described that uses the flexible nature of the 2-tuple linguistic representation to discover rules that occur
at the intersection of fuzzy set boundaries. The GA-based approach is enhanced from previous work by
including a graph representation and an improved fitness function. A comparison of the GA-based
approach with a traditional approach on real-world Web log data discovered rules that were lost with
the traditional approach
1. Introduction
Web usage mining is one type of Web mining [23] that attempts
to discover patterns of user behaviours that are recorded in the logs
of Web servers as users browse Web sites [10]. In this paper, tem poral fuzzy association rules are used for Web usage mining. For
example, ‘‘On a Friday evening, visitors who viewed history.html
for a large amount of time also viewed contact-us.html for a medium amount of time’’
s [1] by incorporating temporal and fuzzy quantitative
features. The temporal feature of the rule is on a Friday evening,
and the fuzzy features are the large and medium descriptions.
Matthews et al. [25] discovered a problem of losing some rules
when using traditional methods on synthetic market basket data.
2. Related work
The application of Web usage mining has been categorised as
either personalised for learning user profiles, or unpersonalised
for user navigation patterns [30]. In this paper, we focus on user
navigation patterns represented with fuzzy association rules.
Web usage mining can be used for the personalisation of web con customer relationship management in e-commerce [13]. Recent
work has also applied similar techniques to those in this paper.
GAs have mined sequence rules in Web log data [31] and have also
performed subgroup discovery [6]. Fuzzy sets have been used to
represent the time spent viewing Web pages for fuzzy association
rules [33] and fuzzy sequence rules [19]. The temporal and fuzzy
features of association rules that are mined in this paper are now
reviewed.
The term temporal is ambiguous, because it can have different
interpretations in temporal data mining [26].
The temporal property of not discovering rare fuzzy itemsets
[32] is different to our research, because we focus on how the fuzzy
sets are defined instead of only the temporal property. Au and
Chan [5] also mine fuzzy association rules in temporal partitions
of the dataset, and they follow the same two-step process, which
can lose rules.
3. Temporal fuzzy association rule mining
Two approaches for mining temporal fuzzy association rules
were run on the United States Environmental Protection Agency
(EPA) dataset. The purpose is to demonstrate how the flexibility
of the 2-tuple linguistic representation approach can help to discover rules on real-world data that a traditional approach cannot.
The two approaches are described here, and enhancements to the
GA-based approach are explained in Section 4.
3.1. FuzzyApriori
FuzzyApriori [18] is an extension to the Apriori algorithm [1]
that uses a breadth-first search. FuzzyApriori uses fuzzy sets to express quantities of items with linguistic terms, but it does not consider any temporal pattern. So, the dataset is partitioned according
to its temporal dimension, such as by hour, and FuzzyApriori is
executed on each dataset partition separately. The systematic
search of the temporal dimension allows for the discovery of temporal features of fuzzy association rules. This is similar to the first
approach for mining cyclic association rules [27] where the dataset
is also partitioned according to the temporal dimension.
3.2. CHC with 2-tuple linguistic representation
The GA-based approach was first described in Matthews et al.
[25], so an overview is given before introducing enhancements in
Section 4. The pseudocode of the algorithm is described in Section
A. The GA-based approach by Matthews et al.
The GA-based approach uses Iterative Rule Learning (IRL) [14]. IRL
represents one rule in a chromosome. One rule is used from the
final population of a GA. More rules are learnt by repeating the
GA and penalising previously learnt rules in the fitness function.
4. Enhanced temporal fuzzy association rule mining
The GA-based approach is extended with an enhanced fitness
function. A weight in the fitness function provides a preference
based multi-objective model to overcome confidence dominating
the fitness [25]. Previous approaches also use Pareto-based muti-objective models [24], however, selecting a single rule from
the Pareto front (for IRL) is a challenging problem. A chromosome
C has mixed types, and is defined as C = (el,eu,i1, s1,a1,a1, ... , ik, sk,
ak,ak) where the lower temporal endpoint is el (start of time
window), the upper temporal endpoint is eu (end of time window),
i is the uniform resource locator (URL), s is the linguistic label
expressing the page view time for that URL (e.g., medium), a is
the lateral displacement of that linguistic label, a determines
the antecedent/consequent part, and k is the number of URLs in a
rule. For example, a chromosome (807127200,807130800,
‘‘/Rules.html’’, ‘‘medium’’,0.42, antecedent, ‘‘/’’, ‘‘medium’’,0.31,
consequent) represents the rule ‘‘IF view time of/Rules.html is
(medium,0.42) THEN view time of/ is (medium, 0.31) during the
period from 807127200 to 807130800’’ (unixtime). A single
rule is represented and extracted from a chromosome, because
the lateral displacements of a fuzzy set are specific to each rule.
5. Evaluation
The dataset and methodology for analysing the enhanced GAbased approach are discussed and results are then presented.
5.1. Data
A Web log dataset has both temporal and quantitative features.
The temporal feature is the timestamp of a request made to the
server, and the quantitative feature is the page view time in
seconds.
The EPA dataset1 is a collection of Hypertext Transfer Protocol
(HTTP) requests to a Web server collected from a 24-h period. The
geographical location of the Web server is Research Triangle Park,
NC, USA. The EPA dataset was recorded from 23:53:25 29th August
1995 EDT to 23:53:07 30th August 1995 EDT. The EPA dataset has
47748 requests: 46014 GET requests, 1622 POST requests, 107 HEAD
requests, and 6 invalid requests. Table 2 shows a sample of records
from the EPA dataset before cleaning and preprocessing.
6. Conclusions
We have demonstrated the problem of losing temporal fuzzy
association rules on real-world Web log data for the first time
and presented a novel solution. Our previous approach of using a
GA and the 2-tuple linguistic representation has been improved
by transforming the dataset to a graph, which ensures valid item sets are discovered, and modifying the fitness function.
The execution time of the GA-based approach is longer, how
It is important to note that lowering minimum support/confdence would overcome the problem of losing rules with traditional
approaches, however, the number of rules increases, which is
undesirable in association rule mining. Further work will explore
different enhancements, and different approaches to tackle the
same problem.