17-11-2012, 02:20 PM
Efficiently Mining Maximal Frequent Itemsets
1Efficiently Mining Maximal.pdf (Size: 165.82 KB / Downloads: 29)
Abstract
We present GenMax, a backtrack search based algorithm
for mining maximal frequent itemsets. GenMax uses a number
of optimizations to prune the search space. It uses
a novel technique called progressive focusing to perform
maximality checking, and diffset propagation to perform
fast frequency computation. Systematic experimental comparison
with previous work indicates that different methods
have varying strengths and weaknesses based on dataset
characteristics. We found GenMax to be a highly efficient
method to mine the exact set of maximal patterns.
Introduction
Mining frequent itemsets is a fundamental and essential
problem in many data mining applications such as the discovery
of association rules, strong rules, correlations, multidimensional
patterns, and many other important discovery
tasks. The problem is formulated as follows: Given a large
data base of set of items transactions, find all frequent itemsets,
where a frequent itemset is one that occurs in at least a
user-specified percentage of the data base.
Many of the proposed itemset mining algorithms are a
variant of Apriori [2], which employs a bottom-up, breadthfirst
search that enumerates every single frequent itemset.
Thus, there has been recent
interest in mining maximal frequent patterns in these ”hard”
dense databases. Another recent promising direction is to
mine only closed sets [9, 11]; a set is closed if it has no
superset with the same frequency. Nevertheless, for some
of the dense datasets we consider in this paper, even the set
of all closed patterns would grow to be too large. The only
recourse is to mine the maximal patterns in such domains.
GenMax for efficient MFI Mining
There are two main ingredients to develop an efficient
MFI algorithm. The first is the set of techniques used to
remove entire branches of the search space, and the second
is the representation used to perform fast frequency computations.
We will describe below how GenMax extends the
basic backtracking routine for FI, and then the progressive
focusing and diffset propagation techniques it uses for fast
maximality and frequency checking.
Optimizing GenMax
Superset Checking Optimization
The main efficiency of GenMax stems from the fact
that it eliminates branches that are subsumed by an already
mined maximal pattern. Were it not for this pruning, Gen-
Max would essentially default to a depth-first exploration of
the search tree.
Frequency Testing Optimization
So far GenMax, as described, is independent of the data
format used. The techniques can be integrated into any of
the existing methods for mining maximal patterns. We now
present some data format specific optimizations for fast frequency
computations.
GenMax uses a vertical database format, where we have
available for each item its tidset, the set of all transaction
tids where it occurs. The vertical representation has the following
major advantages over the horizontal layout: Firstly,
computing the support of itemsets is simpler and faster with
the vertical layout since it involves only the intersections
of tidsets (or compressed bit-vectors if the vertical format
is stored as bitmaps [4]).
Experimental Results
Past work has demonstrated that DepthProject [1] is
faster than MaxMiner [3], and the latest paper shows that
Mafia [4] consistently beats DepthProject. In out experimental
study below, we retain MaxMiner for baseline comparison.
At the same time, MaxMiner shows good performance
on some datasets, which were not used in previous
studies. We use Mafia as the current state-of-the-art method
and show how GenMax compares against it.
All our experiments were performed on a 400MHz Pentium
PC with 256MB of memory, running RedHat Linux
6.0. For comparison we used the original source or object
code for MaxMiner [3] and MAFIA [4], provided to
us by their authors. Timings in the figures are based on total
wall-clock time, and include all preprocessing costs (such
as horizontal-to-vertical conversion in GenMax and Mafia).
The times reported also include the program output. We
believe our setup reflects realistic testing conditions (as opposed
to some previous studies which report only the CPU
time or may not include output cost).
Conclusions
This is one of the first papers to comprehensively compare
recent maximal pattern mining algorithms under realistic
assumptions. Our timings are based on wall-clock time,
we included all pre-processing costs, and also cost of outputting
all the maximal itemsets (written to a file). We were
able to distinguish four different types of MFI distributions
in our benchmark testbed. We believe these distributions
to be fairly representative of what one might see in practice,
since they span both real and synthetic datasets. Type
I is a normal MFI distribution with not too long maximal
patterns, Type II is a left-skewed distributions, with longer
maximal patterns, Type III is an exponential decay distribution,
with extremely short maximal patterns, and finally
Type IV is an extreme left-skewed distribution, with very
large average maximal pattern length.