17-11-2012, 04:02 PM
Discovery of maximum length frequent itemsets
1Discovery of maximum.pdf (Size: 565.76 KB / Downloads: 22)
Abstract
The use of frequent itemsets has been limited by the high computational cost as
well as the large number of resulting itemsets. In many real-world scenarios, however,
it is often sufficient to mine a small representative subset of frequent itemsets with
low computational cost. To that end, in this paper, we define a new problem of
finding the frequent itemsets with a maximum length and present a novel algorithm
to solve this problem. Indeed, maximum length frequent itemsets can be efficiently
identified in very large data sets and are useful in many application domains. Our
algorithm generates the maximum length frequent itemsets by adapting a pattern
fragment growth methodology based on the FP-tree structure. Also, a number of
optimization techniques have been exploited to prune the search space. Finally,
extensive experiments on real-world data sets validate the proposed algorithm.
Introduction
In 1993, Agrawal et al. [2] first proposed the problem of finding frequent itemsets
in their association rule mining model. Indeed, finding frequent itemsets
plays an important role in the field of data mining. Frequent itemsets are essential
for many data mining problems, such as the discovery of association
rules [3,12], data correlations [20,21], and sequential patterns [18,23].
The frequent itemset mining problem can be formally stated as follows: Let I
be a set of distinct items. Each transaction T in database D is a subset of I.
We call X ⊆ I an itemset. An itemset with k items is called a k-itemset. The
support of X, denoted by supp(X), is the fraction of transactions containing
X. If supp(X) is no less than a user-specified minimum support ε, X is called
a frequent itemset. Let FI denote the set of all frequent itemsets. X is closed
if it has no proper superset with the same support. Let FCI denote the set
of all frequent closed itemsets [16,25]. X is called a maximal frequent itemset
if it has no proper superset that is frequent. The set of all maximal frequent
itemsets is denoted by MFI [1,14,4,6,9,10]. X is a maximum length frequent
itemset if X contains a maximum number of items in FI
Preliminaries and related work
In this section, we first describe the conceptual framework of the item subset
lattice on which many algorithms are based. Then we review some related
research on frequent itemset mining.
Preliminaries
Most of the algorithms for mining frequent itemsets can be described using the
item subset lattice framework [1,4,6,9]. This lattice shows how sets of items are
completely enumerated in a search space. Assume there is a total lexicographic
ordering ≤L of all items in the database. This ordering is used to enumerate
the item subset lattice (search space). A particular ordering affects the item
relationships in the lattice but not its completeness. Fig. 1 shows a sample
of a complete subset lattice for four items. The lattice can be traversed in a
breadth-first way, a depth-first way or some other way according to a heuristic.
The problem of finding the frequent itemsets in the database can be viewed
as finding a cut through this lattice so that all those tree nodes above the cut
are frequent itemsets, while all those below are infrequent.
Related work
The Apriori algorithm [2,3] is a classic algorithm for finding frequent itemsets
and most of algorithms are its variants. It uses frequent itemsets at level k to
explore those at level k + 1, which needs one scan of the database. Besides,
it employs the heuristic that all nonempty subsets of a frequent itemset must
also be frequent, which prunes unpromising candidates to narrow down the
search space.
Apriori is based on the horizontal format of the database representation, in
which a transaction is represented as an item list. An alternative way is to
represent a database in vertical format, i.e., each item is associated with a set
of transaction identifiers (TIDs) that include the item. As a representative in
this group, VIPER [17] uses a vertical bit-vector with compression to store
intermediate data during execution and performs counting with a vertical
TID-list approach.
FP-growth [11] is a fundamentally different algorithm from the Apriori-like
algorithms. The efficiency of Apriori-like algorithms suffers from the exponential
enumeration of candidate itemsets and repeated database scans at each
level for support check. To diminish these weaknesses, the FP-growth algorithm
finds frequent itemsets without candidate set generation and records the
database into a compact FP-tree structure to avoid repeated database scans.
Due to the savings of storing the database in main memory, the FP-growth
algorithm achieves great performance gains against Apriori-like algorithms.
However, it is not scalable to very large databases, due to the requirement
that the FP-trees fit in the main memory.
Representative breadth-first algorithms for mining MFI include Pincer-Search
[14] and Max-Miner [4]. The former combines both the bottom-up and topdown
searches. The latter uses lookahead to prune branches from the itemset
lattice by quickly identifying long frequent itemsets.