15-12-2012, 01:57 PM
Fast Algorithms for Mining Association Rules
1Fast Algorithms.pdf (Size: 354.18 KB / Downloads: 27)
Abstract
We consider the problem of discovering association rules between items in a large database
of sales transactions. We present two new algorithms for solving this problem that are fundamentally
dierent from the known algorithms. Experiments with synthetic as well as real-life
data show that these algorithms outperform the known algorithms by factors ranging from three
for small problems to more than an order of magnitude for large problems. We also show how
the best features of the two proposed algorithms can be combined into a hybrid algorithm,
called AprioriHybrid. Scale-up experiments show that AprioriHybrid scales linearly with the
number of transactions. AprioriHybrid also has excellent scale-up properties with respect to the
transaction size and the number of items in the database.
Introduction
Database mining is motivated by the decision support problem faced by most large retail organizations
[S+93]. Progress in bar-code technology has made it possible for retail organizations to
collect and store massive amounts of sales data, referred to as the basket data. A record in such
data typically consists of the transaction date and the items bought in the transaction. Successful
organizations view such databases as important pieces of the marketing infrastructure [Ass92].
They are interested in instituting information-driven marketing processes, managed by database
technology, that enable marketers to develop and implement customized marketing programs and
strategies [Ass90].
The problem of mining association rules over basket data was introduced in [AIS93b]. An
example of such a rule might be that 98% of customers that purchase tires and auto accessories also
get automotive services done. Finding all such rules is valuable for cross-marketing and attached
mailing applications. Other applications include catalog design, add-on sales, store layout, and
customer segmentation based on buying patterns. The databases involved in these applications are
very large. It is imperative, therefore, to have fast algorithms for this task.
Discovering Large Itemsets
Algorithms for discovering large itemsets make multiple passes over the data. In the rst pass, we
count the support of individual items and determine which of them are large, i.e. have minimum
support. In each subsequent pass, we start with a seed set of itemsets found to be large in the
previous pass. We use this seed set for generating new potentially large itemsets, called candidate
itemsets, and count the actual support for these candidate itemsets during the pass over the data.
At the end of the pass, we determine which of the candidate itemsets are actually large, and they
become the seed for the next pass. This process continues until no new large itemsets are found.
The Apriori and AprioriTid algorithms we propose dier fundamentally from the AIS [AIS93b]
and SETM [HS93] algorithms in terms of which candidate itemsets are counted in a pass and
in the way that those candidates are generated. In both the AIS and SETM algorithms (see
Sections 4.1 and 4.2 for a review), candidate itemsets are generated on-the-
y during the pass
as data is being read. Specically, after reading a transaction, it is determined which of the
itemsets found large in the previous pass are present in the transaction. New candidate itemsets
are generated by extending these large itemsets with other items in the transaction. However, as
we will see, the disadvantage is that this results in unnecessarily generating and counting too many
candidate itemsets that turn out to be small.
Algorithm Apriori
Figure 1 gives the Apriori algorithm. The rst pass of the algorithm simply counts item occurrences
to determine the large 1-itemsets. A subsequent pass, say pass k, consists of two phases. First, the
large itemsets Lk