01-10-2012, 10:39 AM
An Efficient Association Rule Mining Algorithm In Distributed Databases
An Efficient Association.pdf (Size: 545.4 KB / Downloads: 12)
Abstract
This paper describes the alarm correlation in communication
networks based on data mining. A direct application of
sequential algorithms to distributed databases is not effective,
because it requires a large amount of communication overhead. In
our study, an efficient algorithm, EDMA, is proposed. It
minimizes the number of candidate sets and exchange messages
by local and global pruning. In local sites, it runs the application
based on the improved algorithm-CMatrix, which is used to
calculate local support counts. By numbering the global frequent
itemsets generated at the end of k-th iteration from 1 to m, the
algorithm codes every candidate (k+1)-itemset into a pair of those
number formed as-(x,y) to compress the context transmitted and
query corresponding support counts in CMatrix. Our solution
also reduces the size of average transactions and datasets that
leads to reduction of scan time. The performance study shows that
EDMA has superior running efficiency, lower communication cost
and stronger scalability than direct application of a sequential
algorithm in distributed databases.
INTRODUCTION
Background
Association rule mining is an active data mining research
area and most ARM algorithms cater to a centralized
environment. However, adapting centralized data mining to
discover useful patterns in distributed database isn't always
feasible because merging data sets from different sites incurs
huge network communication costs. Therefore, our research is
to develop a distributed algorithm for geographically
distributed data sets that reduces communication costs.
1.2 Association Rules
Let I = {i1, i2,… , im} be a set of literals, called items. Let D
be a set of transactions where each transaction T is a set of
items such that ∀T∈D, T⊆I. The transaction T supports an
itemset A if it supports every element a∈A. An association
rule is an implication of the form A =>B, where A⊆I, B⊆I
and A∩ B=∅. The rule A=>B has two properties, support
and confidence. When s% of transactions in D contain A ∪ B,
the support of it is s%. If some of the transactions in D contain
A, and c% also contain B, then the confidence in the rule is c%.
The problem of mining association rules is to generate all
association rules that have support and confidence are larger
than the user-specified minimum support and minimum
confidence respectively. We also call them strong rules to
distinguish from the weak ones.
The Count Distribution (CD) algorithm
CD algorithm uses the sequential Apriori algorithm in a
parallel environment and assumes datasets are horizontally
partitioned among different sites. At each iteration, it generates
the candidate sets at every site by applying the Apriori-gen
function on the set of frequent itemsets found at the previous
iteration. Every site then computes the local support counts of
all these candidate sets and broadcasts them to all the other
sites. Subsequently, all the sites can find the globally frequent
itemsets for that iteration, and then proceed to the next iteration.
This algorithm has a simple communication scheme for count
exchange. However, it also has the similar problems of higher
number of candidate sets and larger amount of communication
overhead. It does not use the memory of the system effectively.
Experimental Results
We have extensively studied our algorithm's performance to
confirm its effectiveness. We compare the performance of our
EDMA with the CD algorithm and consider the superiority of
it with FDM. The algorithm was implemented using Java.
We first fix the number of sites. The aim is to perform the
comparison with respect to different support thresholds and
database sizes. We partition the original data set into four
partitions. To reduce the dependency among different
partitions, each one contains only 25 percent of the original
data set's transactions. So, the number of identical transactions
among different partitions is very low.
In Fig.2, the execution time of EDMA, FDM and CD for a
database of size 10K transactions are plotted against the
support thresholds. We experienced, EDMA is about 4% to
67% faster than CD or FDM, and the difference decreases as
the support increases.
CONCLUSION
Distributed ARM algorithms must reduce communication
costs so that generating global association rules costs less than
combining the participating sites' datasets into a centralized
site. We have developed an efficient algorithm for mining
association rules in distributed databases.