An Efficient Association Rule Mining Algorithm In Distributed Databases

**seminar flower** · 01-10-2012, 10:39 AM

An Efficient Association Rule Mining Algorithm In Distributed Databases

.pdf

An Efficient Association.pdf (Size: 545.4 KB / Downloads: 12)

Abstract

This paper describes the alarm correlation in communication
networks based on data mining. A direct application of
sequential algorithms to distributed databases is not effective,
because it requires a large amount of communication overhead. In
our study, an efficient algorithm, EDMA, is proposed. It
minimizes the number of candidate sets and exchange messages
by local and global pruning. In local sites, it runs the application
based on the improved algorithm-CMatrix, which is used to
calculate local support counts. By numbering the global frequent
itemsets generated at the end of k-th iteration from 1 to m, the
algorithm codes every candidate (k+1)-itemset into a pair of those
number formed as-(x,y) to compress the context transmitted and
query corresponding support counts in CMatrix. Our solution
also reduces the size of average transactions and datasets that
leads to reduction of scan time. The performance study shows that
EDMA has superior running efficiency, lower communication cost
and stronger scalability than direct application of a sequential
algorithm in distributed databases.

INTRODUCTION

Background

Association rule mining is an active data mining research
area and most ARM algorithms cater to a centralized
environment. However, adapting centralized data mining to
discover useful patterns in distributed database isn't always
feasible because merging data sets from different sites incurs
huge network communication costs. Therefore, our research is
to develop a distributed algorithm for geographically
distributed data sets that reduces communication costs.
1.2 Association Rules
Let I = {i1, i2,… , im} be a set of literals, called items. Let D
be a set of transactions where each transaction T is a set of
items such that ∀T∈D, T⊆I. The transaction T supports an
itemset A if it supports every element a∈A. An association
rule is an implication of the form A =>B, where A⊆I, B⊆I
and A∩ B=∅. The rule A=>B has two properties, support
and confidence. When s% of transactions in D contain A ∪ B,
the support of it is s%. If some of the transactions in D contain
A, and c% also contain B, then the confidence in the rule is c%.
The problem of mining association rules is to generate all
association rules that have support and confidence are larger
than the user-specified minimum support and minimum
confidence respectively. We also call them strong rules to
distinguish from the weak ones.

The Count Distribution (CD) algorithm

CD algorithm uses the sequential Apriori algorithm in a
parallel environment and assumes datasets are horizontally
partitioned among different sites. At each iteration, it generates
the candidate sets at every site by applying the Apriori-gen
function on the set of frequent itemsets found at the previous
iteration. Every site then computes the local support counts of
all these candidate sets and broadcasts them to all the other
sites. Subsequently, all the sites can find the globally frequent
itemsets for that iteration, and then proceed to the next iteration.
This algorithm has a simple communication scheme for count
exchange. However, it also has the similar problems of higher
number of candidate sets and larger amount of communication
overhead. It does not use the memory of the system effectively.

Experimental Results

We have extensively studied our algorithm's performance to
confirm its effectiveness. We compare the performance of our
EDMA with the CD algorithm and consider the superiority of
it with FDM. The algorithm was implemented using Java.
We first fix the number of sites. The aim is to perform the
comparison with respect to different support thresholds and
database sizes. We partition the original data set into four
partitions. To reduce the dependency among different
partitions, each one contains only 25 percent of the original
data set's transactions. So, the number of identical transactions
among different partitions is very low.
In Fig.2, the execution time of EDMA, FDM and CD for a
database of size 10K transactions are plotted against the
support thresholds. We experienced, EDMA is about 4% to
67% faster than CD or FDM, and the difference decreases as
the support increases.

CONCLUSION

Distributed ARM algorithms must reduce communication
costs so that generating global association rules costs less than
combining the participating sites' datasets into a centralized
site. We have developed an efficient algorithm for mining
association rules in distributed databases.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Ranked, Efficient and Secure Keyword search over encrypted cloud data PPT	seminar post	1	814	21-09-2017, 11:55 AM Last Post: jaseela123
	Data Mining: What is Data Mining? Report	project girl	1	2,262	21-09-2017, 11:47 AM Last Post: jaseela123
	Enabling Secure and Efficient Ranked Keyword Search over Outsourced Cloud Data pdf	study tips	1	2,018	13-09-2017, 12:59 PM Last Post: jaseela123
	INCREMENTAL MINING USING FREQUENT PATTERN TREE	project topics	1	10,061,816	13-09-2017, 09:40 AM Last Post: jaseela123
	Green Computing for Efficient use of Energy and Electronic Waste Minimization Report	project girl	1	1,357	12-09-2017, 12:37 PM Last Post: jaseela123
	Blowfish Encryption Algorithm pdf	project girl	1	1,113	12-09-2017, 12:36 PM Last Post: jaseela123
	A Reduced-Bit Multiplication Algorithm for Digital Arithmetic	seminar flower	1	1,864	31-08-2017, 03:41 PM Last Post: jaseela123
	Report on Data Mining Technique	study tips	1	986	31-08-2017, 12:45 PM Last Post: jaseela123
	COMPARISON OF DIFFERENT TYPES OF SCHEDULING ALGORITHM IN WIRE NETWORK	seminar surveyer	1	4,508,586	30-08-2017, 11:02 AM Last Post: jaseela123
	Fiber Distributed Data Interface	Computer Science Clay	0	9,748,003	25-08-2017, 09:32 PM Last Post: Computer Science Clay

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.