An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data S

**seminar tips** · 12-11-2012, 06:21 PM

An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams

.pdf

An_efficient_algorithm_for_mining_frequent_itemsets_over_the_entire_history_of_data_streams.pdf (Size: 226.88 KB / Downloads: 32)

Introduction

Mining frequent itemsets is an essential step in many data mining problems, such as
mining association rules, sequential patterns, closed patterns, maximal pattern, and many
other important data mining tasks. The problem of mining frequent itemsets in large
databases was first proposed by Agrawal et al. [2] in 1993, and the problem can be defined
as follows. Let Ψ = {i1, i2, …, in} be a set of literals, called items. Let database DB be a set
of transactions, where each transaction T consists of a set of items, such that T ⊆ Ψ. Each
transaction is also associated with a unique transaction identifier, called TID. A set X ⊆ Ψ
is also called an itemset, where items within an itemset are kept in lexicographic order. A
k-itemset is represented by (x1, x2, …, xk), where x1 < x2 < …< xk. The support of an
itemset X, denoted sup(X), is the number of transactions in which that itemset occurs as a
subset. An itemset X is called a frequent itemset if sup(X) ≥ ms*|DB|, where ms ∈ (0, 1) is
a user-specified minimum support threshold and |DB| is the size of the database. Hence,
the problem of mining frequent itemsets is to mine all itemsets whose support is no less
than ms*|DB| in a large database.

Problem Definition

Based on the estimation mechanism of the Lossy Counting algorithm [8], we propose a
new single-pass algorithm for mining all frequent itemsets in data streams based on a
landmark windows model when a user-specified minimum support threshold ms ∈ (0, 1),
and a user-defined maximal estimated support error threshold ε ∈ (0, ms) are given. For
mining all frequent itemsets, a data stream can be defined as follows.
Let Ψ = {i1, i2, …, im} be a set of literals, called items. Let data stream DS = B1, B2, …,
BN, …, be an infinite sequence of blocks, where each block is associated with a block
identifier i, and N is the identifier of the “latest” block BN. Each block Bi consists of a
timestamp tsi, and a set of transactions; that is, Bi =[tsi, T1, T2, …, Tk], where k ≥ 0. Hence,
the current length (CL) of the data stream is defined as CL = |B1|+|B2|+…+|BN|. A
transaction T consists of a set of items, such that T ⊆ Ψ. Moreover, each transaction is
also associated with a unique transaction identifier, called TID. A set of items X is also
called an itemset and an itemset X with k items is denoted as (x1x2…xk), such that X ⊆ Ψ.
Due to the unique characteristics of streaming data, any one-pass algorithms have to
sacrifice the correctness of their analysis results by allowing some errors. Hence, the True
support of an itemset X, denoted by Tsup(X), is the number of transactions seen so far in
which that itemsets occurs as a subset, and the Estimated support of the itemset X, denoted
as Esup(X), is the estimated true support of X stored in the summary data structure
constructed by the one-scan approaches, where Esup(X) ≤ Tsup(X). An itemset X is called
a frequent itemset if Tsup(X) ≥ ms*CL. An itemset is called a sub-frequent itemset if
ms*CL > Tsup(X) ≥ ε * CL. Furthermore, an itemset is called an infrequent itemset if ε*CL
Tsup(X). An itemset is called a maximal frequent itemset if it is not a subset of any other
frequent itemsets generated so far.

Maximal Estimated Support Error Analysis

In this section, we discuss the maximal supprot error of the frequent itemsets generated by
DSM-FI. Let X.block-id is the block-id of itemset X stored in the current IsFI-forest.
Moreover, we assume that the average size of each block is a constant value k for simplify
discussion, that is, each block contains k transactions. Let current block-id of the incoming
stream be block-id(N). Now, we have the following theorem of maximal error guarantee
for DSM-FI’s outputs.

Performance Evaluation

All the experiments are performed on a 1GHz IBM X24 with 384MB, and the program is
written in Microsoft Visual C++6.0. To evaluate the performance of algorithm DSM-FI,
we conduct the empirical studies based on the synthetic datasets. In Section 4.1, we report
the scalability study of algorithm DSM-FI. In Section 4.2, we compare the memory and
execution time requested by DSM-FI with Lossy Counting algorithm. The parameters of
synthetic data generated by IBM synthetic data generator [2] are described as follows.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Ranked, Efficient and Secure Keyword search over encrypted cloud data PPT	seminar post	1	814	21-09-2017, 11:55 AM Last Post: jaseela123
	Data Mining: What is Data Mining? Report	project girl	1	2,262	21-09-2017, 11:47 AM Last Post: jaseela123
	DEMONSTRATING DATAPOSSESSION AND UN CHEATABLE DATA TRANSFER	seminar flower	1	1,466	19-09-2017, 11:05 AM Last Post: jaseela123
	Processing of collected data PPT	seminar projects maker	1	718	15-09-2017, 12:48 PM Last Post: jaseela123
	MAC Protocol for Reliable Multicast over Multi-Hop Wireless Ad Hoc Networks pdf	study tips	1	1,029	15-09-2017, 12:39 PM Last Post: jaseela123
	Enabling Secure and Efficient Ranked Keyword Search over Outsourced Cloud Data pdf	study tips	1	2,018	13-09-2017, 12:59 PM Last Post: jaseela123
	INCREMENTAL MINING USING FREQUENT PATTERN TREE	project topics	1	10,061,816	13-09-2017, 09:40 AM Last Post: jaseela123
	Green Computing for Efficient use of Energy and Electronic Waste Minimization Report	project girl	1	1,357	12-09-2017, 12:37 PM Last Post: jaseela123
	Blowfish Encryption Algorithm pdf	project girl	1	1,113	12-09-2017, 12:36 PM Last Post: jaseela123
	Data Warehouse Report	study tips	1	879	12-09-2017, 12:23 PM Last Post: jaseela123

Quick Reply
Message Type your reply to this message here. Disable Smilies	You have selected one or more posts to quote. Quote these posts now or deselect them.