12-11-2012, 06:21 PM
An Efficient Algorithm for Mining Frequent Itemests over the Entire History of Data Streams
An_efficient_algorithm_for_mining_frequent_itemsets_over_the_entire_history_of_data_streams.pdf (Size: 226.88 KB / Downloads: 32)
Introduction
Mining frequent itemsets is an essential step in many data mining problems, such as
mining association rules, sequential patterns, closed patterns, maximal pattern, and many
other important data mining tasks. The problem of mining frequent itemsets in large
databases was first proposed by Agrawal et al. [2] in 1993, and the problem can be defined
as follows. Let Ψ = {i1, i2, …, in} be a set of literals, called items. Let database DB be a set
of transactions, where each transaction T consists of a set of items, such that T ⊆ Ψ. Each
transaction is also associated with a unique transaction identifier, called TID. A set X ⊆ Ψ
is also called an itemset, where items within an itemset are kept in lexicographic order. A
k-itemset is represented by (x1, x2, …, xk), where x1 < x2 < …< xk. The support of an
itemset X, denoted sup(X), is the number of transactions in which that itemset occurs as a
subset. An itemset X is called a frequent itemset if sup(X) ≥ ms*|DB|, where ms ∈ (0, 1) is
a user-specified minimum support threshold and |DB| is the size of the database. Hence,
the problem of mining frequent itemsets is to mine all itemsets whose support is no less
than ms*|DB| in a large database.
Problem Definition
Based on the estimation mechanism of the Lossy Counting algorithm [8], we propose a
new single-pass algorithm for mining all frequent itemsets in data streams based on a
landmark windows model when a user-specified minimum support threshold ms ∈ (0, 1),
and a user-defined maximal estimated support error threshold ε ∈ (0, ms) are given. For
mining all frequent itemsets, a data stream can be defined as follows.
Let Ψ = {i1, i2, …, im} be a set of literals, called items. Let data stream DS = B1, B2, …,
BN, …, be an infinite sequence of blocks, where each block is associated with a block
identifier i, and N is the identifier of the “latest” block BN. Each block Bi consists of a
timestamp tsi, and a set of transactions; that is, Bi =[tsi, T1, T2, …, Tk], where k ≥ 0. Hence,
the current length (CL) of the data stream is defined as CL = |B1|+|B2|+…+|BN|. A
transaction T consists of a set of items, such that T ⊆ Ψ. Moreover, each transaction is
also associated with a unique transaction identifier, called TID. A set of items X is also
called an itemset and an itemset X with k items is denoted as (x1x2…xk), such that X ⊆ Ψ.
Due to the unique characteristics of streaming data, any one-pass algorithms have to
sacrifice the correctness of their analysis results by allowing some errors. Hence, the True
support of an itemset X, denoted by Tsup(X), is the number of transactions seen so far in
which that itemsets occurs as a subset, and the Estimated support of the itemset X, denoted
as Esup(X), is the estimated true support of X stored in the summary data structure
constructed by the one-scan approaches, where Esup(X) ≤ Tsup(X). An itemset X is called
a frequent itemset if Tsup(X) ≥ ms*CL. An itemset is called a sub-frequent itemset if
ms*CL > Tsup(X) ≥ ε * CL. Furthermore, an itemset is called an infrequent itemset if ε*CL
Tsup(X). An itemset is called a maximal frequent itemset if it is not a subset of any other
frequent itemsets generated so far.
Maximal Estimated Support Error Analysis
In this section, we discuss the maximal supprot error of the frequent itemsets generated by
DSM-FI. Let X.block-id is the block-id of itemset X stored in the current IsFI-forest.
Moreover, we assume that the average size of each block is a constant value k for simplify
discussion, that is, each block contains k transactions. Let current block-id of the incoming
stream be block-id(N). Now, we have the following theorem of maximal error guarantee
for DSM-FI’s outputs.
Performance Evaluation
All the experiments are performed on a 1GHz IBM X24 with 384MB, and the program is
written in Microsoft Visual C++6.0. To evaluate the performance of algorithm DSM-FI,
we conduct the empirical studies based on the synthetic datasets. In Section 4.1, we report
the scalability study of algorithm DSM-FI. In Section 4.2, we compare the memory and
execution time requested by DSM-FI with Lossy Counting algorithm. The parameters of
synthetic data generated by IBM synthetic data generator [2] are described as follows.