11-05-2013, 04:35 PM
FP-Growth algorithm
Introduction
Apriori: uses a generate-and-test approach – generates candidate itemsets and tests if they are frequent
Generation of candidate itemsets is expensive(in both space and time)
Support counting is expensive
Subset checking (computationally expensive)
Multiple Database scans (I/O)
FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two step approach:
Step 1: Build a compact data structure called the FP-tree
Built using 2 passes over the data-set.
Step 2: Extracts frequent itemsets directly from the FP-tree
FP-Tree Construction
FP-Tree is constructed using 2 passes over the data-set:
Pass 1:
Scan data and find support for each item.
Discard infrequent items.
Sort frequent items in decreasing order based on their support.
Use this order when building the FP-Tree, so common prefixes can be shared.
FP-Tree size
The FP-Tree usually has a smaller size than the uncompressed data - typically many transactions share items (and hence prefixes).
Best case scenario: all transactions contain the same set of items.
1 path in the FP-tree
Worst case scenario: every transaction has a unique set of items (no items in common)
Size of the FP-tree is at least as large as the original data.
Storage requirements for the FP-tree are higher - need to store the pointers between the nodes and the counters.
The size of the FP-tree depends on how the items are ordered
Ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a heuristic).
Conditional FP-Tree
The FP-Tree that would be built if we only consider transactions containing a particular itemset (and then removing that itemset from all transactions).
I Example: FP-Tree conditional on e.