01-01-2013, 01:01 PM
Sequential Pattern Mining
Sequence Databases
A sequence database consists of ordered elements or events
Transaction databases vs. sequence databases
Applications
Applications of sequential pattern mining
Customer shopping sequences:
First buy computer, then CD-ROM, and then digital camera, within 3 months.
Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures
Challenges on Sequential Pattern Mining
A huge number of possible sequential patterns are hidden in databases
A mining algorithm should
find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold
be highly efficient, scalable, involving only a small number of database scans
be able to incorporate various kinds of user-specific constraints
Studies on Sequential Pattern Mining
Concept introduction and an initial Apriori-like algorithm
Agrawal & Srikant. Mining sequential patterns, [ICDE’95]
Apriori-based method: GSP (Generalized Sequential Patterns: Srikant & Agrawal [EDBT’96])
Pattern-growth methods: FreeSpan & PrefixSpan (Han et al.KDD’00; Pei, et al. [ICDE’01])
Vertical format-based mining: SPADE (Zaki [Machine Leanining’00])
Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim [VLDB’99]; Pei, Han, Wang [CIKM’02])
Mining closed sequential patterns: CloSpan (Yan, Han & Afshar [SDM’03])
GSP—Generalized Sequential Pattern Mining
GSP (Generalized Sequential Pattern) mining algorithm
Outline of the method
Initially, every item in DB is a candidate of length-1
for each level (i.e., sequences of length-k) do
scan database to collect support count for each candidate sequence
generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori
repeat until no frequent sequence or no candidate can be found
Major strength: Candidate pruning by Apriori
Pseudo-Projection vs. Physical Projection
Pseudo-projection avoids physically copying postfixes
Efficient in running time and space when database can be held in main memory
However, it is not efficient when database cannot fit in main memory
Disk-based random accessing is very costly
Suggested Approach:
Integration of physical and pseudo-projection
Swapping to pseudo-projection when the data set fits in memory
Summary
Sequential Pattern Mining is useful in many application, e.g. weblog analysis, financial market prediction, BioInformatics, etc.
It is similar to the frequent itemsets mining, but with consideration of ordering.
We have looked at different approaches that are descendants from two popular algorithms in mining frequent itemsets
Candidates Generation: AprioriAll and GSP
Pattern Growth: FreeSpan and PrefixSpan