06-10-2012, 03:02 PM
Suffix trees
Suffix tree.pdf (Size: 456.79 KB / Downloads: 49)
Data structures for string pattern
matching: Suffix trees
• Linear algorithms for exact string matching
– KMP
– Z-value algorithm
• What is suffix tree?
– A tree-like data structure for solving problems
involving strings.
– Related data structures: Trie (retrieval) & PATRICIA
(radix tree)
– Allow the storage of all substrings of a given string in
linear space
– Simple algorithm to solve string pattern matching
problem in linear time
Better than hash tables?
• Hash tables are certainly easier to understand.
And, one can produce a hash table of all length
k strings in O(m) time and look up a k-length
string x in O(k) time, finding all p places where
string x is found in O(p) time. This is the same
as the bound for suffix trees.
• What if you don’t know how long the string x is
going to be?
• And most other string matching tricks don’t work
for it either.
Suffix Tree: definition
• A suffix tree ST for an m-character string S
is a rooted directed tree with exactly m
leaves numbered 1 to m.
• Each internal node, other than the root,
has at least two children and each edge is
labeled with a nonempty substring of S.
• No two edges out of a node can have
edge-labels beginning with the same
character.
• The key feature of the suffix tree is that for
any leaf i, the concatenation of the edgelabels
on the path from the root to the leaf
i exactly spells out the suffix of S that
starts at position i.
Exact string matching problem
• Given a pattern P of length m, find all
occurrences of P in text T
– O(n+m) algorithm
• Solution: Build a suffix tree ST for text T in
O(m) time. Then, match the characters of
P along the unique path in ST until either
P is exhausted or no more matches are
possible.
Comparing to the other algorithms
• KMP and Boyer-Moore both achieve this worst
case bound.
– O(m+n) when the text and pattern are presented
together.
• Suffix trees are much faster when the text is fixed
and known first while the patterns vary.
– O(m) for single time processing the text, then only O(n)
for each new pattern.
• Based on suffix trees, is faster for searching a
number of patterns at one time against a single
text (exact set matching problem)
– Aho-Corasick algorithm: preprocessing P instead of T.
Exact string matching
• Both P (|P|=n) and T (|T|=m) are known:
– Suffix tree method achieves same worst-case bound
O(n+m) as KMP.
• T is fixed and build suffix tree, then P is input, k
is the number of occurrences of P
– Using suffix tree: O(n+k)
– In contrast (KMP, preprocess P): O(n+m) for any
single P
• P is fixed, then T is input
– Selecting KMP rather than suffix tree
– or Aho-Corasick algorithm (exact set matching
problem)
Applications
• Problems
– linear-time longest common substring
– constant-time least common ancestor
– maximally repetitive structures
– all-pairs suffix-prefix matching
– compression
– inexact matching
– conversion to suffix arrays