19-01-2013, 12:33 PM
The Boyer-Moore Algorithm
The Boyer-Moore.pdf (Size: 157.45 KB / Downloads: 180)
Introduction
This chapter develops a number of classical comparison-based matching algorithms
for the exact matching problem. With suitable extensions, all of
these algorithms can be implemented to run in linear worst case time, and
all achieve this performance by preprocessing pattern P. (Methods that
preprocess T will be considered in Part II of the book.) The original preprocessing
methods for these various algorithms are related in spirit, but quite
dierent in conceptual diculty. Some of the original preprocessing methods
are quite dicult1. This chapter does not follow the original preprocessing
methods but instead exploits fundamental preprocessing, developed in the
previous chapter, to implement the needed preprocessing for each specic
matching algorithm.
Also, in contrast to previous expositions, we emphasize the Boyer-Moore
method over the Knuth-Morris-Pratt method, since Boyer-Moore is the practical
method of choice for exact matching. Knuth-Morris-Pratt is nonetheless
completely developed, partly for historical reasons but mostly because
it generalizes to problems such as real-time string matching and matching
against a set of patterns more easily than Boyer-Moore does. These two
topics will be described in this chapter and the next.
The Boyer-Moore Algorithm
As in the naive algorithm, the Boyer-Moore algorithm successively aligns P
with T and then checks whether P matches the opposing characters of T.
Further, after the check is complete, P is shifted right relative to T just as
in the naive algorithm. However, the Boyer-Moore algorithm contains three
clever ideas not contained in the naive algorithm { the right to left scan, the
bad character shift rule and the good sux shift rule. Together, these ideas
lead to a method that typically examines fewer than m + n characters (an
expected sublinear-time method), and that (with a certain extension) runs
in linear worst case time. Our discussion of the Boyer-Moore algorithm, and
extensions of it, concentrates on provable aspects of its behavior. Extensive
experimental and practical studies of Boyer-Moore and variants have been
reported in [?, ?, ?, ?, ?].
Right to left scan
For any alignment of P with T the Boyer-Moore algorithm checks for an
occurrence of P by scanning characters from right to left rather than from
left to right as in the naive algorithm. For example consider the alignment
of P against T shown below.
Bad character rule
To get the idea of the bad character rule, suppose that the last (rightmost)
character of P is y and the character in T it aligns with is x 6= y. When this
initial mismatch occurs, if we know the rightmost position in P of character
x, we can safely shift P to the right so that the rightmost x in P is below
the mismatched x in T. Any shorter shift would only result in an immediate
mismatch. So, the longer shift is correct, i.e., it will not shift past any
occurrence of P in T. Further, if x never occurs in P, then we can shift P
completely past the point of mismatch in T. In these cases, some characters
of T will never be examined and the method will actually run in \sub-linear"
time. This observation is formalized below.
Extended bad character rule
The bad character rule is a useful heuristic for mismatches near the right end
of P, but it has no eect if the mismatching character from T occurs in P to
the right of the mismatch point. This may be common when the alphabet
is small and the text contains many similar, but not exact, substrings. That
situation is typical of DNA which has an alphabet of size four, and even
protein, which has an alphabet of size twenty, often contains dierent regions
of high similarity.