tItULO I DERECHOS, ObLIGACIONES y GARANtIAS
ODM 8 Asegurar un medio ambiente sustentable
III. promover el trabajo decente.
The clever scheme of the KMP-search yields genuine benefits only if a mismatch was preceded by a partial match of some length. Only in this case is the pattern shift increased to more than 1. Unfortunately, this is the exception rather than the rule; matches occur much more seldom than mismatches. Therefore the gain in using the KMP strategy is marginal in most cases of normal text searching. The method to be discussed here does indeed not only improve performance in the worst case, but also in the average case. It was invented by R.S. Boyer and J.S. Moore around 1975, and we shall call it BM search. We shall here present a simplified version of BM-search before proceeding to the one given by Boyer and Moore..
BM-search is based on the unconventional idea to start comparing characters at the end of the pattern rather than at the beginning. Like in the case of KMP-search, the pattern is precompiled into a table d before the actual search starts. Let, for every character x in the character set, dx be the distance of the rightmost occurrence of x in the pattern from its end. Now assume that a mismatch between string and pattern was discovered. Then the pattern can immediately be shifted to the right by dp[M-1] positions, an amount that is quite likely to be greater than 1. If pM-1 does not occur in the pattern at all, the shift is even greater, namely equal to the entire pattern's length. The following example illustrates this process.
Hoola-Hoola girls like Hooligans.
Hooligan
Hooligan Hooligan
Hooligan
Since individual character comparisons now proceed from right to left, the following, slightly modified versions of of the predicates P and Q are more convenient.
P(i,j) = Ak: j ≤ k < M : si-j+k = pk Q(i) = Ak: 0 ≤ k < i : ~P(i, 0)
These predicates are used in the following formulation of the BM-algorithm to denote the invariant conditions. i := M; j := M; WHILE (j > 0) & (i <= N) DO (* Q(i-M) *) j := M; k := i; WHILE (j > 0) & (s[k-1] = p[j-1]) DO (* P(k-j, j) & (k-j = i-M) *) DEC(k); DEC(j) END ; i := i + d[s[i-1]] END
The indices satisfy 0 < j < M and 0 < i,k < N. Therefore, termination with j = 0, together with P(k-j, j), implies P(k, 0), i.e., a match at position k. Termination with j > 0 demands that i = N; hence Q(i-M) implies Q(N-M), signalling that no match exists. Of course we still have to convince ourselves that Q(i-M) and P(k-j, j) are indeed invariants of the two repetitions. They are trivially satisfied when repetition starts, since Q(0) and P(x,M) are always true.
Let us first consider the effect of the two statements decrementing k and j. Q(i-M) is not affected, and, since sk-1 = pj-1 had been established, P(k-j, j-1) holds as precondition, guaranteeing P(k-j, j) as postcondition. If the inner loop terminates with j > 0, the fact that sk-1≠ pj-1 implies ~P(k-j, 0), since
~P(i, 0) = Ek: 0 ≤ k < M : si+k ≠ pk
Moreover, because k-j = M-i, Q(i-M) & ~P(k-j, 0) = Q(i+1-M), establishing a non-match at position i-M+1. Next we must show that the statement i := i + ds[i-1] never falsifies the invariant. This is the case, provided that before the assignment Q(i+ds[i-1]-M) is guaranteed. Since we know that Q(i+1-M) holds, it suffices to establish ~P(i+h-M) for h = 2, 3, ... , ds[i-1]. We now recall that dx is defined as the distance of the rightmost occurrence of x in the pattern from the end. This is formally expressed as
Ak: M-dx≤ k < M-1 : pk≠ x Substituting si for x, we obtain
Ah: M-ds[i-1] ≤ h < M-1 : si-1 ≠ ph Ah: 1 < h ≤ ds[i-1] : si-1≠ ph-M Ah: 1 < h ≤ ds[i-1] : ~P(i+h-M)
The following program includes the presented, simplified Boyer-Moore strategy in a setting similar to that of the preceding KMP-search program. Note as a detail that a repeat statement is used in the inner loop, incrementing k and j before comparing s and p. This eliminates the -1 terms in the index expressions.
PROCEDURE Search(VAR s, p: ARRAY OF CHAR; m, n: INTEGER; VAR r: INTEGER); (*search for pattern p of length m in text s of length n*)
(*if p is found, then r indicates the position in s, otherwise r = -1*) VAR i, j, k: INTEGER;
d: ARRAY 128 OF INTEGER; BEGIN
FOR i := 0 TO 127 DO d[i] := m END ;
FOR j := 0 TO m-2 DO d[ORD(p[j])] := m-j-1 END ; i := m;
REPEAT j := m; k := i; REPEAT DEC(k); DEC(j)
UNTIL (j < 0) OR (p[j] # s[k]); i := i + d[ORD(s[i-1])]
UNTIL (j < 0) OR (i > n);
IF j < 0 THEN r := k ELSE r := -1 END END Search
Analysis of Boyer-Moore Search. The original publication of this algorithm [1-9] contains a detailed
analysis of its performance. The remarkable property is that in all except especially construed cases it requires substantially less than N comparisons. In the luckiest case, where the last character of the pattern always hits an unequal character of the text, the number of comparisons is N/M.
The authors provide several ideas on possible further improvements. One is to combine the strategy explained above, which provides greater shifting steps when a mismatch is present, with the Knuth-Morris- Pratt strategy, which allows larger shifts after detection of a (partial) match. This method requires two precomputed tables; d1 is the table used above, and d2 is the table corresponding to the one of the KMP- algorithm. The step taken is then the larger of the two, both indicating that no smaller step could possibly lead to a match. We refrain from further elaborating the subject, because the additional complexity of the table generation and the search itself does not seem to yield any appreciable efficiency gain. In fact, the additional overhead is larger, and casts some uncertainty whether the sophisticated extension is an improvement or a deterioration.