• No se han encontrado resultados

FUNCIÓN PRODUCCIÓN

3.5 Análisis de correlaciones

3.6.1 Estimación para

We first present a linear space index, which is based on the AC automaton [AC75]. Recall the following encoding scheme of Crochemore et al. [CIK+13], outlined in Section 6.2.1. We convert a string S to a string order(S) as follows. Let i ∈ [1, |S|] and p (resp. p+) be the highest value (resp. lowest value) in S[1, i − 1] that is at most S[i] (resp. at least S[i]). Let j (resp. j+) be the rightmost occurrence of p− (resp. p+) in [1, i). If p− (resp. p+) does not exist, then assign j= i (resp. j+= i). Assign, order(S)[i] = hi − j−, i − j+i.

Two pairs hxi, yii and hxj, yji in this encoding scheme are the same iff xi = xj and yi = yj. Two strings X and Y are order-preserving iff order(X) = order(Y ). Also, X is order-preserving with a prefix of Y iff order(X) is a prefix of order(Y ). See Fact 6.1.

7.2.1

The

3 Main Components of the Linear Index

Compute order(Pi) for every Pi in D, and then create a trie T for all the encoded patterns. Let the number of nodes in the trie be m, where m ≤ n + 1. For a node u, denote by order(u) the string formed by concatenating the edge labels from root to u. Mark a node u in the trie as final iff order(u) = order(Pi) for some Pi in D. Clearly, the number of final nodes is d, i.e., the number of patterns in the dictionary. For any node u, define strDepth(u) = |order(u)| and ζ(u, j) = order(Pi[j, strDepth(u)]), where Pi is a pattern whose corresponding final node lies in the subtree rooted at u. Each node u is associated with 3 links as defined below:

• next(u, c) = v iff the label on the edge from u to v is labeled by the character c. • failure(u) = v iff order(v) = ζ(u, j), where j > 1 is the smallest index for which such

a node v exists. If no such j exists, then failure(u) points to the root node. This represents the smallest shift to be performed in T in case of a mismatch.

• report(u) = v iff v is a final node and order(v) = ζ(u, j), where j > 1 is the smallest index for which v exists. If no such j exists, then report(u) points to the root node. This represents a pattern with an occurrence ending at the current text position.

The total space needed is Θ(m log m) bits. We note that the number of states in T is the same as the number of states in the trie of Kim et al. [KEF+14] for the collection D.

7.2.2

The Querying Algorithm

To find the occurrences, we use a balanced binary search tree (BST) that stores the symbols in Σ that appear within a certain sliding window of T . To this end, we maintain an array A[1, σ] such that A[c] equals the position of the latest occurrence of c ∈ Σ. Now, match T in the trie as follows. Suppose, we are considering the position j in T (initially, j = 1), and we are at a node u, i.e., we have matched T [j − strDepth(u) + 1, j] in the trie. First, repeatedly follow report-links starting from u until the root node is reached, thereby, reporting all patterns with a match ending at j. Now, look at the character T [j + 1] to match. We have to obtain cj,u = order(T [j − strDepth(u) + 1, |T |])[strDepth(u) + 1]. Using the BST, find the largest (resp. smallest) number within the window that is at most (resp. at least) T [j + 1]. Now, use the array A to find the rightmost occurrence of these numbers within the window to obtain the desired encoding. If v = next(u, cj,u) is defined, follow it, update the BST (by including T [j +1] if it is already not present) and the array A (by letting A[T [j +1]] = j +1) to incorporate the symbol T [j + 1]. Repeat by letting v = u and j = j + 1; in this case, the right boundary of the sliding window has slid by one position. Otherwise if v = next(u, cj,u) is not defined, follow failure(u) to a node w and repeat by letting w = u; in this case, the left boundary of the sliding window is going to change. Specifically, we are going to slide over strDepth(u) − strDepth(w) characters. For each character c slid over, we check A[c] to check if it is in the current sliding window. If it is not, we remove c from the BST. We continue this process until the last character of T is read. The number of all deletion, search, and insertion operations in the BST is at most 3|T |, each requiring O(log σ) time. Hence, each character in T is encoded in O(log σ) amortized time. On following a report link, either we report an occurrence or we reach the root. Then, either we take a next transition or we follow a failure link; the number of such operations combined is ≤ 2|T |. Each transition takes O(1) time. Therefore, the total time required is O(|T | log σ + occ).

7.3

Representing States Succinctly

Broadly speaking, we use Belazzougui [Bel10]’s succinct representation of the AC automa- ton [AC75]. Let T be the trie in Section 7.2. We observe that any node u ∈ T has a final node in its subtree T (u). Let ←−−order(u) = order(Pi[strDepth(u)] ◦ Pi[strDepth(u) − 1] ◦ · · · ◦ Pi[1]), where Pi is the pattern corresponding to a final node in T (u) and pi = |Pi|. Each state u is conceptually labeled by the lexicographic rank of ←−−order(u) in the set {←−−order(v) | v is a node in the trie}. Thus, each state is labeled by a number in [1, m], where the root is labeled by 1.

Convention 7.1. Without loss of generality, assume that no two patterns Pi and Pj exist such that order(Pi) = order(Pj). Also, assume that i < j iff

←−−

order(Pi) precedes ←−−

order(Pj) in the lexicographic order, where ←−−order(P ) = order(P [p] ◦ P [p − 1] ◦ · · · ◦ P [1]) and p = |P |.

We explicitly store the labels of the final states using the SID of Fact 2.5. Since there are d final nodes, the space required is d log(m/d) + O(d) bits. Given the label of a final state v, we first find the rank of v among all the final states using Fact 2.5. If the rank is r, then v corresponds to the pattern Pr by Convention 7.1. Thus, given the label of a final state, we can find the corresponding pattern in O(1) time, leading to the following lemma: Lemma 7.1. Given the label of a final state, we can find the corresponding pattern in O(1) time by using an d log(m/d) + O(d)-bit data structure.

Lastly, we maintain a bit-vector leaf[1, m] such that leaf[j] = 1 iff the state with label j is a leaf in T . The total space for representing the states is m + d log(m/d) + O(d) bits.