• No se han encontrado resultados

NIC 8 Políticas contables, cambios en las estimaciones

B. Modificación de los Procesos

In Section 2.5 on page 19, we described a DP algorithm that sequentially scans a text of length 𝑛 to search a pattern 𝑝 with up to 𝑘 errors in 𝒪(𝑘𝑛) time on average. In typical applications, such as searching sequenced reads in a reference genome, the text is much larger than the pattern and even the linear search time becomes prohibitive. In the fol- lowing, we propose a simple recursive search algorithm that descends the suf ix tree of the text and solves the in 𝒪 (2 ⋅ |Σ| ⋅ |𝑝|) time [Navarro and Baeza-Yates, 2000].

Searching a pattern with errors in a suf ix tree requires to tolerate mismatches while descending along the path of pattern characters from the root towards the leaves. That means whenever a pattern character is compared with an edge character, a mismatch only reduces the remaining number of tolerated errors, see Algorithms 4.9 and 4.10 for the corresponding pseudo-code. Branching nodes must be left via the edge beginning with the current pattern character and, if there are errors remaining, also via all other edges. Approximate matches have been found if the end of the pattern has been reached without exceeding the number of tolerated errors.

The algorithmic idea of Algorithm 4.9 can be combined with the multiple pattern search algorithm of the previous section to approximately search multiple patterns in a text. We again use a radix tree of patterns and a suf ix tree of the text and traverse both in parallel, starting at the root nodes. During the recursion, both concatenation strings are compared character-wise while recording the number of mismatches. When the end of one or both strings is reached, the search recurses into all children of the nodes or the Cartesian product of children sets of both nodes. The recursion ends if more mismatches

Algorithm 4.9: R (𝑝𝑎𝑡𝑡𝑒𝑟𝑛, 𝑖𝑡𝑒𝑟, 𝑖, 𝑒)

input : search pa ern, suffix tree iterator

input : length 𝑖 of compared prefixes of pa ern and concatena on string

input : remaining number of tolerated errors 𝑒

output : all text occurrences within tolerated Hamming distance

1 if𝑒 = 0then

2 if D (𝑖𝑡𝑒𝑟, 𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑖..|𝑝𝑎𝑡𝑡𝑒𝑟𝑛|))then

3 print“pattern found at: ” O (𝑖𝑡𝑒𝑟)

4 else

5 while𝑖 < |𝑝𝑎𝑡𝑡𝑒𝑟𝑛|and𝑖 < L (𝑖𝑡𝑒𝑟)do

6 if𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑖] ≠ (𝑖𝑡𝑒𝑟)[𝑖]then // on mismatch …

7 if𝑒 = 0then // reduce the tolerated errors

8 return

9 else

10 𝑒 ← 𝑒 − 1

11 𝑖 ← 𝑖 + 1

12 if𝑖 = |𝑝𝑎𝑡𝑡𝑒𝑟𝑛|then

13 print“pattern found at: ” O (𝑖𝑡𝑒𝑟)

14 else

15 if not D (𝑖𝑡𝑒𝑟)then return

16 repeat // at branching nodes

17 R (𝑝𝑎𝑡𝑡𝑒𝑟𝑛, 𝑖𝑡𝑒𝑟, 𝑖, 𝑒) // try all outgoing edges

18 un l not R (𝑖𝑡𝑒𝑟)

Algorithm 4.10: P S (𝑝𝑎𝑡𝑡𝑒𝑟𝑛, 𝑒𝑟𝑟𝑜𝑟𝑠)

input : pa ern and number of tolerated Hamming errors

output : all approximate matches

1 create iterator 𝑖𝑡𝑒𝑟 of the suffix tree of the text 2 R (𝑖𝑡𝑒𝑟)

3 R (𝑝𝑎𝑡𝑡𝑒𝑟𝑛, 𝑖𝑡𝑒𝑟, 0, 𝑒𝑟𝑟𝑜𝑟𝑠)

occurred than tolerated or a leaf in either tree is reached. If it is a leaf in the radix tree a pattern has been found. Algorithm 4.11 shows the corresponding pseudo-code of this algorithm. The multiple exact pattern search algorithm of the previous section is used in line 2 as an optimization if no more errors are tolerated. The repeat-loops in lines 24 and 26 enumerate the children of the current nodes 𝛼 or 𝛽 depending on whether the end of 𝛼or 𝛽 has been reached in the comparison.

To evaluate the practical running time of the multiple approximate search algorithm, we searched 100,000 substrings in DNA, protein, and natural language texts of length 100 million characters while varying the allowed number of mismatches and the length of the substrings. The results are shown in Figure 4.4.

1 5 50 500 5000 hs_chr2 (Σ =5) pattern length running time (s) 0.01 0.1 1 10 100 1000 exact 1 error 2 errors 3 errors 1 5 50 500 5000 sprot (Σ =24) pattern length running time (s) 0.01 0.1 1 10 100 1000 1 5 50 500 5000 rfc (Σ =256) pattern length running time (s) 0.01 0.1 1 10 100 1000

Figure 4.4:Running times required to search 100,000 patterns with a varying number of

tolerated errors in the irst 100 M characters of a DNA, amino acid, and natural language text. We compared the exact (Section 4.3.3) and approximate (Sec- tion 4.3.4) recursive algorithms that search the radix tree of the patterns in the suf ix tree (enhanced suf ix array) of the text. Patterns are random substrings of varying length.

It can be seen that the number of errors has the greatest in luence on running time which increases by an order of magnitude for every additional error. The search time on large alphabets is higher than on small alphabets due to a greater out-degree of suf ix tree nodes.

In [Siragusa et al., 2013a,b], we demonstrate the applicability of the above-mentioned exact and approximate multiple backtracking approaches to the read mapping problem. In that work, we search exact or approximate occurrences of non-overlapping seeds of the reads in the reference sequence and extend them up to a given error rate.

The approximate search can also be extended to edit distance. Instead of comparing the edge labels of both trees character-wise, they need to be aligned recursively with a modi ied DP algorithm [Needleman and Wunsch, 1970] that updates a DP matrix which for a pair of tree nodes re lects the pairwise alignment of both concatenation strings. For more details, we refer the reader to [Navarro and Baeza-Yates, 2000].

Algorithm 4.11: M R (𝑖𝑡𝑒𝑟𝐴, 𝑖𝑡𝑒𝑟𝐵, 𝑖, 𝑒)

input : iterator 𝑖𝑡𝑒𝑟𝐴 of pa ern radix tree

input : iterator 𝑖𝑡𝑒𝑟𝐵 of text suffix tree

input : length 𝑖 of compared prefixes

input : remaining number of tolerated errors 𝑒

output : all text occurrences within tolerated Hamming distance

1 if𝑒 = 0then

2 M R (𝑖𝑡𝑒𝑟𝐴, 𝑖𝑡𝑒𝑟𝐵, 𝑖) // no errors left, use Algorithm 4.8 3 else

4 𝛼 ← (𝑖𝑡𝑒𝑟𝐴)

5 𝛽 ← (𝑖𝑡𝑒𝑟𝐵)

6 while𝑖 < |𝛼|and𝑖 < |𝛽|do

7 if𝛼[𝑖] ≠ 𝛽[𝑖]then // on mismatch …

8 if𝑒 = 0then // reduce the tolerated errors

9 return 10 else 11 𝑒 ← 𝑒 − 1 12 𝑖 ← 𝑖 + 1 13 if𝑖 = |𝛼|then 14 if L (𝑖𝑡𝑒𝑟𝐴)then

15 print“pattern ” 𝛼 “ found at: ” O (𝑖𝑡𝑒𝑟𝐵)

16 return

17 D (𝑖𝑡𝑒𝑟𝐴)

18 if𝑖 = |𝛽|then

19 if not D (𝑖𝑡𝑒𝑟𝐵)then return

20 repeat 21 𝑖𝑡𝑒𝑟𝐵 ← 𝑖𝑡𝑒𝑟𝐵 22 repeat 23 M R (𝑖𝑡𝑒𝑟𝐴, 𝑖𝑡𝑒𝑟𝐵, 𝑖, 𝑒) 24 un l𝑖 ≠ |𝛽|or not R (𝑖𝑡𝑒𝑟𝐵) 25 𝑖𝑡𝑒𝑟𝐵 ← 𝑖𝑡𝑒𝑟𝐵 26 un l𝑖 ≠ |𝛼|or not R (𝑖𝑡𝑒𝑟𝐴) Algorithm 4.12: M P S (𝑝𝑎𝑡𝑡𝑒𝑟𝑛𝑠, 𝑒𝑟𝑟𝑜𝑟𝑠)

input : mul ple pa erns and number of tolerated Hamming errors

output : all approximate matches

1 create pa ern radix tree and tree iterator 𝑖𝑡𝑒𝑟𝐴 2 create iterator 𝑖𝑡𝑒𝑟𝐵 of the suffix tree of the text 3 R (𝑖𝑡𝑒𝑟𝐴), R (𝑖𝑡𝑒𝑟𝐵)