• No se han encontrado resultados

1.6. Delimitación del estudio

1.8.10. La Norma ISO 9001:2008

Given a BWT for level (i+ 1),Bi+1, all that remains is to induce the BWTBi. First, note that each symbol in Bi+1 corresponds to an S* substring at level i. Thus, each symbol inBi+1 provides information about the ordering of all suffixes from the corresponding S* substring at leveli. In our example,B2[0] = 0indicating that the smallest suffix inσ2 is preceded by a ‘0’. This ‘0’ corresponds to S* substring “0230" inσ1. The algorithm then uses this information to induce the location of three symbols inB1. Namely, the first suffix inB1 starting with “30" is preceded by “02", starting with “230" is preceded by “0", and starting with “0230" is preceded by “3" (derived fromB2 andΣ1). In other words, the single symbol fromB2 informs multiple symbols inB1. However, the algorithm still needs to know where suffixes starting with “30", “230", and “0230" are located as offsets into the BWT (and implicit suffix array). In other words, it needs to know an offset value into Bi for each suffix of each S* substring from σi.

To compute the offsets, the algorithm needs to know both the lexicographic order of the S* substring suffixes and the number of times each S* substring suffix occurs in the entire string collection. Given the sorted order of the S* substring suffixes, the offset for a suffix is the sum of the frequencies of all suffixes that lexicographically precede it. For example, the S* substring suffixes of σ1 are shown in sorted order in Table 5.3. In this example, each suffix occurs once, so the offsets start at 0 and increase by 1 for each offset.

S* substring suffix Frequency Offset intoB1 0230 1 0 1241 1 1 230 1 2 241 1 3 30 1 4 41 1 5

Table 5.3: S* substring suffixes - level 1. This table shows the sorted list of S* substring suffixes for σ1 = {124,023}. Note that all suffixes of length 1 are excluded because they are represented elsewhere in the suffix list. In this particular example, each suffix occurs once. The offsets are calculated by starting with 0 for the first S* substring suffix and adding the frequency to get the next offset. Since each suffix occurs once, the offsets are only increasing by one each time.

While the sorted order of the S* substring suffixes can be explicitly computed with a sort subroutine, we instead use a modified version of the Induce(...) subroutine of Okanohara and Sadakane [2009]. To summarize, the originalInduce(...) subroutine computes the sorted order of

all S* substring suffixes in the whole input string. In general, it does this by first solving for the sorted order of the L-type suffixes and then by solving for the reverse sorted order of all the S-type suffixes. The two sorts can then be trivially combined for a complete ordering of all S* substring suffixes. This algorithm runs in O(M) steps for a string of length M.

Instead of computing the order of every S* substring suffix in the string collection, our method only computes the order of the S* substring suffixes in the stored alphabet. In other words, duplicated S* substring suffixes are removed prior to performing the sort. Then, it uses the computed order and the S* substring suffix frequencies to calculate offsets into the BWT. In short, MSBWT-IS only sorts ki suffixes whereki is the combined length of all S* substrings instead of M suffixes as in BWT-IS. This requires less memory than the originalInduce(...) method because there are no repeated S* substrings in the computation. Additionally, since the alphabet already allows for the possibility of multiple terminal symbols, this adaptation enables the algorithm to run on string collections instead of just a single string.

The final step is to linearly traverse the BWT Bi+1 and fill inBi using the computed offsets. For each symbol c in Bi+1, the algorithm gets the corresponding S* substring in σ1. Then, the offset for each S* substring suffix in the S* substring is extracted. Every time an offset is used, it is then incremented by one so that the next time the S* substring suffix is used, it is pointing to the next entry. Given the offset value, the character stored at the offset is the symbol preceding the S*

Index,x B2[x] B1 init – [?, ?, ?, ?, ?, ?]

0 0 [3, ?, 0, ?, 2, ?]

1 1 [3, 4, 0, 1, 2, 2]

Table 5.4: Induced BWT - level 1. This table shows the state of BWTB1 as the values for the BWT B2 are filled in. During each iterationx, the S* substring corresponding to symbolB2[x]and the offsets for each S* substring suffix are extracted and stored. Then, the corresponding predecessor symbol for each offset is written to output BWT B1. Initially, the entire BWT is unknown. The first symbol encountered inB2 is ‘0’, corresponding to S* substring “0230" inσ1. This S* substring has three suffixes [“30", “230", “0230"] corresponding to offsets[4,2,0]in Table 5.3. Thus, index 4 gets set to ‘2’ (the symbol preceding “30"), index 2 gets set to ‘0’, and index 0 gets set to ‘3’. This process is repeated forB2[1] = 1 to finish solving forB1.

substring suffix, which can be computed from the stored alphabet. In our example, the computed BWTB2= [0,1]. The first symbol is B2[0] = 0, corresponding to the S* substring “0230". This S* substring has three suffixes [“30", “230", “0230"] corresponding to offsets[4,2,0]. Thus, index 4 gets set to ‘2’ (the symbol preceding “30"), index 2 gets set to ‘0’, and index 0 gets set to ‘3’. We show the partial result in row 0 of Table 5.4. We then repeat the process forB2[1] = 1for S* substring “1231" to retrieve offsets [5, 3, 1] and symbols [‘2’, ‘1’, ‘4’]. The final result is shown in row 1 of Table 5.4.

At this point, the BWT for σ1 = [124,023]has been solved as B1 = [3,4,0,1,2,2]. The final step is to exit the recursion to solve forB0=BW T(σ0). The algorithm uses the same method as before but with the alphabets from levels 0 and 1. The S* substring suffixes, frequencies, and offsets are shown in Table 5.5. Then each symbol from B1 is processed one at a time, showing the state of B0 after each symbol in Table 5.6.

5.3.4 Asymptotic performance

The algorithm can be summarized as a three step process. The initial step requires two linear passes over a string collectionσiof total sizeNi, which can be done inO(Ni)steps. While performing these linear passes, the S* substrings are stored in a collection to create Σi+1 of sizeki+1. These substrings are then sorted in O(ki+1∗log(ki+1))steps prior to performing the second linear pass. The second step is to recursively compute the BWT for a string collection that is at most half the size of the current collection,Ni+1≤ Ni2 . The third step is to calculate the offsets into the BWT for each S* substring suffix. Using the modifiedInduce(...) method, this requiresO(ki+1) steps (note thatki+1 ≤Ni). Finally, a linear pass using Bi+1 and the offsets allows the output BWTBi to be

S* substring suffix Frequency Offset intoB0 $GA 1 0 $TA 1 1 AGC 2 2 CG$ 1 4 CT$ 1 5 G$ 1 6 GA 1 7 GC 2 8 T$ 1 10 TA 1 11

Table 5.5: S* substring suffixes - level 0. This table shows the sorted list of S* substring suffixes forσ0 ={$TAGCT, $GAGCG}. Note that all suffixes of length 1 are excluded because they are represented elsewhere in the suffix list. The offset for a suffix is calculated as the sum of all frequencies of S* substring suffixes that lexicographically precede it. Most suffixes in this example occur once, but the S* substring suffixes “AGC" and “GC" both occur twice, once in each string.

Index, x B1[x] B0 init – [?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?] 0 3 [?, ?, ?, ?, G, ?, C, ?, ?, ?, ?, ?] 1 4 [?, ?, ?, ?, G, G, C, ?, ?, ?, C, ?] 2 0 [G, ?, ?, ?, G, G, C, $, ?, ?, C, ?] 3 1 [G, T, ?, ?, G, G, C, $, ?, ?, C, $] 4 2 [G, T, G, ?, G, G, C, $, A, ?, C, $] 5 2 [G, T, G, T, G, G, C, $, A, A, C, $]

Table 5.6: Induced BWT - level 0. This table shows how each symbol inB1is used to induce multiple symbols in B0. During each iteration x, the S* substring corresponding to symbol B1[x] and the offsets for each S* substring suffix are extracted and stored. Then, the corresponding predecessor symbol for each offset is written to output BWT B0. Initially, the entire BWT is unknown. The first symbol encountered inB1 is ‘3’, corresponding to S* substring “CG$" inσ1. This S* substring has two suffixes [“G$", “CG$"] corresponding to offsets [6,4]in Table 5.5. Thus, index 6 gets set to ‘C’ (the symbol preceding “G$") and index 4 gets set to ‘G’ (the symbol preceding “CG$"). This

written inO(Ni) steps.

In summary, the total cost for level iis T(Ni) = Ni+ki+1∗log(ki+1) +T(Ni2 ). However, the modifedInduce(...) function can actually be used to perform an implicit sort of the S* substrings. This subroutine requiresO(Ni)steps in the worst case (every substring is unique), leading to a total runtime ofT(Ni) =Ni+T(Ni2 ) which isO(Ni). Note that in practice, we found the time to perform the explicit sort of the S* substrings (using a built-in C++ std::sort) was much faster than the implicit sort calculation performed by the modifiedInduce(...) function because ki+1 tends to be much smaller thanNi for the test datasets.

5.4 Results

5.4.1 BWT construction tools

Our implementation of MSBWT-IS is in C++ and is publicly available1. It currently does not rely on any external libraries. We compiled the program using g++ with options g++ -O3 -std=c++11.

We compare MSBWT-IS to the ropebwt2 tool first mentioned in Section 5.1 [Li, 2014]. Recall that this method is a variant of the BCR algorithm that performs “column-wise" insertions. While the algorithm is reported asO(N∗log(N)) for a string collection withN symbols, the BCR class of algorithm performs worse as the length of reads increases simply because the number of “columns" increases with it. The ropebwt2 implementation tackles this issue by keeping the partial BWT in memory as a B+-tree, allowing for faster modification of the structure and reducing the cost from repeatedly allocation of space for the BWT arrays. In general, this method is quite fast, outperforming both earlier BCR implementations and the merge-based construction algorithm of Chapter 4. Ropebwt2 is implemented in C and was compiled using the instructions available online.

For ropebwt2, we actually tested two versions of their output. The first version requires a pre-sort of the strings in the collection, but builds a BWT that is functionally identical to the one produced by MSBWT-IS. In our results, this version is labeled asropebwt2 -LR + sortto reflect the options and that the cost of running the sort is included. The second version does not require a pre-sort, but it stores both the forward and reverse-complement of all strings in the collection. This allows

downstream analyses to perform single queries to retrieve both forward and reverse counts as a single value. However, the forward and reverse-complements are joined together in the implicit suffix array such that it is impossible to tell which string was the original without additional information. Additionally, this BWT can require twice as much space because it is storing twice as many strings. The second version is labeled as ropebwt2 -Lr. In general, both versions of the ropebwt2 algorithm are expected to get slower as the length of the strings increases due to its “column-wise" insertion. Additionally, the added time caused by sorting the strings is expected to increase with both the number and length of the strings in the collection. The ropebwt2 tool also calculates the BWT in parallel. In our experience, it utilizes approximately 2-3 threads on average using this parallelization.

All tests were run on a machine running Ubuntu 14.04 with 32 GB memory and an Intel Xeon E5-2620 6-core 2.00 GHz processor. The machine is connected to a 1 TB HDD for reading and writing any necessary input or output files. For measuring performance, we used the built-in/usr/bin/time

function to extract real time, user time, memory usage, and CPU utilization. For performing the pre-sort required by ropebwt2, we used the built-in/usr/bin/sortfunction.

5.4.2 Simulated datasets

For simulated datasets, we tested the effect of read length on each method. To do this, we generated random “genome" strings and then sampled reads from those strings at approximately 100x coverage with a 1% error rate. To maintain a constant number of input bases, the read length was varied through repeated doublings while simultaneously halving the number of reads. Our first, smaller batch of datasets started with 220 100-basepair reads and went to211 51200-basepair reads, so each dataset had approximately 104 megabases of simulated data. Our second, larger batch of datasets started with 224 100-basepair reads and went to215 51200-basepair reads, so each dataset had approximately 1.7 gigabases of simulated data. For each dataset, we ran MSBWT-IS and both versions of ropebwt2.

The CPU and wall clock times for the smaller batch of datasets are shown in Figure 5.3. As expected, the CPU time required by ropebwt2 grows with the read length. Since theropebwt2 -Lr

version encodes both forward and reverse-complement strings, it requires approximately twice as much CPU time as ropebwt2 -LR. In contrast to both ropebwt2 types, the CPU time required by the MSBWT-IS algorithm is actually decaying slightly as the read length increases because there are fewer end-of-string symbols (‘$’) to encode. As a result, MSBWT-IS requires far less CPU time for

Figure 5.3: Performance - small simulated datasets. These figures show the CPU time (top) and wall clock time (bottom) as measured by /usr/bin/timefor the three different methods of BWT construction for long reads. Datasets were generated by first randomly generating a “genome" string and then sampling reads from that string at approximately 100x coverage with a 1% mismatch rate. Multiple datasets were generated by varying the read length while keeping a constant number of genomic bases across all datasets (approximately 104 mega-bases per dataset). Each consecutive dataset has half the number of reads but double the read length. In general, the CPU time for MSBWT-IS decays slightly as the read length increases. This is caused by a reduction in total symbols because there are fewer end-of-string characters as the number of reads decreases. In contrast, both versions of ropebwt2 require increasing CPU time as the length of the reads increases. This performance pattern is reflected in the wall clock time. However, the ropebwt2 wall clock times are shifted downward relative to the MSBWT-IS times due to parallelization. Finally, theropebwt2 -Lroption is almost always slower than ropebwt2 -LRbecause it is encoding twice as many symbols

long reads than both versions of ropebwt2. The CPU usage of all three methods is reflected in the wall clock time. The wall clock time of MSBWT-IS is roughly the same as the CPU time because it is not parallelized. In contrast, the wall clock times of ropebwt2 tend to be 2-3x faster due to multi-processing.

The CPU and wall clock times for the larger batch of datasets are shown in Figure 5.4. In general, the same trends from the small datasets are evident in the large datasets. The CPU time of MSBWT-IS is decaying slightly as read length increases while both versions of ropebwt2 require more CPU time. Additionally, the ropebwt2 wall clock times are roughly 2-3x less than their CPU times due to multi-processing. Given these initial tests, the expectation is for MSBWT-IS to require less CPU time than ropebwt2 when the read length is relatively long.

5.4.3 Long-read datasets

The next tests were performed on several publicly available PacBio datasets. In general, PacBio reads are thousands of basepairs long but have a relatively high error rate. The selected datasets include threeE. coli K12 MG1655 datasets2,3, oneP. falciparum 3d7 dataset4, oneN. crassa OR74A dataset5, oneS. cerevisiae dataset6, one C. elegans dataset7, and oneA. thaliana P5C3 dataset8.

Table 5.7 shows the CPU and wall clock time for all tested long-read datasets for both MSBWT-IS and the two methods of ropebwt2. In all tests, MSBWT-IS required less CPU time than both versions of ropebwt2. Additionally, ropebwt2 -Lr used approximately double the amount of CPU time asropebwt2 -LR + sortbecause it encodes both the forward and reverse-complement sequences. In general, ropebwt2 -LR required less wall clock time than MSBWT-IS due to parallelization. However, MSBWT-IS required less wall clock time thanropebwt2 -LR for the two test cases with the longest average read lengths (C. elegans andArabidopsis P5C3). Additionally, these two datasets had large differences in CPU time, again suggesting that MSBWT-IS is less affected by long reads than ropebwt2. 2https://github.com/PacificBiosciences/DevNet/wiki/Datasets 3 https://github.com/PacificBiosciences/DevNet/wiki/E-coli-K12-MG1655-Resequencing 4 https://figshare.com/articles/Plasmodium_3D7_Genome_Assembly_With_PacBio_Data_and_CLEAR_/712587 5https://github.com/PacificBiosciences/DevNet/wiki/Neurospora-Crassa-(Fungus)-Genome,-Epigenome,-and- Transcriptome 6https://github.com/PacificBiosciences/DevNet/wiki/Saccharomyces-cerevisiae-W303-Assembly-Contigs 7 https://github.com/PacificBiosciences/DevNet/wiki/C.-elegans-data-set 8 https://github.com/PacificBiosciences/DevNet/wiki/Arabidopsis-P5C3

Figure 5.4: Performance - large simulated datasets. These figures show the CPU time (top) and wall clock time (bottom) as measured by /usr/bin/timefor the three different methods of BWT construction for long reads. Datasets were generated by first randomly generating a “genome" string and then sampling reads from that string at approximately 100x coverage with a 1% mismatch rate. Multiple datasets were generated by varying the read length while keeping a constant number of genomic bases across all datasets (approximately 1677 mega-bases per dataset). Each consecutive dataset has half the number of reads but double the read length. The same patterns from the small datasets (see Figure 5.3) persist in these larger simulations. Both CPU and wall clock time required by both versions of ropebwt2 is growing with read length whereas they are decaying slightly with MSBWT-IS.

Dataset Bases Average length MSBWT-IS ropebwt2 -LR + sort ropebwt2 -Lr E. coli MG1655 CLR 98213822 1934 71.48 (87.54) 151.44 (65.39) 316.97 (119.84) E. coli MG1655 CCS 217871193 940 63.48 (67.68) 237.48 (93.5) 452.25 (160.5) E. coli k12 re-seq. 436994771 5278 323.51 (349.63) 392.89 (152.51) 1048.34 (327.48) P. falciparum3d7 630045976 2595 437.29 (455.77) 485.0 (197.36) 1422.0 (503.53) N. crassa OR74A 981884113 5581 813.35 (863.34) 922.52 (345.35) 2008.75 (677.26) S. cerevisiae W303 1307390784 6030 1025.7 (1091.48) 1153.83 (467.61) 3658.49 (1379.77) C. elegans 2586974111 13127 844.81 (911.96) 2138.94 (921.28) 4333.18 (1653.7) A. thalianaP5C3 2692706178 14609 791.62 (823.13) 2205.07 (948.18) 4379.72 (1649.87)

Table 5.7: Long-read dataset performance. This table shows the performance of MSBWT-IS and the two versions of ropebwt2 on long-read sequencing datasets ordered by their total number of bases. For each method, the CPU and wall clock time was measured using /usr/bin/time. The CPU time in seconds is shown for each test with the wall clock time in seconds in parentheses. In all test cases, MSBWT-IS requires less CPU time thanropebwt2 -LR. However,ropebwt2 -LR tend to beat MSBWT-IS in wall clock time because the implementation is parallelized. However, for the two datasets with the longest reads, MSBWT-IS requires less than 40% of the CPU time of ropebwt2 -LR and uses less wall clock time despite ropebwt2’s parallelization. ropebwt2 -Lr uses more CPU

Documento similar