Capítulo II Marco Teórico
J. Determinación del tamaño de la muestra
Improving the performance of BWT construction algorithms for string collection is one area of ongoing research. MSBWT-IS requires less CPU time for long reads than the main competitor, ropebwt2, but tends to be slower by wall clock time because it is not parallelized. While some
components of the MSBWT-IS algorithm can be trivially parallelized (S* substring discovery and replacement), it is not immediately obvious how to parallelize other parts of the algorithm that currently require a sequential execution. Future work will involve exploring both algorithmic and engineering solutions to this problem in order to make MSBWT-IS more competitive in terms of wall clock time. Additionally, MSBWT-IS performs well for long reads and ropebwt2 performs well for short reads, but there is still not a single algorithm that performs well for any type of string collection. Therefore, it is possible that an undiscovered BWT construction algorithm exists that works well on all types of sequencing datasets.
In addition to constructing BWTs quickly and more efficiently, there is actually a large amount of metadata associated with each read that is usually ignored during the construction such as paired end information or the quality strings. The main issues with quality strings is that they do not compress as well as the BWT because there is a larger alphabet and it is not obvious how to do random lookups of a particular quality score. Paired end information can actually be stored in an auxiliary lookup table that pairs reads based on their unique ‘$’ symbol in the BWT. However, this requires additional overhead in both the storage and computation of the pairs because the information is only associated with the ‘$’. There is the possibility of creating paired-end strings that are a concatenation of the two paired end reads with a delimiter character in between. This will have an impact on both the compression of the BWT data structure and possibly the size of the FM-index table. While not discussed in this dissertation, the FM-index queries can also be sped up using an auxiliary data structure called the longest common prefix array (LCP array). Unfortunately, the LCP array is a large data structure that does not compress very well. Future work may involve researching alternate methods for storing the LCP array such that the memory usage of the array is small and random accesses are quick.
The implementation of the BWT-based correction method, FMLRC, also has room for improve- ment. In particular, we showed that the performance of FMLRC drastically changes depending on the implementation of the indexing structure, suggesting that there may be modification or alternate implementations that are faster and/or more memory efficient than the current implementation. Secondly, there is some evidence that the actual bridging algorithm is not as thorough as the one in LoRDEC. Additionally, many of the k-mer sizes were chosen based on the results of limited experimentation. Further experimentation may reveal that dynamically choosingk andK based on
thek-mer frequencies may yield better results than a static, user-defined value.
The web tools presented in the dissertation represent only a small set of possible tools that can be generated using the BWT and FM-index. These web tools are relatively generic and mostly rely on information that is contained solely within the raw sequencing data. The one exception is the reference-based tool that incorporates other expectations about the organism into the queries for the user. There are many other possible tools that can incorporate other outside information in a similar manner. For example, a tool that automatically performs a set of pre-defined probe queries on a BWT or a tool that performs sequence correction on a user defined input sequence. Many of the tools created thus far were made at the request of collaborators in order to suit a specific need. Thus, future research on BWT-based web tools will likely be driven by the needs of researchers to ask specific questions of the sequencing datasets.
In summary, the BWT and FM-index are efficient data structures for compressing and accessing raw sequencing data. The algorithm implementations that are currently available allow for the efficient construction of BWTs for most types of sequencing datasets. This representation is a lossless, unbiased representation of the sequencing dataset that enables access to the raw reads throughk-mer queries. Since the BWT and FM-index can now be constructed efficiently, tools and interfaces have been created to assist in solving biologically relevant problems such asde novo assembly [Simpson and Durbin, 2010], splice junction detection [Cox et al., 2012b], short-read correction [Greenstein et al., 2015], long-read correction, probe searches, and targeted genomic assembly.
APPENDIX A
MULTI-STRING BWT UTILITY SUITE
The Multi-String BWT Utility Suite is a publicly available Python/Cython package for interfacing with BWTs of genomic sequencing data. The package includes a command line interface for constructing BWTs and a Python/Cython API for developing custom scripts to query the resulting BWTs. Both the command line interface and API are available through PyPI1 and GitHub2. Additionally, wiki pages3 are available describing various use cases of the package along with more detail on the CLI and API. In the following sections, we briefly discuss some of these use cases and functionality included in the package.