CAPÍTULO 3: ANÁLISIS DE RESULTADOS
1. Debilidad institucional
Comparative analyses involving multiple species are increasingly important as the number of sequenced genomes is continuously rising. However, these analyses bear challenges as the compared genomes often differ with respect to their genomic ar- chitecture. Due to insertions, deletions but also genomic rearrangements, such as translocations or inversion, the genomes have different coordinate systems making the direct comparison of coordinate-based features difficult or impossible.
In chapter 5 the SuperGenome concept was presented as an approach to a solution of this problem. The SuperGenome is independent from a fixed reference genome and is computed on the basis of a multiple whole-genome alignment. It provides a common coordinate system for the aligned species and a mapping between this common coordinate system and the coordinate systems of the individual genomes. Furthermore, the SuperGenome implementation offers a variety of functions operat- ing on the SuperGenome data structure to allow for coordinate transformations of
9.3. The SuperGenome concept as the basis for comparative analyses
genome annotations, such as genes or transcription start sites (TSS), or of genomic and transcriptomic data, such as RNA-seq expression graphs in single-nucleotide res- olution. This can be utilized to compare the expression of homologous genes, if they have been aligned in the multiple whole-genome alignment, without the necessity of an ortholog mapping. Furthermore, it is possible to investigate conserved intergenic regions in order to discover novel coding or non-coding transcripts, for example. All these comparisons can be performed despite any architectural differences between the genomes as genomic regions that are conserved among the organisms will be assigned the same coordinates in the SuperGenome.
In principle, any software that is working on genomic data could also be applied to the SuperGenome and annotations that have been projected into its coordinate system. Standard genome browsers can be utilized, for example, to visualize genomic and transcriptomic data of different organisms as tracks in the same browser window. Elements in the different genomes that are related to each other, such as homologous genes or their TSS, share the same position in the SuperGenome and therefore they are also visualized at the same position in the genome browser, although their original genomic coordinates are completely different. E.g., an element might be located in the middle of the genome in one organism and due to a translocation the respective homologous element of another organism might be located at the end of the genome. The SuperGenome can compensate for these effects as long as the elements have been aligned in the multiple whole-genome alignment.
Two different applications of the SuperGenome approach have been presented in this thesis. In connection with GenomeRing it was applied to the comparative vi- sualization of genomic architectures (section 5.2). The second application was its integration with an algorithm for TSS detection from RNA-seq data to allow for a cross-genome annotation and comparison of the detected elements (chapter 6). In the context of alignment visualization GenomeRing employs a different strategy in comparison to other visualization tools. The linear viewer integrated in Mauve [37], for example, visualizes conserved regions as colored blocks. For each genome these blocks are shown in the order they appear in that genome. As the order of the blocks differs between the genomes they are connected by lines and by a common color. Due to the varying position of a block between the different genomes it can therefore be quite difficult for the user to quickly identify blocks that are missing in specific genomes. Another circular viewer is Circos [85], where the block repre- sentations of the different genomes are laid out on a circle and connected by lines or ribbons. Both approaches focus on preserving the genomic architectures of the individual genomes in the visualization. GenomeRing, however, focuses on high- lighting differences and similarities between the genomes by visualizing each block only once independently of the number of genomes in which the block is conserved. As a colored path representing the aligned genomes either traverse blocks or skips them, the user can immediately identify conserved blocks that can be found in all genomes or regions that are specific for only a subset of genomes. The architec- ture of the individual genomes is still shown as the paths connect the blocks in the order they appear in the respective genome. The application of GenomeRing to the four Campylobacter jejuni strains that were also subject of a comparative TSS
analysis (chapter 7) demonstrated how this visualization technique can be utilized for the quick identification of architectural differences between the genomes. On the other hand GenomeRing also proved its ability to guide in-depth analyses as demon- strated by its application to the alignment of three Helicobacter pylori strains [142]. GenomeRing’s connection to Mayday’s linear genome browser and the integration of transcriptomic data allowed for multi-level inspection of the strains. Getting a global overview of architectural differences in GenomeRing the incorporation of gene annotations and results of transcriptomic analyses allows the user to quickly iden- tify loci of interest, which can then be investigated on the level of gene clusters or single genes by GenomeRing’s linkage to Mayday’s genome browser. An even more detailed analysis is made possible by integrating position specific expression informa- tion in the form of RNA-seq data. This demonstrates how the SuperGenome-based visualization of genome alignments in GenomeRing complement the application of other tools, such as standard genome browsers, for a more comprehensive integrated analysis of genomic and transcriptomic data.
In the current implementation the number of genomes that can be visualized in GenomeRing is limited, which is due to the fact that in the visualization genomes are distinguished using different colors. However, this problem can be overcome in future implementations by developing aggregation techniques that allow for the grouping of genomes that are highly similar with respect to large parts of the genome. By this, differences between groups of genomes would be emphasized even more. It is likely that the SuperGenome basis of GenomeRing will prove to be very helpful for this task as similarities and differences between genomes and groups of genomes are implicitly modeled in the SuperGenome. Therefore, algorithms for the clustering and summarization of genomes and genomic regions will have to operate on the core data structure of the SuperGenome. Thus, the results of these summarization techniques will be beneficial not only for the visualization with GenomeRing but potentially in the context of all applications of the SuperGenome approach.
However, for all possible applications it has to be considered that the SuperGenome strongly depends on the alignment that is used as input. If homol- ogous regions are not properly aligned in the input data, they will not share the same SuperGenome coordinates and comparative analyses will not be possible for those loci. Furthermore, in its current implementation the SuperGenome only allows for an injective mapping of coordinates between the SuperGenome and the original genomes. Thus, a SuperGenome position can map to only one position in any of the other genomes and duplications are therefore not modelled by this approach. Mauve [38], however, which is used for the generation of multiple whole-genome alignments is also not able to handle duplications, but there are other tools for genome alignment that can find duplications, such as MUMmer [87]. As gene duplica- tion is an important evolutionary mechanism in prokaryotes and eukaryotes [84] the extension of the SuperGenome concept in this respect would be beneficial to allow for the comparative analysis of such events.
Another unsolved challenge is the ‘evolution’ of the SuperGenome itself. In the course of a study of a bacterial species, for example, the number of sequenced genomes of different strains might grow over time, or additional sequences have to