6. ANÁLISIS DE RESULTADOS
6.1. Comparación de los resultados teóricos y experimentales
6.1.2. Resultados experimentales de la demanda energética
During the past years, ESTs became popular for phylogenetic studies (e.g. Dunn et al. (2008), Roeding et al. (2007)), because on one hand, plenty of data is already publicly available and on the other hand, ESTs for taxa not yet present in the sequence databases can be generated at reasonably low costs. Furthermore, since ESTs are based on mRNAs, they mainly represent protein coding regions of a genome. This is particularly useful for studies addressing evolutionary events that took place millions of years ago, because the phylogenetic signal fades slower over time on the protein level than on the DNA level (Opperdoes (2003)).
However, the advantages of using ESTs in phylogenetic analyses comes at the cost of several disadvantages. As pointed out earlier (Section 3.3), the generation of ESTs involves several stages in which the cDNA is altered from its mRNA template. Most prominently, ESTs usually do not cover the complete mRNA due to the inefficiencies of the reverse transcriptase and the sequencing process. This leads to a reduction of the phylogenetic signal (when compared to the full length mRNA sequence), because it is assumed that longer sequences contain more phylogenetic information (Philippe et al. (2004)). Also, EST sequences as obtained from the sequencing machine usually contain contaminations, such as parts of the vector, the adapter sequence and genetic material from the bacterial host cell that was integrated via transposable elements. When not taken care of, these contaminations can cause severe problems during phylogenetic tree reconstruction, because these sequence parts do not share an evolutionary history with the cDNA they are attached to. Finally, nucleotides of the cDNA do not necessarily correspond to their mRNA counterpart as the reverse transcriptase is not operating faultlessly. Consequently, sequences that are compared will show more differences on the nucleotide level, which makes them appear more distantly related than they really are. Finally, the quality of EST sequences is usually poor at the ends, caused by the sequencing process. Discounting this fact and not removing faulty nucleotides will lead to an overestimate of the number of substitutions that have been introduced by mutations over time.
Fortunately, methods exist to deal with such sources of error. Vector contaminations can be identified by comparing the EST against the known vector sequence and subse- quently remove them. As explained, modern base-calling software do not only deliver the sequences themselves, but additionally an estimate of the correctness of each single sequence position; the base quality values. Consequently, a researcher working with the sequences has knowledge about which nucleotides can be trusted and which should be regarded with suspicion. The latter category can be simply removed or masked before performing analyses.
To counteract the other error sources, one can take advantage of the redundancy of ESTs: A gene can be transcribed in parallel, which results in the presence of multiple
3.4 ESTs in a Phylogenetic Context 21
mRNA molecules derived from the same gene. The pace in which a gene is processed to the final protein, the gene expression level, reflects the need of the cell for this gene product. Expression levels differ therefore not only between genes but also for one gene between different tissues, developmental stages or environmental conditions (e.g. Su et al. (2004)). Usually, the individual expression levels of genes are unknown when the mRNA is extracted to construct a cDNA library, but typically there are some genes (10 to 15) that account for up to 20% of the total mRNA mass of a cell. Approximately 1000-2000 genes are represented with intermediate levels of mRNA and the remaining genes are only found to be present with a few mRNA molecules or to be completely absent (Bonaldo
et al. (1996)). It is very likely, that genes with a higher expression level are represented
by several ESTs, because cDNA clones are usually randomly picked and sequenced. A common strategy is to remove these redundancies after the sequencing, by grouping ESTs that stem from the same gene (clustering) and then assembling all overlapping ESTs in a cluster to form a longer continuous sequence, called contig. Although all ESTs in a cluster should represent the same gene, different mRNA molecules were used as templates. Hence, random errors introduced by the reverse transcriptase should not be present at the same position. Thus, errors will trigger conflicts during the consensus sequence determination of a contig and can be either corrected or at least marked as suspicious. As a further advantage of the clustering, different ESTs of the same gene can cover different parts of the mRNA. By clustering and assembling them, the final cDNA sequences can be extended, yielding a higher coverage of the gene and increase the phylogenetic signal compared to single ESTs.
A processing of ESTs is therefore straightforward and nowadays routinely done, indepen- dent of the application. Correspondingly, a broad range of tools for each step have been developed in the last 15 years (Nagaraj et al. (2007)). But with an increase in numbers and sizes of EST projects to be processed, there is a need for completely automated solutions. Consequently, to process the enormous amounts of EST data for the Deep Meta- zoan Phylogeny project, we developed a program pipeline that wraps up each individual processing step without user-interaction and that also takes care of the data management.
ATTTGCACCGTGGGGATT ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCACAACGACCTATAT ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCACAACGACCTATA ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCACAACGACCTAT ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCACAACGACCTA ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCACAACGACCT ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCACAACGACC ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCACAACGAC ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCACAACGA ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCACAACG ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCACAAC ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCACAA ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCACA ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCAC ATTTGCACCGTGGGGATTACACCTGCTCCGTGATCA ATTTGCACCGTGGGGATTACACCTGCTCCGTGATC ATTTGCACCGTGGGGATTACACCTGCTCCGTGAT ATTTGCACCGTGGGGATTACACCTGCTCCGTGA ATTTGCACCGTGGGGATTACACCTGCTCCGTG ATTTGCACCGTGGGGATTACACCTGCTCCGT ATTTGCACCGTGGGGATTACACCTGCTCCG ATTTGCACCGTGGGGATTACACCTGCTCC ATTTGCACCGTGGGGATTACACCTGCTC ATTTGCACCGTGGGGATTACACCTGCT ATTTGCACCGTGGGGATTACACCTGC ATTTGCACCGTGGGGATTACACCTG ATTTGCACCGTGGGGATTACACCT ATTTGCACCGTGGGGATTACACC ATTTGCACCGTGGGGATTACAC ATTTGCACCGTGGGGATTACA ATTTGCACCGTGGGGATTAC ATTTGCACCGTGGGGATTA TACAC C T G C TC C G TG A TC ACA A C G AC C TA TA T
La
se
r/
D
etec
tor
Chromatogram
Sequencing G
el
Fragments of cDNA with primer sequence and a terminal ddNTPFigure 3.6: EST sequencing The synthesizing of fragments is continued for a fixed
number of cycles, resulting in many fragments of different length. These fragments are then loaded on a sequencing gel for example. By that, the fragments will be separated according to their size, forming a band pattern. A laser aimed on the gel will stimulate the dye attached to the ddNTPs in bands passing the laser to emit light in its specific color. The time at which a band passes the laser corresponds to the length of the fragments in the band. The series of the four colors is detected and processed by the sequencing machine which yields a chromatogram. The chromatogram can then be translated into the actual DNA sequence.
4 EST Processing
4.1 Introduction
As already explained in the previous chapter, the state in which ESTs are obtained from the sequencer is not suitable for direct use in phylogenetic analyses. A wide choice of tools for every necessary step exists, to obtain high-quality data. Depending on the source, the available ESTs are provided on different levels of quality. Some sources only offer unprocessed ESTs as obtained directly from the sequencing machine, where they potentially contain vector contaminations and low quality regions. Other providers remove contaminations and low quality regions, but do not apply any clustering procedures. We call this state preprocessed hereafter. Finally, sequence data, based on cleaned and assembled ESTs, is available as well. Here, we describe a pipeline in which the complete processing of ESTs is performed automatically with a minimum of user interaction. We also take into account the different stages ESTs can be delivered in, to prevent unnecessary steps and save computational resources. An overview of the complete workflow is shown in Fig. 4.1. In the following we explain each processing step.