3. CAPÍTULO III
3.6 Triangulación de Entrevistas
Obtaining a draft metabolic network for an organism is primarily a bioinformatic problem. The typical steps taken to create such a draft network is provided in Figure 3.4. Given that most metabolic reactions are driven by enzymes, the obvious first step is to obtain an enzyme inventory. For most microorganisms, the primary resource in obtaining this list would most probably be its one-dimensional genome annotation.
Gene Finding
A one-dimensional annotation of a genome, put very simply, refers to the identification of “interesting” regions within the genome, such as those connected to genes, rRNAs, and tRNAs, and the assignment of (putative) functions to their products whenever possible. In small prokaryotes, the identification of genes (Figure 3.4, transition A), or “gene find- ing”, is largely a matter of identifying long open reading frames (ORFs). However, some tricky situations do arise, such as when long ORFs overlap on opposite strands and cause ambiguities as to which are true coding regions (Stein, 2001). Gene finding becomes increas- ingly tricky as genomes get larger. Currently available gene finding tools include GLIM- MER (Salzberg et al., 1998), GENSCAN (Burge and Karlin, 1997), Genie (Reese et al., 2000), GeneMark.hmm (Besemer and Borodovsky, 1999), Grail (Uberacher and Mural, 1991), HEXON (Solovyev et al., 1994), and REGANOR (Linkeet al., 2006).
GC-rich genomes are characterized by a low frequency of stop codons (TAA, TGA, TAG), which leads to a severe overprediction of potential genes. In Halobacterium sali- narum, the GC content is 68%, and this led to an average of 1.7 additional spurious open reading frames of at least 100 codons for every gene coding for a real protein (Tebbe et al., 2005; Aivaliotis et al., 2007). In addition, problems related to incorrect gene start codon as-
Figure 3.4: Summary of steps leading to a draft network. The yellow curved boxes (places) correspond to intermediate results in obtaining a draft metabolic network for an organ- ism. Processes that link these intermediate results are represented by blue rectangles (transitions). Some popular tools and resources relevant to the transitions are provided to the right. Typically, the first two steps are performed by the people who publish the genome. This is true forHalobac- terium salinarum, so we only had to use (gene) protein function prediction methods in specific cases. One-letter codes are attached to each transition for easy reference from the main text.
signment are also aggravated in the archaea because ribosome binding sites (Shine-Dalgarno boxes) are poorly conserved and mainly precede genes within transcription units but are not found in the case of leaderless mRNAs (Falb, 2005; Sartorius-Neef and Pfeifer, 2004; Torarinsson et al., 2005). Because of these difficulties, standard gene prediction tools had to be supplemented with methods that exploit features of halophilic proteins, for example their acidic isoelectric point profiles, and with experimental data such as the proteomic identification of coding regions (Falb, 2005). Fortunately, such processed data are typically already available when a microbial genome is published. In this study, we used the expert- validated gene sets forHalobacterium salinarumstrain R1 (Figure 3.5) andNatronomonas
pharaonis available in the Halolex database (Pfeiffer et al., 2008b).
Figure 3.5: A representation of the genome of Halobacterium salinarum.
Proteins are marked in the outer circle in a strand-specific manner. The central circle shows stable RNAs. Insertion elements are in- dicated in the innermost circle. The image was taken from the Halolex website (Pfeiffer et al., 2008b).
Gene Function Assignment - Homology
When the hot-spots in the genome have been adequately identified, the next step in the annotation is to obtain a definitive catalogue of the proteins of the organism, where a putative function is assigned to each element whenever possible (Figure 3.4, transition B). To be sure, a rigorous experimental characterization remains the authoritative basis for assigning gene function. However, the extremely rapid pace at which sequences are generated precludes the possibility of performing experiments in each case. In this context, computational methods that require only the sequence for predicting function have been developed, and these, to a degree, have been widely successful.
Two genes are said to be homologous if they share a common ancestry, i.e., they come from the same gene of a common ancestor species. Furthermore, both are orthologous to each other if they are from distinct species, and paralogous otherwise. This concept of homology carries powerful implications in that it, with certain assumptions, often allows function to be “transferred” from one gene (protein) to another (Bork et al., 1998). For example, a protein that is homologous to a permease very likely also serves a transport- related function. In bioinformatics, homology among proteins and DNA is often concluded on the basis of sequence similarity (Altschul et al., 1990, 1997); if two or more genes have a score that is above a certain threshold using some sequence similarity system, then they
are likely homologous. While not perfect, for instance because sequence similarity can arise from other reasons, such as by chance in short sequences or in different proteins being selected for on the basis of having to bind to a particular structure, the system for determining homology by sequence similarity has been extensively used. Indeed, up to now it still serves as the primary source for annotating genes of newly sequenced organisms with function.
Perhaps the most popular set of tools for comparing sequences at the nucleotide or protein level is the BLAST family of tools (Altschul et al., 1990, 1997; Zhang et al., 2000; Zhang and Madden, 1997; Madden et al., 1996). Typically, new query proteins, such as those obtained from newly sequenced genomes, are each “blasted” against several different databases. At this step, it is important to bear in mind that the accuracy of the results is, to a large extent, determined by the quality of the reference proteins used. For one thing, the nature of transfering functions based on sequence similarity itself dictates that incorrect annotations can also be easily transferred. In this respect, SWISS-PROT and SWISS-PROT TrEMBL (Boeckmann et al., 2003; O’Donovan et al., 2002) have emerged as among the most valuable sequence repositories (Stein, 2001). SWISS-PROT is a cu- rated collection of confirmed protein sequences that have been extensively annotated and cross-referenced with other bioinformatic databases. Annotations include bibliographic references, functional descriptions, biological roles, protein family assignments and, when available, links to structural data.
Supplementing Homology Searches - Gene Context Analysis
Even with the level of success achieved by homology-based methods, it was recognized that further systems that could supplement them had to be developed. Problems associated with relying exclusively on homology include the fact that the stringent requirement of finding for a given query protein a match with known function cannot always be satisfied, and the well-known reality that two proteins with the same function need not necessarily have similar sequences. It was against this backdrop that methods which exploit gene context have emerged as successful complements.
True to their namesake, context-based approaches use contextual information such as gene fusion, the conservation of local gene neighborhoods, and the co-occurrence of genes across genomes. In the first case, proteins encoded by genes with homologs which are fused in another organism tend to be functionaly related (Enright et al., 1999; Snel et al., 2000; Marcotte et al., 1999). Likewise, in the second case, genes which are significantly
encountered as neighbors across genomes, detected by the conservation of either gene order (Dandekar et al., 1998) or genes in a run (Overbeek et al., 1999), also tend to be functionaly related. Finally, in the third case, functionally linked proteins are assumed to be inclined to evolve in a correlated fashion, and as such can be found by comparing their phylogenetic profiles (Pelligrini et al., 1999). While these methods typically do not predict specific functions by themselves, they are used to predict higher level functions, such as the participation of a protein in a particular structural complex or metabolic pathway.
In the case of Halobacterium salinarum, protein function assignments based on various homology methods and other approaches are already available in the Halolex database (Pfeifferet al., 2008b). Accordingly, in this study we only had to use protein function assignment methods in specific cases, such as in instances where new relevant sequences are created or characterized.