• No se han encontrado resultados

Los elementos románticos: otro factor que coadyuvó a la aparición

Capítulo III: ¿Cómo escribe Dolores Veintimilla? La poesía y la construcción del sujeto

3.2 Dolores Veintimilla o la escritora ilustrada: aparición y difusión de un nuevo

3.2.2 Los elementos románticos: otro factor que coadyuvó a la aparición

There are two approaches for novel gene discovery; one is experimental

approach which we had discussed in the previous section. The following section will focus on the analysis and prediction by using computational approach.

1.2.1 Extrinsic (similarity or evidence-based) gene finding

systems

Extrinsic gene finding systems locate target genes by comparing the RNA or protein sequences under study with all other RNA or protein sequences registered in databases to look for similarity. A high degree of similarity to a known RNA or protein product is strong evidence that a target gene is a protein- coding gene. Approximately 20-50% of newly found genes contain an ancient conserved region that is represented in the database (Fickett, 1996). Basic Local

Alignment Tool (blast.ncbi.nlm.nih.gov) is a widely used system designed for this

purpose, with several variants. This system provides task specific tools such as blastn (nucleotide blast) for searching a nucleotide sequence database using a nucleotide query; blastp (protein blast) for searching a protein sequence database using protein query; blastx for searching a protein sequence using a translated nucleotide query; tblastn for searching translated nucleotide

database using a protein query; tblastx for searching a translated nucleotide database using a translated nucleotide query.

As extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mouse, yeast, Drosophila and C.elegans so these extrinsic methods are quite popular to use. However, to apply this approach systemically requires extensive sequencing of mRNA and protein products. Although the RefSeq database, Ensembl system and NCBI database contain transcripts and protein sequence for many species, these databases however are both incomplete and contain a number of errors. One specific issue needs to be addressed is the limited availability sequences and protein products for tissue-specific genes, and for some genes that are only expressed at certain times. These limitations mean that the extrinsic evidence for many genes is not yet available.

1.2.2 Intrinsic (Ab initio approaches)

Ab initio gene prediction approaches use statistical and computational methods

to detect coding regions, splice sites, and start and stop codes in genomic sequences. These signs can be broadly categorized as either, specific sequences that indicate the presence of a gene nearby termed ‘signals’, or statistical properties of protein–coding sequence itself termed ‘content’. The Ab initio approach is the predominant gene prediction approach, due in large part to the fact that it doesn’t depend on sequence similarity and is therefore not limited by the availability of sequence data. Instead, understanding gene structure is the key step to predicting genes.

In the prokaryotic genomes, genes have specific and relatively well-understood promoter sequences (signals), such as the Pribnow box (TATAAT) and

transcription binding sites that are easy to identify systematically. Genes that code for proteins comprise open reading frames (ORFs) consisting of a series of codons that specify the amino acid sequence of the protein for which the gene codes. The ORF begins with an initiation codon, usually but not always ATG, and ends with a termination codon that can be TAA, TAG or TGA. Searching for a DNA sequence that begins with an ATG and ends with a termination triplet is a start towards gene annotation. Statistically, one would expect a stop codon

approximately every 60-75bp in a random sequence so a much longer stretch without a stop codon is good evidence for an open reading frame. These characteristics make prokaryotic gene finding relatively straightforward and well-designed systems will achieve a high level of accuracy.

Ab initio gene finding in eukaryotes, especially complex organisms like humans,

is considerably more challenging for several reasons. Firstly, the promoter and other regulatory signals in these genomes are more complex and less well understood than in prokaryotes. Secondly, the main problem for the human genome and those of other higher eukaryotes is that gene sequences are often split by introns and so do not appear as continuous ORFs. Many ORFs that continue into introns are subject to termination due to the presence of stop codon within introns. Due to the relatively small length of exons compared to introns, simple ORF scanning cannot locate gene sequences. For example, many exons are smaller than 100 codons whilst some are less than 50 codons in length. Thirdly, there is substantially more space between real genes in the human genome and those of higher eukaryotes (70% of human genome is intergenic), increasing the chance of finding spurious ORFs.

Given these issues, three modifications to the basic procedure for ORF scanning have been adopted for eukaryotes. The first of these modifications is codon bias by which not all codons are used equal frequently for particular organism. The second modification is that exon-intron boundaries can be used as a signal to identify genes. The third modification is that upstream control sequences can be used to locate the regions where genes begin. Additional strategies are also possible for specific organisms, such as the identification of CPG islands and binding sites for a poly(A) tails.

GLIMMER and GeneMark software programs are widely used, highly accurate gene finders for prokaryotes (Aggarwal and Ramaswamy, 2002). Eukaryotic ab

initio gene finders, by comparison, have achieved only limited success, as in the

GENSCAN and Geneid programs (Peters et al., 2007). Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex probabilistic models, such as Hidden Markov models (HMMs), in order to combine information from a variety of different signals and content measurements. Seven ab intio programs were evaluated on a nonhomologous mammalian data set by Rogic et

al. They reported that among the evaluated programs only GeneScan and HMMs gene were able to predict the precise location of 70-80% coding exons with low false positive rates (Rogic et al., 2001).

1.2.3 Combined approaches

Combined approaches bring together extrinsic and ab initio approaches by mapping protein and EST data to the genome in order to validate ab initio predictions. The ab initio approaches have delivered maximum accuracy of 70- 80% (Rogic et al., 2001). The similarity search programs are very effective in improving the accuracy of gene prediction. In particular, combining the two methods can improve the overall accuracy by 4-10% (Issac and Raghava, 2004). Usually, ab initio gene prediction and similarity searches are run independently with the output from these two approaches being manually integrated for gene annotation. Many automated programs have been developed to combine the two approaches such as GenomeWise, the TwinScan, GenomeScan and EGPred (Issac and Raghava, 2004). The GenomeScan program for gene prediction was

developed as an extension of Genescan and incorporates similarity searching for protein detected by BLASTX. GenomeScan is able to predict coding regions missed by using both GeneScan and BLASTX alone, leading to an improvement in the accuracy of gene prediction by 10% (Mathe et al., 2002).

1.2.4 Comparative genomic approaches

Comparative genome approaches rely on the sequence similarity to predict genes in a new species by comparison with an already sequenced relative. This approach is based on the principle that nature selection causes genes and other functional elements to undergo mutation at a slower rate than the rest of the genome. This means that the coding regions of genes are more conserved than noncoding regions under evolutionary pressure. Comparison of a few closely related genomes has proved successful for the discovery of protein-coding genes (Kellis et al., 2003). Stark et al. used a comparative analysis of twelve

Drosophila genomes to predict non-protein-coding RNA genes and structure, and

new microRNA (miRNA) genes (Stark et al., 2007). Comparative genomic analysis constitutes a powerful approach for the systematic understanding of any