3 CAPITULO ETA. UN ANALISIS DE SU REALIDAD
3.1 UN RECORRIDO POR LA HISTORIA DEL GRUPO TERRORISTA ETA
2.2.1 Statement of the Spliced Alignment Problem
Accumulation of the data about expressed genes in the form of protein, mRNA and EST sequences allowed for development of similarity-based approaches to gene recognition. Since formation of mature mRNA in eukaryotes is preceded by splicing of introns, identification of genes by similarity to processed products of gene expression should take into account existence of introns in the genomic sequences: unlike exons, introns cannot be aligned with the processed product. There exist two statements of the spliced alignment problem. The so-called block problem was suggested in [11].
Consider the genomic sequence as a string in the alphabet{A, C, G, T}, and the product of gene expression as a string in the same (in the case of mRNA or EST) or a different (in the case of protein) alphabet. The correspondence between the DNA and product alphabets is defined by a substitution matrix. The given DNA string is supplied by a set of (overlapping) subwords, that is, candidate exons. The goal is to find the set of subwords, whose concatenation is most similar to the expression product according to the substitution matrix. In the most general case, candidate exons correspond to any substring between AG and GT dinucleotides. Then the spliced alignment problem can be formulated as the so-called site problem, that is a generalization of the global spliced alignment [26] with introns treated as deletion of a specific type [25].
The EST GENOME algorithm [25] allows one to align EST to a genomic sequence. The algorithm considers two types of introns in the genomic sequence: proper introns bounded by AG–GT dinucleotides (CT–AC if the complementary strand is considered), and splices, that is, deletions in the genomic sequence that do not require fixed dinucleotides at the termini, and have a smaller penalty than introns. Two types of intron-like deletions are needed for additional flexibility that allows the algorithm to recover short exons.
Let W (m, n) be the weight of the alignment of the EST segment (1, . . . , m) and DNA segment (1, . . . , n), and let Wbest(m) be the weight of the best alignment of the EST segment (1, . . . , m) with K(m) being the corresponding genomic position. Then
2-4 Handbook of Computational Molecular Biology
Wm,n= max
W (m− 1, n) − D
W (m− 1, n − 1) + M(m, n) W (m, n− 1) − D
Wbest(m)− 0
where D denotes the “standard” deletion penalty in the genomic and EST sequences, M (m, n) is the weight of matching symbols at positions m and n, and
=
splice
intron
where = intron is the intron penalty if (K(m), n) is a pair of splicing sites, and
= splice is the splice penalty otherwise; we set
(Wbest(m), K(m)) = (W (m, n), n))if Wbest(m) < W (m, n)
Despite the drawbacks caused by using fixed intron penalty, the approach of this algorithm became quite popular. The running time of the algorithm is O(M N ), where M and N are the lengths of the EST and genomic sequences respectively.
2.2.2 The Use of HMM to Set the Intron Penalty
Probabilistic interpretation of the spliced alignment algorithm created a convenient way to combine the statistical gene recognition and sequence alignment. Use of the statistical models was necessary when the similarity between the aligned sequences was low and in-sufficient to exactly define the exon boundaries from alignment alone. Initially the problem was solved by filtering the candidate splicing sites and candidate exons. In the block vari-ant of the spliced alignment problem, implemented in P rocrustes [11], a set of candidate exons is filtered at a preliminary statistics-based step [24]. Other possibilities are filtering of candidate splicing sites, as in P ro-EST [22] and Pro-Gene [27]. One more algorithm of this family, Pro-Frame uses the fact that the similarity between the genomic and protein sequence exists at only one side of a true splicing site [23]. Finally, in a probabilistic set-ting of GeneSeqer [33], the intron probability depends on the probability of corresponding candidate sites determined by a specific statistics-based module SplicePredictor.
2.2.3 Determination of the Exon-Introns Structure of a Gene by Spliced Alignment with ESTs from Another, Related Gene
GeneSeqer aligns a DNA sequence with EST from a related gene, e.g., an orthologous gene from a different species [33]. It is intended for the annotation of plant genomes, where the number of ETSs for a genome might be rather low, as EST and genome sequencing projects do not always cover same organisms.
The HMM formalism is used for the alignment, and as the similarity between the aligned sequences may be rather low, additional information is needed to precisely map splicing sites. As mentioned above, initially site probabilities were set by a statistics-based model.
In a subsequent study [5] sites were scored using the generalized Bayesian likelihood. Each
Spliced Alignment and Similarity-based Gene Recognition 2-5
e
ni
n+1i
ne
n+1P∆G
(1-P∆G)(1-PDn+1)
(1-P∆G)PDn+1
PAn(1-P∆G) PAnP∆G
1-PAn
FIGURE 2.2: State transition diagram for a hidden Markov model implementing spliced align-ment of EST and genomic sequence [33]. The transition probabilities τ are shown above the arrows. Each position n in the genomic sequence is ascribed exon (en) or intron (in) state. Notation: PGis the deletion probability, PDnand PAnare the probabilities that n-th nucleotide in the DNA sequence is, respectively, first or last position in an intron.
candidate sites is classified to one of the following seven categories: true site in the reading frame 0, 1, 2; false site in the reading frame 0, 1, 2; and false site in non-coding region.
Consider a genomic sequence G of length N and a EST sequence C of length M . Rep-resent the spliced alignment by an HMM with two states, exon en and intron in, where n = 1, . . . , N− 1 a position in the genomic sequence. The transition diagram is shown in Figure 2.2. The transition probabilities at the arcs are denoted by τ , for instance, τen−1,en is the probability to remain in the exon state moving to the next genomic posi-tion. Probabilities of transitions between the exon and intron states are estimated via the candidate site probabilities computed by SplicePredictor. Denote by n and m cur-rent genomic and EST positions respectively, let S(n, m) be an alignment of sequences G1G2. . . Gn and C1C2. . . Cm, and let Z = z1z2z3 . . . zk be the sequence of hidden states;
max m, n≤ l ≤ m + n. The maximum probability is computed using a standard formula
P = max[Emn, Imn]
Emn = max P (Z = z1z2. . . zl, zl= en, S(m, n)) Imn = max P (Z = z1z2. . . zl, zl= in, S(m, n)) E0n = I0= 1
Em0 = 1 Im0 = 0
n = 0, 1, . . . , N m = 0, 1, . . . , M
where the recursions for computing the probabilities of the sequence of states to end at position n of an exon Emn or an intron Imn are as follows:
2-6 Handbook of Computational Molecular Biology
Since the transition probabilities depend on the site probabilities, the probability of the intron-type deletion implicitly depends on its sites.
2.2.4 Determination of the Exon-Introns Structure of a Gene by Spliced Alignment with EST Clusters
Spliced alignment with multiple ESTs is useful for determining the complete gene struc-ture, finding alternatively spliced isoforms, and mapping gene termini. In early programs, e.g Pro-EST [22], it was done by spliced alignment of genomic sequences with pre-computed EST contigs. However, this approach is limited by assumptions used in construction of these contigs.
A more robust way to use the EST information is spliced alignment with individual ESTs with simultaneous construction of complete, alternative exon-intron structures. GeneSeqer [5] uses a decision tree to merge fragments of the exon-intron structure from individual spliced alignments. Another program, TAP, aligns EST to the genome using an empirical fast algorithm sim4 [10] described in more detail below, and uses the following procedure for construction of alternative exon-intron structures.
ESTs aligned with identity exceeding 92% are ascribed to the DNA chain using database annotation and additional verification by analysis of invariant dinucleotides at the intron termini. 3-ESTs are used to find the polyadenylation sites: such a site is defined either by a cluster of at least three ESTs or by a polyadenylated EST if the alignment contains a canonical site AATAAA or ATTAAA, whereas the genomic sequence does not contain a polyA-run. Pairs of splicing sites corresponding to introns can be in one of three possible relationships: continuous (belong to one alignment), transitive (belong to alignments over-lapping in an exon), or conflicting. The algorithm constructs a matrix of such relationships between site pairs; an element of this matrix is the number of ESTs confirming the given relationships. This matrix is used to construct the path of the highest total weight, the next one, etc. The highest scoring path corresponds to the most represented isoform.
2.2.5 Clustering of cDNA (mRNA)
Two main approaches for clustering of full-length cDNAs are pairwise comparison of cD-NA sequences and comparison with the genomic sequences. The former approach is applied when the genomic sequence is not available. In both cases the dynamic programming algo-rithm is too slow for mass comparisons: in the former case, too many pairwise comparisons are needed, whereas in the latter case, the genome sequence is too large.
The following filtering procedure was used in [31]. The local filter identifies an exactly coinciding fragment in two cDNAs whose length exceeds a given threshold. The global filter finds an ordered set of coinciding fragments whose total length exceeds a threshold.
The program uses the EST GENOME algorithm [25] modified so as to allow for pairwise comparison of cDNAs. This is achieved by using zero weights of matched nucleotides, penalties for external and internal deletions, and fixed penalties for deletions longer than
Spliced Alignment and Similarity-based Gene Recognition 2-7 40 nucletoides, the latter corresponding to retained introns and other differences caused by alternative splicing.
The score of the optimal alignment of a cDNA pair assumed to be generated by alternative splicing of one pre-mRNA transcript, should be below some fixed threshold. The thresholds for the local and global filters are determined dependent on the alignment parameters. The local filter is implemented using a hash table. Construction of the hash table requires time proportional to the database size, whereas the search time for all cDNAs having a common word is proportional to the word length. The global filter uses a modification of the algorithm for construction of the maximal chain of common words in two sequences, whose complexity is O(M N ), where M and N are the lengths of the compared sequences [16], [15]. The authors suggested a modification whose run time is O(N + KM ), where M− K is the minimum allowed word length.
2.2.6 Heuristic Algorithms of EST-DNA Spliced Alignment
The complexity of the spliced alignment algorithms is proportional to the product of the sequence lengths. Such algorithms guarantee finding the optimal alignment, but they are too slow for database search. A family of BLAST -like algorithms were developed for the latter purpose: sim4 [10], Spidey [35], BLAT [17], Squall [28]. Such algorithms start with sensitive database similarity search aimed at decrease of the number of sequences requiring exact alignment. The database can be a genomic sequence and the query can be a EST or a protein, or vice versa, the database can be a set of ESTs and the query can be a fragment of the genomic sequence. Fast current algorithms do total alignment of human ESTs (3.73× 106 fragments of total length 1.75× 109) against the human genome (2.88× 109 nucleotides). Heuristic spliced ailgnent algorithms do not guarantee finding the optimal alignment, but they are sufficiently specific and sensitive. The reason for that is that very similar sequences are aligned (more than 90% identity). If the similarity level is lower, the quality of predictions drops dramatically.
sim4 [10] aligns EST to DNA using the following strategy. Pairs of segments with maximum similarity are determined using a BLAST -like procedure: coinciding words of length 12 are found and then extended to form local similarity segments. A set of aligned segment pairs that could represent a gene is formed. The start and end positions of these segments should form increasing sequences in the EST and genomic DNA, and the offset of diagonals representing the segments in the alignment matrix should be either almost coinciding or sufficient to accommodate an intron. To determine the exon boundaries, pairs from almost coinciding diagonals are merged and their projections to the genomic sequence form exon cores. If projections of the exon cores to the EST sequence overlap, the common part of the cores is cut so as to form an intron with canonical GT–AG dinucleotides at the intron boundaries (or CT–AC if the EST is complementary to the gene strand). If the exon cores do not overlap, they are extended until the EST projections of the corresponding diagonals overlap. The intersection point is adjusted so as to define an intron with the canonical dinucleotides. If this procedure fails, a search for shorter matching segments is performed in the area between the cores.
BLAT [17] is intended for identification and fast alignment of very similar sequences, in particular, human ESTs with the human genome, and the human and mouse genomes.
Again, a search for highly similar fragments is performed first. Several variants of local similarity regions are defined: exact match of the length exceeding the threshold, an inexact match with at most one mismatching position, a chain of shorter exact matches within a given interval off one diagonal in the alignment matrix. The parameters are selected by considering the probability of a match in two sequences of the given identity so that to
2-8 Handbook of Computational Molecular Biology maximize the specificity at a fixed sensitivity level of 99%. The alignment procedure differs for EST-genome and protein-genome comparisons.
To construct the EST-genome alignments the obtained alignment segments are extended as in sim4 ; to fill the remaining gaps the procedure of the previous paragraph is applied iteratively with more liberal thresholds. The exon boundaries are selected so as to satisfy the GT–AG rule whenever possible. To construct a protein-DNA alignment, the initial fragments are extended without deletions, and then an oriented graph is constructed, whose vertices are alignment fragments, and the arcs connect vertices if the end of the fragment corresponding to the out-vertex is upstream of the start of the fragment corresponding to the in-vertex in both sequences.
Squall [28] also is intended for the EST alignment with complete genomes. Like the previous two programs, a fast search using a hash table of the large genome sequence is used to identify initial exact alignments. They are then extended and the remaining gaps are filled using the dynamic programming algorithm [12], if the lengths of the unaligned region are similar in the EST and genomic sequence. Otherwise it is assumed that the genomic sequence contains an intron. Otherwise hanging EST end is aligned to DNA fragments of the same length at both termini of the unaligned region, and the intron position is selected so as to satisfy the GT–AG rule whenever possible.
It should be noted that all algorithms of this type have a number of common problems, the most important of which is the possibility of missing short exons.