• No se han encontrado resultados

EL CONTEXTO NACIONAL Y LAS MACROVARIABLES

AGUAS SERVIDAS

2. EL CONTEXTO NACIONAL Y LAS MACROVARIABLES

Generally, one can distinguish between global and local alignment procedures. A global alignment will always cover the entire input of sequences, no matter how different these may be, and the algorithm determines the alignment that maximises the alignment score over the full length of both sequences. Global alignments are a reasonable approach for sequences that are related over their entire lengths. On the other hand, local alignments contain only contiguous parts of the sequence that are ―similar‖. In such cases, the algorithm computes the optimal local alignment by finding the pair of substrings of the full sequences whose alignment yields the highest alignment score among the set of all substrings and their possible alignments.

Global Alignment (Needleman-Wunsch)

The most widely used dynamic programming algorithm for global sequence alignment is the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970). The idea behind all the versions is to build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences.

Pairwise sequence alignment 109

The maximum match can be determined by representing the two sequences A, B of length m,n respectively in a matrix indexed by i and j, one index for each sequence. The score value γ(i, j) assigned to each cell is built recursively by the following recurrence:

1 1 if =0 & =0: 0 if >0 & =0: ( 1) if =0 & >0: ( 1) ( . ) : ( , ) ( 1, 1) max max ( ( , 1) ( )) max ( ( 1, ) ( )) i j k i k j i j i j gap i i j gap j i j otherwise a b i j i k j gap k i j k gap k [5.30]

In this formulation, δ(ai, bj) is the score given for matching the ith symbol in string A to the jth symbol in string B. The penalty for a gap of size k is defined by gap(k). In its simplest form, δ(a,b) can be defined as 1 for a match and 0 for a mismatch. In the original paper, the symbols (amino acids) were numbered from the N-terminal end, although the direction makes no difference to the final result, so the table was filled in starting at the end of the sequences at position (0,0).

Every possible comparison will be represented by pathways through the array. An i or j can occur only once in a pathway because a particular symbol cannot occupy more than one position at one time. Furthermore, the only permissible relationships of their indices are m>i, n>j or m<i, n<j. Any other relationships represent permutations of one or both amino acid sequences which cannot be allowed since this destroys the significance of a sequence.

A pathway is signified by a line connecting cells in the array. Proceeding along complete diagonals with no deviations would imply an alignment without any gaps. The introduction of a gap (either by an insertion or a deletion) in either sequence would correspond to moving either above or below the main diagonal (Fig. 5.16).

110 Time Series Similarity

Fig. 5.16 The next three possible steps from the element (i; j), their representation in the alignment matrix and the corresponding alignment.

To find the best route, Needleman and Wunsch suggested modifying the matrix to represent this idea of tracing different pathways through the matrix. From all the possible pathways only the one which is best (in terms of maximising a score) can be chosen. Their method consists of two passes through the matrix. The first pass traces a score for all possible routes and moves right to left, bottom to top. Once the scores for all possible routes are found, the maximum can be chosen (it will be somewhere on the topmost row or leftmost column) and a second pass can be carried out, this time running left to right, top to bottom to find the alignment that gives the maximum score.

The reason that the algorithm works is that the score is made up of a sum of independent pieces, so the best score up to a point in the alignment is the best score up to the point one step before, plus the incremental score of the new step.

Local alignment

It is often the case that one or more regions of high similarity will exist in two sequences that are otherwise dissimilar. Then, short and highly similar subsequences

Pairwise sequence alignment 111

may be missed in the global alignment because they are outweighed by the rest of the sequence. Hence, one would aim to create a locally optimal alignment.

A small modification of the original Needleman-Wunsch algorithm that allows the determination of the optimal local alignment has been introduced by Smith and Waterman, 1981. The key difference is based on the introduction of zero as a new option in the recursion relation (eq. [5.31]) which has the effect of terminating any path in the alignment matrix in which the score drops below zero. The algorithm requires that the scoring function be negatively biased so that regions of low similarity will have negative scores.

This is required in order to cause the score to drop as more and more mismatches are added. Hence, the score will rise in a region of high similarity and then fall outside of this region. If there are two segments of high similarity then these must be close enough to allow a path between them to be linked by a gap or they will be left as independent segments of local similarity. After optimal alignment scores have been calculated for all nodes in the usual recursive way, the optimal local alignment is found by locating the node with the largest alignment score and performing a trace-back starting from this node, until a node is encountered in which the alignment score is equal to zero.

if =0 or =0: 0 0 ( 1, ) ( , ) ( . ) : max ( , 1) ( , ) ( 1, 1) ( , ) i j i j i j i j a i j otherwise i j b i j a b [5.31]

In eq. [5.31] vertical movements are penalised with the insertion function (ai,-), while

the horizontal movements are penalised with the deletion function (-,bj). Usually these

functions correspond to negative scores.

To illustrate the difference between global and local alignments, Fig. 5.17 shows two alignments of the same DNA sequences. The first shows a weak global alignment while the second shows a stronger local alignment.

112 Time Series Similarity

Fig. 5.17 Global vs. Local Alignment.

More research has been carried out to create gap-costs that allow block insertions and deletions (Gotoh, 1982; Sankoff and Kruskal, 1983). Gotoh, 1982 devised a 3-state, affine gap costs model of mutation which improves time efficiency to scoring gaps in sequence alignment. His work has been used in other research to treat gaps in alignments (Allison, 1993).

The k-best variation of the Smith-Waterman algorithm (Waterman and Eggert, 1987) returns non-overlapping local alignments that score at or above a preset level. This is particularly useful if two sequences share multiple regions of similarity interrupted by dissimilar regions and with the order of the similar regions rearranged.

Heuristic methods

Motivated by the problem of finding sequences in large databases, the heuristic similarity search algorithms FASTA (Lipman and Pearson, 1985) and BLAST (Altschul et al., 1990) were created.

FASTA considers exact matches between short sub-strings k. If a significant number of such exact matches are found, FASTA uses the dynamic programming algorithm to compute optimal alignments. This approach allows speed to be traded for precision. The larger the parameter k, the smaller the number of exact matches. This makes the program faster but loses precision as it becomes less likely that the optimal alignment contains enough exact matches of length k and the procedure may find nothing. Nevertheless, experience shows that with sensitively chosen parameters, FASTA misses very few cases of significant homology.

BLAST is another heuristic method based on a similar idea. BLAST focuses on no gap alignments of (again) a certain fixed length k. Rather than requiring exact matches, BLAST uses a scoring function to measure similarity, rather than distance. In particular, for proteins, one can argue that segment pairs with no gaps and high similarity scores indicate regions of functional similarity. For a given threshold parameter S, BLAST

Similarities between qualitative trends 113

reports to the user all database entries which have a segment pair with the query sequence that scores higher than S. If the scoring function used has a probabilistic interpretation, BLAST can also give an assessment of the statistical significance of the matches it reports.

Another heuristic method for sequence alignment was presented in Rognes and Seeberg, 1998. The algorithm, called SALSA, has many similarities to FASTA and BLAST, but includes a post-processing stage that increases sensitivity. The idea is to build an alignment from all the fragments found in the initial stages of the searching process. Then, fragments should be arranged by a gap obtaining the optimal score. The position of the gap is found using dynamic programming only if the score of the partial alignment on either side of the gap is higher than the gap penalty.

Subsequently, the aforementioned author introduced an improved method in Rognes, 2001. This time the algorithm exploits the parallel processing capability of the microprocessor to perform the same operation in parallel on several independent data sources. Like most heuristic algorithms the method can be divided into two phases. First it computes the exact optimal ungapped alignment score of each diagonal. Secondly, a novel heuristic search estimates a gapped alignment score taking into account the amount of sequence similarity on several diagonals. The fraction containing the 1% of the highest scoring database sequences is finally subjected to a rigorous Smith- Waterman alignment.