CAPÍTULO IV. RESULTADOS EN MODO CALEFACCIÓN
IV.2. Cálculo de variables
7.1.2 Multiple sequence alignment 7.2 Let us Sum up
7.3 Lesson end activities 7.4 Check your progress 7.5 Points for Discussion 7.6 References
7.0 Aims and Objectives
This unit discuss the methods of sequence alignment, alignment of pairs of sequences and multiple sequence alignment.
7.1 Overview Of Methods Of Sequence Alignment
7.1.1 Alignment of Pairs of Sequences
Alignment of two sequences is performed using the following methods:
1. Dot matrix analysis
2. The dynamic programming (or DP) algorithm
Unless the sequences are known to be very much alike, the dot matrix method should be used first, as this method displays any possible sequence alignments as diagonals on the matrix. Dot matrix analysis can readily reveal the presence of insertions/deletions and direct and inverted repeats that are more difficult to find by the other, more automated methods. The major limitation of the method is that most dot matrix computer programs do not show an actual alignment. The dynamic programming method, first used for global alignment of sequences by Needleman and Wunsch (1970) and for local alignment by Smith and Waterman (1981a), provides one or more alignments of the sequences. An alignment is generated by starting at the ends of the two sequences and attempting to match all possible pairs of characters between the sequences and by following a scoring scheme for matches, mismatches, and gaps. This procedure generates a matrix of numbers that represents all possible alignments between the sequences. The highest set of sequential scores in the matrix defines an optimal alignment. For proteins, an amino acid substitution matrix, such as the Dayhoff percent accepted mutation matrix 250 (PAM250) or blosum substitution matrix 62 (BLOSUM62) is used to score matches and mismatches. Similar matrices are available for aligning DNA sequences. The dynamic programming method is guaranteed in a mathematical sense to provide the optimal (very best or highest-scoring) alignment for a given set of user-defined variables, including choice of scoring matrix and gap penalties. Fortunately, experience with the dynamic programming method has provided much help for making the best choices, and dynamic programming has become widely used. The dynamic programming method can also be slow due to the very large number of computational steps, which increase approximately as the square or cube of the sequence lengths. The computer memory requirement also increases as the square of the sequence lengths. Thus, it is difficult to use the method for very long sequences. Fortunately, the computer scientists have greatly reduced the time and space requirements to near-linear relationships without compromising the reliability of the dynamic programming method, and these methods are widely used in the available dynamic programming applications to sequence alignment. Other shortcuts have been developed to speed up the early phases of finding an alignment. The word or k-tuple methods are used by the FASTA and BLAST algorithms . They align two sequences very quickly, by first searching for identical short stretches of sequences (called words or k-tuples) and by then joining these words into an alignment by the dynamic programming method. These methods are fast enough to be suitable for searching an entire database for the sequences that
align best with an input test sequence. The FASTA and BLAST methods are heuristic; i.e., an empirical method of computer programming in which rules of thumb are used to find solutions and feedback is used to improve performance. However, these methods are reliable in a statistical sense, and usually provide a reliable alignment.
Dynamic Programming
The following is an example of global sequence alignment using Needleman Wunsch techniques. For this example, the two sequences to be globally aligned are
GAATTCAGTTA(sequence#1) G G A T C G A (sequence #2)
So M = 11 and N = 7 (the length of sequence #1 and sequence #2, respectively) A simple scoring scheme is assumed where
· Si,j = 1 if the residue at position i of sequence #1 is the same as the residue at position j of
sequence #2 (match score); otherwise · Si,j = 0 (mismatch score)
· w = 0 (gap penalty)
Three steps in dynamic programming 1. Initialization
2. Matrix fill (scoring) 3. Traceback (alignment) Initialization Step
The first step in the global alignment dynamic programming approach is to create a matrix with M + 1 columns and N + 1 rows where M and N correspond to the size of the sequences to be aligned.
Since this example assumes there is no gap opening or gap extension penalty, the first row and first column of the matrix can be initially filled with 0.
Matrix Fill Step
One possible (inefficient) solution of the matrix fill step finds the maximum global alignment score by starting in the upper left hand corner in the matrix and finding the maximal score Mi,j
for each position in the matrix. In order to find Mi,j for any i,j it is minimal to know the score for
the matrix positions to the left, above and diagonal to i, j. In terms of matrix positions, it is necessary to know Mi-1,j, Mi,j-1 and Mi-1, j-1.
For each position, Mi,j is defined to be the maximum score at position i,j; i.e.
Mi,j = MAXIMUM[
Mi-1, j-1 + Si,j (match/mismatch in the diagonal), Mi,j-1 + w (gap in sequence #1),
Mi-1,j + w (gap in sequence #2)]
Note that in the example, Mi-1,j-1 will be red, Mi,j-1 will be green and Mi-1,j will be blue.
Using this information, the score at position 1,1 in the matrix can be calculated. Since the first residue in both sequences is a G, S1,1 = 1, and by the assumptions stated at the beginning, w = 0.
Thus, M1,1 = MAX[M0,0 + 1, M1, 0 + 0, M0,1 + 0] = MAX [1, 0, 0] = 1.
Since the gap penalty (w) is 0, the rest of row 1 and column 1 can be filled in with the value 1. Take the example of row 1. At column 2, the value is the max of 0 (for a mismatch), 0 (for a vertical gap) or 1 (horizontal gap). The rest of row 1 can be filled out similarly until we get to column 8. At this point, there is a G in both sequences (light blue). Thus, the value for the cell at row 1 column 8 is the maximum of 1 (for a match), 0 (for a vertical gap) or 1 (horizontal gap). The value will again be 1. The rest of row 1 and column 1 can be filled with 1 using the above reasoning.
Now look at column 2. The location at row 2 will be assigned the value of the maximum of 1(mismatch), 1(horizontal gap) or 1 (vertical gap). So its value is 1.
At the position column 2 row 3, there is an A in both sequences. Thus, its value will be the maximum of 2(match), 1 (horizontal gap), 1 (vertical gap) so its value is 2.
Moving along to position colum 2 row 4, its value will be the maximum of 1 (mismatch), 1 (horizontal gap), 2 (vertical gap) so its value is 2. Note that for all of the remaining positions except the last one in column 2, the choices for the value will be the exact same as in row 4 since there are no matches. The final row will contain the value 2 since it is the maximum of 2 (match), 1 (horizontal gap) and 2(vertical gap).
Using the same techniques as described for column 2, we can fill in column 3.
Check your progress: