4 CAPÍTULO. ANÁLISIS DE MURALES Y CANCIONES DE GRUPOS
4.4 RESULTADOS
N -dimensional Dynamic Programming
To evaluate the relative goodness of each alignment, A, we have introduced an objective function, HN(A). The optimal alignment A∗ is then defined as the one that has the best score among all the possible alternatives, i.e., A∗ = argmaxallpossibleA{HN(A)}. Given N sequences of a geometric average length of ¯L, a rough estimate of the number of possible distinct alignments among these sequences is (N ¯L)!/( ¯L!)N ≈ NN ¯L, which is tremendous even for a small N and a moderate ¯L, say ¯L≈ 100.
For PSA, the computational explosion can be avoided by using the well-known dynamic programming (DP) algorithm, which can rigorously and efficiently find the alignment(s) with the optimal score among all the possible alternatives by fulfilling the following recursion for 1≤ i ≤ I and 1 ≤ j ≤ J [120], where I =| ¯s1| and J =| ¯s2|:
Hi,j = max
Hi,j1 , Hi,j2 , Hi,j3
(3.8) Hi,j1 = Hi−1,j−1+ S2(s1i, s2j) (3.9) Hi,j2 = max
1≤k≤i{Hi−k,j− g(k)} (3.10)
Hi,j3 = max
1≤k≤j{Hi,j−k− g(k)} . (3.11) It is natural to extend the above algorithm to N dimensions. Let x = {xn}(1 ≤ n ≤ N, 0≤ xn ≤ |¯sn|) be a coordinate indicating a node in the DP graph, and b = {bn}(bn ∈ {0, 1}) be a bit vector indicating an edge that joins adjacent nodes. Temporally, we will use a proportional gap penalty for simplicity. Then, the DP recursive relation is written as:
HN(x) = max
b
HN(x− b) + SN(a)
, (3.12)
where a = {an}, an = sxn if bn = 1, an = ‘-’ if bn = 0, and the elements of b take all possible combinations of 0 and 1 except for b = (0, 0, . . . , 0)T. Most objective functions discussed in the previous section take O(N ) computational steps to evaluate a score for each column. There are
1≤n≤N(| ¯sn| +1) ≈ ¯LN nodes in the DP graph, and we must examine 2N − 1 configurations (partial alignments) at each node to find the optimal path through the graph. Thus, the overall computation takes O(N (2 ¯L)N) steps using O( ¯LN) memory. In fact, it is known that MSA problems with reasonable objective functions are all NP-hard [118, 119, 51, 49]. If we use an affine or a more general gap penalty function, computational complexity is further increased. For the simplest case of an SP scoring system with N = 3 and an affine gap penalty function, we must consider 13 different configurations and seven types of state transitions, as shown in Tables 3.3 (a) and (b). Table 3.3 (c) shows the state transitions together with the number of gaps that open in association with the transition.
For a large N , we would have to consider ∼ [N/(e ln 2)]N√
N configurations and 2N − 1 types of transition [2]. Without additional speeding up techniques discussed in the following sub-subsections, N = 3 is the upper limit of applicability of straightforward DP algorithms.
3-8 Handbook of Computational Molecular Biology TABLE 3.3 Thirteen states (a) and seven transitions (b) used in a three-way DP algorithm. An asterisk, a dash, and a dot indicate (runs of) non-null, null, and either character, respectively. (c) Transition table used for counting gap-opening penalty in the three-way DP alignment. The configuration and the transition type are encoded by a number, as shown in (a) and (b), respectively. The first number in each cell indicates the resulting configuration induced by the transition type of the row from the original configuration of the column. The second number indicates the number of gaps that open in association with the transition. For example, transition type 7 converts state 3 into state 11 whereby one new gap opens.
MSA
MSA is an implementation of an N -dimensional dynamic programming algorithm with a restricted search space, called the Carrillo-Lipman algorithm [13, 38, 68]. It reduces the search space by using upper bounds estimated from the information of a provisional MSA A# and optimal pairwise sequence alignments. Note that MSA optimizes an objective function by not maximizing a similarity score but minimizing a transformation cost.
The objective function of MSA is the WSP score:
C(A) =
1≤m<n≤N
wm,nC(Am,n).
The pair-weights are calculated using the rationale-1 method of Altschul et al. [4].
C(Am,n) is the cost of the induced pairwise alignment between ¯am and ¯an. Obviously,
Multiple Sequence Alignment 3-9 Hence,
wp,qC(A∗p,q)≤ C(A#)− L(S) + wp,qC∗(¯sp, ¯sq), (3.15) because C(A∗)≤ C(A#). C(A#) is calculated using a simple progressive method.
In order to calculate C(A∗), the right-hand side of (3.15) is performed as an upper bound for (p, q)-plane to reduce the search space. The upper bound, however, is usually large, because O(N2) terms are discarded to obtain (3.15). Although MSA can use this upper bound, it uses by default a cost of an induced pairwise alignment fromA# as a heuristic upper bound for a plane. This heuristic upper bound is able to align more sequences than the upper bound by (3.15), although optimality of an alignment cannot be guaranteed any longer.
To determine the search space, MSA first determines the admissible points onN
2 points are candidates for those contributing to an optimal alignment, the search space is the intersection of the admissible points on every plane.
Before describing the algorithm for finding an optimal alignment, we address two data structures (an open list and a cell) and a gap cost of MSA. The open list Q stores a pointer to a cell that is going to be visited. A cell consists of a previous node prev, a current node curr and a cost from the start node to curr, cost. The open list is implemented as a priority queue. Since MSA uses the well-known Dijkstra’s algorithm, all costs including substitution matrix elements are converted to non-negative.
MSA uses a quasi-natural gap cost, which is an approximate affine gap cost. The use of the affine gap cost (or the so-called natural gap cost) is impractical, because large memory and much computation are required, as mentioned above. The quasi-natural gap cost penalizes gap opens based on the two adjacent edges u = x → y and v = y → z on a path, where x, y and z indicate nodes. Let vm be the mth element of z− y. If vm = 0, vm
means a null; otherwise, vm denotes the zmth residue of ¯sm. Table 3.4 shows the rule for penalizing gap opens. This rule penalizes no less than the actual number of gap opens.
When (um, un) = (0, 0) and (vm, vn) is either (0, 1) or (1, 0), an existing gap may extend at z. However, this rule always assigns a gap open penalty in such cases, i.e., the quasi-natural gap cost adopts the pessimistic view. Let C(u, v) be the transformation cost associated with the move from node x to z along edges u and v. Then, is the gap-open penalty based on Table 3.4.
Using the upper bound U =
1≤m<n≤NUm,n,MSA tries to find an optimal alignment within the reduced search space. It first extracts and removes a cell u with the minimum cost from the open list. Then, each admissible node (say z) adjacent to u.curr is examined.
If Q does not have cell v such that v.prev = u.curr and v.curr = z, such cell v with cost U + 1 is inserted into Q. If u.cost + C(u, v) < v.cost, v.cost is replaced by u.cost + C(u, v).
After every admissible node adjacent to u.curr is checked, another cell is extracted from the open list, and these procedures are repeated until curr of the extracted cell is the end node (| ¯s1|, . . . , | ¯sN |)T.
3-10 Handbook of Computational Molecular Biology
TABLE 3.4 The rule for penalizing gap opens. An element of an edge is expressed by a symbol,−for a null and∗for a residue. A gap open penalty is imposed on a cell with the value of unity.
Note that MSA does not always find an optimal alignment subject to the constraints; it may obtain an alignment whose WSP is greater than U . If this happens, the user must manually increase the upper bounds to find an optimal alignment. In addition, the number of sequences to be aligned is limited to single figures even when the heuristic upper bounds are used. Therefore, MSA cannot be used for aligning many sequences.
Several methods [47, 66, 98] use the A∗algorithm for aligning more sequences than MSA can. These A∗-based methods adopt essentially the same strategy as MSA. The major difference lies in the way of calculating the cost assigned to a move from node y to z. Let C(y, z) be the cost from y to z. The modified cost C(y, z) is defined as
C(y, z) = C(y, z) + L(y→ t) − L(z → t), (3.17) where t = (| ¯s1 |, . . . , | ¯sN |)T is the end node and L(y → t) denotes the lower bound of the alignment costs for subsequences smym, . . . , sm|¯sm|. L(z→ t) is defined in a similar manner. The use of C(y, z) decreases more the number of admissible nodes than MSA does, and hence speeds up the computation. GSA [66] achieves further performance im-provements. It uses better lower bounds estimated from three-way alignments instead of pairwise alignemnts, and dynamically updates an upper bound during the alignment pro-cess. However, it is still difficult to align many sequences within a reasonable time. Reinert et al. [88] have improved the scalability of their A∗-based algorithm by combining it with the divide and conquer algorithm (DCA) [111, 102], a heuristics for finding anchor points for global MSA from a set of PSAs.
COSA
COSA is an implementation of the integer linear programming (ILP) method instead of N -dimensional dynamic programming [1]. ILP maximizes an objective function
wx· x subject to some constraints, where x is an integer variable to be optimized and wxis a weight associated with x. An important feature of COSA is that it can accept any gap costs, such as affine, convex or position-specific gap cost. We briefly describe the algorithm of COSA.
Note that COSA maximizes the similarity score rather than minimizes a transformation cost.
To represent an alignment, two types of variables are used: alignment and gap variables.
Both variables take a value of either 0 or 1. An alignment variable denotes whether or not a pair of residues is aligned, whereas a gap variable represents whether or not a segment of a sequence is aligned with a gap of another sequence. Let a(smi, snj) be an alignment variable for a residue pair of smiand snj, and gq(spi, spj) be a gap variable for a segment spi, . . . , spj and a sequence ¯sq. a(smi, snj) = 1 means that residues smiand snj are aligned, and gq(spi, spj) = 1 means that a segment spi, . . . , spj is aligned with a gap inserted in ¯sq. The total number of these variables is
1≤m<n≤N|¯sm||¯sn| + N(N − 1)
Multiple Sequence Alignment 3-11 Four constraints are required to calculate an optimal alignment by ILP. First, each residue of a sequence must either correspond to a residue of another one, or be within a region represented by a gap variable. Second, there should not be any incompatibly aligned residue pairs. For example, if a(spi, sqj) = a(spl, sqk) = 1 with (i < l and k < j), then the corresponding residue pairs are incompatible. Third, the regions of two gap variables should not overlap and there must be at least one aligned residue pair between these regions.
When a convex gap cost is used, this constraint is satisfied automatically. Fourth, the transitivity of three alignment variables has to be satisfied. Specifically, if a(spi, sqj) = 1 and a(spi, srk) = 1, then a(sqj, srk) must be 1.
Alignment and gap variables can be thinned out in advance, as most of them are unlikely to take the value of 1. The idea of reducing the number of variables is essentially the same as that of the search space determination of MSA. A heuristic alignment is first constructed to obtain a lower bound L. Then, an upper bound Uv is calculated for every variable v. If Uv≤ L, v is set to 0 and therefore removed. Assuming that v is associated with sequences
¯smand ¯sn, Uv= Cv∗(¯sm, ¯sn) +
(p,q)=(m,n)C∗(¯sp, ¯sq), where Cv∗(¯sm, ¯sn) is the score of the optimal alignment between ¯sm and ¯sn such that v = 1, and C∗(¯sp, ¯sq) is the score of the optimal PSA between ¯sp and ¯sq.
Because solving an ILP is NP-complete, COSA adopts a cutting plane algorithm. This algorithm solves a linear programming (LP), called a relaxation of an ILP, which is obtained by omitting integer constrains of the ILP. Each variable of the relaxation then takes a value between 0 and 1. If a solution of the relaxation is integral, it is also the solution of the ILP. Otherwise, a cutting plane is calculated from the solution of the relaxation and added to the constraints of the relaxation. A cutting plane is a linear constraint that does not exclude an optimal solution of the ILP. This procedure is repeated until an integer solution is obtained. COSA provides another way to construct alignments in an exact way. Similar to MSA, however, it cannot align many sequences.
3.3.2 Progressive Methods