5 CAPÍTULO: REPERCUSIONES PSICOLÓGICAS DE LA EXPRESIÓN GRÁFICA
5.4 RESULTADOS
5.4.8 C OMPARACIÓN ENTRE FCSE Y POBLACIÓN GENERAL
It is commonly recognized that the major drawback of progressive methods rests on the lack of an appropriate procedure to correct earlier errors when more sequences are added later on. Many ideas have been proposed to overcome this drawback, and most of the proposed methods have adopted some kind of iterative procedure [10, 12, 14, 103]. An extensive review of these methods has been published [34]; thus, we discuss only relatively recent results here. MSA methods based on a hidden Markov model (HMM) may be considered as a variant of this strategy, but will be discussed separately in subsection 3.3.6, since HMM methods have their own mathematical background.
General Strategy
The basic strategy of iterative refinement methods is summarized as follows. Given a preliminary MSA, the alignment is divided into a few (usually two) groups. Columns consisting of null characters only are depleted from each group so that the condition B in subsection 3.2.1 is satisfied. After the two groups are optimally aligned, the total score must never be lower than that of the original alignment. By repeating the process for various ways of division, the overall alignment is gradually improved and ultimately reaches
3-18 Handbook of Computational Molecular Biology convergence. Several variations exist in the way and the order of division, as reviewed by Hirosawa et al. [44]. Recent methods that show good performance adopt similar strategies.
(1) The initial alignment is obtained by a progressive method. (2) Division into two groups is guided by a tree that is constructed by a distance matrix method from the initial alignment.
The order of division is either random or predetermined. (3) Sequences or sequence pairs are weighted. (4) Some heuristics are used for locating anchor points to accelerate overall calculation. Three representative methods are introduced below.
Prrn
The heart of iterative refinement methods is group-to-group pairwise alignment, with which the overall alignment score is improved. When we use a proportional gap penalty function, a DP algorithm solves the problem straightforwardly. However, the worst-case computational complexity is dramatically increased when we use an affine gap penalty under the (W)SP scoring system, as Kececioglu and Starrett have recently proven the problem to be NP-complete with respect to the total number, M + N , of sequences in the two groups [55].
In practice, Gotoh has devised very efficient algorithms that solve the problem in time complexity nearly independent of M + N [30, 31]. Two key ideas have made the algorithm feasible. First, the so-called candidate list paradigm [69] extends the usual DP procedures without loss of rigor but with only a moderate increase in time and space complexity.
Second, the data structure of ‘generalized profiles’ facilitates exact and efficient calculation of affine gap penalties. Note that the natural gap penalties are imposed rather than the quasi-natural gap costs [2] used in MSA [68].
LetAi = a1a2· · · ai and Bj = b1b2· · · bj be the prefixes of the groups of sequences,A andB, to be aligned. A set of candidates at each node (i, j) of the DP procedure correspond to distinct configurations of alignment between Ai and Bj. By mutual competition, only those candidates that have the possibility to contribute to the final optimal alignment are retained. In the earlier versions of Prrn/Prrp, four criteria, which Kececioglu and Starrett call extremal pruning, were used to prune candidates. Since Ver. 3.0 [34], a single criterion, the dominance pruning [55], has been adopted. The efficiencies of the extremal and dominance pruning are virtually equivalent, whereas the dominance pruning can be coded significantly more compactly. Incidentally, from this version, Prrp for protein sequences was merged into a single program Prrn that had been used to align nucleotide sequences only.
Another unique feature of Prrn is the use of a doubly nested randomized iterative strat-egy. In the inner loop of this strategy, the tree-partitioned iterative refinement of MSA is performed with a set of weights given to all pairs of sequences. These pair-weights are calculated from an unrooted tree by the ‘three-way’ method [32]. The tree used for par-titioning and calculation of weights is obtained by a distance matrix method, UPGMA or the NJ method. The distance matrix is, in turn, obtained from an MSA. Thus, MSA, tree, and pair-weights are mutually interdependent. Prrn repeats the iteration until this triad becomes mutually consistent [33, 34].
From a practical point of view, the Prrn algorithm is somewhat over-luxuriant. For example, the rigorous group-to-group pairwise alignment algorithm (Algorithm D in [30]) may be replaced by a cheaper one (Algorithm B or C in [30]) without significant loss of accuracy, as assessed with a method discussed in section 3.4. MAFFT and MUSCLE discussed below follow this idea.
Multiple Sequence Alignment 3-19 MAFFT
The features of MAFFT [52] are rapid construction of the guide tree and fast search for anchor points by means of the fast Fourier transform (FFT) method [87]. MAFFT can construct accurate alignments even faster than ClustalW. MAFFT differs from Prrn in three major respects: (1) preparation of the initial alignment, (2) the method for detecting anchor points, and (3) treatment of the gap-open penalty. These differences are described in detail below.
MAFFT constructs an initial alignment using a progressive method twice. The first phase aligns sequences based on a roughly estimated guide tree. A modified method of Jones et al. [50] is used for calculating distance. The distance between ¯smand ¯sn, Dm,n, is obtained by:
Dm,n= 1− Tm,n
min(Tm,n, Tm,n), (3.19)
where Tm,nis the number of K-mer segments (K = 6 in MAFFT ) shared by ¯smand ¯sn. In the calculation of Tm,n, the 20 amino acids are grouped into six categories depending on their physico-chemical properties. This method requires only O( ¯L) computational steps for each sequence pair provided that the 6-mer frequencies are precomputed for all sequences, which requires O(N ¯L). Hence, the total computation requires O(N2L). By contrast, the¯ dynamic programming method requires O( ¯L2) for each pair, and O(N2L¯2) in total. Thus, much computation time can be saved by the K-mer method. MAFFT uses a slightly modified version of the UPGMA method [100] for the construction of the guide tree.
The second phase uses the progressive method again. The second phase differs from the first phase in the guide tree, which is reconstructed from the distance matrix estimated from the MSA obtained by the first phase. The distance Dm,n is defined as Dm,n=− log Im,n, where Im,n is the degree of sequence identity between ¯am and ¯an. The second phase alignment is likely to be more accurate than that of the first phase, since the new guide tree is expected to be more reliable.
Next, we explain the methods of FFT preprocessing and the group-to-group sequence alignment algorithm used in MAFFT. The FFT preprocessing method finds anchor points that vertically divide a group into several disjoint sections. In this procedure, correlations between two groups are rapidly calculated with the FFT algorithm, and 20 positional lags (diagonals) with the highest correlation scores are identified. These diagonals are then searched for high-scoring segment pairs with average matching scores per column exceeding a threshold.
The correlation between groupsA and B with positional lag k, ρ(k), is defined as
ρ(k) = ρv(k) + ρp(k), (3.20)
where
ρv(k) =
1≤i≤min(I,J−k)
{vA(i)vB(i + k)}. (3.21)
ρv(k) denotes the correlation of the volume component. The correlation of the polarity component ρp(k) is defined in a similar way. vC(j) and pC(j) are the weighted sum of the volume values and the polarity values for the j-th column ofC ∈ {A, B}. MAFFT obtains the weights in the same way as ClustalW in the progressive phase, and by the three-way method [32] in the iterative phase. The calculation of ρ(k) requires O( ¯L log ¯L) computation used in the FFT procedure to obtain ρv(k) and ρp(k).
To determine the positions of high-scoring segment pairs in each diagonal, MAFFT uses a sliding window method with the window size of 30. If successive high-scoring segment pairs
3-20 Handbook of Computational Molecular Biology
FIGURE 3.3: FFT preprocessing. Open squares with thin lines denote high-scoring segment pairs detected. In this example, four of the five high-scoring segment pairs are included in the optimal compatible assembly. Both groups are cut into five sections at the midpoint of each high-scoring segment pair indicated by a filled circle. After division, each pair of GAi and GBi is aligned. Since only the shaded regions are examined, MAFFT attains significantly rapid calculation.
overlap, they are merged. Potential noises are removed by a sparse DP algorithm that finds the optimal combination of compatible high-scoring segment pairs. The midpoints of the high-scoring segments in this optimal combination serve as the boundaries by which each group is divided into several sections (Figure 3.3). Then, each pair of sections is aligned by the group-to-group sequence alignment algorithm described below.
MAFFT uses the WSP-type objective function for group-to-group sequence alignment, but the opening penalty differs from that of the natural WSP scoring system. The gap-opening penalty assigned to a gap that opens or closes opposite to two successive columns cp and cq is defined as:
v
{1 − gCs(p)} + {1 − geC(q)}
/2, (3.22)
where v is the basic gap-opening penalty. gsC(p) is the number of gaps starting at cpinC and geC(q) is the number of gaps ending at cq. MAFFT also uses a normalized substitution matrix. A score of substitution matrix S2(a, b) is normalized as
S¯2(a, b) = S2(a, b)− µ2
µ1− µ2
+ u, (3.23)
where µ1=
a∈ΣfaS2(a, a) and µ2=
a,b∈ΣfafbS2(a, b). u is a parameter with u 1, corresponding to a gap extension penalty. fa is the stationary composition of amino acid a derived from the substitution matrix. An average score per position between segments satisfies u ≤ µ¯s≤ 1 + u. MAFFT uses an ordinary two-dimensional DP for aligning two groups.
Note that MAFFT is not always faster than MAFFT without FFT preprocessing; since FFT preprocessing may not identify high-scoring segment pairs among highly divergent sequences, the search space may not be sufficiently reduced to compensate for the cost of
Multiple Sequence Alignment 3-21 the preprocessing. In addition, the alignment accuracies of MAFFT are slightly decreased because FFT preprocessing reduces the search space without guaranteeing optimality.
MUSCLE
Recently, a new program called MUSCLE [20] has been reported. MUSCLE uses nearly the same strategy as MAFFT. MUSCLE differs from MAFFT in that it adopts an oligomer counting [19] method instead of FFT preprocessing for the detection of anchor points, which requires O( ¯L) computation. Moreover, instead of the classical profiles [37] used in Prnn or MAFFT, MUSCLE adopts a log expectation scoring scheme to evaluate the S2(a, b) term for group-to-group alignment. The log expectation score between profile columns a and b is defined as
(1− a[−])(1 − b[−])
x,y∈Σ
log
a[x]b[y]px,y
pxpy
, (3.24)
where a[−] and a[x] (b[−] and b[y]) are frequencies of null and residue x (y) of profile column a (b), respectively. px,y is a joint probability between residues x and y, and pz is a background probability of residue z. Both px,y and pz are derived from the probabilistic substitution matrix of amino acids.
The gap penalty of MUSCLE is basically the same as that of MAFFT, but MUSCLE, like ClustalW, adjusts gap penalties according to the hydrophobicity of the surrounding residues.
3.3.5 Stochastic Methods