CAPÍTULO II MARCO TEÓRICO
2.3 Cibertendencias
2.3.2 Tecnologías de Información y Comunicación
ClustalW
ClustalW [Thompson et al., 1994] is a heuristic multiple sequence alignment (MSA) program. This tool was implemented in our 18S rDNA pipeline. MSAs are vital to many areas in bioinformatics, for example in homology modelling and phylogenetic
analysis [Capella-Gutierrez et al., 2009]. The aim of MSA is to output an arrangement of a set of sequences, with the aim that similar sequence features are aligned together, so that patterns can be identified that may be common among many sequences, or changes revealed that may clarify functional and phenotypic variability. A feature can be defined as any relevant biological information, that is, structure, function or homology to the common ancestor [Kemena and Notredame, 2009]. The quality of MSAs for these applications is critical for the reliability and accuracy of the analyses.
A large number of algorithms for MSA are presently available, which apply different heuristic algorithms to find the optimal solutions to the alignment problem. 80-90%
accuracies have been reported for the best MSA algorithms, but even these algorithms can fail at specific regions in the alignment. For large scale analysis the problem gets worse, due to the implementation of faster algorithms that are less reliable [Capella-Gutierrez et al., 2009].
The basic ClustalW algorithm consists of three key parts. First all pairs of sequences are aligned separately and from this a distance matrix is calculated, thus giving the divergence of each pair of sequences. Second, from the distance matrix and using the Neighbor-Joining [Saitou and Nei, 1987] method a guide tree is calculated. Initially an unrooted tree is constructed with branch lengths proportional to the approximate divergence for each branch. By employing the mid-point method a root is placed at a point on the tree, in which the branch lengths on either side of the root are equal. Third, according to the branching order in the guide tree, from the leaves of the rooted tree towards the root, the sequences are progressively aligned. A dynamic programming algorithm at each stage of the alignment is performed with a residue weight matrix and also penalties for opening and extending gaps. Each part is made up of aligning two existing alignments or sequences and gaps that are introduced in previous alignments remain unaltered. New gaps that are added at each stage get full gap opening and extension penalties, regardless of whether or not they are added inside an original gap location. The score at a position from one sequence or alignment and another sequence or alignment is calculated based on the average of all the pairwise weight matrix scores from the sets of sequences used. If any set of sequences has one or more gaps in one of the locations being considered, this gets scored a zero if it is a gap versus a residue. The default amino acid weight matrices used are adjusted to
be assigned positive values. Consequently, this treatment of gaps results in the score of a residue versus a gap ending up with the worst possible score. Therefore when the sequences are weighted, each weight matrix value is multiplied by the weights from the two sequences [Thompson et al., 1994].
trimAL
trimAl [Capella-Gutierrez et al., 2009] is an automated trimming tool for multiple sequence alignment. This tool was implemented in our 18S rDNA pipeline. It has been reported that the removal of poorly aligned regions from an alignment increases the quality of further analyses [Capella-Gutierrez et al., 2009]. trimAl firstly reads all the columns in the alignment and calculates a score, a gap score, a similarity score or a consistency score for each of the columns. The score for each column is calculated on information from that column or, if a window size is given, it relates to the average value of the given window size columns around the position being considered. The gap score for a given column is the fraction of sequences with no gap in that specified position.
The residue similarity score uses the mean distance score between pairs of residues, as defined by a given scoring matrix. The consistency score is only calculated when more than one alignment for the same set of sequences is given. The consistency score is the level of consistency for all residue pairs located in a given column as compared with other alignments. The alignment with the highest consistency score is trimmed to remove the columns that are less conserved.
trimAL can proceed in two ways after all the column scores have been calculated.
A conservation threshold relates to the minimum percentage of columns, from the initial alignment, that the user would like to have in the trimmed multiple sequence alignment. If a score and a minimum conservation threshold are provided, trimAL will output a trimmed alignment. This alignment will consist only of the columns with scores greater than the score threshold. If the number falls below the conservation threshold, in a decreasing order of scores trimAl will add more columns to the trimmed alignment until the conservation threshold is hit. Alternatively, trimAl has three modes for the automated selection of parameters- gappyout, strict and strictplus- these are based on the use of gap and similarity scores. trimAl will calculate the specific score thresholds based on the characteristics of each alignment. trimAL also has an option
which implements a heuristic in order to decide the appropriate mode depending on the alignment characteristics [Capella-Gutierrez et al., 2009].