Discusión de Resultados - PRESENTACIÓN, ANÁLISIS E INTERPRETACIÓN DE

CAPÍTULO III: PRESENTACIÓN, ANÁLISIS E INTERPRETACIÓN DE

3.2. Discusión de Resultados

Virus sequences were retrieved from the NCBI GenBank database, and a complete list of the downloaded accession numbers is provided in the corresponding chapters. Studies have demonstrated that errors in an inferred multiple sequence alignment (MSA) can be common when divergent sequences are analysed, and these errors can inflate estimates of positive selection acting on the genes of interest. In an effort to circumvent this difficulty GUIDANCE2 (Privman, Penn and Pupko, 2012) was used for filtering unreliably aligned codons (i.e. codons that had and an alignment score of < 0.90) were masked for subsequent analysis (Privman, Penn and Pupko, 2012). Alignments were carried out initially with MEGA 7.02 and the MUSCLE algorithm. However alignments for the final input into the selection analysis software were realigned using Probabilistic Alignment Kit (PRANK) (Fletcher and Yang, 2010).

PRANK is a probabilistic multiple alignment programs that can be used to align nucleotide, codon and amino-acid sequences. PRANK is based on a novel algorithm that treats insertions and deletions as distinct entities and thus can avoid the common alignment problem of assigning multiple penalties to single deletion event (Löytynoja and Goldman, 2005). This trait, in theory, means that the PRANK algorithm will create a more evolutionarily correct alignment, as selection analysis is fundamentally a study of the evolutionary patterns at work on a sequence this factor may be crucial to improving analysis results.

83 2.13.1. Selection analysis

To test for positive selection in individual codons of the viral glycol proteins of interest in this study, the ratio of Non-synonymous(dN) to synonymous (dS) changes were compared using several distinct methods. The analysis was implemented using the Datamonkey (Pond and Frost, 2005; Delport et al., 2010) webserver to apply a number of different algorithms to the datasets. The use of Datamonkey has several advantages over other methods for detection of selection, i.e. PAML (Yang, 1997, 2007), the most significant of which for this study is that complex model are quickly fitted using the remote servers on which the tools run.

This allows an analysis that would otherwise take hours or days to run on a conventional PC to be completed in minutes (Pond and Frost, 2005). The Datamonkey tools require an MSA and an initial phylogenetic reconstruction or in the case that recombination has occurred in the dataset, multiple phylogenetic trees (one for each non-recombinant segment). To avoid frameshift errors etc. the sequence alignment was conducted on translated protein sequences, and then back translated in nucleotides. This MSA was then used to perform an initial phylogenetic reconstruction to form the frame work for comparison of selective pressure within the MSA when the observed phylogeny inferred from available sequence data did not match published phylogenetic information the tree was manually edited to reflect the most recent and robust phylogeny prior to selective analysis.

Due to the well-known difficulties in performing this form of analysis in the presence of recombination (Fletcher and Yang, 2010; Brieuc and Naish, 2011) all MSA datasets were pre-screened for the presence of recombination using the genetic algorithm for recombination detection (GARD) tool in the Datamonkey web based selection detection tools (Delport et al., 2010). Recombination, in general, has a minor impact on estimates of global dN/dS rates but can have a profound effect on estimates of site-to-site and branch-to-branch variation in selection pressure. To perform selection analysis in the presence of recombination in a phylogenetic reconstruction, it is necessary to split the alignment into non-recombinant sequence fragments first. The GARD algorithm uses a genetic algorithm to identify breakpoints in an MSA that separates recombinant fragments. Once these separate non-recombinant fragments have been identified, selection analyses was run separately on each fragment.

After analysis looking for evidence of recombination in the assembled MSA, several algorithms were implemented to search for sites under positive selection. internal fixed effects likelihood (IFEL) uses the entire alignment to infer model parameters shared by all sites (e.g., branch lengths) and then fits dS and dN rates individually at every site. Neutrality of an individual site is tested using the likelihood ratio test. This method is useful for inferring

selective pressure that is acting on internal branches of the phylogeny being tested (Delport et al., 2010)

The Single likelihood ancestor counting (SLAC), the SLAC analysis involves a maximum likelihood reconstruction of ancestral codon states using a phylogenetic tree input together with the MSA. This is then compared to the observed ratio of nonsynonymous and synonymous substitutions with the approximate estimate of the expected ratio assuming neutral evolution (Kosakovsky Pond and Frost, 2005; Sorhannus and Kosakovsky Pond, 2006). Therefore, if the algorithm counts more nonsynonymous mutations than synonymous it can be expected that the selective pressure acting on the gene/domain is not adequately explained by the neutral model of evolution. For SLAC, analysis sites identified with a p value

> 0.2 were considered under positive selection.

Internal fixed effect likelihood (IFEL), In the IFEL analysis, every codon is assigned a single rate of synonymous substitution and two nonsynonymous rates. These two nonsynonymous rates are assigned to account for differences in potential codon selection that is driven as the codon evolves along a terminal branch (i.e. usually indicating recent bouts of evolutionary pressure) and the potentially different rate of evolution of the codon for internal branches (i.e. episodic or historical evolution) . This manner of testing has been demonstrated to be effective in cases were strong diversifying selection is in effect, and so codons are likely to be highly divergent (Kosakovsky Pond, Frost, et al., 2006). For the IFEL algorithm sites with a significance level of 0.1 were considered under positive selection.

Fast Unconstrained Bayesian Approximation for Inferring Selection (FUBAR), is a hierarchical Bayesian method that makes extensive use of a Markov chain Monte Carlo (MCMC) algorithm to return robust results even in cases were the fitted model is a poor fit for the data. This is achieved by averaging over a large number of predefined site classes. This differs from the SLAC and IFELmethods which both use a small number of discrete categories which are combined with probability distributions to estimate the likelihood of selective pressure acting on a region/site. By employing a large number of categories, the FUBAR algorithm does not require a re-assessment and re-sorting of the maximum likelihood function of the dataset. In effect, this means that the a-priori categories of the FUBAR algorithm deliver much faster analysis when compared the maximum likelihood approaches of the SLAC or IFEL (Murrell et al., 2013).

For FUBAR analysis sites with a posterior probability of positive selection > 0.8 were taken as under positive selective pressure.

The MEME algorithm allows the distribution of dN/dS to vary from site to site (the fixed effect) and also from branch to branch at a specific site (the random effect) (Murrell et al., 2012). This method uses phylogenetic models to describe the evolution of codon characters along a given branch of a phylogenetic tree by a continuous-time stationary Markov process.

For MEME analysis codon-specific positive selection was admitted at p-value < 0.1, and the inference of the lineages in which diversifying selection occurred at a given codon was performed using a Bayes empirical Bayes approach. Codons in which the Bayes factor was greater than 1 were considered as targets of episodic diversifying selection.

For GARD, MEME, IFEL, Fubar and SLAC the best fit nucleotide substitution models were chosen using a Genetic Algorithm implemented in the DataMonkey suite (Delport et al., 2010).

Chapter 3 Coronavirus PV Production and Optimisation of

In document FACULTAD DE DERECHO Y CIENCIA POLITICA TESIS (página 140-151)