• No se han encontrado resultados

Phylogenetic signal and support assessment using a hidden markov model approach based on rna secondary structure information

N/A
N/A
Protected

Academic year: 2020

Share "Phylogenetic signal and support assessment using a hidden markov model approach based on rna secondary structure information"

Copied!
81
0
0

Texto completo

(1)PHYLOGENETIC SIGNAL AND SUPPORT ASSESSMENT USING A HIDDEN MARKOV MODEL APPROACH BASED ON RNA SECONDARY STRUCTURE INFORMATION. FRANCISCO MÉNDEZ RIAÑO. UNIVERSIDAD DE LOS ANDES FACULTAD DE CIENCIAS DEPARTAMENTO DE CIENCIAS BIOLÓGICAS BOGOTÁ D.C. 2008.

(2) PHYLOGENETIC SIGNAL AND SUPPORT ASSESSMENT USING A HIDDEN MARKOV MODEL APPROACH BASED ON RNA SECONDARY STRUCTURE INFORMATION. Undergraduate Thesis By FRANCISCO MÉNDEZ RIAÑO. Submitted in partial fulfilment of the requirements for the degree of BIOLOGIST. Approved By JUAN ARMANDO SÁNCHEZ Marine Biologist PhD Thesis Committee Director UNIVERSIDAD DE LOS ANDES. & JORGE EDUARDO ORTIZ Systems Engineer PhD Thesis Committee Co-Director UNIVERSIDAD NACIONAL DE COLOMBIA. BOGOTÁ D.C. January, 2008. 2.

(3) PHYLOGENETIC SIGNAL AND SUPPORT ASSESSMENT USING A HIDDEN MARKOV MODEL APPROACH BASED ON RNA SECONDARY STRUCTURE INFORMATION. Abstract. Background. Usage of the less variable biological patterns found in RNA secondary structures has been come a very common practice in phylogenetic inference and comparative mapping of genomes, due to the higher degree of functional restriction exerted over them than over primary sequence. Given that evaluation of phylogenetic signal consists on testing whether a given set of comparative data exhibits a significant tendency for taxa sequences to resemble each other taxa sequences by assuming a randomly structured data, it is employed as an indicator of the phylogenetic outperforming qualities of alignment data sets. If it can be determined that phylogenetic signal content is related to particular functional properties of secondary structures, like nucleotide chemical bonding promiscuity within alternative predicted secondary configurations of the same sequence, that would mean that even better and reliable markers and pivot positions can be obtained from secondary foldings to be used on all biological comparative tasks. In this aim, it has been appreciated additional improvements that lead to more accurate identification and appropriate resolution intensification of secondary structure data, such as those followed in the Hidden Markov Model (HMM) context.. 3.

(4) Results. By means of a rigorous local pairwise based alignment construction and a stochastic model estimation (HMM context), promiscuity indexes of in silico predicted secondary structures could be related to true molecular functions on a 5’ETS-18S and 25S LSU rDNA eukaryote fragments. After applying two phylogenetic signal measurement routines, it was found that the relative amount of phylogenetic signal is directly proportional to the parsimonyinformative character ratio in alignments, and that different functional regions on secondary structural patterns can provide differential amounts of phylogenetic signal corresponding to the level of functional selective pressure exerted over them. Thus, the low bonding promiscuity secondary regions exhibited a higher thermodynamic stability and an equally higher relative amount of phylogenetic signal, than that exhibited by more promiscuous secondary regions. Additionally, on phylogenetic inference improvement testing, the increment of the relative amount of phylogenetic signal did not enhance the degree of bootstrapping support on all of the clades in the most parsimonious tree topology of a given alignment data set. However, the HMM – Viterbi algorithm annotation was able not only to improve data resolution without exerting a generalized enhancement nor detriment on phylogenetic signal content whichever the case, but also to enhance the degree of bootstrapping support on every clade in the most parsimonious tree topology on a given alignment data set.. Conclusions. The HMM – Viterbi de-noising strategy as used here can be postulated as a reliable phylogenetic inference routine because it does not interfere with phylogenetic signal. 4.

(5) amount, while enhancing support and definiteness of phylogenetic topologies. This strategy, together with assessment of phylogenetic signal amount improving functional secondary structure positional markers can be employed to develop future mixed methodologies to accurately compare genomes.. 5.

(6) VALORACIÓN DEL SOPORTE Y SEÑAL FILOGENÉTICAS USANDO UN MODELO OCULTO DE MARKOV BASADO EN ESTRUCTURAS SECUNDARIAS DE ARN. Resumen. Introducción. El uso de patrones biológicos más conservados como aquellos que se encuentran en las estructuras secundarias de ARN ha venido siendo una práctica más común en inferencia filogenética y mapeo comparativo de genomas, debido a que es más fuerte la presión selectiva que se ejerce sobre estas estructuras que sobre las secuencias primarias. Debido a que la evaluación de la señal filogenética consiste en poner a prueba si un determinado grupo de de taxa exhibe una tendencia a asemejarse a otras secuencias de taxa ya que se asume una estructuración aleatoria de los datos, la señal filogenética se utiliza como indicador de las cualidades positivas de los alineamientos para generar mejores filogenias. Si se llegara a determinar que el contenido de señal filogenética está relacionado con propiedades funcionales en la estructura secundaria, como la promiscuidad de enlace de nucleótidos dentro de las posibles configuraciones secundarias de una misma secuencia, esto significaría que aún mejores marcadores pueden ser obtenidos de las estructuras secundarias para ser usados en todo tipo de tareas biológicas comparativas, incluyendo las de inferencia filogenética.. 6.

(7) En este propósito han sido apreciados los avances en identificación y resolución de los datos de patrones de estructuras secundarias, como aquellos que implementan los Modelos Ocultos de Markov (MOM).. Resultados. Por medio de un riguroso alineamiento basado en alineamientos locales pareados de secuencias y la estimación de un modelo estocástico en el contexto MOM, se encontró que los índices de promiscuidad de las estructuras secundarias predichas in silico pueden relacionarse con funciones moleculares reales en fragmentos eucariótas de 5’ETS-18S y 25S LSU rDNA. Luego de aplicar dos rutinas para la medición de la señal filogenética, se encontró que la cantidad relativa de la misma es directamente proporcional a la proporción de caracteres parsimonia-informativos dentro de los alineamientos, y que las diferentes regiones funcionales dentro de los patrones de la estructura secundaria proveen de cantidades diferenciales de señal filogenética que se corresponden con el nivel de presión selectiva que se ejerce sobre la función molecular que desempeñan. De esta manera, las regiones de baja promiscuidad de enlace en la estructura secundaria exhiben una mayor estabilidad termodinámica y una igualmente mayor cantidad relativa de señal filogenética, que la exhibida por regiones de más alta promiscuidad. Adicionalmente, en cuanto a la puesta en prueba del mejoramiento en inferencia filogenética usando información de estructuras secundarias, se encontró que el aumento en la cantidad relativa de señal filogenética no incrementó el grado de soporte Bootstrap de manera generalizada en la topología del árbol más parsimonioso obtenido para un alineamiento dado. Como quiera que sea, la anotación. 7.

(8) de los datos realizada mediante el algoritmo de Viterbi (contexto MOM) no sólo incrementó la resolución de los datos sin ejercer un incremento o detrimento en el contenido de señal filogenética cualquiera fuera el caso, sino que también aumentó el grado de soporte Bootstrap en cada clado dentro de la topología del árbol más parsimonioso encontrada para un determinado alineamiento.. Conclusiones. La estrategia MOM-Viterbi para incrementar la resolución de los datos tal cual como es usada aquí, puede ser postulada como una rutina fiable en inferencia filogenética puesto que no interfiere directamente con la cantidad de señal filogenética, mientras incrementa el soporte y definición de las topologías filogenéticas. Esta estrategia, junto con la evaluación del incremento en la cantidad de señal filogenética por parte de marcadores moleculares funcionales en la estructura secundaria, puede ser empleada para desarrollar metodologías mixtas para comparar genomas con mayor credibilidad y precisión.. 8.

(9) Acknowledgements. I wish to thank to my thesis committee chairman Juan Armando Sánchez from whom I have received not only the knowledge which make us better scientists but also the tips which make us better persons. Thanks to my thesis committee co-director Jorge Eduardo Ortiz for his suitable comments and suggestions, and also to my colleagues at BIOMMAR for their helpful discussions. I hugely appreciate the unconditional help and love of my parents, brother and my very best friends, from whom I have obtained all of the support and courage without which I could not ever have conceived this work.. 9.

(10) PHYLOGENETIC SIGNAL AND SUPPORT ASSESSMENT USING A HIDDEN MARKOV MODEL APPROACH BASED ON RNA SECONDARY STRUCTURE INFORMATION. TABLE OF CONTENTS. ABSTRACT………………………………………………………………. 3 RESUMEN………………………………………………………………… 6 ACKNOWLEDGEMENTS……………………………………………… 9 LIST OF FIGURES………………………………………………………. 11 LIST OF TABLES………………………………………………………… 13 BACKGROUND………………………………………………………….. 14 METHODS……………………………………………………………….. 18 RESULTS AND DISCUSSION………………………………………….. 26 CONCLUSIONS………………………………………………………….. 36 REFERENCES……………………………………………………………. 37. 10.

(11) LIST OF FIGURES. 1) LSU promiscuity states posterior probability vs. secondary sequence alignment position plots………………………………………………………….. 45. 2) ETS promiscuity states posterior probability vs. secondary sequence alignment position plots………………………………………………………….. 48. 3) LSU consensus promiscuity states posterior probability vs. alignment position plots………………………………………………………………….... 49. 4) ETS consensus promiscuity states posterior probability vs. alignment position plots…………………………………………………………………... 50. 5) LSU complete (non-partitioned) nucleotide sequence, gapped and un-gapped, alignments overview…………………………………………………. 51. 6) ETS complete (non-partitioned) nucleotide sequence, gapped and un-gapped, alignments overview…………………………………………………. 52. 7) Secondary structure promiscuity partitioned LSU nucleotide sequence alignments overview……………………………………………………………... 53. 8) Secondary structure promiscuity partitioned ETS nucleotide sequence alignments overview……………………………………………………………... 54. 9) LSU promiscuity states sequence gapped alignments, with and without Viterbi annotation……………………………………………………………. 55. 10) LSU promiscuity states sequence un-gapped alignments, with and without Viterbi annotation…………………………………………………………….. 11. 56.

(12) 11) ETS promiscuity states sequence gapped alignments, with and without Viterbi annotation…………………………………………………………….. 57. 12) ETS promiscuity states sequence un-gapped alignments, with and without Viterbi annotation……………………………………………………………. 58. 13) Character-states permutation assays on LSU and ETS data set matrices (primary and promiscuity states alignments): Tree-lengths distributions……... 64. 14) Random-trees assays on LSU and ETS data set matrices (primary and promiscuity states alignments): Tree-lengths distributions……………………….. 70. 15) Most parsimonious phylogenetic topologies (tree) found for LSU and ETS data set matrices (primary and promiscuity states alignments): Bootstrap support values equal or greater than 50% are mapped on their corresponding clade… 75. 12.

(13) LIST OF TABLES. I. 25S LSU rDNA gene fragments employed and similarity patterns……. 76 II. 5’ETS-18S rDNA gene fragments employed and similarity patterns….. 77 III. Functional related 25S LSU rRNA sites……………………………….. 78 IV. Functional related 5’ETS sites…………………………………………. 78 V. g1 test for the tree lengths distribution skewness and data structure…… 79 VI. Emission probabilities matrix construction overview………………….. 80 VII. Transition probabilities matrix construction overview………………… 81. 13.

(14) Background. One of the premises of comparative genomics is the identification of functional patterns, previously unknown in other organisms, which could be used for comparative mapping of molecular markers with agricultural, medical and scientific purposes. Identification of these markers was up to a few years ago, a matter of optimization and strictness in the evaluation of similarities among nucleotide sequences [1]. However, differences in the rates of evolution within a genome and between genomes make the correlated nucleotide patterns appear dissolved, increasing the probability to find false positives of putative orthologous regions by sequence similarity inference among highly divergent species. A more conservative approach has been the use of less variable (non-primary sequence) patterns that reduce the genomic variance between species, such as those found on secondary structures over which it is exerted more functional restrictions to the change process than over primary sequence. In fact, it is well known that a given secondary structure shape can tolerate a considerable amount of nucleotide mutation, without suffering significant phenotypic disturbance, principally due to Compensatory Base Changes (CBC’s) at double stranded regions in the structure [2].. Due to the limited quantity of specific crystallography determined secondary structures available, up to now it has been developing procedures directed to link highly similar nucleotide patterns with already known structures from reference species nucleotide sequences, to predict the secondary structure. By restricting the sequences to fold by. 14.

(15) forcing the known bondings which produce the double-stranded regions, and prohibiting the loops that produce the single-stranded ones, the folding shapes on adjacent sites has been inferred by using folding algorithms based on theoretical and empirical information such as MFOLD [3]. The original Zuker algorithm finds only the optimal structure that is the one with the lowest equilibrium free energy (∆G), and this has been assumed as the standard secondary structure for the evaluated organism sequence [4]. Nonetheless, the biologically correct structure is often not the calculated optimal structure, but rather a structure within a few percent of the calculated minimum energy [5]. The most recent versions of MFOLD introduce the Zuker suboptimal folding algorithm, which provides us of a holistic overview of the ‘behaviour’ of the sequence. By means of specification of certain sub-optimal folding parameters, it can be observed into an energy dot-plot returned by MFOLD the degree of promiscuity of a nucleotide, in a specific position on the sequence, in its bonding with other nucleotides in foldings within a range from the lowest free energy structure [6]. Here it will be shown that these indexes of promiscuity as well as being equivalent to the degree of secondary structure shape stability (shape character plasticity), they have a potential functional relevancy on molecular nature. So, a shape-region of low promiscuity (i.e. implying high shape stability), inside the secondary structure pattern of a given nucleotide sequence, might be more probable to be found at any laboratory or cellular environmental conditions.. Thus, these regions of low promiscuity may be used as more accurate comparative genome map markers if it is shown that sequence data they are containing on, have significant. 15.

(16) phylogenetic signal distinguishing them from random generated data; or if at least a relative change in signal is observed between differential promiscuity level partitioned data sets.. Partition of data matrices can be accomplished by selecting the alignment positions based on the probability of a particular bonding promiscuity category given the sequence of quantitative values (indexes) of promiscuity, which are assigned to each nucleotide of a primary sequence of each taxon. Then, phylogenetic signal evaluation could be applied over truly promiscuity category representative positions; which might be important to contribute in knowledge about molecular evolution on secondary structures. As well as it might be relevant to know if it is possible the development of future methods aimed on identification of more trustful markers for comparative genomics by phylogenetic signal assessment.. According to results of real taxa sequences in [7], it can be suggested a direct proportional relation between the degree of support of an internal node, belonging to a certain clade on a most parsimonious phylogenetic tree of a given data set, and the quantity of phylogenetic signal accounted for by that clade on that data set. With this in mind, here it will be evaluated if the enhanced quantity of phylogenetic signal accounted for by a set of characters within a taxa data matrix, is implicated on the generalized enhancement of the degree of support of not only the already well-supported clades (on un-partitioned data), but of all of the internal nodes from the topology of the most parsimonious tree containing all of the taxa included in the matrix.. 16.

(17) In addition, here it will be tested an HMM (Hidden Markov Model) approach to improve data resolution by means of the reported use of the Viterbi algorithm, as an error-correction scheme to de-noise digital communication links and analogous sequential patterns, by finding the maximum likelihood sequence given a markovian model [8].. In this context, it will be considered if Viterbi annotation does affect the phylogenetic signal content and if these data, together with the differential phylogenetic signal quantity partitions, have a relative effect over the bootstrap supporting values of a phylogeny. What is reported here might be important not only for discerning strongly supported real phylogenetic relations between organisms, but also to develop future strategies for biological sequence data information content optimization.. 17.

(18) Methods. Data mining and sequence alignment construction. To identify truly positional homology a conservative computational strategy was followed. The NCBI nucleotide database was probing with two Saccharomyces cerevisiae sequences corresponding to a fragment of the nuclear 25S LSU(Large Sub-Unit) rDNA, and to a segment of the nuclear 5’ETS(External Transcribed Spacer)-18S rDNA, representing a nucleotidic conserved and variable primary sequence regions respectively, for contrastable purposes. It must be clear that, although the ETS region is transcribed together with the 18S cistron, it does not constitute part of the highly functional constrained ribosome. Selected sequences corresponding to eukaryotes organisms, should meet the criterion of a putative homology inferred by primary sequence similarity stated in [1]: they should match at least on a single site at an expect value of ≤ exp-15 using the local alignment algorithm BLASTN (except for Xenopus laevis, see Table 2). Furthermore the most significant matching positions of the chosen sequences should share a common overlapping segment with regard to the S.cerevisiae sequence (see Tables 1 and 2). Some sequences and crystalstructure data were taken from other databases [9, 10] as is properly shown on Table 1 footnotes. The 25S LSU rDNA data set was referred here as the LSU data set, as well as the 5’ETS18S rDNA data set was referred as the ETS one, like shortenings. Multiple alignment construction was based on the pairwise alignments between each one of the selected sequences and its corresponding S.cerevisiae query as following:. 18.

(19) 1. Indels and substitutions awarded by the local alignment algorithm, determined the edition on high similarity matching regions, while the edition on remainder fragments was specified by the indels and substitutions awarded by the global alignment algorithm (ClustalW – default parameters). 2. All of the query sequences of S.cerevisiae, which had been separately aligned with the selected sequences on step 1, were stacked together. Then, they were aligned using their coordinate numbers in such a manner that it would make them to reestablish their positions, simply because it is about the same sequence (S.cerevisiae). 3. The new edited query sequence was re-aligned with each one of the selected sequences, in such a way that new indels (on step 2) were transferred from the edited query sequence to the selected sequence, without breaking the already established alignment relations between them (indels and substitutions on step 1). 4. Finally, gapped positions in the constructed matrix were deleted. This procedure was applied to the two gene data sets separately.. Secondary structures folding and promiscuity indexes calculation. The final sequence segments which built the gapped multiple alignment of each gene data set (step 3 previous sub-heading), were submitted to the MFOLD v.3.2 web server [11]. Their lengths are shown in Tables 1 and 2. Nucleotide sequences were folded at suboptimal folding parameters (Window=1, Percent sub-optimality=100%), that let at least one difference would be between two sub-optimal energy adjacent structures, and their spectra. 19.

(20) of possible foldings could extend through the maximum energy range allowed by the algorithm (∂∂G =12Kcal/mol from the most negative free energy value calculated ∆G). The MFOLD web server interface retrieved the number of times each specific nucleotide was linked with different nucleotide positions of the same sequence through the spectra of possible structures. The indexes of promiscuity were calculated as a ratio (from 0 to 1) on a given sequence, relative to the most promiscuous nucleotide position inside the same sequence. Then, they were approximated to one decimal cipher.. HMM stochastic model construction. To get a posterior probability criteria for partitioning data set and develop a de-noising method, it was built a stochastic model in the HMM (Hidden Markov Model) context. Due to the poor number of specific population sequences data from some chosen species, no standard estimation of parameters for HMM was made, but an insightful approach was followed. On this model, it was defined as a State each one of the three categories within which, were grouped the indexes of promiscuity (Symbols) that specified the low, middle and high levels of promiscuity. Assumptions about promiscuity were as following: a low level of promiscuity comprehends indexes values of 0, 0.1, 0.2 and 0.3; a middle level of promiscuity comprehends indexes values of 0.4, 0.5 and 0.6; a high level of promiscuity comprehends indexes values of 0.7, 0.8, 0.9 and 1. For identification purposes low, middle and high promiscuity are referred as RED, GREEN and BLUE on tables respectively, and also identified with their corresponding colours on figures 1,2,3,4,9,10,11 and 12.. 20.

(21) It was employed a single emission probabilities matrix which contained the probabilities that a given symbol was seen when in a given state, whose building overview is presented in Table 6. It was constructed in such a way that low and high promiscuity states could share symbols only with the middle promiscuity state. No low and high promiscuity states had common symbols between them. Due to the fact that it should be estimated the transition probabilities (the probability of a state depending on the previous state) and initial probabilities (the probability of starting the Markov chain in a given state) from a single sequence, a state compositional ratio based strategy was employed. To do it that way, two important and biologically non-conflicting assumptions should be made: 1. The Markov chain, which represents the sequence of promiscuity states, must follow a time independent behaviour. This means that the probability of observing a given output, from the dynamic programming algorithm used to calculate the posterior probabilities of states and/or obtain the denoised sequence (as will be described further on the methods section), at a specified step of the sequence is independent of the step number in which it is calculated. This is the same as to consider the states compositional ratios in a given sequence, when recalculated by the algorithm at each step, to be constant. 2. The transition probabilities matrix must be a symmetrical one. Therefore, the initial probabilities vector represented the eigenvector for the transition matrix corresponding to an eigenvalue of one, and an equations system could be suggested to solve the transition matrix. The indispensable data needed to solve all of the equations. 21.

(22) were the transition probabilities between the same state, which were found to be exemplified perfectly by the compositional ratios of the state on each sequence, given that at compositional equilibrium those changes of state depend linearly on its frequency in the sequence [12]. Due to the heterogeneity in the promiscuity states compositional biases among the used sequences (data not shown), a different initial probabilities vector and transitional probabilities matrix were calculated for each one of the sequences; so they were not estimated from the average state composition of all sequences. A construction overview of the transition matrix is presented in Table 7, where matrix scaling process was derived from [13, 14].. Data matrix partitioning. Aligned data sets were partitioned accordingly to promiscuity states. First, sequences of symbols (promiscuity indexes) from each species nucleotide sequence fragment in the gapped alignment were used together with the emission probability matrix, and their correspondent initial vectors and transition matrices, as inputs to execute a posterior decoding. This posterior decoding consisted of an integration of the Forward and Backward algorithms of the HMM theory [5], whose final output was the probability of each promiscuity state on each sequence position given the observed sequence of symbols. The posterior decoding was executed through implementation of the statistics toolbox for hidden Markov model analysis on MATLAB® v.7.4.0.287 (R2007a, The MathWorks), for every sequence.. 22.

(23) The outputs corresponding to gapped positions of the multiple sequence alignment, already constructed, were dismissed. Then, a ‘consensus’ posterior probability for each one of the states and for each alignment position was obtained by calculating the average and the median of posterior probabilities for the gene data sets individually. The extracted positions from the nucleotide sequences alignment, which were joined to generate the promiscuity based partition alignment, were those which met a median of the state posterior probability equal or higher to 0.7.. Promiscuity states sequences data de-noising. Using the Viterbi algorithm’s characteristics for enhancing data sequence resolution, it was determined the most probable promiscuity state path (likely states sequence) for every species sequences (symbols sequence). The symbols sequences and the already estimated, emission, transition and initial probabilities matrices, were utilized as inputs in the Viterbi algorithm tool for hidden Markov model analysis on MATLAB® v.7.4.0.287 (R2007a, The MathWorks).. Phylogenetic signal measurement. Un-gapped multiple alignment matrices of complete and promiscuity partition based nucleotide sequences, as well as the promiscuity’s assumption direct assignment (over symbols sequences) of promiscuity state _ sequences (without Viterbi annotation), and the Viterbi promiscuity state sequences, were analyzed using PAUP 4.0 Beta v.10 [15]. The matrix signal structure was proved against a random one, by two routines:. 23.

(24) 1. The random-trees option was used to randomly sample (with replacement) trees from the set of all possible trees that can be assembled given the total number of terminally labelled taxa on the matrix. 2. The permute option was used to specify the Permutation Tail Probability test (PTP), to perform a certain number of random permutations of data inside the matrix columns (character-states). For the former option, it was sampled 106 trees whose lengths were computed by the Maximum Parsimony (MP) optimality criterion for each one of the matrices of both gene data sets. From the resulting tree-lengths distribution the g1 statistic, to measure the skewness of the distribution, was computed. Critical values, over which it could be suggested a non-random data structure, of the g1 statistic were taken from [7], using those values corresponding to the most approximate data conditions (In case of highly dissimilar conditions, the most negative values were used). For the latter option, the ETS gene data set matrices were permuted nine thousand times except for the ETS GREEN partition (ninety thousand times), and ninety thousand times for the LSU gene data set matrices. Then, from the retrieved permuted matrices, an optimal tree (minimum length) searching strategy was performed by the branch-and-bound (bandb) algorithm for the ETS gene data set matrices, and a heuristic one was used for the LSU gene data set matrices due to exhaustiveness factors, using Maximum Parsimony (MP) as optimality criterion to calculate tree lengths. From the optimal trees obtained in the permutation test, a tree-lengths distribution was built and a P-value was computed for the optimal tree-length found for the unpermuted (original) data.. 24.

(25) Clades’ bootstrapping support estimation. The bootstrap analysis was made by performing ten thousand bootstrap replications of the matrices (re-sampling with replacement of state-characters on each alignment position) for the ETS and LSU data sets except for the LSU RED and GREEN partitions (twenty thousand replications) on PAUP 4.0 Beta v.10 [15]. After each replication, an optimal tree (minimum length) searching strategy was performed by the branch-and-bound (bandb) algorithm for the ETS gene data set matrices, and a heuristic one was used for the LSU gene data set matrices due to exhaustiveness factors, using Maximum Parsimony (MP) as optimality criterion to calculate tree lengths. Without specifying topological restrictions, those clades (groups of taxa) which occurred on at least 50% of all the bootstrap replicate optimal trees were indicated with its corresponding supporting proportion of bootstrap replicates, into the most parsimonious tree found for each original matrix by branch-andbound search and MP optimality criterion.. 25.

(26) Results and Discussion. Secondary structure patterns and data matrix partitioning. All of the aligned un-gapped secondary sequences (promiscuity indexes sequences) were plotted against their immersed promiscuity state posterior probability as shown in Figure 1 for the LSU sequences, and in Figure 2 for the ETS sequences. It was expected that given the high primary sequence conservation among the LSU sequences, similar secondary structure patterns were found between any two primary sequences such as those found among S.paradoxus (SACPARADO) and K.lodderae (KLULODDER) (whose E-values were relatively almost equally distant from S.cerevisiae and shared a similar S.cerevisiae secondary phenotype; 100% high-similarity-match extension on SCEREVISI – see Table 1). As well as it was expected that given the low primary sequence conservation among the ETS sequences, dissimilar secondary structure patterns were found between any two primary sequences such as those found among G.morsitans (TSETSEFLY) and D.melanogaster (DROSOPHIL) (whose E-values were relatively almost equally distant from S.cerevisiae and neither of two shared a similar S.cerevisiae secondary phenotype – see Table 2). However, it was interesting also to observe how dissimilar could be the secondary structure patterns on the LSU data set between two primary sequences such as those from P.igniarius (PHEIGNIAR) and A.bisporigera (AMABISPOR) (whose E-values were relatively almost equally distant from S.cerevisiae and only P.igniarius shared a nearly similar S.cerevisiae secondary phenotype; 100% high-similarity-match extension on SCEREVISI – see Table 1); and how similar could be the secondary structure patterns on. 26.

(27) the ETS data set between two primary sequences such as those from K.lactis (KLULACTIS) and X.laevis (XENLAEVIS) (whose E-values were relatively differently distant from S.cerevisiae and shared a similar S.cerevisiae secondary phenotype – see Table 2), even considering that X.laevis was the unique that did not meet the criterion Evalue<exp-15. It suggests that the promiscuity character is a legitimate indicator of secondary level complexity phenotype behaviour, because it is able to be changed by few critical mutations but also to let a lot of mutations to be accumulated until a certain point, after which the old secondary structure phenotype breaks down to bring a new one. Furthermore in the LSU and ETS median and average states posterior probability plots, it is observed that the low promiscuity regions (RED) are grouped into well-defined blocks (nearly higher than a posterior probability of 0.7), and they also constituted important part of the secondary structure patterns observed, like those ‘hairpins’ drawn by the lines into the plot (Figures 3 and 4). This suggested that may be a real molecular function involved on the observed promiscuity patterns as it actually was confirmed by checking annotations on sequences of S.cerevisiae in databases and reports [16, 17] (see Tables 3 and 4), which mapped approximately in the immediate periphery of these blocks (check on Figures 3 and 4). Therefore these low promiscuity regions are regions thermodynamically very stable, in such a degree that only their flanking lower promiscuous regions (less stable) are able to let other molecules, such as those involved on pre-mRNA processing and pre-rRNA processing, to bind to perform structurally important changes on the molecule. Even more, it seems that the definiteness and advisable extension limits observed especially on the low promiscuity blocks are characteristics, that restrict strongly the primary sequence. 27.

(28) (submerged within the low promiscuity block) to adopt an uncertain secondary configuration, which allow a well-defined one to be folded for to a permanent and effectively carrying out of the molecular functions take place. More over the already mentioned molecular evidence it is the fact that, at least the ETS data set sequences which were folded had a cistron-transcription sense; because in addition to the 5’ETS region (involved in the SSU rRNA maturation) these sequences also included a little segment of the 18S Small Sub-Unit (SSU) rRNA, which needs the 5’ETS to fold appropriately into the cellular system (see Figures 5 and 6). Thus, if the sequences to be folded would have not included the same cistron-transcription coherent sequence sense, surely the functional related sites found in S.cerevisiae had not been mapped in the secondary patterns as accurately as they did (check ungapped alignment positions in Tables 3 and 4 on Figures 3 and 4).. In this way, such an identification process can be tentatively used on comparative genomics even more considering that functional sites such as the mapped 5’ETS Fungi Motif (used for cleavage of the ETS region), although primary-sequence variable across some lower animals, it is secondary-structure conserved on a positionally homologous region (A’ and A0 pre-RNA cleavage sites) of the protozoan parasite Trypanosoma brucei [18]. Whether this structure bonding promiscuity has a real meaning on the secondary structure capabilities, should be further tested by determining if the position 106 on the LSU ungapped alignment is directly or indirectly implicated with an existent snoRNA (small nucleolar RNA- modification guide molecule) as it is derived from the results reported here, at least on S.cerevisiae (see Table 3).. 28.

(29) With this in mind, a biologically meaningful partition of the alignment data would seem a reasonable approach to the molecular evolution and phylogenetic signal content of the secondary structural patterns. So, in data partition of nucleotide sequence alignments was observed that the low promiscuity partitioned alignments (RED partitioned) were nearly as variable as the middle promiscuity partitioned ones (GREEN partitioned), for both gene data sets (Figures 7 and 8). Nevertheless, the high promiscuity partitioned alignment (BLUE partitioned) showed a high conservation of primary sequence across the taxa of the LSU data set (no reliable high promiscuity positions were found at the ETS data set), which might seem logical from the perspective of the promiscuity indexes determination. Such determination implies that if the reference point (the most promiscuous position or block), from which the low promiscuity regions are established, is not constant relative to the other sequences, the secondary structure pattern could not be preserved through the alignment.. Promiscuity states sequence data de-noising. The Viterbi de-noised sequence of promiscuity states and the directly assigned promiscuity categories (without Viterbi annotation) sequence alignments are shown in Figures 9 and 10 for the LSU data set and in Figures 11 and 12 for the ETS data set. The Viterbi annotation had a subtle effect on the alignment as a whole, but it had a remarkable one on localized zones of the same, intensifying shared similarities and differences between the sequences; as an effect of the expansion and/or contraction of low (RED), middle (GREEN) and high (BLUE) promiscuity blocks within each sequence.. 29.

(30) Effect of partitions and data de-noising on phylogenetic signal quantity measurement. From the P-values obtained by permutation of the character-states on nucleotide sequence alignments (4 character-states) and promiscuity states alignments (3 character-states), it is inferred an evident phylogenetic non-random signal content by assessment of the permutation generation of highly sub-optimal trees, on all data matrices except for the ETS middle promiscuity partitioned primary sequence alignment (GREEN partition, P>0.05) (see Figure 13). No distribution plot is shown for the LSU high promiscuity partition (BLUE partition) due to absence of synapomorphies in this matrix to build phylogenetic relations in a tree. However, no partition effect can be assessed from these results because it is not available random data generating a tree-length distribution from which to determine the critical Pvalue, or descriptive statistic over which the optimal tree obtained from un-permuted data can be considered to be generated by a non-random structured data set; and over which it can be determined how far this optimal tree of un-permuted data is away from that critical value (this approach could be effectively carried out for the random-trees assay as it is told further on).. The observed g1 statistic absolute values in the random-trees assay agreed with the absolute values from the permutation assay, when indicating that there was a phylogenetic nonrandom signal content in all data set matrices with exception of the ETS middle promiscuity partitioned primary sequence alignment (GREEN partition) – (see Figure 14).. 30.

(31) Although in this case a critical statistic g1 value, inferred from tree-length distributions generated by random data, was already available in [7].. The difference between the observed g1 statistic and the critical one showed that there is a common contrast among the low (RED) and the middle (GREEN) partitioned data sets from both genes. So in both gene data sets, the low promiscuity partitioned primary sequence alignment provided of a greater (i.e. a more negative difference between g1 observed and g1 critical) phylogenetic signal quantity relative to the critical statistic value, than middle promiscuity partition did (see Table 5). This result seems to be related with the relative quantity (ratio) of parsimony informative characters. In both gene data sets the proportion of parsimony informative characters in the low promiscuity partition (Red partitioned) is bigger than that from the middle promiscuity partition (Green partitioned) and even than that from the complete (un-partitioned) primary sequence (see Table 5).. Whether it is convenient to partition the nucleotide sequence alignments depends on if it is required or not to enhance the absolute phylogenetic signal measure, as it can be deduced from the bigger phylogenetic signal quantity on both un-partitioned ETS and LSU data sets (see Table 5). However more important than that, is to enhance the parsimony-informative character ratios on a data set by choosing functionally representative alignment positions for phylogenetic inference outperforming purposes. In fact, there is already evidence that states that the greater number of parsimony-informative characters has a positive effect on phylogenetic signal quantity [19]; which agrees with what is suggested here. So as. 31.

(32) mentioned previously, it seems reasonable that parsimony-informative character ratios could be even more reliable factors of phylogenetic signal quantity, than the absolute number of parsimony-informative characters.. Therefore, when comparing partitioned and un-partitioned data sets for both genes it is evidenced an overestimation of relative phylogenetic signal quantity (difference between the observed g1 and the critical g1) measure in the un-partitioned (complete) nucleotide sequence alignments, by following the presumed direct proportional relation among the parsimony-informative character ratios (taken as %) and the quantity of phylogenetic signal over the critical statistic value (77.2%<86.1% in spite of 0.620313>0.24938 on ETS data set and 26.5%<31.7% in spite of 0.251448>0.146368 on LSU data set; see Table 5) Nonetheless, it is also interesting to note that the amount of overestimation, from the expected relative phylogenetic signal quantity given its corresponding parsimonyinformative character ratio, is notoriously greater on the ETS un-partitioned nucleotide sequence alignment data than on the LSU one. Such deviation could be explained only by simulating that it would be more probable to find an artificially generated less randomly structured data on a matrix constituted by fast evolving sequences like those from the 5’ETS fragment, than on one constituted by slow evolving nucleotide sequences like those from the 25S LSU fragment. No such simulation was made here, but it would be interesting to carry out that on future works.. 32.

(33) Regarding to the existent contrast among the noised and the Viterbi de-noised promiscuity states sequence alignments, it can be deduced again that the relative reduction of parsimony informative characters is the principal reason by which the matrices seem to be more randomly structured.. Concerning to whether there was an effect of Viterbi annotation over phylogenetic signal quantity on data sets two contrastable results were observed: 1. As a consequence of the expansion and contraction of low, middle and high promiscuity blocks for each sequence, the Viterbi annotation generally enhanced the level of consensus (i.e. presence of a most commonly occurring character-state at each position of an aligned series of sequences) on the majority of alignment positions in both gene data sets (see consensus bars on Figures 10 and 12), but also diminished the level of consensus on alignment positions where there was an already conspicuous low consensus strength. 2. However, the already higher level of consensus on LSU data set alignment positions than in the ETS one, allow that enhancements and diminutions on consensus levels were more visually advisable on the former than in the latter case (see Figures 10 and 12). Therefore, although this annotation apparently generated a more randomly seeming data in the case of the ETS data set, it just put on evidence that in that alignment had been more autapomorphic (uninformative) and constant characters (i.e. a lesser number of parsimonyinformative characters) than expected from a less randomly structured data set (see Table 5).. 33.

(34) Effect of partition and data de-noising on phylogenetic tree’s bootstrapping support values. To taking into account strategies that reveal new strongly supported phylogenetic relations by functional partitioning of nucleotide sequence alignments, it would be desirable to find that data sets containing more phylogenetic signal had been equally able to show a generalized strong bootstrapping support of every clade in an optimal tree obtained from them. However, this was not the case for the LSU low promiscuity partitioned matrix (LSU RED partition) and the LSU middle promiscuity partition matrix (LSU GREEN partition), for which although the low promiscuity partition provided a greater phylogenetic signal, it was not corresponded with a greater and stronger supported number of clades mapped on the tree (see Figure 15). The same behaviour was also observed between the ETS Viterbi annotated and the ETS non Viterbi-annotated data sets, where the more phylogenetic signal structured ETS without Viterbi annotation data set, did not provide a more numerous and stronger support values for its most parsimonious tree topology than the ETS Viterbi annotated did. Thus, it is clear that the unique cause generating the improvement in robustness and definiteness of the phylogenetic tree topology was the enhancement on data resolution by means of the Viterbi annotation, and not the functional partitioning of alignments.. This may be a reason of why some studies based on approaches which take into account the data partitioned nature (alignment positions heterogeneity), which includes those considering mixture of substitution models methods like BayesPhylogenies [20], have considered premature to conclude that the application of phylogenetic mixture of models. 34.

(35) always has a differential improvement over phylogenetic relations support whatever the case, that can lead to new insights about the phylogeny of a taxa data-set [21].. In addition, although the already strong supported clades on un-partitioned data on both ETS and LSU data sets (ETS_[STETANAKA, PMONOGYNU]→100%; LSU_[URECCAUPO, ARBPUNCTU]→98%; see Figure 15) were also well supported on partitioned data sets, they were not as strongly supported as they were on un-partitioned ones (ETS_[STETANAKA, PMONOGYNU]:RED Partition→92%, GREEN Partition→80%; LSU_[URECCAUPO, ARBPUNCTU] ]:RED Partition→52%, GREEN Partition→76%; see Figure 15). So, the partitions did not improve the bootstrapping supports of those clades just in the same manner as they neither improved the phylogenetic signal absolute g1 statistic value observed on data matrices (ETS: Un-partitioned obs.g1=0.92, RED Partition obs.g1=-0.58, GREEN Partition obs.g1=-0.47; LSU: Un-partitioned obs.g1=-0.45, RED Partition obs.g1=-0.34, GREEN Partition obs.g1=-0.39; see Table 5).. 35.

(36) Conclusions. After all, the real role of the Viterbi annotation was only to improve the robustness of a topology implied by a given data set whatever was its phylogenetic signal ‘condition’. It is also deduced that the relative amount of phylogenetic signal is direct proportional related to the parsimony-informative character ratio in alignments, and that different functional regions on secondary structural patterns can provide differential amounts of phylogenetic signal corresponding to the level of functional selective pressure exerted over them. These followed assays are concluding that, in spite that they were designed trying to improve the already present (on un-partitioned data set) phylogenetic signal content by alignment partition through selection of characters containing differential functional relevancy, the bootstrapping support values of the clades in the most parsimonious tree are not necessarily enhanced by increasing the relative amount of phylogenetic signal. But it is suggested that partition would had been able to enhance support of tree, if it implied also the enhancing of the absolute amount of phylogenetic signal.. So, the HMM – Viterbi de-noising method as used here can be postulated as a reliable phylogenetic inference routine because it does not interfere with phylogenetic signal amount, by enhancing support and definiteness of phylogenetic topologies. This strategy, together with assessment of phylogenetic signal amount improving functional secondary structure positional markers can be employed to develop future mixed strategies to accurately compare genomes.. 36.

(37) References. 1.. Fulton TM, Van der Hoeven R, Eannetta NT, Tanksley SD: Identification, analysis, and utilization of conserved ortholog set markers for comparative genomics in higher plants. Plant Cell 2002, 14(7):1457-1467.. 2.. Coleman AW: ITS2 is a double-edged tool for eukaryote evolutionary comparisons. Trends Genet 2003, 19(7):370-375.. 3.. Zuker M, Stiegler P: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res 1981, 9(1):133148.. 4.. Aguilar C, Sánchez JA: Phylogenetic hypotheses of gorgoniid octocorals according to ITS2 and their predicted RNA secondary structures. Mol Phylogenet Evol 2007, 43(3):774-786.. 5.. Durbin R: Biological sequence analysis : probabilistic models of proteins and nucleic acids. Cambridge, UK New York: Cambridge University Press; 1998.. 6.. Zuker M, Jacobson AB: Using reliability information to annotate RNA secondary structures. RNA 1998, 4(6):669-679.. 7.. Hillis DM, Huelsenbeck JP: Signal, noise, and reliability in molecular phylogenetic analyses. J Hered 1992, 83(3):189-195.. 8.. Viterbi AJ: A personal history of the Viterbi algorithm. Signal Processing Magazine, IEEE 2006, 23(4):120-142.. 37.

(38) 9.. Christie KR, Weng S, Balakrishnan R, Costanzo MC, Dolinski K, Dwight SS, Engel SR, Feierbach B, Fisk DG, Hirschman JE et al: Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res 2004, 32(Database issue):D311-314.. 10.. Cannone JJ, Subramanian S, Schnare MN, Collett JR, D'Souza LM, Du Y, Feng B, Lin N, Madabusi LV, Muller KM et al: The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics 2002, 3:2.. 11.. Zuker M: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 2003, 31(13):3406-3415.. 12.. Miyamoto MM, Cracraft J: Phylogenetic analysis of DNA sequences. New York: Oxford University Press; 1991.. 13.. Holmes I: A probabilistic model for the evolution of RNA structure. BMC Bioinformatics 2004, 5:166.. 14.. Foster PG: "The Idiot’s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed". Unpublished manuscript. In.; 2001.. 15.. Swofford DL: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. In. Sunderland, Massachusetts: Sinauer Associates; 2000.. 16.. Piekna-Przybylska D, Decatur WA, Fournier MJ: New bioinformatic tools for analysis of nucleotide modifications in eukaryotic rRNA. RNA 2007, 13(3):305312.. 38.

(39) 17.. Chen CA, Miller DJ, Wei NV, Dai C: The ETS/IGS Region in a Lower Animal, the Seawhip, Junceella fragilis (Cnidaria: Anthozoa: Octocorallia): Compactness, Low Variation and Apparent Conservation of a Pre-rRNA Processing Signal with Fungi. Zoological Studies 2000, 39(2):138-143.. 18.. Hartshorne T, Toyofuku W: Two 5'-ETS regions implicated in interactions with U3 snoRNA are required for small subunit rRNA maturation in Trypanosoma brucei. Nucleic Acids Res 1999, 27(16):3300-3309.. 19.. Simmons MP, Zhang LB, Webb CT, Reeves A: How can third codon positions outperform first and second codon positions in phylogenetic inference? An empirical example from the seed plants. Syst Biol 2006, 55(2):245-258.. 20.. Pagel M, Meade A: A phylogenetic mixture model for detecting patternheterogeneity in gene sequence or character-state data. Syst Biol 2004, 53(4):571-581.. 21.. Collins AG, Schuchert P, Marques AC, Jankowski T, Medina M, Schierwater B: Medusozoan phylogeny and character evolution clarified by new large and small subunit rDNA data and an assessment of the utility of phylogenetic mixture models. Syst Biol 2006, 55(1):97-115.. 39.

(40) Figure 1 extended. 40.

(41) Figure 1 extended. 41.

(42) Figure 1 extended. 42.

(43) Figure 1 extended. 43.

(44) Figure 1 extended. 44.

(45) Figure 1. LSU promiscuity states posterior probability vs. secondary sequence alignment position, plots. The LSU sequences are displayed in order of primary sequence similarity relative to S.cerevisiae (E-value) from the most similar (top) to the most dissimilar (bottom). RED: Low, GREEN: Middle and BLUE: high promiscuity.. 45.

(46) Figure 2 extended. 46.

(47) Figure 2 extended. 47.

(48) Figure 2. ETS promiscuity states posterior probability vs. secondary sequence alignment position, plots. The ETS sequences are displayed in order of primary sequence similarity relative to S.cerevisiae (Evalue) from the most similar (top) to the most dissimilar (bottom). In S.cerevisiae is shown a segmentedline region which indicates 5’upstream region of the 18SrDNA, which primary sequence resembles an evolutionary conserved motif in fungi as mentioned on [17]. RED: Low, GREEN: Middle and BLUE: high promiscuity.. 48.

(49) Figure 3. LSU consensus promiscuity states posterior probability vs. alignment position, plots.. 49.

(50) Figure 4. ETS consensus promiscuity states posterior probability vs. alignment position, plots.. 50.

(51) 51. character-states.. of the squared region, it can be found the snR78-snR67 flanked low promiscuity region. The colours are representing the four. Figure 5. LSU complete (non-partitioned) nucleotide sequence, gapped and un-gapped, alignments overview. Among the limits.

(52) 52. be found the 5’ETS fungi motif region. The colours are representing the four character-states.. primary sequence conservation on the 18S fragment at the end of the alignment. Among the limits of the squared region, it can. Figure 6. ETS complete (non-partitioned) nucleotide sequence, gapped and un-gapped, alignments overview. Notice the.

(53) 53. positions). The colours are representing the four character-states.. partition (120 positions), Left Bottom: LSU GREEN partition (71 positions), Right Bottom: LSU BLUE partition (12. Figure 7. Secondary structure promiscuity partitioned LSU nucleotide sequence alignments overview. Top: LSU RED.

(54) 54. positions), Bottom: ETS GREEN partition (30 positions). The colours are representing the four character-states.. reliable positions (posterior P≥0.7) were found to build the BLUE partitioned alignment. Top: ETS RED partition (137. Figure 8. Secondary structure promiscuity partitioned ETS nucleotide sequence alignments overview. No high promiscuity.

(55) 55. and BLUE: high promiscuity).. A.punctulata, P.annulata and U.caupo. The colours are representing the three character-states (RED: Low, GREEN: Middle. Z.smithiae, P.igniarius, A.bisporigera, B.nanus, S.ficus, H.circumcincta, C.coronatus, M.polymorpha, M.sativa, P.bachei,. flanked low promiscuity region. From top to bottom the sequences correspond to: S.cerevisiae, S.paradoxus, K.lodderae,. categories were assigned through Viterbi algorithm. Among the limits of the squared region, it can be found the snR78-snR67. promiscuity were directly assigned from the promiscuity assumptions while for the ‘WITH Viterbi annotation’ alignment,. Figure 9. LSU promiscuity states sequence gapped alignments. On 'WITHOUT Viterbi annotation’ alignment, categories of.

(56) 56. character-states (RED: Low, GREEN: Middle and BLUE: high promiscuity).. commonly occurring character-state at each position of an aligned series of sequences). The colours are representing the three. A.punctulata, P.annulata and U.caupo. Dark bars under alignments represent the level of consensus (i.e. presence of a most. Z.smithiae, P.igniarius, A.bisporigera, B.nanus, S.ficus, H.circumcincta, C.coronatus, M.polymorpha, M.sativa, P.bachei,. flanked low promiscuity region. From top to bottom the sequences correspond to: S.cerevisiae, S.paradoxus, K.lodderae,. categories were assigned through Viterbi algorithm. Among the limits of the squared region, it can be found the snR78-snR67. of promiscuity were directly assigned from the promiscuity assumptions while for the ‘WITH Viterbi annotation’ alignment,. Figure 10. LSU promiscuity states sequence un-gapped alignments. On 'WITHOUT Viterbi annotation’ alignment, categories.

(57) 57. Middle and BLUE: high promiscuity).. P.monogynus, S.jamesii, S.tanakae and X.laevis. The colours are representing the three character-states (RED: Low, GREEN:. motif region. From top to bottom the sequences correspond to: S.cerevisiae, K.lactis, H.wingei, D.melanogaster, G.morsitans,. categories were assigned through Viterbi algorithm. Among the limits of the squared region, it can be found the 5’ETS fungi. promiscuity were directly assigned from the promiscuity assumptions while for the ‘WITH Viterbi annotation’ alignment,. Figure 11. ETS promiscuity states sequence gapped alignments. On 'WITHOUT Viterbi annotation’ alignment categories of.

(58) 58. are representing the three character-states (RED: Low, GREEN: Middle and BLUE: high promiscuity).. (i.e. presence of a most commonly occurring character-state at each position of an aligned series of sequences). The colours. G.morsitans, P.monogynus, S.jamesii, S.tanakae and X.laevis. Dark bars under alignments represent the level of consensus. motif region. From top to bottom the sequences correspond to: S.cerevisiae, K.lactis, H.wingei, D.melanogaster,. categories were assigned through Viterbi algorithm. Among the limits of the squared region, it can be found the 5’ETS fungi. of promiscuity were directly assigned from the promiscuity assumptions while for the ‘WITH Viterbi annotation’ alignment,. Figure 12. ETS promiscuity states sequence un-gapped alignments. On 'WITHOUT Viterbi annotation’ alignment, categories.

(59) Figure 13 extended. 59.

(60) Figure 13 extended. 60.

(61) Figure 13 extended. 61.

(62) Figure 13 extended. 62.

(63) Figure 13 extended. 63.

(64) Figure 13. Character-states permutation assays on LSU and ETS data set matrices (primary and promiscuity states alignments). P-values correspond to the position of the tree obtained from the un-permuted data with regard to the tree-length distribution obtained from permuted data sets.. 64.

(65) Figure 14 extended. 65.

(66) Figure 14 extended. 66.

(67) Figure 14 extended. 67.

(68) Figure 14 extended. 68.

(69) Figure 14 extended. 69.

(70) Figure 14. Random-trees assays on LSU and ETS data set matrices (primary and promiscuity states alignments). g1 values are the summatory of the third central moment of n. tree lengths divided by the cube of the standard deviation. ∑ (T. i. i =1. 70. − T )3 n ⋅ s 3.

(71) Figure 15 extended. 71.

(72) Figure 15 extended. 72.

(73) Figure 15 extended. 73.

(74) Figure 15 extended. 74.

(75) Figure 15. Most parsimonious phylogenetic topologies (tree) found for each matrix. Only clades with a bootstrap support value equal or greater than 50% have their values mapped. Blue:Animal; Yellow:Fungi; Green:Plant. A line segment at left bottom of some trees (obtained from consensus of multiple-equally optimal trees) shows a value and scale of parsimony distance in the tree.. 75.

(76) 76.

(77) 77.

(78) 78.

(79) 79.

(80) 80.

(81) 81.

(82)

Referencias

Documento similar

T F is folding temperature and it depends on ∆E the energy gap between funnel minima and random states, and configurational entropy Sc.. T G is glass transition temperature

In the case of states delocalized on the surface, it leads to (i) ISs repelled into vacuum by the insulating layer, 9 (ii) the mixture of image potential states with conduction

As expected from the calculated density of states, intensity maxima are mainly observed on Si up adatom for occupied states and on Si down adatom for empty states whereas almost

User-dependent Hidden Markov Models A recent study [10] has proven the benefits of using user-dependent models by specifically setting the num- ber of states and Gaussian mixtures

The Chapter presents termination conditions based on steady states covering a formulation of steady states inspired in the concept of Cauchy sequences and how to detect when

The study by Kitchens and Fishback (2015) for the case of the United States measures the effects of the expansion of the network in different states for the period 1930–40 and finds

The Commission, taking into account information provided by the Member States and the EEAS, and available from the EMCDDA, Europol and other relevant EU bodies, as well

We carry out calculations by resorting to three different representations of neutrino states: (1) Pontecorvo states, (2) mass states and (3) exact QFT flavor states, which are