CAPITULO II: Procesos para la Gestión del Conocimiento en la Ronera Central
2.1. Caracterización general de la Ronera Central “Agustín Rodríguez Mena”.
We now present two examples that demonstrate how tree constraints can be used to assess tree- compatibility. The two applications are in linguistics and phylogenetics, both of which are topics known for using latent tree models for describing evolutionary relationships between species. The examples shown here should only be considered as illustrative. Although both applications explore questions typical to the subject areas and proceed in a sensible manner, without consultation with domain experts we cannot give much weight to the conclusions of the analyses. With this proviso noted, we proceed with the applications with particular note to the methodology that can carry across to other data sets.
4.2.1 Tree-compatibility of Indo-European languages using binary random vari-
ables
The aim of this analysis is to determine whether four selected Indo-European languages French, Italian, Spanish and Brazilian Portuguese can be adequately described using a binary latent tree model. This is achieved by checking whether the inequality constraints derived in Section4.1.2
are respected by the sample estimates of the data. Of course, it is important that a suitable set of binary random variables are selected to begin with. The data set is a subset of that used in Nicholls and Gray [2008] (and is denoted Dyen et al. in Section 7.1 of that paper). The data set is based uponDyen et al. [1997] which itself makes use of the famous Swadesh list of 200 word meanings [Swadesh,1952]. The Swadesh list comprises word meanings that are known to have a low level of borrowing between languages — borrowing can be considered a linguistic equivalent of horizontal gene transfer in the genetic context. Thus the Swadesh list provides information about the historical relationships between languages largely focused on gradual evolutionary development. The data set was formed by taking one of 87 Indo-European languages and one of 200 word meanings and then identifying all words within that language with that particular meaning. This was then repeated for each language and meaning pair. Words with a shared meaning are said to be homologous if they are believed to share a common ancestor or origin. For each word meaning, words that are homologous (as judged by linguistic experts) are said to belong to the same cognate class. For example:
TABLE4.1: Example of words with given meaning for each of four languages. ‘all’ ‘to sit’ ‘to burn’
Brazilian
Portuguese todo(to) — queimar French tout asseoir bruler
Italian tutto sedere ardere, bruciare
Spanish todo sentasse arder
TABLE4.2: The corresponding cognate classes for the words in Table4.1. ‘all’ ‘to sit’ ‘to burn’
Brazilian
Portuguese c=1 — c=3 French c=1 c=2 c=4
Italian c=1 c=2 c=4, 5
Spanish c=1 c=2 c=5
For the word meaning ‘all’ in English, an equivalent word was given in the data set for each of the four languages: ‘tout’, ‘tutto’, ‘todo’ and ‘todo(to)’. It can be read from the data set that these four words are deemed homologous and so they share the cognate class denotedc = 1
(say). Now considering the verb ‘to sit’ we have the rare occurrence that the data set in this case does not provide a word for one of the languages (or in circumstances such a word does not exist). This means that there is no class code for Brazilian Portuguese for the word meaning ‘to sit’. This occurs rarely in the data set and is thus unlikely to materially affect the analysis. The final example word meaning is ‘to burn’ where we have the occurrence that two words are provided for Italian: ‘ardere’ and ‘bruciare’. There are three cognate classes containing{bruler, bruciare},{ardere, arder}and{queimar}. Notice that there are two cognate classes associated with Italian and ‘to burn’. The data set is presented as an2665×87binary data matrixZwhere each row represents a cognate classc = 1, . . . ,2665and each column relates to one of the 87 languages. If a languagej has a word in cognate classithenZ[i, j] = 1otherwise absence is indicated by a−1or0depending on your binary coding choice. Hence, the data matrix relating
to the example in Tables4.1and4.2is 1 1 1 1 −1 1 1 1 1 −1 −1 −1 −1 1 1 −1 −1 −1 1 1
where the columns are ordered Brazilian Portuguese, French, Italian and Spanish.
We denote the submatrix ofZthat contains the four languages of interest asZ∗. We treat each of the languages as an observed variable with each of the cognate classes being considered as an observational unit. We can then assess whetherZ∗is compatible with a binary latent tree model using the constraints described in Section4.1.1and Section4.1.2.
The required central moment and mean estimates for checking for tree constraint violations are provided below (where the four languages are coded1 =Brazilian Portuguese,2 =French,3 =
Italian and4 =Spanish).
ˆ µ1 =−0.8507 µˆ2=−0.8544 µˆ3 =−0.8454 µˆ4 =−0.8522 ˆ σ12= 0.1839 σˆ13= 0.1991 σˆ14= 0.2256 σˆ23= 0.2101 ˆσ24= 0.1916 σˆ34= 0.2097 ˆ σ123= 0.2951 σˆ124= 0.2953 σˆ134 = 0.3152 σˆ234 = 0.3017 ˆ σ1234 = 0.3470
The full set of inequality constraints were evaluated, namely (4.1.7), (4.1.8), (4.1.12)–(4.1.14) and (4.1.23). All of the constraints were satisfied with the exception of (4.1.23) that was violated. This suggests that the four languages are not tree-compatible, but that any three of the languages are indeedT3-compatible as all of the tripod constraints are satisfied. We discuss some of the
4.2.2 Assessing evolutionary history using the COI gene
Phylogenetic trees can be constructed using gene sequences where the vectors of data are ob- tained from DNA sequences coded into binary. Each base in a sequence has one of four chemi- cals either T or C (pyrimidines) or A or G (purines). Transversions (which is when a sequence jumps from a pyrimidine to a purine or vice versa) occur at a lower rate than transitions (jumps within pyrimidines and purines) and so transversions can be considered of more interest [Yang,
2007]. We can thus encode T and C as 1, and A and G as -1 [Vij and Biswas,2005, p.8]. When fitting phylogenetic trees to data there has been some use of constraints implied by conditional independence (e.g.Casanellas and Fern´andez-S´anchez[2007]), however without the inequality constraints given inSettimi and Smith[2000],Zwiernik and Smith[2011] this can lead to erro- neously fitting a tree to data. The sequences used for genetic analyses usually have hundreds of entries. For example, Barcode of Life Data Systems (BOLD Systems) requires sequences with a minimum of five hundred base pairs (BPs) [Ratnasingham and Hebert,2007a].
The genetic data obtained from BOLD Systems [Ratnasingham and Hebert, 2007b] is from a particular region of the mitochondrial gene, cytochrome c oxidase I (COI). Hebert et al.
[2004] published the first practical paper using this gene region suggesting the gene region “as a DNA barcode for the identification of animal species” and since then COI has had increasingly widespread use in animal species classification.
To illustrate the technique we consider the unresolved problem of how to model the evolution of placental mammals. For example,Teeling and Hedges[2013] recently surveyed the competing theories as to the ancestral root of placental mammals and found that despite advances in phylo- genetic techniques and data sizes that there remain three serious possibilities. The disagreement is about the ordering of three groups of species called clades. A clade is a group of all species that are descendants of a common ancestor (and does not exclude any descendants). The three clades of interest areboreoeutheria,xenarthraandafrotheria.
InTeeling and Hedges[2013, Figure 1] the three proposed orderings of the clades are presented and Figure4.6 here generalises the graph, the question then being the location of each clade in the positionsA, B andC. However, this precludes the possibility that a latent tree model is not the appropriate model class. Reading the tree as a rooted BN, a necessary condition for
A B C
FIGURE 4.6: Outline of possible placental mammal phylogenetic tree whereA,B andCare also trees and represent each of the three clades.
tree-compatibility is that tree constraints hold for any set of extant species within these groups. The minimum analysis involves selecting a species from each of these three clades and using the binary encoding of each COI gene to test them against tripod inequality constraints. It is this approach that we use in our example. The proof that the tripod constraints apply is provided by Theorem4.3.1. This minimal analysis is undoubtedly a simplification but the principle is correct. A more detailed analysis might involve a larger number of species, more constraints, and more extensive sections of genetic data.
In our example we selectPongo pygmaeus(Bornean orangutan) fromboreoeutheria, Dasypus novemcinctus(nine-banded armadillo) from xenarthra andLoxodonta africana(African bush elephant) fromafrotheria. In this instance the choice of particular species was arbitrary though motivated by the images used inTeeling and Hedges[2013]. Clearly a large number of such choices could be tested in a full-blown analysis.
The species are coded1 =orangutan,2 =armadillo and3 =elephant.
ˆ µ1 = 0.1170 µˆ2= 0.2226 µˆ3 = 0.3132 ˆ σ12= 0.4004 σˆ13= 0.4049 σˆ23= 0.2208 ˆ σ123= 0.1656
We find that the tripod constraints (4.1.7), (4.1.8) are satisfied and so based on this analysis the data set isT3-compatible. Therefore, this offers no evidence against a tree adequately describing
the evolutionary history of the species and thus one of the three rootings of the tree surveyed in
4.2.3 Discussions of examples
In these two examples we have motivated two contexts where a phylogenetic tree structure might be assessed and we demonstrated how to implement straightforward tests of tree-compatibility using up to fourth-order moments. However, it is worth considering the limitations of these analyses. Firstly, there has been no expert linguistic or phylogenetic guidance during the analy- sis as already mentioned. Secondly, the data sets are not definitive. Although the Swadesh list is a sensible choice it is just one of a number of lists, many of which have been constructed with more contemporary linguistic knowledge in mind. Likewise, the COI gene is only one of several genes that have been suggested for identifying species. With advances in technology it is possible for much larger numbers of BPs or even entire genomes to form the basis of genetic studies. Finally, we are only using point estimates for the moments and so ignoring any estimate error. It is interesting to note that for the linguistic example the only violations occur in the constraints that make use of fourth-order moments, and it is the higher order moment estimates that tend to be least accurate as argued earlier. To get a sense of reliability of these results there are several approaches that could be taken. For example, a non-parametric method would be to bootstrap the data and record the proportion of samples that adhere to each constraint. Alterna- tively, a prior distribution could be assigned to each estimate and a Bayesian hierarchical model could be constructed and through simulation a posterior probability of tree-compatibility can be estimated. These probabilistic methods could potentially be used to additionally incorporate the algebraic constraints though that is beyond the scope of these short examples. A search of current research suggest that this is the first example of this type of diagnostic to appear in the literature.