Extensión de las Especificaciones de Caso de Uso

Once a method proves to be superior to others, we can use it for all data sets.

[ Tateno et al [93]]

7. 1

Introduction

This thesis was not intended as a comparison of the performance of phylogenetic methods. Rather, it was to study the effects of several factors on the performance of some phylogenetic methods.

Comparisons of the performance of methods are abundant in the literature

already; for example, see [48] , [50] , [54] , [56] , [58] , [75], [80], and [93] . These studies

have used specific models of evolution to generate the data: the type of transition matrix, the tree topology, the sequence length, and the edge lengths have all been taken from a small set of possibilities, and the performance of the methods assessed. O ne danger in such studies is of assuming that because in some set of circumstances,

method ' A ' works more accurately than method 'B' , it will be the better method

to use in some other circumstances. This may indeed be true, but is not definite, and the results described in this thesis do not support a single ' best' method. I

therefore have to disagree with the statement of Tateno et al quoted above.

Hence throughout Chapters 5 and 6 ( "Small n" and "Large n" , respectively) I have avoided making any overall j udgement of the superiority of any one method over the others. In differing sets of circumstances, each method has its own merits.

l 54 Chapter 7. Discussion

But such an investigation as this is bound to reveal trends in the relative per )rmances of these methods, and of course these trends must be considered with �gard to other simulation and theoretical studies which have been carried out. [ence in this chapter an overall view is provided of the results presented so far, nd how they relate to the results of some other experiments.

This study has been inspired in part by previous work, because of necessity 01ch work was limited by computational costs and theoretical and empirical un erstanding. It would be folly to suppose that, j ust because this study has used �latively large simulations \Vith variation of many parameters, it is in some way onclusive, and that no more need be done. For this reason I include an indication f the direction in which further research into this problem may be headed.

r .2

S ummary of results

'he investigations I have carried out investigating factors affecting the performance f phylogenetic methods have involved a large amount of computer processing. The esults from the two programs "sirn . c" and "big . c" have been summarized, to find hose which are the most informative and interesting for inclusion in this thesis. Ience some conclusions have been made with reference to data not shown. This mission of data has been made purely in the interests of saving space and not Q.terrupting the flow of the discussion with countless figures and tables.

The investigation has provided several conclusions about the effects on the 1erformance of phylogenetic methods of sampling error, tree topology, number of axa, edge length probability distribution, treatment of observed data, white noise .nd pink noise. It has also enabled the desirable properties (accuracy, consistency, ·fficiency, falsifiability, and robustness) of the phylogenetic methods used to be ested both empirically and theoretically.

T. 2 . 1

Sampling error

vly results support the expected conclusion that sampling error has a strong effect

m the accuracy of the phylogenetic methods used: all perform poorly when the

;equence length c is less than about 1 00 characters, and all the methods consistent Nith the model by which the data were generated perform well when c is large, in ;he order of 1 000 or 2000 characters. This is of course to be expected, but the rate

at which the accuracy changes with varying c is now easily observed, with the large set of sequence lengths used.

With the "large n" case, the same effect of sampling error was seen with respect

to the clustering methods NJ , SL, TD and UPGMA.

When the internal edges were very short, there was often no change of state of any characters across that edge (in the "small n" case, this would mean the

bipartition corresponding to that edge was not sampled). In such cases there was not enough information in the data set for phylogenetic methods to reconstruct the generating tree Tc, unless the sequences were very long.

When the internal edge lengths, or indeed the pendant edge lengths, grew very large, the sequences effectively grew more and more dissimilar, until once again there was not enough information in the observed data to reconstruct T c unless the sequences were very long. In the "small n" case, this took place when the upper bound a on the maximum path length exceeded 1 .4. With the "large n" case an experiment was carried out to find the variance of the inferred divergence time between two sequences, given a known rate of evolutionary change. This highlighted the difficulty of inferring edge lengths from sequences in which there is not enough change to be reliable, or when the data become too dissimilar.

By using the Hadamard conjugation technique to calculate the expected bipar tition frequencies and effectively eradicate sampling error [41), and by using very long sequences, it was possible to see easily which of the methods is consistent with the model of evolution used to generate the data.

It was also possible to prove theoretically that the neighbour-joining method (NJ ) uses the only possible weighting scheme for net divergence which is consistent with the model (see Appendix B, Theorem 5).

7

2 Tree topology

It has been found that, with the models used, tree topology has little effect on the performance of the tree reconstruction methods used, other than UPGMA. The other methods which are the most affected are NJa, SL, UPGMA , and the sequence spectrum based methods. Those least affected are the distance-spectrum based methods (CoDH, CTDH , MPDH), NJ, ST and TD. I have shown the accuracy of UPGMA to be strongly dependent on tree topology. This means simulation studies on UPGMA need to explicitly take tree topology into account.

156 Chapter 7. Discussion

Knowing that, apart from UPGMA , tree topology for small n makes little dif ference to the performance of these phylogenetic methods, it is then reasonable to use one tree topology for the generating tree

Ta,

or choose the topology of Tc at random, in simulation studies such as this. Therefore with the "large n" case Tc

could be selected at random, without unnecessary concern about the affect of the topology of

Ta

on the accuracy of NJ, SL, ST or TD, but bearing in mind that tree topology may have a large effect on the accuracy of UPGMA in this case also. This allowed the extension of the simulation experiments to 30 taxa.

7.2.3 Number of taxa

The number of taxa ( n

)

greatly affects performance of phylogenetic methods. The

number of unrooted binary trees with n pendant vertices is (2n - 5)!!, and that

of rooted binary trees is (2n - 3 ) ! ! : these numbers grow exponentially with n, so

with increasing n, we predict that the accuracy of the methods must decrease.

Also, as the maximum distance between any two sequences in a given trial was bounded above

(

usually by 0.25 - see Section 4.4. 1 ) , when n was increased, the

. edge lengths decreased. Hence, with the problems known to be caused by short internal edges, this was another reason to expect the accuracy of the reconstruction methods to decrease markedly with increasing n.

Sections 5. 7 and 6.3 discuss this problem, and the degree to which n affects the

mean number of edges wrongly inferred and the probability of reconstructing Tc.

It was possible to estimate the rate at which the sequence length must increase

(

which is related to the rate at which sampling error must decrease

)

with i ncreasing number of taxa, for a particular phylogenetic method to retain a given confidence in the inferred tree. This rate has been shown to be close to linear for the cases studied, which is a promising result, but it still suggests that for 1 00 or more taxa, proportionally long sequences must be obtained

(

in the order of 1 04 characters

)

, which m ay be infeasible.

The model used for data generation was 'ideal ' , in the sense that it was known, and the methods used for tree reconstruction were

(

in this experiment

)

consistent with that model. 'Real' data cannot be assumed to be so well behaved, and it may well be that with real DNA sequences the required growth rate in c; is much faster than a linear function of n, if a fixed confidence level is sought.

supported by these experiments: the mean number of edges wrongly inferred by

each method increased at least linearly with n , indicating that the probability of

wrongly inferring a particular edge is at least increasing. This leaves aside the position of the edge in the tree: edges which are ' deeper' than others {see Section

6. 7) may have a lower probability of being inferred than those which are 'shallower' .

A study of the above effect would be rather involved, and complicated by the relative lengths of the internal edges and the position of the edge( s) in question: this may be a fruitful avenue for further study.

In document 9806 pdf (página 110-122)