In the first part of this chapter I provided a comprehensive review of different techniques for estimation of local and global ancestry proportions. This overview extends previous partial reviews due to Church-house (2012); ChurchChurch-house and Marchini (2013); Liu et al. (2013). In this review I looked at different techniques, such as regression, clustering and window-based, non-Parametric Bayesian approaches,
5.3. Summary of ancestry inference techniques and conclusions 121 PCA-based approaches and machine learning approaches with a focus on kernel-based methods.
I also discussed the evolution of local HMM-based model approaches: initially researchers applied first order HMMs before they moved to second order HMMs to account for LD between adjacent SNP markers. Due to the insight that none of the fixed order HMM can model long range dependencies researchers focused on modelling entire haplotypes within a hierarchical HMM with layers either corre-sponding to populations or haplotypes. Then, due to computational constraints HMMs were developed whose number of hidden states is much less than the number of ancestral individuals. Alternatively, to avoid oversimplification by using fixed order HMMs and overspecification by modelling whole haplo-types, researchers studied haplotype clusters as hidden states which are adaptive according the amount of locally observed LD in each ancestral population. All of these HMM variants described before are either not applicable for many populations or have time complexity which makes inference in multiple population setup infeasible. This limitation of multiple populations was successfully implemented by models, such as MULTIMIX and ChromoPainter. MULTIMIX segments chromosomes into windows of constant ancestry and estimates SNP frequencies and covariance over those SNPs in the window assuming normality. Then, for an unknown admixed individual the most likely hidden ancestry states are inferred using a first order HMM. On the other hand, ChromoPainter represents test dog individuals as linear combination of haplotype segments in the training data.
After this review of ancestry inference techniques in Section 5.2.1 I defined a set of characteristics to select one of these approaches as comparison technique with DBAncestry. In particular, the required characteristics include whether the algorithm is suitable for sparse/dense marker sets, how many popu-lations it can process, how it scales with the number of popupopu-lations, whether it accounts for correlation among SNPs, whether there is a publicly available implementation and feedback opportunities of expert users.
Based on these characteristics I chose the ChromoPainter technique which I compared with DBAncestry in Section 5.2.1 based on factors, such as time complexity as a function of the num-ber of SNPs and training samples, whether ancestry changes are considered within the chromosome, whether it deals with LD, how much information is extracted from the training samples as well as disk storage and working memory requirements.
After that in Section 5.2.2 I discussed how I obtain the haplotye representation for the test dogs, how I estimate recombination and mutation rates in Section 5.2.3 which are used as input to ChromoPainter, how I compute breed proportion estimates using either the standard or NNLS variant according to Sec-tion 5.2.4. Finally, I described the results in SecSec-tion 5.2.5 and compared them with those ones obtained from DBAncestry.
Results showed that ChromoPainter predicts pure breed proportions reasonably well with median estimates of 0.33 and 0.57 for the standard and NNLs variant, respectively. However, performance sharply declines for more complex lineages involving multiple breeds. In particular, parental breeds only have median breed proportions estimates of 0.14 to 0.16 which is around 0.35 less than the truth.
Similary, at the gp level median estimates are 0.05 to 0.08 which is much less than the truth of 25 percent. Finally, at the ggp level the NNLs variant almost does not detect breeds at this ancestral level while the standard variant underestimates these breed proportions at 0.05.
122 Chapter 5. Alternative ancestry inference models
A comparison with the DBAncestry technique showed DBAncestry still yields a high AUC score for highly complex lineage trees (i.e. lineages with several gps and ggps) which is only achieved for pure breed trees using ChromoPainter. This difference in performance is most likely due to the fact that ChromoPainter extracts less information from the training dataset than the MCMC approach.
Chapter 6
Conclusions and future work
This chapter briefly summarizes the thesis contributions along with pointers to future work.
6.1 Conclusions
This research on canine ancestry inference was initiated by Mars Veterinary and Prof. David Balding.
The underlying motivation for this work was to improve upon currently available ancestry inference technique for the interesting case where canine samples are represented by short genetic sequences from a large number of breeds. In particular, we were interested in the estimation of breed composition of synthetic test dogs at the level of recent ancestry going up to three generations back which reflects great-grandparent kinship.
As first step in this thesis we provided some background about commercial ancestry testing and dog research. In particular, we discussed direct-to-consumer (DTC) tools for ancestry inference and why private users, academic researchers, medical professionals and employees of government bodies, housing associations and insurance agencies are interested in accurate global canine breed composition estimation. There are a variety of data sources which are predictive of ancestry and we review some of them including data based on phenotypes, language, stable isotopes and genetic data. Genetic data can be further subdivided into SNP and microsatellite markers which both contain autosomal compo-nents as well as lineage-based parts for the maternal and paternal lineage. However, these ancestry inference tools also have certain limitations, i.e. their results depend on a definition of ancestry which is interpreted differently depending on audience and purpose, the number of markers and their breed discriminating characteristics, the utilized data sources and how complete ancestral populations where sampled with respect to their database use. Furthermore, there are limitations of those DTC tools based on the accuracy and interpretation of genetic testing of disease and other disorders. Ancestry inference can also lead to emotional distress of dog owners who expected a different breed composition or may lead to legal disputes with respect supposedly purebred dogs. Finally, many of these DTC providers enforce proprietary confidentiality rules to protect their business which may lead to the unexpected fact that different, even contradictory, inference results are obtained from different DTC companies for the same test sample. In the next section we explained how dog breeds diverged from wolves based on two bottlenecks related to domestication and Victorian breed formation, respectively. Analysis different types of genetic data shows that these breed creations led to more between-breed variability and a decline of genetic diversity within breeds which can be utilized for ancestry predictions but has consequences on whether certain disorders are inherited.
Firstly, I explored the genetic proximity among breeds for the datasets used in this thesis. Therefore, we look at several multivariate distance and similarity measures which can be computed using either the
124 Chapter 6. Conclusions and future work
original genotype data or the frequencies of the phased haplotypes. We also discussed one similarity measure based on the ChromoPainter software which take the marker correlation into account although we could not exploit this strength much due to limited LD in our sparse SNP marker profile composed of short genetic sequences. Then, we visualized these derived proximity matrices using heatmaps, dendrograms as well as 2D reconstructions using multi-dimensional scaling which retain original, high-dimensional distances approximately optimal. We noticed that some breeds can be identified as very distinct while some breeds have individuals which overlap with members of other breeds.
After that I continued with a review of Mars Wisdom’ proprietary canine ancestry inference imple-mentation which I used as foundation for the development of our own algorithm DBAncestry for breed composition inference. At first our work was concerned with the development of an inference algorithm for the special case of purebreed synthetic dogs. To characterize different breeds in our algorithm we enumerated typical haplotypes with their frequencies for each pair of breed and chromosome. These frequencies were computed using the statistical genetics software PHASE. More specifically, we either phased the purebred training dataset by breed which is adversely affected by small samples sizes in some of the breeds, or we phased all training samples together across breed which confounds population structure. Then, based on the test dog genotype for a chromosome we enumerated consistent haplotype pairs and their corresponding frequencies to compute the most likely assignment of a breed pair to this very chromosome. However, there are cases where the frequencies of those consistent haplotypes are zero, and we experimented with three inference options to deal with this case. Although a few breeds, especially subpopulations were confused, results were very encouraging.
In the next chapter of this thesis we extended our novel algorithm DBAncestry from the special case of purebred dogs to mixed breed dogs of varying lineage complexities. In this crossbred case each of 25 chromosomes has breed pairs assigned from a database composed of 125 breeds. Therefore, assigning breed pairs across the genome leads to a prohibitively large number of possible breed compositions.
Therefore, we sample the space of possible breed configurations using a Metropolis-Hastings algorithm.
As proposal update in this MCMC algorithm we either uniformly sample new breeds or we bias the Metropolis-Hastings algorithm, such that breed proposals is likely to be similar to the current breed assignment. As breed-biased update proposal we use information based on a breed rank Manhattan distance matrix using the original genotype data. We also experimented with the run-time of the MCMC algorithm using either a short run of 700K or a long run of 7 million iterations in the main phase. As performance measures we used an adaptation of classical classification literature, i.e. an extension of binary ROC curves to multiple class, and we calculated how predicted breed proportion estimates deviate from the genuine ones. The results we obtained are very good although the breed contribution of genuine breeds in the ancestry are underestimated due to exploration in the MCMC. More complex ancestries as well as the short MCMC run lead to a slight drop in performance. Furthermore, we expected that breed-biased update rules either improves mixing time or prediction accuracy over a uniform update rule but results do not show any evidence for this hypothesis. Then, I continued to show how lineage tree is derived from the estimated breed contributions via the proxy of true ancestral proportions. Finally, I showed that the DBAncestry algorithm is robust against recombination events typical for canine datasets and only shows a very small decrease in prediction quality.
In the final chapter of this thesis we were seeking for a sophisticated algorithm for the inference of complex ancestries which should be utilized as comparison technique for DBAncestry to predict predict complex canine breed compositions. However, there is a lack of a comprehensive review of ancestry
6.2. Future work 125