2.2 Marco Legal
2.2.2 Plan Nacional del Buen Vivir 2013-2017
Sciences
The most historical model organisms used began with the analysis of a single inbred line, where by inbred we mean that each individual in the line are genetically identical. These simple observational analyses allowed for better observation of phenotype values as you are able to average over many inbred individuals. To help understand the genetic effects for these diseases, recombinant chromosome substitution lines were proposed. They have had a long history for use in wheat breeding (Cavanagh et al., 2008). Typically chromosome substitution sets involve all chromosomes (except one) being derived from a recurrent parent and the remaining chromosome from a donor parent. Although effective, the resolution to detect QTLs in substitution lines was far to large. To define the position of genes on substitution chromosomes, recombinant inbred chromosome substitution lines can be developed (Law, 1966) and have been successful in the cloning of genes underlying traits in agriculture.
Whereas classical single strain and substitution line association studies had advan- tages in terms cost, coverage and reproducibility, their main weakness was a lack of
power for genome-wide association and low resolution. To obtain a higher resolution and an increase in power, we started to examine genetic reference panels (GRP). GRPs are defined as sets of individuals with fixed and known genomes that can be replicated indefinitely (i.e. inbred lines). Typically they consist of dozens to hundreds of inbred lines related by descent from a set of common ancestors (i.e., the founders). GRPs have been developed for many organisms, including yeast, plants, flies, and mammals (Crow, 2007; Buckler et al., 2009; Ayroles et al., 2009; Kover et al., 2009; Cubillos et al., 2011). While individual inbred lines are free of population structure, when considering a large number of inbred lines, there is evidence of population structure between the inbred lines (McClurg et al., 2007) complicating the analysis.
The Hybrid Mouse Diversity Panel (HMDP; Bennett et al., 2010) is an example of a large GRP which has been widely used. The HMDP increases the statistical power and resolution of the classical association studies by including a set of 70 recombinant inbred mouse strains in the mapping panel. In this design, approximately 100 strains are phenotyped (30 classical inbred strains and 70 recombinant inbred strains), and association is carried out after correcting for population structure using, for example, efficient mixed-model association (EMMA; Kang et al., 2008). By using the combined population included in the HMDP provides a high statistical power (from the recom- binant inbred strains) and a high resolution (from the classical inbred strains). A limitation of the HMDP is the number of available inbred strains, resulting in an upper limit on the statistical power of the HMDP.
Rather than simply observing the differences in phenotypes between lines in a GRP, another option is to cross two inbred lines together. In a cross, two inbred strains are mated, and their offspring are either mated to each other (an intercross design) or to a progenitor strain (a backcross design). Second-generation offspring are then phenotyped and genotyped, and linkage analysis is carried out to identify a region that is associated
with the trait. However, each QTL region is large (i.e. low resolution), often containing tens of megabases and hundreds of genes. The process of identifying the causal variant and the gene involved is therefore difficult and costly. In a genetic cross, only a few hundred animals are required to identify loci that together explain 50% or more of the phenotypic variance for a particular trait. This finding is particularly striking compared to human studies, in which typically tens of thousands of individuals are required to identify loci that are involved in traits, and in which the loci identified typically explain only a small fraction of phenotypic variance (Flint and Eskin, 2012).
Another reference panel type approach is that of “in silico mapping” (Grupe et al., 2001). By “in silico” we mean a QTL mapping method which uses existing phenotypic and genotypic variation within common laboratory inbred strains for association stud- ies. Over the years, breeding and inbreeding over the years has produced the commonly used modern laboratory strains of mice, and a wide variation of phenotypic traits have been observed (McClurg et al., 2006). The genotypic structure of these strains is also being explained through dense mapping of SNPs, and variance among these strains is emerging in the form of haplotype structure (Yalcin et al., 2004; Wiltshire et al., 2003). It was originally hypothesized that “in silico” mapping has the necessary experimental requirements to facilitate QTL mapping (Grupe et al., 2001; Pletcher et al., 2004), suggesting that phenotype-specific mouse crosses are not needed for the identification of QTL, and that large-scale genotyping efforts could be generated and combined in a phenotype-independent manner. While this may be true, many “in silico” mapping projects fell short of their goals due to the inability to properly assess the popula- tion structure prior to methods such as EMMA being adopted from techniques used in animal breeding where historically they have had to deal with related individuals.
The Collaborative Cross (CC) was proposed in 2002 as a large-scale multiparental recombinant inbred line panel as a project aimed at generating a common platform
for mammalian complex trait genetics that would overcome the limitations of exist- ing resources (Threadgill, Hunter and Williams, 2002) and that can advance the field beyond complex trait analyses toward systems genetics (Threadgill, 2006). Unlike the HMDP, which consists of currently available strains, the Collaborative Cross has gen- erated new inbred strains using a specific breeding scheme increasing power and resolu- tion. The Collaborative Cross is also advantageous as there is less population structure than would be expected in a standard GRP. While techniques such as EMMA are available to correct for population structure, the presence of population structure still has a negative effect on statistical power. The final eight-way RIL design of the CC was community driven (Churchill et al., 2004) and included founders from five classi- cal inbred strains (A/J, C57BL/6J, 129S1/SvImJ, NOD/ShiLtJ, and NZO/HlLtJ) and three wild-derived strains that were selected to represent three Musmusculus subspecies (CAST/EiJ, PWK/PhJ, andWSB/EiJ).
An alternative strategy to inbred lines is to use outbred mice. Bi-parental popula- tions, or advanced intercross lines (AIL; proposed by Darvasi and Soller (1995)), have been used by selecting founder lines with large phenotypic differences for one or more traits, usually with unrelated parents selected to maximize marker polymorphisms (Do- erge, 2002). They traditionally were difficult to analyze due to relatedness of individuals in the populations. With the introduction of methods to deal with the relatedness (see Sillanp¨a¨a (2011) for a review of many options) the use of AILs and more complicated populations have recently become more widely used. These include heterogeneous stock mice (Demarest et al., 1999; Valdar et al., 2006) (for which animals are descended from eight classical inbred founder strains) and the Diversity Outbred (DO) mice (Svenson et al., 2012) (which comprises animals descended from the eight Collaborative Cross founder strains). Outbred mice can be viewed to be similar to F2 animals generated from a cross, but they have ancestry from eight founder strains instead of only two, and
the population is bred for more generations. The main advantage of HS/DO strategies is that they can be used to generate an almost limitless number of animals, enabling large studies to be carried out that can find weak genetic effects. In addition, owing to their breeding history, animals have undergone many more recombination events increasing mapping resolution.
1.3.2
Analysis of Outbreed populations
A number of experimental strategies have been proposed for association mapping of complex traits in model organisms. Many involve the use of highly recombinant pop- ulations derived from inbred lines. Examples of such populations are advanced inter- cross lines (AILs) , where a pair of inbred progenitors are intercrossed for three or more generations, and heterogeneous stocks (HS; Demarest et al., 1999), where a number, usually eight, of inbred strains are intercrossed for many generations. The Diversity outbreed (DO; Svenson et al., 2012) population has recently been developed in mice which resembles the HS in breeding structures. In theory, these strategies can achieve much higher-resolution mapping than that which is obtainable in standard inbred strain crosses. One such reason is they accumulate a greater density of recombinations, al- lowing for a finer mapping of the founders. Another issue is that the individuals in the population are related to some level, which often violates standard mapping techniques which may be applied to independent subjects.
Multiple founder recombinant populations have used similar breeding schemes to AILs (Valdar et al., 2006) but differ from AILs as they descend from more than two inbred strains, typically eight, adding additional complexity to the population. Because the markers used for genotyping will have fewer alleles than the number of haplotypes in the cross, individual markers typically do not unambiguously identify the underly- ing strain haplotype. In particular, unless all variants are genotyped, single-marker association analyses will fail to capture some QTL effects (Mott et al., 2000).
Polygenic based approaches
According to some recent views, population structure and relatedness between indi- viduals both require their own correction terms (see Sillanp¨a¨a (2011)), or may need additional correction after fitting a polygenic model (Amin, van Duijn and Aulchenko, 2007). Recently, linear mixed models have been shown to effectively correct for popula- tion structure in the association mapping of quantitative traits (Yu et al., 2006). Linear mixed models incorporate genetic relatedness between every pair of individuals directly as a random effect which addresses the correlation between individuals phenotypes due to their level of relatedness (e.g. siblings, first cousins, second cousins, etc.). This reflects the theory that the phenotypes of two genetically similar individuals are more likely to be correlated than those which are more dissimilar genetically. Applications of mixed models to association mapping in maize and potato panels demonstrate that mixed models obtain fewer false positives and higher power than previous methods in- cluding genomic control, structured association, and principal component analysis (Yu et al., 2006; Malosetti et al., 2007; Zhao et al., 2007).
Many highly recombinant model organism populations, such as the DO or HS, resemble those found in plant and animal breeding. Linear mixed models approach modeling the relatedness of individuals through variance components parameterized by the kinship matrix (Valdar et al., 2009). Specifically, the effects of a single locus are estimated simultaneously with one or more random intercept whose expected cor- relation structure is fixed given the kinship matrix based on the pedigree (or realized kinship matrix based on observed genotypes) and models the effects of overall genetic relatedness to account for effects from the rest of the genome (Kennedy, Quinton and Vanarendonk, 1992; Jannink, Bink and Jansen, 2001; Zhao et al., 2007).
This type of approach has been taken by two popular methods: Efficient Mixed- Model Association (EMMA; Kang et al., 2008) and QTLRel (Cheng et al., 2011).
EMMA was proposed as an efficient exact procedure that corrects for population struc- ture and genetic relatedness in model organism association mapping during a period where it was not computationally efficient to use linear mixed effect models. EMMA takes advantage of the specific nature of the optimization problem in applying mixed models for association mapping, substantially increase computational speed and im- proved the reliability of results by achieving near global optimization (Kang et al., 2008). While this was a great improvement, the EMMA algorithm was still compu- tationally infeasible for large data sets because the variance components parameters are estimated for each marker. A new implementation of the algorithm called Efficient Mixed-Model Association eXpedited (EMMAX; Kang et al., 2010) makes the simplify- ing assumption that because the effect of any given SNP on the trait is typically small, then the variance parameters only need to be estimated once for the entire dataset, rather than once for each marker. This change sacrificed the exact solution calculation from EMMA for a feasible computation time.
QTLRel (Cheng et al., 2011) is a more recent software which was developed to quickly perform genomewide scans, using a similar technique to EMMAX, with the advantage of having multiple random effects. While they specifically use the pedigree to infer the relationship matrix between individuals, this can be replaced by a realized kinship matrix based on the observed genotypes. One of the main advantages to QTL- Rel over EMMA is that it also has the ability to include other random effects such as cage effects, environment effects, or treatment effects.
While several approximate methods have been proposed address the issue of compu- tation times of genomewide scans (e.g. EMMAX and QTLRel), efficient exact options exist. Zhou and Stephens (2012) propose an efficient exact method, which is refer to as genome-wide efficient mixed-model association (GEMMA) which makes approxima- tions unnecessary in many contexts. The method is approximatelyn times faster than
the exact method EMMA and comparable to many approximate methods, making ex- act genome-wide association analysis computationally practical for large numbers of individuals. We note that in some settings the approximate methods provide results almost identical to those from the exact method (Kang et al., 2010; Zhang et al., 2010), it is not guaranteed in general.
Multiple locus based approaches
In a complex trait GWAS, the trait is affected by multiple functional loci and therefore a multiple locus association method would be preferred (Ayers and Cordell, 2010). To identify the important loci within the multiple locus model, variable selection or regularization of the predictors is required (e.g., Sillanp¨a¨a and Bhattacharjee, 2005; Hoggart et al., 2008; O’Hara and Sillanp¨a¨a, 2009; Wu et al., 2009; Ayers and Cordell, 2010; Cho et al., 2010).
The polygenic aspect of the model which accounts for both the distant (i.e., between populations) and close (i.e., within population) relatedness structures in the data can be addressed by a multiple locus model as the genetic relationships between the individ- uals can be captured by the markers themselves (e.g., Habier, Fernando and Dekkers, 2007). This allows for the possibility to use the models without additional polygenic terms. In K¨arkkinen and Sillanp¨a¨a (2012), they showed that multiple locus models that did not try to explicitly model polygenic effects worked well. Their observation of the redundancy in including additional polygenic components is in agreement with, for example, Calus and Veerkamp (2007) and Pikkuhookana and Sillanp¨a¨a (2009). Fur- thermore, Calus and Veerkamp (2007) claim that including polygenic effects at higher SNP densities will not improve the accuracy of total breeding values. Specifically, they found that when the average LD, measured asr2, between adjacent markers is at least
0.10, depending on the heritability of the trait, there appears to be little reason to include a polygenic effect in the model.
Utz, Melchinger and Sch¨on (2000) implement a multiple locus resampling based procedure for detecting functional loci in GWAS, and showed in their simulations that the resampling was able to correct some biases and sampling errors in the model estima- tion. Sch¨on et al. (2004) used composite interval mapping by the regression approach (Haley and Knott, 1992) in combination with the use resampling of an multiple locus additive genetic model, as done in Utz, Melchinger and Sch¨on (2000) with loci selected by stepwise regression for the analysis of test cross progenies. They found that for even moderate sample sizes that their procedure was able to obtain estimates with very low bias. They concluded that for traits regulated by a few QTL with large effects, for which phenotypic selection is expensive or hampered due to rare occurrence, that resampling multiple locus approach of MAS (Utz, Melchinger and Sch¨on, 2000) can be very useful.
Another resampling based multiple locus method called frequentist model averaging (FMA) was proposed in Hjort and Claeskens (2003). FMA examines each combination of predictors multiple locus models and averages over the models with weights to obtain parameter estimates. FMA can be implemented without much difficulty or protracted computations. One requirement of FMA is the specification of model weights. Several method to define the weights have been proposed which include AIC weights (Buck- land, Burnham and Augustin, 1997), weights based on minimizing a Mallows criterion (Hansen, 2007), and weights based on the Focused Information Criterion (Claeskens and Consentino, 2008). Williams and Christian (2006) showed that FMA estimates for genetic effects in twins studies were more accurate than the standard estimates based on the criteria used for the model averaging weights. Schomaker, Wan and Heumann (2010) address the issue of missing data in the FMA framework. They proposed how one can incorporate imputation first and then preform FMA rather than attempt to incorporate complex weighting adjustments to criteria such as AIC which allow for
missing data (e.g., the EM-based AIC developed in Claeskens and Consentino (2008)). They also propose a frequentist model selection (FMS) estimator which is a special case of FMA which focuses on the selected model rather than the estimated effects.
The QTLMAS XII meeting provided a common data set for which attendees could propose methods to analyze the data. The summaries of submitted methods support recent views for of a preference for multiple locus models (Crooks et al., 2009). The results from LDHap (Ledur, Navarro and P´erez-Enciso, 2009) were best overall in this dataset, with LABayes (Bink and van Eeuwijk, 2009) and LDBayes (Cleveland and Deeb, 2009) having the second highest power for QTL detection. As LDHap and LABayes both used information from several markers for detecting QTLs, it suggests that multiple marker methods may have higher power to find QTLs.
Other approaches
Although polygenic effects and multiple locus modeling are popular methods, other methods have been proposed. Other widely used methods for related individuals in human association mapping include genomic control (Devlin and Roeder, 1999), struc- tured association (Pritchard, Stephens and Donnelly, 2000), and principal component analysis (Patterson, Price and Reich, 2006; Price et al., 2006). However, these methods have shown to be inadequate within the realm of model organisms. Genomic control has reduced power when the effect of population structure is large, as would be expected in model organisms (Yu et al., 2006). Principal component based analyses, which assume only a small number of ancestral populations and admixture, are only able to partially capture the multiple levels of population structure and genetic relatedness in model organisms (Aranzana et al., 2005; Yu et al., 2006; Zhao et al., 2007).