• No se han encontrado resultados

All the operations for the GPWAS analyses are detailed in the R source code used to conduct the analysis – and associated documentation – which has been made available online (https://github.com/shanwai1234/GPWAS). Briefly, we employed a model

selection approach to adaptively select the most significant phenotypes associated with each gene. A F-test was used to compare a model to explain variation in SNPs based solely on population and a model which incorporated both population structure and trait data. The significance in the difference of the goodness of fit between these two models was used to determine the significance of the association of individual genes with phenotypic variation in the dataset.

The first stage is a stepwise selection procedure. The procedure iterates over all phenotypes in order to select individual phenotypes to incorporate into the model. This approach models all the SNP markers assigned to a given gene jointly with multiple responses. During each iteration, the association between each single trait and all of the evaluated SNPs are determined using a F-test which incorporates the dependence among the SNPs (see provided R code for details). If at least one trait passes a set

threshold (in the analyses presented in this paper a threshold of p < 0.01 was employed), the single most significant trait is added to the model. If at least one trait was not

significant based on the same threshold employed above, the single least significantly associated trait was removed from consideration. This process is repeated for a

configurable number of iterations. For the analyses presented in this paper, the number of iterations was set to 35 as, given this number of iterations, none of the models for any gene included the maximum of 35 distinct traits.

After the number and identity of the phenotypes included in the model for a particular gene is finalized, the next stage is to evaluate how much the inclusion of phenotypic data improves model fit, relative to a purely population structure based model. To do this, two separate models are fit. The first model (initial model or IM) uses only population structure principal components to predict the values for all SNP markers associated with the target gene. The second model (GPWAS model or GM) uses both population structure and the phenotypes selected in stage one to predict the values for the same set of SNP markers. The goodness of fit of these two models is compared using

a F-test. The final result of the F-test takes into account all of the SNPs included from the target interval, as well as the degree of correlation between these SNPs. One of the criteria of those F-tests is that multiple response variables should not exhibit strong correlations with each other. This is the reason that the set of SNPs within each gene/interval were first filtered to select only one representative SNP from groups of SNPs in high linkage disequilibrium with each other.

In order to calculate the principal components used above, a separate PCA analysis was conducted for genes on each of the 10 chromosomes of maize. For analysis of the given gene on each chromosome, markers solely from the other 9 chromosomes were used to reduce the endogenous correlations between genes and principal components.228 A subset of 1.24 million SNPs distributed across both intragenic and intergenic regions on all 10 chromosomes was used to perform PCA for both GPWAS and GWAS. The first three PCs were calculated using R prcomp function and included in GPWAS analysis.

The final model can be represented as:

gk,i= P Ck,1βi1+ P Ck,2βi2+ P Ck,3βi3+ vi X

j=1

P hek,(j)τi(j)+ k,ij. (4.1)

Here, the subscript k and i represent the kth observation and the ith gene, respectively. There are viselected phenotypes for the ith gene, where vi ≤ 260. The selected

phenotypes {P hek,(j)}are a subset of the collection of all the phenotypes

{P hek,1, P hek,2, . . . , P hek,260}, where τi(j)is the corresponding coefficients for the selected phenotype P hek,(j)of the ith gene. The first three PC scores P C1, P C2and P C3 were always included in the model with their effects βi1, βi2and βi3. Note that gk,i, βi1, βi2, βi3and τi(j)could be vectors corresponding to the multiple SNPs within the ith gene. Total phenotypes was iteratively selected for each scanned gene. The p-value of each gene was determined using the partial F test through comparing the final model containing both the first three PCs and the selected phenotypes with the initial model

containing only the PCs.

FDR cutoffs for the partial F-test were based on the results from 20 permutation analyses, for which the values for each trait were independently shuffled among the 277 genotyped individuals and the entire GPWAS pipeline was rerun for all genes. Selected significant GPWAS genes with incorporated phenotypes are listed in Supplementary Table 8.

Documento similar