4. RESULTADOS Y DISCUSIÓN
5.8 Desarrollo
5.8.2 Reestructuración del área de mantenimiento
Due to the added complexity that the machine learning models introduce, it was impera- tive that the data inputs to the models was managed correctly. There were two different datasets used: a smaller set consisting of only those SNPs found to be GWAS significant in the PGC-2 study, and a larger one that was created by performing a PLINK clumping procedure on the SNPs and selecting the index SNPs below a p-value threshold of 0.05 from the GWAS. Throughout this chapter, these two datasets will be referred to as the “GWAS significant dataset” and the “threshold dataset”. Ideally, both of these prepara- tions of the data would have come from a dataset that had the same Quality-Control (QC) procedures carried out on them, but unfortunately, this was not possible; the more stringent QC procedure used on the threshold dataset ended up removing 49.6% of the GWAS significant SNPs. A full description of the QC procedures, together with possible caveats are described here. Each of the two datasets are described separately.
The GWAS significant dataset
The only QC procedure carried out on the GWAS significant dataset was an INFO score threshold of 0.9 for the SNP imputation. All of the trials carried out in this chapter excluded the sex-chromosomes, and this dataset contained 125 out of the 128 GWAS significant SNPs located in the autosomes (Ripke et al, 2014) (for details on the individual SNPs and their locations, refer to table 2 in the supplementary material of their paper). However, the missingness rates of the SNPs were not considered. The effect of this can be seen for both the I1M and OMNI data in figures 3.1 and 3.2. While the distribution of the number of missing SNPs per sample seems acceptable, it can be seen in the two histograms showing the number of missing samples per SNP that these datasets did not undergo any QC procedure for missingness. In some cases, not all of the SNPs were present for many thousands of samples for both chips. However, while this is clearly not desirable, nothing else could have been done here if these GWAS significant SNPs were to be studied.
0 10 20
0 1000 2000 3000
Number of Missing Samples
Frequency
(a) Missing data points per SNP
0 500 1000
0 5 10 15
Number of Missing SNPs (b) Missing data points per sample Figure 3.1.: Histograms showing missing data information for the I1M, 125 GWAS SNPs dataset
0 30 60 90 120 0 1000 2000 3000 4000
Number of Missing Samples
Frequency
(a) Missing data points per SNP
0 200 400 600 800 0 4 8 12 Number of Missing SNPs (b) Missing data points per sample Figure 3.2.: Histograms showing missing data information for the OMNI, 125 GWAS SNPs dataset
The threshold dataset
As the best performances in the risk profile scoring section of the PGC-2 study for the CLOZUK dataset were achieved when the GWAS significance threshold was set at 0.05 (Ripke et al, 2014), it was decided that this threshold would be used again here. The initial data from which the SNPs were selected had been through the additional QC steps:
• Genotyping missingness rate 2%
• Minor Allele Frequency 10%
• Hardy-Weinberg Equilibrium (HWE) significance level 1 × 10−4
These additional QC steps could be carried out in this case as there were no specific SNPs that had to be included in the analysis. In addition, the Major Histocompatibility Complex (MHC) region (chr6: 25-34 Mb) was removed due to the highly correlated nature of the SNPs in this region.
At this point 2,680,814 SNPs remained. However, these were not all independent markers due to the effects of Linkage Disequilibrium (LD), so the data had to be clumped to find the desired index SNPs. The same parameters outlined in PGC-2 were used here: p1 The significance threshold of the Index SNPs 0.05
kb Kilo-Bases (KBs) between Index SNPs 500 r2 The r2 value for the LD threshold 0.1
After the clumping procedure, which was carried out in PLINK, a total of 14,462 SNPs remained. As the genotyping missingness rate was far more stringent on this occasion, the distribution for the missing number of samples for each SNP is not so positively skewed as in the previous example. This can be seen in figures 3.3 and 3.4. The maximum number of missing samples per SNP is far lower than was seen in the GWAS significant dataset, and while the distribution for the number of missing SNPs per sample shows higher numbers, this is due to the greatly increased numbers of SNPs being used.
0 1000 2000 3000 4000 5000 0 50 100 150
Number of Missing Samples
Frequency
(a) Missing data points per SNP
0 250 500 750 50 75 100 125 150 175 Number of Missing SNPs (b) Missing data points per sample Figure 3.3.: Histograms showing missing data information for the full I1M dataset
0 2000 4000
0 20 40 60 80
Number of Missing Samples
Frequency
(a) Missing data points per SNP
0 200 400 600 40 80 120 160 Number of Missing SNPs (b) Missing data points per sample Figure 3.4.: Histograms showing missing data information for the full OMNI dataset
Reformatting the data for the machine learning models
After selecting the SNPs as described above, the data had to be transformed into a format suitable for being entered in the Python scripts for the machine learning. The information required was how many of the reference alleles an individual had for each SNPs. The term reference allele is used to signify which of the mutations was considered in the calculation of the Odds-Ratio in the GWAS statistical tests. In a GWAS study data format, the reference allele is normally assumed to be the least frequent allele seen at each location; in effect, the minor allele. However, in the case of the data used in this study, the GWAS results were calculated from all of the different datasets contributed by the teams in the Schizophrenia Consortium apart from CLOZUK and Cardiff COGS (as this data had to be held out), so the more infrequent alleles seen in the different positions could well have been different in the CLOZUK dataset. In order to control for this, the PLINK command --recodeA was used, which can be prefixed with the flag --referenceallele and a text file instructing which allele should be used as the reference for each SNP. This results in a flat text file, showing how many of these reference alleles each sample had for all the SNPs.
Data Analysis in Python
All of the data analysis was carried out in Python, using two main packages for machine learning and data processing. The scikit-learn package (Pedregosa et al., 2011) is an on- going development of a variety of different algorithms for a vast array of machine learning requirements, and provides an interface to allow all of the algorithms used throughout this thesis. The Pandas package (McKinney, 2011) is another ongoing development that facilitates the application of a wide selection of data processing procedures, often used to prepare and shape the data to make it ready for use in scikit-learn.
The output of the recoding procedure results in a file showing (amongst other information) the number of counts of the various reference alleles for the SNPs for each sample in the analysis. This is essentially a large matrix of integers in the set {0, 1, 2}, because, due to the existence of two chromosomes for each SNP position, each sample must have one of these number of reference alleles at each location. This information was used to create the three main types of input information that were used in this chapter:
Raw Allele Counts A matrix of the integers {0, 1, 2} only
Polygenic Score The mean of all the weighted allele counts for each individual
Creating these inputs was done with relative ease using Pandas; the information for allele counts could be extracted from the output of the --recodeA operation. The weighting was carried out by getting the correct order of the SNPs from the columns of the recode output file, and using this to get the LOR from the GWAS results file. This resulted in a matrix of the allele counts as well as a vector of the respective weight values. In a single operation, these two can be multiplied together to give the final matrix of the weighted values. The polygenic score was then calculated by taking the mean of the weighted values for each sample.