4. PROPIEDADES ACEROS DE HERRAMIENTAS
5.9. Recocidos
In this Chapter, the methodology of the proposed solution is presented. This includes a description of the used dataset in terms of the number of samples and SNPs for the cases and controls, and the transformations that were applied to it. In addition, the quality control criteria, performed imputation, and codifying strategy were discussed. Moreover, the theories behind the dimensionality reductions in the baseline and the used classifiers, are explained. Finally, the proposed scoring scheme and features ranking methods are explained in details. For each, an algorithm is given, and an example whenever applicable. Figure 6 shows an overview of the methodology.
32 Dataset
The dataset used in this thesis is the Wellcome Trust Case Control Consortium (WTCCC1) dataset [24]. The dataset contains around 14000 cases for seven diseases and around 1500 controls genotyped for 500k SNPs using the Affymetrix 500k microarray [44]. Table 2 shows the details of the dataset.
Table 2 Dataset Details Dataset Number of samples Number of features Number of classes 1958 British Birth Cohort samples (Controls) 1504 500,568 1
Bipolar Disorder (BD) samples 1998 500,568 1
Coronary Artery Disease (CAD) samples 1998 500,568 1 Inflammatory Bowel Disease (IBD) samples 2005 500,568 1
Hypertension (HT) samples 2001 500,568 1
Rheumatoid arthritis (RA) samples 1999 500,568 1
Type 1 Diabetes (T1D) samples 2000 500,568 1
33 Dataset Preprocessing
This Section describes how the dataset was transformed to be suitable for the Scikit- Learn [45] machine learning library.
Transformation
The obtained dataset consists of eight directories. Each directory corresponds to one class in Table 2, and contains 23 files. Each file, contains the genotyping of SNPs in a specific chromosome for all samples in the class. The size of the dataset was 253 GB. Figure 7 shows a snapshot of one of the files. The files were in TAB delimited format and the order of the columns was: SNP ID, SAMPLE ID, SNP VALUE, CONFIDENCE. It is important to note that the NN SNP value represents a missing value.
After exploring the files, a lot of redundancy was observed. The ID of each sample was repeated a number of times equal to the number of SNPs. In addition, the ID of each SNP was repeated a number of times equal to the number of samples. The following transformations were applied to each class files.
• Samples IDs and SNPs IDs are extracted and kept in separate files.
• The SNPs values of each sample are appended in one row. The order of the rows is the same as in the samples IDs file, while the order of the SNPs is the same as the order in the SNPs IDs file.
34 After applying the transformations, SNPs values of each class were combined in one TAB delimited file. Each row contains the values of all SNPs of one sample, while each column contains the values of one SNP for all samples in the class. It is worth noting that the confidence value was not maintained, because it will not be used in the classification task. The size of the dataset was reduced to 17.8 GB.
Quality Control
The original paper of the dataset [24] excluded some samples from the study, these samples were excluded. In addition, samples and SNPs with more than 20% of missing data were also excluded. Specifically, 804 samples and 31,216 SNPs were excluded.
Imputation
After applying quality control, the observed percentage of missing values in the dataset was 0.81. These missing values were replaced with the most frequent value in the same SNP. In other words, if sample X from class Y has a missing value of the SNP Z. The missing value is replaced with the most frequent value of SNP Z in all samples in class Y.
Codifying
As shown in Figure 7, each SNP value is represented in two letters. There are four possible letters A, C, T, and G. Thus, there are 16 possible combinations. Each of these combinations and the corresponded numerical value are shown in Table 3.
35 Table 3
Codifying Strategy
SNP Value Numerical Value SNP Value Numerical Value
AA 1 CA 9 AG 2 CG 10 AC 3 CC 11 AT 4 CT 12 GA 5 TA 13 GG 6 TG 14 GC 7 TC 15 GT 8 TT 16
Merging and Class Labels Generation
The problem is formulated as a binary classification problem. To generate datasets suitable for the binary classification problem and compatible with Scikit-Learn library, the following steps were applied.
• The control samples were merged with each of diseases samples in Table 2. The resulting of this is seven files; one for each disease.
• For each file, a separate file is generated for the class labels. In each file, the controls samples assigned the class -1 while the disease samples assigned the class 1.
36 Baseline
This section describes the dimensionality reduction techniques and the classifiers used in the baseline. It is worth noting that the both used dimensionality reduction techniques are filter methods and based on information theory measures.
Dimensionality Reduction Symmetrical Uncertainty
The first selected dimensionality reduction technique for the baseline is ranking features based on SU [46]. SU was selected because it was part of a method presented in [41] that achieved high classification accuracy on the WTCCC1 dataset. The SU between two variables X and Y, is a value in the range [0,1], where 0 indicates that the two variables are completely independent, and 1 indicates that one variable can be completely predicated by the other variable. In fact, the SU is built on top of the Mutual Information (MI), which can be described as the reduction in entropy of a variable, achieved by knowing another variable [46]. The SU of two variables X and Y is given by the following equation, where MI is the mutual information and H is the entropy.
𝑆𝑈(𝑋, 𝑌) = 2 [ 𝑀𝐼(𝑋, 𝑌)
𝐻(𝑋) + 𝐻(𝑌)] (9)
Conditional Entropy
The second selected dimensionality reduction technique for the baseline is ranking features based on CE [47]. The reason behind the selection of CE is that, up to the author’s
37 knowledge, it was never used as feature ranking method with SNPs data in classification problems. The CE of variable Y given the variable X, can be described as the amount of information required to describe the variable Y knowing the variable X.
Classifiers
Two classifiers were used in the baseline, which are described in the following. K-Nearest Neighbors
In the KNN classifier [48], training samples are represented as points in the space. To classify a test sample Z, the Euclidean distances between the test sample and all training samples are computed. The most frequent class in the K nearest training samples is predicted as the class of the sample Z. In Figure 8, the unknown sample is assigned the class A based on K equal to four. Several K values were tested on a small subset and eleven was the best. Thus, in this thesis, K is equal to eleven because
38 Support Vector Machine
In SVM [50], the goal is to find an optimal hyperplane the separates the two classes of the training samples. A test sample is assigned a class based on its relative position to the separating hyperplane. An optimal hyperplane, as shown in Figure 9, is the hyperplane that correctly separates all training samples to its either sides, and correctly categorize all the training examples. In addition, the optimal hyperplane, should maximize the margins with nearest training samples of both classes. The nearest training samples to the hyperplane are called the support vectors, while the shape of the hyperplane is called the kernel, which can be linear, polynomial or other basic functions. Figure 9 shows a toy example of linear SVM with three support vectors. In this thesis, a linear kernel is used and the default parameters values of the LinearSVC classifier in the Scikit-Learn library are used. The values of these parameters are; penalty parameter C=1, loss function loss=squared_hinge, penalization norm penalty=l2, solve the dual optimization problem dual=True, tolerance for stopping criteria tol=1e-4, no class weights are used class_weight=None, and the algorithm is set to maximum 1000 iterations max_iter=1000.
39 Proposed Scoring Scheme
The hypothesis behind the proposed scoring scheme is that specific values within a feature might be useful in predicting specific class labels, while other values are not. In this context, each group of similar values within a feature are considered a region as shown in Figure 10. It is clear from the toy example shown in Figure 10 that whenever the value of feature 1 is A, the corresponded class label is almost 1 in all samples. While whenever the value of the feature is C, the corresponded class label is always zero. However, samples with the value A, are more frequent than samples with value C. Thus, the proposed scoring scheme must consider the usefulness of the region, and the region's significance within the feature. It is important to note that the scoring scheme only computes scores and assign class labels for the regions, and further steps are still required to select the best features. The following describes the proposed scoring scheme for one feature, while a pseudo code of the scheme is presented in Algorithm 1.
For each region, a score that is ranging from zero to one is computed and a class label is assigned. The assigned class, is the class that the region is useful for. A positive score means that the region is more informative for one of the class labels in a binary classification problem. The higher the score the more informative the region for a specific class label. Each region’s score is normalized to the length of the region within the feature.
40 Algorithm 1 Pseudo code of the scoring scheme
Input
Vector V of the values of one feature, categorical values
Vector T of the class label of each sample, class label can be 0 or 1
Output
Vector O of tuples, each tuple in the form (n, s, c), where n is the region
name, s the score of the region, and c is the assigned class label.
1: UniqueValues ← FindUniqueValues (V)
2: Regions ← new list, O ← new list
3: for each UniqueValue u
∈
UniqueValues do4: Temp ← FindClassLabels (u, T)
5: add (Regions, (u, Temp))
6: end for
7: for each Region r (value, labels)
∈
Regions do8: ClassZeroCount ← Count (0, r (labels))
9: ClassOneCount ← Count (1, r (labels))
10: MaxCount ← max (ClassZeroCount, ClassOneCount)
11: MinCount ← min (ClassZeroCount, ClassOneCount)
12: s ← (MaxCount – MinCount) / length (V)
13: c ← FindMostFreqClass (r (labels))
14: add (O, (r (value), s, c))
41 In Algorithm 1, in line 1, the unique values within the feature are found. In lines 3- 6, the correspondent class labels for each unique value are found and stored in the array Regions. Lines 7-15 are repeated for each region in Regions array. In lines 8-9, the frequency of each class label in the region is computed. The class label with maximum frequency, and the class label with the minimum frequency are identified in lines 10-11. In line 12, the score of the region is computed by subtracting the frequency of the class with minimum occurrences from the frequency of the class with maximum occurrences. The result of the subtraction is divided by the length of the feature vector. In line 13, the class of the max frequency is assigned as the class label of the region.
The perfect region based on the scoring scheme, is the region that contains only one of the class labels in its correspondent class labels. In other words, whenever the region value is observed within the feature, the class label of the sample will always be the same. This indicates that the region is useful for identifying samples from a specific class label. Figure 11 shows how the toy example in Figure 10 is scored.
42 Feature Ranking Method 1
After the scoring scheme is applied to all features, the first proposed heuristic to rank the features, is to sum the scores of all regions within each feature, and rank the features in descending order based on the summed score. The top N features are then selected and fed to a classifier by adding one feature at a time and the accuracy of the classifier is recorded.
The rationale behind this method is that the perfect feature will achieve a summed score of 1, while the worst feature will achieve a summed score of zero. When the summed score of a feature is 1, the regions within this feature can perfectly distinguish between the class labels. In other words, within each region, there is only one class label. Figure 12 shows how the total score for the example in Figure 10 is computed, while the pseudo code of the method is given in Algorithm 2.
43 Algorithm 2 Pseudo code of method 1
Input
Vector F of tuples, each tuple in the form (fid, (s1, s2, … si)), where fid is the
feature id and s1 to si are the scores of the regions in the feature. N the number of the top features to be retrieved.
Output Vector O of N ids
1: TotalScores ← new list, O ← new list
3: for each Feature (fid, (s1, s2, … si)) f
∈
F do4: add (TotalScores, (fid, sum (s1, s2, … si)))
5: end for
6: SortedTotalScores ← SortDescending (TotalScores)
7: for each SortedTotalScore (fid, TotalScore) s
∈
SortedTotalScores do8: add (O, fid)
9: if length (O) == N
10: Break
11: end for
44 Feature Ranking Method 2
There is a predefined set of values that a feature can take on the problem under investigation of this thesis. Therefore, the second proposed heuristic, is to select the features that contain the highest scoring region for each possible region. The selected features are then sorted in descending order based on the score of the highest scoring region within each feature. The selected features are then fed to classifier one feature at time and the accuracy is recorded. It is worth mentioning that this method selects a number of features that is less than or equal to the number of possible values of the features.
In method 1, the values of the features that achieved the highest total scores are not investigated. In the top N selected features, some of the possible values might never appear. Thus, the rationale of method 2 is to consider the significance of each possible value.
There are two reasons behind why method 2 may select features less than the number of possible values. The first, some possible values may never appear in the dataset. The second, the same feature may contain more than one highest scoring region.
The pseudo code of method 2 is given in Algorithm 3. The presented algorithm scans the entire scored features only once. A naïve implementation may scan the entire scored features more than that. In line 3, the dictionary is initialized with empty tuples. The first entry of each tuple will hold a score of region, while the second entry of each tuple will hold a feature id.
45 Algorithm 3 Pseudo code of method 2
Input
Vector F of tuples, each tuple in the form (fid, (s1, s2, … si), (r1, r2, … ri)),
where fid is the feature id and s1 to si are the scores of the regions in the
feature, and r1 to ri are the names of the regions
Vector P of all possible values of the features. Output Vector O of N ids where N <= length of (P)
1: HighestScoringRegions ← new dictionary, O ← new list
2: for each PossibleValue p
∈
P do3: HighestScoringRegions [p] ← (ϕ, ϕ)
4: end for
5: for each Feature (fid, (s1, s2, … si), (r1, r2, … ri)) f
∈
F do6: for i repetitions do 7: if HighestScoringRegions [ri] [0] < si 8: HighestScoringRegions [ri] [0] = si 9: HighestScoringRegions [ri] [1] = fid 10: end for 11: end for
12: SortedRegionsScores ← SortDescending (HighestScoringRegions)
13: for each SortedRegionsScore (s, fid) sr
∈
SortedRegionsScores do14: add (O, sr [1])
46 Feature Ranking Method 3
The problem is formulated as a binary classification problem. This means that the class label assigned to any region can only be one of two classes. However, within the same feature, multiple regions might be assigned the same class label. Thus, the third proposed heuristic is to sum the scores of all regions within the same feature that assigned the same class label. The result of this, is that two scores are computed for each feature; one for each class label. After that, the features are ranked in descending order based on the computed score of each class label within the feature in two separate lists. From each list, the top N features are selected and fed to a classifier one feature from each list at a time, and the classification accuracy is recorded.
In method 1 and method 2, the class assigned to the regions were not taken into account. This may result in an unbalanced selection of features. In other words, it could be possible that most of the selected features contain high scoring regions for one class only. Thus, the rationale behind method 3 is to ensure balanced selection of features for each class label. Figure 13 shows how the score for each class label is computed for the example in Figure 10, while the pseudo code of the method is given in Algorithm 4.
47 Algorithm 4 Pseudo code of method 3
Input
Vector F of tuples, each tuple in the form (fid, (s1, s2, … si), (c1, c2, … ci)),
where fid is the feature id and s1 to si are the scores of the regions in the
feature, and c1 to ci are the class labels assigned to the regions. N the number of top the features to be retrieved.
Output Vector O of N ids
1: ClassOneScores, ClassZeroScores, O ← new dictionary
2: for each Feature (fid, (s1, s2, … si), (c1, c2, … ci)) f
∈
F do3: ClassOneScores [fid] ← SumScores (1, (s1, … si), (c1, … ci))
4: ClassZeroScores [fid] ← SumScores (0, (s1, … si), (c1, … ci))
5: end for
6: SortedClassOneScores ← SortDescending (SortedClassOneScores) 7: SortedClassZeroScores ← SortDescending (SortedClassZeroScores) 8: Counter ← 1
9: while length (O) < N do
10: add (O, SortedClassZeroScores [Counter])
11: if length (O) == N
12: Break
13: add (O, SortedClassOneScores [Counter])
14: Counter++
48