Capítulo 5: discusión, conclusiones y recomendaciones
5.1 Discusión
It is important to be able to validate results and prove that they are not due to random chance or to overfitting the data. Because of the difficulty of obtaining new and relevant data in this field, it was necessary to use sophisticated techniques in order to validate the results we have obtained. We used a random label shuffling to test our results.
3.8
Further SNP Analysis
Because of the interesting results found for the SNP dataset, we decided to investigate this further. There are many different experimental pathways that can be explored during a data-mining experiment. We decided to look down five different pathways; predicting relapse, graph analysis, a reformatting of the data, updating the patient labels and a cross validation of the attributes.
3.8.1
Predicting relapse
The previous attribute selection was performed using the patient mortality as a class label. In order to select attributes that are predictive of relapse, the random forest algorithm was performed for the SNP data with the class label being whether or not the patient relapsed.
3.8.2
Graph analysis
The previous analysis using SVD focused on an individual’s attribute values to deter- mine their place in the object space. Another way to view this problem is to compare
CHAPTER 3. EXPERIMENTS 30
patients to each other using the dot product to create a similarity matrix. This ma- trix is n by n and an entry at x (i,j) represents the similarity between patients i and j. By applying a threshold to these data, any value less than the threshold becomes a 0 and the matrix can now be viewed as an adjacency matrix for a graph. SVD is then applied to this matrix and plotted. All non-zero entries in the matrix represent an edge between the points in the graph.
3.8.3
Reformatting the data
As explained previously, the SNP data is coded to represent the three possible alleles. The previous data had one attribute for each SNP which had three possible values. Another approach we developed was to split each SNP into three attributes; one for each allele. For each SNP, a patient would have a value of 1 for the allele they had and 0 for the other two alleles. We then performed attribute selection on this dataset with the hope that if there were any specific alleles for certain SNPs that were interesting, they would be selected using this method. We then applied SVD to the resulting datasets and observed the results.
3.8.4
Updating patient labels
Late in the development of this research, we were able to obtain updated data for the patients. This included five patients who had since died and seven who had relapsed. We were interested to see where these patients lie in our previous space, and so the SVD images were relabeled with the updated information. We also reran the random forest algorithm to see the effect on the attribute-selection process with updated labels.
CHAPTER 3. EXPERIMENTS 31
3.8.5
Cross validation of the attributes
To further support the attribute-selection process, we divided the selected subsets into smaller subsets that we then ran through the random forest algorithm. We then compared the resulting attribute lists in order to see which attributes appeared in multiple lists or only in one.
Chapter 4
Results
In this chapter we report and explain the results of the experiments. First, we look at the clustering of the data based on the SVD images for each of the individual datasets as well as the combined datasets. Next, we explore the subsets of the data created through the attribute-selection process and evaluate these using SVD as well as SVM. Then, we discuss the results of attribute selection and present the top attributes from each dataset. We also explore the SNP dataset further in order to find the biological significance of the results.
4.1
SVD Results
Each of the following results were obtained by plotting the first three dimensions of the U matrix in order to visualize the data. All of the images are labeled with clinical information, such as the patient mortality outcome or the subtype of the disease. By doing SVD on the entire dataset as an early step, we learn some general structures that may exist in these data. We can also use these images as a benchmark for how
CHAPTER 4. RESULTS 33
our subsequent analysis performs.
4.1.1
SVD on SNP data
The resulting SVD image for the full SNP dataset can be seen in Figure 4.1. There appear to be two fairly well defined clusters in the data but they are clearly not related to mortality, seen in Figure 4.1(a), or subtype of disease, seen in Figure 4.1(b). Since these datasets includes 13917 SNPs, not all of these SNPs are expected to be associated with leukemia. Upon further investigation we were able to determine that the spread of the data is caused by the way in which the data is coded. This causes the data to tend to form three clusters based on whether patients have a majority of genes of type AA, AB or BB.
−77 −76.5 −76 −75.5 −75 −74.5 −74 −20 −15 −10 −5 0
(a) SNP data labeled with mortality
−77 −76.5 −76 −75.5 −75 −74.5 −74 −20 −15 −10 −5 0
(b) SNP data labeled with subtype
Figure 4.1: SVD images of SNP data. (a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown
CHAPTER 4. RESULTS 34
4.1.2
SVD on cDNA data
The SVD image for the cDNA dataset does not appear to contain any noticeable clusters based on the mortality label, as seen in Figure 4.2(a). When labeling this image with the patients’ subtype of the disease, more interesting results appear. As seen in Figure 4.2(b), there is a fairly well defined cluster of T-cell patients. This implies that the cDNA genes are able to capture a variation in the subtype of disease that the patients have. This is not surprising since it is well known that the difference between subtypes can be distinguished by the expression of only one gene [27].
−110 −100 −90 −80 −70 −60 −50 −40 −30 −20 −10 −80 −60 −40 −20 0 20
(a) cDNA data labeled with mortality
−110 −100 −90 −80 −70 −60 −50 −40 −30 −20 −10 −80 −60 −40 −20 0 20
(b) cDNA data labeled with subtype
Figure 4.2: SVD images of cDNA. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown
4.1.3
SVD on Affy data
The SVD results of the Affy dataset did not provide much useful information. The images, labeled with mortality and subtype, are seen in Figures 4.3(a) and (b). It is
CHAPTER 4. RESULTS 35
clear that there are no separations in the data based on either label. However, when labeling by type it can be seen that there are no T-cell patients in the main cluster of points. This suggests that the Affy data contains some information regarding the subtype of the disease. Since there are so many points whose disease type is labeled as unknown it is difficult to be confident in this assessment. It is interesting to note that the cDNA data separates the data better than the Affy data does. This was moderately surprising as the Affymetrix technology is generally more accepted to be more reliable than cDNA microarrays. We believe this is related to the way in which the data was collected and perhaps to issues combining datasets from different operators. −150 −145 −140 −135 −130 −125 −40 −20 0 20 40
(a) Affy data labeled with mortality
−150 −145 −140 −135 −130 −125 −40 −20 0 20 40
(b) Affy data labeled with subtype
Figure 4.3: SVD images of Affy. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown
CHAPTER 4. RESULTS 36
4.1.4
SVD on clinical data
The shape of this dataset is interesting, as seen in Figure 4.4. There are four parallel clusters of data, but these cannot be explained by the mortality or subtype labels. Upon further investigation it is found that the three leftmost clusters consist of the data for patients who had a genetic translocation. More specifically the cluster far- thest to the left contains data for patients who had a BCR-ABL translocation, and to the right of that is a cluster of data for patients who had a TCL-AML translocation. The cluster of data points on the far right can only be described as the data points related to patients who did not have a translocation. The spread of data from the bottom to the top of the cluster has been identified as being affected by the size of the liver at diagnosis, the size of the spleen at diagnosis as well as the initial platelet count. It is quite clear that any medical decisions about diagnosis, prognosis or treat- ment based on these data would probably be poor. This is the type of information that is presently being used for decisions regarding leukemia. Based on the results of the genetic datasets, we believe that it is important that this information be used to help support the diagnosis, prognosis and treatment decisions.
CHAPTER 4. RESULTS 37 −6 −5 −4 −3 −2 −1 0 1 2 3 4 −5 −4 −3 −2 −1 0 1 2
(a) Clinical data labeled with mortality
−6 −5 −4 −3 −2 −1 0 1 2 3 4 −5 −4 −3 −2 −1 0 1 2
(b) Clinical data labeled with subtype
Figure 4.4: SVD images of Clinical. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown
4.1.5
SVD on combined SNP and cDNA data
The combined datasets have many more attributes than the individual datasets. With a large dataset there are going to be many attributes which are irrelevant for our purposes and will not contain any useful information. If there are many more of these attributes than useful attributes, the SVD may not be able to discover any information from the useful attributes are the experiment will not be successful. The images in Figures 4.5 (a) and (b) demonstrate this effect. When labeling by subtype, all but one of the T-cell patients are clustered together. However, there are also several B-cell patients in this cluster as well. This does support the theory that the cDNA dataset primarily contains information regarding the subtype of the disease.
CHAPTER 4. RESULTS 38 −130 −120 −110 −100 −90 −60 −40 −20 0 20 40 60
(a) SNP-cDNA data labeled with mortality
−130 −120 −110 −100 −90 −60 −40 −20 0 20 40 60
(b) SNP-cDNA data labeled with subtype
Figure 4.5: SVD images of combined SNP-cDNA. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown
4.1.6
SVD on combined SNP and Affy data
It is interesting to note that the shape of the data in the SVD image, shown in Figure 4.6, is similar to that of the Affy dataset alone. This shows that the SNP dataset does not have a powerful global effect when it is combined with the larger Affy dataset. As such, the evaluation of this image is similar to that of the previously describe Affy dataset. Once again, the mortality and subtype labels, shown in Figures 4.6(a) and (b), do not provide any explanation for the shape of this dataset.
CHAPTER 4. RESULTS 39 −166 −164 −162 −160 −158 −156 −154 −152 −150 −148 −146 −40 −20 0 20 40
(a) SNP-Affy data labeled with mortality
−166 −164 −162 −160 −158 −156 −154 −152 −150 −148 −146 −40 −20 0 20 40
(b) SNP-Affy data labeled with subtype
Figure 4.6: SVD images of combined SNP-Affy. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown
4.1.7
SVD on combined Affy and cDNA data
This dataset was interesting, because the shape of the data is similar to that of the cDNA dataset alone. This is surprising because the Affy dataset is more than double the size of the cDNA dataset, and so in order for this to happen the cDNA dataset must contain many more globally powerful attributes. As seen in Figure 4.7(a), the mortality label does not provide any meaningful separation in the data, but in Figure 4.7(b), the subtype label does appear to be fairly well separated. It is clear that the cDNA data has the ability to capture variation in the patients based upon their subtype of the disease. It is important to note that the combination of these two datasets is different from combining either of them with the SNP data. These two datasets are intended to capture the same information, that is, the genetic expression
CHAPTER 4. RESULTS 40
levels of certain genes. This suggests that they should be similar to each other and the combination may not provide any interesting information.
−175 −170 −165 −160 −155 −150 −145 −140 −135 −130 −125 −40 −20 0 20 40 60
(a) Affy-cDNA data labeled with mortality
−175 −170 −165 −160 −155 −150 −145 −140 −135 −130 −125 −40 −20 0 20 40 60
(b) Affy-cDNA data labeled with subtype
Figure 4.7: SVD images of combined cDNA-Affy. a) blue = alive, red = deceased. (b) red = T-cell, blue = B-cell, green = unknown
4.1.8
SVD on combined SNP, cDNA and Affy data
The shape of this dataset is also similar to that of the cDNA dataset, suggesting again that the cDNA dataset contains the most obvious structure. The results of labeling by mortality are shown in Figure 4.7(a). It can be seen that there are no tight clusters of patients. When labeling by subtype, as shown in Figure 4.7(b), the T-cell patients appear to cluster together on the bottom left of the image. Since the separation based on subtype continues to appear with these large microarray datasets, it is clear that the genetic expression patterns for these two subtypes are quite distinct which enables the SVD to discover the separation between them.
CHAPTER 4. RESULTS 41 −190 −185 −180 −175 −170 −165 −160 −155 −150 −80 −60 −40 −20 0 20
(a) SNP-Affy-cDNA data labeled with mortality
−175 −170 −165 −160 −155 −150 −145 −140 −135 −130 −125 −40 −20 0 20 40 60
(b) SNP-Affy-cDNA data labeled with subtype
Figure 4.8: SVD images of combined SNP-cDNA-Affy. a) blue = alive, red = de- ceased. (b) red = T-cell, blue = B-cell, green = unknown
So far, we have seen that the entire datasets contain only weak clusters that are mostly related to the subtype of the disease. Next, we look at subsets of the attributes as determined by the random forests algorithm.
4.2
Combination of Datasets
To properly test which method of combination was better, we explored the combined SNP and cDNA dataset. The SVM results for the two methods of combination are shown in Table 4.1 and Table 4.2.It is quite clear that by first combining the datasets and then performing attribute selection the accuracy of the datasets is much better than performing attribute selection on the individual datasets and then combining the top attributes. The reason this method works better is that the random forests
CHAPTER 4. RESULTS 42
algorithm is able to find correlations between the two sets of data when selecting the best attributes for splitting. It is interesting that this SNP-cDNA dataset showed a significant improvement with this method of combination, because it means that the correlation between the SNP dataset and the cDNA dataset provides meaningful information about the mortality of patients. Because of this, we decided to use this method of combination for all experiments.
Table 4.1: SVM prediction accuracy of combining datasets and then doing attribute selection (6-fold cross validation)
Atttributes % Class Alive % Class Deceased
25 100 100
50 100 100
100 100 100
250 100 100
Table 4.2: SVM prediction accuracy of attribute selection and then combining datasets (6-fold cross validation)
Atttributes % Class Alive % Class Deceased
25 97.62 57.14
50 97.62 100
100 100 85.71
CHAPTER 4. RESULTS 43