I. CAPÍTULO I: ASPECTOS METODOLÓGICOS
6. CAPITULO Nº VI: DISCUSIÓN DE RESULTADOS
6.1.1. Desmontera Quellocunca
A dense model: (X)i,i = 2, (X)i,i’ = 1 otherwise
Figure 2.15 Model 2 Error between Graphical Lasso and Modified Graphical Lasso (p =10).
Figure 2.16 Model 2 Optimal penalty for Graphical Lasso and Modified Graphical Lasso (p =10).
37
Figure 2.15 shows the error across different samples for the dense model when p=10. From Figure 2.15, it appears that at most sample sizes, the Modified Graphical Lasso performs better than the original Graphical Lasso, having lower error. Statistical hypothesis tests in Appendix E Table E.9 however show that the original Graphical Lasso method performs statistically significantly better than the Modified Graphical Lasso when 𝑁 > 30. When 𝑁 < 30, the two methods perform essentially the same. The pseudoinverse performs better than both the Graphical Lasso and Modified Graphical Lasso when 𝑁 > 30 as expected in the dense scenario, with its performance getting better as the number of samples increase.
Figure 2.17 Model 2 Error between Graphical Lasso and Modified Graphical Lasso (p =30).
38
Figure 2.18 Model 2 Optimal penalty for Graphical Lasso and Modified Graphical Lasso (p =30).
Figure 2.19 Model 2 error between Graphical Lasso and Modified Graphical Lasso (p =50).
39
Figure 2.20 Model 2 optimal penalty for Graphical Lasso and Modified Graphical Lasso (p =50).
Figure 2.17 shows the error across different samples for the dense model when p=30, while Figure 2.19 shows the error across different samples for the sparse model when
p=50. In these two figures, the Modified Graphical Lasso method appears to perform
better than the Graphical Lasso method. Statistical hypothesis tests in Appendix E Table E.11 and Table E.13 however show that the two methods perform essentially the same across all samples. When p=30, both methods perform better than the pseudoinverse as seen in Appendix E Table E.11 when 𝑁 < 70. When p=50, Appendix E Table E.13 shows that both methods perform better than the pseudoinverse when 𝑁 > 30. We now look at performance when p=70.
40
Figure 2.21 Model 2 error between Graphical Lasso and Modified Graphical Lasso (p =70).
Figure 2.22 Model 2 optimal penalty for Graphical Lasso and Modified Graphical Lasso (p =70).
As we increase the variable size to p=70 for the dense model, Figure 2.21 shows the error across different sample sizes for the dense model when p=70. In this figure, the
41
Modified Graphical Lasso method appears to perform better than the Graphical Lasso method. Statistical hypothesis tests in Appendix E Table E.15 however show that the two methods perform essentially the same across all samples, and both methods perform better than the pseudoinverse when 𝑁 > 40.
Figure 2.16, Figure 2.18, Figure 2.20 and Figure 2.2 show the optimal regularization as the sample size increases for the dense model when p=10, p=30, p=50 and p=70 respectively. From these figures, we can see that the optimal regularization for the Graphical Lasso method is always an intermediate value between the optimal off- diagonal and diagonal regularizations for the Modified Graphical Lasso method across all sample sizes.
Once again, based on the experiment in section 2.3.1 which showed the over- estimation of the diagonal elements as the regularization increases, we expected that for the Modified Graphical Lasso, a higher regularization will be needed for the diagonal elements, while a lower regularization will be needed for the off-diagonal elements, however, our results go against our hypothesis. Our results across all variable and sample sizes show that for the Modified Graphical Lasso method, the off- diagonal optimal regularization is higher than that of the diagonal elements.
2.6 Summary
In this chapter, we presented the Graphical Lasso algorithm which is the method that we use to estimate all sparse inverse covariances for application in bioinformatics and finance problems in this thesis. We showed the importance of regularization in section 2.3.1 and presented existing methods for approximating the optimal penalty
42
parameter in section 2.3.2. We pointed out a characteristic of Graphical Lasso to over- estimate the diagonal elements as the regularization amount increased, and presented a new method to remedy this problem by using two different penalties in section 2.4.1. We presented the results of this new method known as the ‘Modified Graphical Lasso’, which showed performance that was essentially the same as the original Graphical Lasso performance when 𝑝 > 10 on synthetically generated data from sparse and dense inverse covariance models.
For the sparse model, the Graphical Lasso and Modified Graphical Lasso performed better than the pseudoinverse across all variable and sample sizes, while for the dense model, the pseudoinverse showed improved performance and performed as well or better than the Graphical Lasso and Modified Graphical Lasso methods at various instances, especially when the sample sizes were very high relative to the number of variables. These results were consistent with our hypothesis, where we expected the Graphical Lasso and Modified Graphical Lasso to perform better than the pseudoinverse when the data comes from a sparse model. In such scenarios, the Graphical Lasso is known to be a better approximation of the inverse covariance matrix than actually inverting the sample covariance matrix or using the pseudoinverse. Based on our results shown in section 2.3.1, we hypothesized that for the Modified Graphical Lasso, a higher regularization would be needed for the diagonal elements to remedy the diagonal over-estimation problem, while a lower regularization would be needed for the off-diagonal elements. However, our results went against our hypothesis. Our results were optimal when the off-diagonal elements had a higher regularization than the diagonal-elements.
43
Chapter 3
Graphical Lasso Application to Bioinformatics
Machine learning and data mining have found a multitude of successful applications in microarray analysis, with gene clustering and classification of tissue samples being widely cited examples [32]. Deoxyribonucleic acid (DNA) microarray technology provides useful tools for profiling global gene expression patterns in different cell/tissue samples. One major challenge is the large number of genes 𝑝 relative to the number of samples N. The use of all genes can suppress or reduce the performance of a classification rule due to the noise of non-discriminatory genes [33]. Selection of an optimal subset from the original gene set becomes an important pre-step in sample classification [33].
In this chapter, we propose the use of the sparse inverse covariance estimator, Graphical Lasso, which was introduced in chapter 2 to estimate the inverse covariance matrix even when 𝑁 < 𝑝. The estimated sparse inverse covariance is used for dimensionality reduction and to classify tissue samples given gene microarray data.
44
3.1 Introduction
DNA microarrays enable scientists to study an entire genome’s expression under a variety of conditions. The advent of DNA microarrays has facilitated a fundamental shift from gene-centric science to genome-centric science [34].
Figure 3.1 DNA microarray image.
DNA microarrays are typically constructed by mounting a unique fragment of complementary DNA (cDNA) for a particular gene to a specific location on the microarray [34]. This process is repeated for N genes. The microarray is then hybridized with two solutions, one containing experimental DNA tagged with green fluorescent dye, the other containing reference or control DNA tagged with red fluorescent dye [34].
45
Figure 3.2 Acquiring the gene expression data from DNA microarray.
DNA microarrays are composed of thousands of individual DNA sequences printed in a high density array on a glass microscope slide using a robotic arrayer as shown in Fig. 3.2. The relative abundance of these spotted DNA sequences in two DNA or RNA samples may be assessed by monitoring the differential hybridization of the two samples to the sequences on the array [34]. For mRNA samples, the two samples are reverse-transcribed into cDNA, labeled using different fluorescent dyes mixed (red- fluorescent dye and green-fluorescent dye). After the hybridization of these samples with the arrayed DNA probes, the slides are imaged using a scanner that makes fluorescence measurements for each dye [34]. The log ratio between the two intensities of each dye is used as the gene expression data [35-37].
Traditionally, genes have been studied in isolation in an attempt to characterize their behavior [34]. While this technique has been successful to a limited extent, it suffers from several fundamental drawbacks. The most significant of these drawbacks is the
46
fact that in a real biological system, genes do not act alone; rather, they act in concert to affect a particular state in a cell [34]. As such, examining cells in isolation offers a perturbed and very limited view of their function [34]. DNA microarrays allow a scientist to observe the expression level of tens of thousands of genes at once. Rather than considering individual genes, a scientist now has the capability of observing the expression level of an entire genome.
The power of DNA microarrays is a double-edged sword: to handle the enormous amount of data being generated by microarray experiments, we need sophisticated data analysis techniques to match [34]. More specifically, we need to extract biologically meaningful insights from the morass of DNA microarray data, and apply this newly gained knowledge in a meaningful way. The types of information scientists want to extract from DNA microarray data can be regarded as patterns or regularities in the data [34]. One important application of gene expression data is classification of samples into categories. For example, a scientist may want to discover which samples belong to particular tissue or which samples belong to healthy/unhealthy patients. They may also want to know which genes are co-regulated, or attempt to infer what the gene expression regulator pathways are [34]. Alternatively, a doctor may want to know if the gene expression profile of an unhealthy patient can help predict an optimal treatment. In combination with classification methods, other machine learning techniques are designed to extract such patterns which can be useful for supporting clinical management decisions for individual patients, e.g. in oncology. Standard statistic methodologies in classification or prediction do not work well
47
when the number of variables p (genes) is much larger than the number of samples N which is the case in gene microarray expression data [34].