The classifiers built with methods that favor sparsity can be unreliable and present instabilities under slight perturbations of the training set (Haury et al., 2010;Yu et al.,2008). The lack of robustness and stability has its origin in the fact that these methods aim to generate classifiers that are accurate, but use only a small subset of the available features. The result is that one often discards features that are relevant for prediction, but are highly correlated to features already selected. The reason for this is that the improvements in performance on the training set is not sufficient to compensate the cost of including the additional features in the model. As a consequence of this behavior, the features that are selected can change significantly even when the training set is only slightly perturbed. These instabilities are specially severe when the amount of available data is limited and the dimensionality of the feature space is very high (Kalousis et al.,2007;Loscalzo et al.,2009).
NBSVM and GL can be significantly affected by these instabilities because they optimize a loss function on the training set with a penalty that favors the selection of a reduced number of features (Haury et al., 2010). This is illustrated by the graphs for NBSVM and GL in the bottom-left and bottom-middle of Figure5.3which display the relevance given by these methods to each feature on a particular training instance of the phoneme dataset. The most informative frequencies for prediction in this problem are a group of features clustered around the 50th frequency (see the top-middle plot in Figure5.3). However, the bottom-left and bottom-middle plots in Figure5.3indicate that NBSVM and GL select only a reduced fraction of these features. Similar results can be observed for the handwritten digit dataset in Figure5.5.
An advantage of a fully Bayesian method, such as NBSBC, over approaches that provide point-estimates of the model parameters, such as NBSVM and GL, is that they are generally more stable, specially when the data available for induction are scarce. In a fully Bayesian approach all possible values of the model parameters are considered by computing averages over the posterior distribution. Thus, the problem of selecting only a reduced number of elements among a group of highly correlated features does not appear. This is illustrated by the plot in the
top-right of Figure5.3which displays the relevance given by the posterior distribution of SBC to each feature in a specific training instance of the phoneme dataset. In this case, the estimates of the relevance of the features around the 50th frequency are all high. Furthermore, the results are more robust when prior information is incorporated into the Bayesian model using a network of features. In particular, the values of the relevance are spatially smoothed when the MRF prior is used (see the bottom-right plots in figures5.3and5.5). This smoothing reduces the magnitude of the fluctuations in the relevance values caused by small variations in the training set. The effect is similar to the reduction of noise in images by Markov random field models (Bishop,
2006;Geman et al.,1993).
To further investigate the stability of the different sparse linear classifiers when the training data are slightly perturbed, we employ an index of feature selection stability (Kuncheva,2007). This index measures the level of agreement between the feature rankings generated by a feature selection method under different training conditions. For a given classification problem, let us assume that the prediction performance of the method under consideration is evaluated in T train/test episodes and let Bki be the set with the k most relevant features as estimated by the
method in the i-th train/test episode. The expression for the index of feature selection stability is IFSS(k) = 2 T(T − 1) T−1
∑
i=1 T∑
j=i+1 oi jkd− k2 k(d − k) , for k= 1, . . . , d , (5.18) where d is the data dimensionality and oi jk is the number of common elements between thesets Bki and Bk j. This index satisfies −1 < IFSS(k) ≤ 1, approaches its maximum value when
the number of common features in the sets Bk1, . . . , BkT increases and takes values close to zero
when the sets Bk1, . . . , BkT are independently drawn. The more stable a feature selection method
is, the higher the value of IFSS(k).
Figure5.8displays plots of IFSS(k) for each sparse linear classifier (NBSBC, SBC, NBSVM
and GL) on each of the four datasets previously analyzed (phoneme, handwritten digit, precip- itation and breast cancer). The graphs are computed using the feature rankings generated by NBSBC, SBC, NBSVM and GL on each of the 100 train/test episodes of the experiments de- scribed previously. The most stable method is clearly NBSBC, followed at a significant distance by SBC. The least stable methods are NBSVM and GL.
5.6
Summary and Discussion
Some classification problems are characterized by a feature space whose dimension d is very large when compared to the number n of data instances available for induction. Under these conditions, a common approach to obtain robust and reliable classifiers is to consider sparse linear models. Sparse models often have an improved prediction accuracy and can also be used to identify those features that are more relevant for classifying new data instances. Most of the sparse classification techniques analyzed in the literature assume that the features that char- acterize the data instances are independent of each other. While enforcing sparsity assuming independent features is often advantageous, better results can be achieved if prior information about feature dependencies is available. This information can be encoded in the form of a net- work whose nodes correspond to features and whose edges represent dependence relationships
Chapter5. Network-based Sparse Bayesian Classification 97
Figure 5.8: Stability of the feature rankings given by each sparse classification model (NBSBC, SBC, NBSVM and GL) on the four analyzed datasets (phoneme, handwritten digit, precipita- tion and breast cancer).
between features. Whenever two features are connected in the network, they are assumed to be both relevant or both irrelevant for the solution of the classification problem.
In this chapter, we have presented a new network-based sparse Bayesian classifier (NBSBC) that makes use of the information encoded in such a network to improve its prediction perfor- mance and its ability to select relevant features in problems with a reduced amount of training data and a very high-dimensional feature space. NBSBC is based on an extension of the Bayes point machine (Herbrich et al.,2001;Minka,2001) that is capable of learning the intrinsic noise in the class labels (Hern´andez-Lobato and Hern´andez-Lobato,2008). Sparsity in the model is favored by a spike and slab prior distribution (George and McCulloch, 1997) which is com- bined with a Markov random field prior (Bishop,2006;Wei and Li,2007) that accounts for the network of feature dependencies. Approximate Bayesian inference is implemented using the expectation propagation algorithm (EP) (Bishop,2006;Minka,2001). NBSBC has a fairly low computational cost. When the network of feature dependencies is sparse and the training set includes n instances and d features, the computational complexity of NBSBC is
O
(nd).The performance of NBSBC has been evaluated in a series of experiments with phoneme data (Hastie et al.,1995,2001), handwritten digits (Lecun et al.,1998), precipitation data (Razu- vaev et al.,2008) and gene-expression data from microarray experiments (Bos et al.,2009). For each of these datasets, we have constructed a network of features using either information spe- cific to the problem domain or additional unlabeled data. The experiments include an exhaustive
comparison with four benchmark binary classification methods: the standard support vector ma- chine (SVM) (Hastie et al.,2001;Vapnik,1995), the sparse Bayesian classifier (SBC) which is obtained when NBSBC ignores the network of features, the network-based support vector ma- chine (NBSVM) (Zhu et al.,2009) and the graph lasso method (GL) (Jacob et al.,2009). Like NBSBC, NBSVM and GL assume also sparse models that consider a network to encode prior knowledge about pairwise feature dependencies. The results of these experiments show that NBSBC outperforms SVM, NBSVM, SBC and GL in all the problems analyzed except for the modeling of the precipitation data, where it ranks second. NBSBC is also very effective in the selection of features that are relevant to the solution of the classification problem.
An important property of NBSBC is the robustness of the estimates of the relative relevance of the individual features generated by this method. This stability derives from considering all the possible parameter values in the posterior distribution computed by this method. By contrast, GL and NBSVM, which employ point estimates for the model parameters, tend to discard features that, while being relevant for prediction, are highly correlated with previously selected features. Because of this, when the training data are slightly perturbed these methods generally select different groups of features. NBSBC is less affected by this instability because the Markov random field prior dampens the effect of the fluctuations in the feature relevance values arising from small variations in the training data. This effect is similar to the reduction of noise in images by Markov random field models (Bishop,2006;Geman et al.,1993).
Chapter
6
Discovering Regulators from
Gene Expression Data
This chapter describes a hierarchical sparse Bayesian model for the discovery of transcriptional regulators from gene expression data. The hierarchy incorporates the prior knowledge that only a few genes act as regulators, controlling the expression pattern of many other genes. This prior knowledge is incorporated via a spike and slab prior, in which the mixing weights are assumed to follow a hierarchical Bernoulli model. Expectation propagation is used to carry out approximate inference efficiently. The model is applied to gene expression data from the malaria parasite. Among the top ten genes identified as the most likely to be regulators, we found four genes with significant homology to transcription factors in an amoeba, another one is a known RNA regulator, three have an unknown function, and two are known not to be regulators. These results are promising, given that only gene expression data are used to identify the transcriptional regulators.
6.1
Introduction
Bioinformatics is a rich source for the application of automatic learning methods. In particular, the discovery of transcription regulatory networks has been addressed by a variety of machine learning algorithms (Gardner and Faith, 2005), including the sparse linear models with spike and slab priors described in subsection4.4.1of this thesis. In this chapter, we specifically focus on the identification of the genetic regulatory elements of the causative agent of severe malaria, Plasmodium falciparum (Hern´andez-Lobato et al., 2008). Several properties of this parasite necessitate a tailored method for the identification of regulators:
1. In most species gene regulation takes place at the first stage of gene expression when the DNA template is transcribed into an mRNA molecule. This transcriptional control is mediated by specific regulatory molecules called transcription factors (Alon, 2006). However, few specific transcription factors have been identified in Plasmodium based on sequence homology with other species (Balaji et al., 2005;Coulson et al.,2004). This
could be due to Plasmodium possessing a unique set of transcription factors or due to the presence of other mechanisms of gene regulation, for example at the level of mRNA stability or post-transcriptional regulation.
2. Compared with yeast, gene expression in Plasmodium is hardly changed by perturbations, for example by adding chemicals or changing temperature (Sakata and Winzeler,2007). The biological interpretation of this finding is that the parasite is so narrowly adapted to its environment inside a red blood cell that it follows a stereotyped program of gene expression. From a machine learning point of view, this means that network elucidation methods relying on perturbations of the gene expression process cannot be used.
3. Similar to yeast (Spellman et al.,1998), data for three different strains of the parasite with time series of gene expression are publicly available (Llin´as et al.,2006). These assay all of Plasmodium’s 5600 genes for about 50 time points. In contrast to yeast, there are no ChIP-chip data available and fewer than ten transcription factor binding motifs are known (Aparicio et al.,2001).
These properties point to a vector autoregressive model, using the available gene expression time series data (point 3 above), for the identification of regulators in Plasmodium. The model should not rely on sequence homology information but it should be flexible enough to integrate sequence information in the future. This points to a Bayesian model as a favored approach since Bayesian methods can incorporate additional prior knowledge very easily (Buchan et al.,2009).