Despite shortcomings, application of clustering methods to gene expression data has proven to be of immense value, providing insight on cell regulation, as well as on disease characterisation. Nevertheless, not all clustering methods are equally valuable for high dimensional gene expression data. Recognition that well-known, simple clustering techniques, such as K-Means and Hierarchical clustering, do not capture complex local structure, has led to investigation of other options. In partic- ular, bi-clustering has gained considerable recent popularity. Indications to date are that these methods provide increased sensitivity at local structure level in discovery of meaningful biological patterns.
An inherent problem with exploratory clustering is ab initio knowledge of K, the number of clusters. Consequently, those methods for gene expression analysis which do not need K specified ab initio have an advantage. Most algorithms seek empirically to determine this at run time, but derive complicated thresholds that may not make sense in the context of gene expression data. There is a risk that determination of these thresholds is not a one step process but requires testing and validation of clusters produced. A comprehensive survey of robust cluster valida- tion and evaluation methods is given (Handl et al., 2005) but it seems clear that a
requirement for information-driven clustering is emerging, which integrates cluster and meta-information, (Choi et al., 2004; Liu et al., 2004; Kasturi and Acharya, 2005; Gamberoni et al., 2006; Kustra and Zagdanski, 2006). This provides a ba- sis for validation, independent of the current problem, as well as interpretation of clustering results.
3.5
Summary
Cluster analysis, applied to gene expression data, aims to highlight meaningful pat- terns for gene co-regulation. The evidence suggests that, while commonly applied, agglomerative and partitive techniques are insufficiently powerful given the high dimensionality and nature of the data. While further testing on non-standard and diverse data sets is required, comparative assessment and numerical evidence, to date, supports the view that bi-clustering methods, although computationally ex- pensive, offer better interpretation in terms of data features and local structure. While the limitations of commonly-used algorithms are well documented in the lit- erature, adoption by the bioinformatics community of new (and hybrid) techniques, developed specifically for gene expression analysis has been slow, mainly due to the increased algorithmic complexity required. This would be catalysed by more transparent guidelines and increased availability in specialised software and public dataset repositories.
C
HAPTER
4
C
LUSTER
A
NALYSIS
: A P
RACTICAL
E
VALUATION
In the Assessment process, a clustering achieved is tested for specific properties. Assessment measures are rarely a fixed set but together form a diagnostic toolkit targeted at improving the clustering process. In general, clustering techniques op- timise some form of this measure as a criterion function. Evaluation of clustering thus involves the synthesis of a number of assessment measures used to gauge final cluster quality in order to form an objective final judgement on the most suitable technique for the dataset involved. In this chapter we use these basic principals to investigate the applicability of clustering algorithms to gene expression data. The approach is to consider a series of measures which assess cluster quality on the basis of biological realism amongst other criteria. It also involves comparison of these measures between clustering algorithms and for different datasets. We evalu-
ate clusterings obtained with selected algorithms1identified in Chapter 3. Clusters
obtained from real and synthetic datasets are compared between algorithms. We demonstrate the fact that, with so many classification criteria for clustering, no one
1Reporting for all algorithms is prohibitively detailed. Our aim is to give a ‘flavour’ of tech-
niques and their validation, by applying selected algorithms from each group in the Jain et al. (1999) taxonomy.
algorithm is good for all datasets, so that a preliminary review of the most appropri- ate methods is essential. Further, it is also strongly advocated that no single method or interpretation is sufficient and that recognition of valid clusters is frequently in- definite or misleading.
4.1
Introduction
Evaluation of clustering requires both internal/external assessments of clusters ob- tained, and comparison between algorithms. This is a complicated area for gene expression data due to its unique properties, due to the fact that little may be known about the data before hand. Many clustering algorithms are designed to be ex- ploratory; so that clusters (dependent on given criteria) found will discover “a struc- ture” which, while meaningful in the context of these, may yet fail to be optimal or even biologically realistic. Algorithms are inherently biased, as properties of clusters reflect built-in clustering criteria, while structures found are not usually the same for different algorithms. For example, with regard to the K-Means criterion the “best” structure is one that minimises the sum of squared errors (MacQueen, 1967), while for the Cheng and Church biclustering algorithm (Cheng and Church, 2000), it is that which minimises the Mean Residue Score (MRS, Eq. 3.3). The two assessments are generally not directly comparable, as the former highlights global
patternsin the data and the latter local patterns, (Section 3.2). Also, large devia-
tions from the mean may correspond to large residue scores, but this is not always the case. For example, Fig. 4.1(a), and the corresponding table, highlight a simple case of three genes in a cluster across four samples. According to the K-Means cri- terion, the cluster (Euclidean and centroid) distance is approximately 11.02, while MRS = 0. In the second case (Figure 4.1(b)) the scale of profile 1 was reduced by
one third. In this case the cluster distance is decreased to 7.91 (indicating a bet- ter cluster), while the MRS is increased to 0.0168 (indicating an inferior cluster). Obviously, interpretation of cluster results relies at some level on subjective choice with regard to the assessment criterion to use. The greater the need, therefore, for independent validation by integration of findings with metadata. Subjective evalua- tion, (based on experience, background knowledge, expected results), even for low dimensional data, is non-trivial at best, but becomes increasingly difficult for high dimensional gene expression data. From these considerations, it is clear that cluster validation is critical for algorithm development and verification of results, with the latter usually based on a manual, lengthy and subjective exploration process.
Figure 4.1: Cluster (B) has profile 1 scaled down by one third.