Chatfield and Collins [5], in the introduction to their chapter on cluster analysis, quote the first sentence of a review article on cluster analysis by Cormack [13]: ‘The availability of computer packages of classifica- tion techniques has led to the waste of more valuable scientific time than any other “statistical” innovation (with the possible exception of multiple-regression techniques).’ This is perhaps a little hard on cluster analysis and, for that matter, multiple regression but it serves as a note of warning. The aim of this book is to explain the basic principles of the more popular and useful multivariate methods so that readers will be able to understand the results obtained from the techniques and, if interested, apply the methods to their own data. This is not a substi- tute for a formal training in statistics; the best way to avoid wasting one’s own valuable scientific time is to seek professional help at an early stage.
Cluster analysis (CA) has already been briefly mentioned in Sec- tion 2.3, and a dendrogram was used to show associations between variables in Section 3.6. The basis of CA is the calculation of distances between objects in a multidimensional space using an equation such as Equation (5.1). These distances are then used to produce a diagram, known as a dendrogram, which allows the easy identification of groups (clusters) of similar objects. Figure 5.7 gives an example of the process for a very simple two-dimensional data set.
The two most similar (closest) objects in the two-dimensional plot in part (a) of the figure are A and B. These are joined together in the dendrogram shown in part (b) of the figure where they have a low value of dissimilarity (distance between points) as shown on the scale. The similarity scale is calculated from the interpoint distance matrix by find- ing the minimum and maximum distances, setting these equal to some arbitrary scale numbers (e.g. 0 and 1), and scaling the other distances to lie between these limits. The next smallest interpoint distance is between point C and either A or B, so this point is joined to the A/B cluster. The next smallest distance is between D and E so these two points form a cluster and, finally, the two clusters are joined together in the dendro- gram. This process is hierarchical and the links between clusters have been single; the procedure is known, unsurprisingly, as single-link hier- archical cluster analysis and is one of the most commonly used methods. Another point to note from this description of CA is that clusters were built up from individual points, the process is agglomerative. CA can
Figure 5.7 Illustration of the production of a dendrogram for a simple two- dimensional data set (reproduced with permission from ref. [3] with permission of Energia Nuclear & Agricultura).
start off in the other direction by taking a single cluster of all the points and splitting off individual points or clusters, a divisive process.
There are many different ways in which clusters can be generated; all of the examples that will be described in this section use the agglomer- ative, hierarchical, single-linkage method, usually referred to as ‘cluster analysis’. Most textbooks of multivariate analysis have a chapter describ- ing some of the alternative methods for performing CA, and Willett [14] deals with chemical applications. It may have been noticed that in this description of CA the points to be clustered were referred to as just that, points in a multidimensional space. They have not been identified as sam- ples or variables since CA, like many multivariate methods, can be used to examine relationships between samples or variables. For the former we can view the data set as a collection of n objects in a p-dimensional parameter space. For the latter we can imagine a data set ‘turned on its side’ so that it is a collection of p objects in an n-dimensional sample space. When using CA to examine the relationships between variables, the distance measure employed is often the correlation coefficients be- tween variables, as shown in Figure 3.5.
Figure 5.8 Dendrogram of water samples characterized by their concentrations of Ca, K, Na, and Si (reproduced from ref. [15] with permission of Wiley-Blackwell).
The study of mineral waters characterized by elemental analysis dis- cussed in Section 5.2 [3] provides a nice example of the use of CA to classify samples. Figure 5.8 shows a dendrogram of water samples from one geographical region (Lindoya) described by the concentrations of four elements. The water samples were drawn from six different locations in this region and one group on the dendrogram, cluster IV, contained all the samples from one of these locations. The samples from the other five locations are contained in clusters I, II, and III. One sample, cluster V, is clearly an outlier from this set and thus must be subject to suspicion.
The characterization of fruit juices by various analytical measure- ments was used as an example of a principal component scores plot (Figure 4.9) in Chapter 4 [15]. A dendrogram from this data is shown in Figure 5.9 where it is clearly seen that the grape, apple, and pineapple juice samples form distinct clusters. The apple and pineapple juice clus- ters are grouped together as a single cluster which is quite distinct from the cluster of grape juice samples. This is interesting in that it mimics the results of the PCA; on the scores plot, all three groups are separated, but the first component mainly serves to separate the grape juices from the others while the second component separates apple and pineapple juices. This is a good illustration of the way that different multivariate
Figure 5.9 Dendrogram showing the associations between grape (G), apple (A), and pineapple (P) juice samples described by 15 variables (reproduced from ref. [16] with permission of Arzneimettel-Forschung).
methods tend to produce complementary and consistent views of the same data set.
The dendrogram in Figure 5.10 is derived from a data matrix of ED50 values for 40 neuroleptic compounds tested in 12 different assays in rats [16]. This is an example of a situation in which the data involves multiple dependent variables (see Chapter 8), but here the multiple biological data is used to characterize the tested compounds. The figure demonstrates that the compounds can be split up into five clusters with three com- pounds falling outside the clusters. Compounds within a cluster would be expected to show a similar pharmacological profile and, of course, there is the finer detail of clusters within the larger clusters. A procedure such as this can be very useful when examining new potential drugs. If the pharmacological profile of a new compound can be matched to that
Figure 5.10 Dendrogram of the relationships between neuroleptic drugs character- ized by 12 different biological tests (reproduced from ref. [12] with kind permission of Springer Science+ Business Media).
of a marketed compound, then the early clinical investigators may be forewarned as to the properties they might expect to see.
The final example of a dendrogram to be shown here, Figure 5.11, is also one of the largest. This figure shows one thousand conformations of an insecticidal pyrethroid analogue (see Figure 5.5) described by the values of four torsion angles [12]. A dendrogram such as this was used for the selection of representative conformations from the one thousand conformations produced by molecular dynamics simulation. Conforma- tions were chosen at equally spaced intervals across the dendrogram ensuring an even sampling of the conformational space described by the torsion angles. In fact, the procedure is not as simple as this and various approaches were employed (see reference for details) but sampling at even intervals was shown to be suitable.
Figure 5.11 Dendrogram of the relationships of different conformations of a pyrethroid derivative described by the values of four torsion angles (reproduced from ref. [17] with permission from the American Chemical Society).