• No se han encontrado resultados

The next two diversity methods involved the use of clustering analysis. Clustering analysis aims to divide a group of objects into clusters so that the objects within each cluster are similar, with objects taken from different clusters therefore being structurally dissimilar.45 When molecules are clustered, a representative subset of one of more compounds can be selected from within that cluster, in order to represent that class of compounds, as illustrated by figure 3.9. Clustering analysis has found many uses, including in medicine and information sciences, and there are a large number of algorithms available for its implementation. The efficiency of

183

these algorithms various widely, with certain methods better suited to pharmaceutical applications and clustering databases of chemical structures.47-50

Fig. 3.9 Illustration of clustering analysis and the grouping together of similar compounds. Red circle show the selection of a representative molecule from a particular cluster.

The overall process of cluster based compound selection is as follows:

1. Generate descriptors for each compound in the dataset.

2. Calculate the similarity or distance between all compounds in the dataset. 3. Use a clustering algorithm to group the compounds within the dataset.

4. Select a representative subset by selecting one or more compounds from each cluster.

For step 1, the descriptors may include property values such as biological activity, or even topological indexes and structural fragments.51, 52 For the two clustering analysis methods used here, FCFP_4 fingerprints were used to generate the descriptor values for each of the compounds in the dataset. Similarity measures may also be used in step 2 to quantify the degree of structural resemblance between pairs of molecules,53 with Tanimoto once again used to calculate the similarity between

184

the compounds for both of the clustering methods employed. Where the two clustering methods employed vary is in step 3. Most clustering methods are non- overlapping, with each object belonging to just one cluster, (in overlapping methods compounds can belong to more than one cluster). The non-overlapping methods are divided into two classes, hierarchical and non-hierarchical. For step 3, one hierarchical method was used, and one non-hierarchical method.

3.3.1.1 Hierarchical Clustering

Hierarchical clustering methods organise compounds into clusters of increasing size, with small clusters of related compounds being grouped together into larger clusters. At one extreme each compound is in its own cluster, but after progressive joining of these smaller clusters, the compounds ultimately reside within a single cluster at the opposite extreme.54 The successive levels and relationships between clusters can be visualised using a dendrogram, an example of which is in figure 3.10.

Fig. 3.10 A dendrogram representing a hierarchical clustering of seven compounds.

The dataset is analysed in an iterative manner, such that at each step either a pair of clusters are merged, or a single cluster is divided. Each level of the hierarchy represents a partitioning of the dataset. If a hierarchical method starts with all

185

compounds as singletons, that are then merged iteratively until all compounds are in a single cluster, the method is said to be agglomerative, that is from the bottom up in terms of the dendrogram.

Clusters are formed to minimise the total variance of the dataset.55 The variance of a cluster is measured as the sum of the squared deviation from the mean of the cluster. For a cluster, , of objects where each object is represented by a vector the mean (or centroid) of the cluster, is given by equation 3.3, with the intracluster variance, given by equation 3.4. The total variance is calculated as the sum of the intracluster variances for each cluster. At each iteration a pair of clusters is chosen whose merger leads to the minimum change in total variance. This is known as Ward’s method.55

Eq. 3.3 Definition of the cluster centroid.

Eq. 3.4 Definition of intracluster variance.

The most commonly implemented hierarchical clustering methods are those belonging to the family of sequential agglomerative hierarchical non-overlapping (SAHN) methods. One particular example is AGNES,56 or agglomerative nesting. SAHN methods are traditionally implemented using the stored matrix algorithm. Each cluster initially corresponds to an individual item, and as clustering proceeds, pairs of clusters are merged together and the number of clusters decreases by one. Eventually these evolves into just one cluster containing all items. The stored matrix algorithm is as follows:

186

1. Calculate the initial proximity matrix containing the pairwise proximities between all pairs of clusters (singletons) in the dataset.

2. Scan the matrix to find the most similar pair of clusters, and merge them into a new cluster (thus replacing the original pair).

3. Update the proximity matrix by inactivating one set of entries of the original pair and updating the other set (now representing the new cluster).

4. Repeat steps 2 and 3 until just one cluster remains.

Contrary to agglomerative methods, there are the divisive hierarchical clustering algorithms. These start with all compounds in a single cluster, and iteratively partitions one cluster into two (top to bottom on the dendrogram) until all compounds are singletons. This method is of particular use when only a small number of clusters is desired, so that only the first part of the hierarchy needs to be produced. Thus divisive methods can be faster than their agglomerative counterparts, though their overall performance is generally inferior.57 This has been attributed to the fact that the initial criterion for partitioning a cluster is based on only a single descriptor, or is monothetic, unlike agglomerative methods which are polythetic.

When using hierarchical clustering methods it is necessary to choose a level from the hierarchy in order to define the appropriate number of clusters to represent the dataset. This corresponds to drawing an imaginary line across the dendrogram, with the number of vertical lines which intersect this line being equal to the number of clusters. This can be observed in figure 3.11, where the red line dissects the dendrogram, thus representing the seven molecules in four clusters. Visual inspections is a useful way to select the appropriate number of clusters, as the