In this section we consider another important question: If the proximity measure has been chosen, which criterion should be used to guide the cluster analysis and evaluate
the derived results? Clustering can be regarded as a search problem with each node
corresponding to a certain partitioning [112], but exhaustively evaluating all possible partitions to find the optimal one is infeasible in practice due to the high computational complexity. Therefore some heuristic methods are utilized in the design of clustering algorithms to accelerate the search. It has been pointed out that the procedure of cluster analysis is considerably subjective in nature: the target data objects are partitioned into “a number of more or less homogeneous subgroups on the basis of an often subjectively chosen measure of similarity (i.e., chosen subjectively based on its ability to create interesting clusters)” [8]. Hence, the evaluation criterion plays an important role for the cluster analysis. It guides the search direction in the partitioning space as well as quantitatively evaluating different partitions derived by clustering algorithms to find the optimal one.
Some criteria for propositional datasets are outlined in [35]. Among whichSum
of Squared Error (SSE) is the most widely used for clustering. LetK be the number of
derived clusters,Nk be the number of data instances in clusterCkandxkbe the center
of these data. The SSE criterion is defined as:
E= K X k=1 X x∈Ck f d2(x, xk) (2.10)
The SSE criterion is appropriate when the clusters are compact and well separated from each other. However, when the number of data objects in the optimal partition vary greatly, the cluster result with minimum SSE may not reveal the true underlying data structure, because a partition splitting large clusters is more favorable under such circumstance. It means those clustering algorithms designed based on the principle of minimizing the SSE criterion, e.g. k-means, tend to generate the clusters of the equal
size. To address this issue, another criterion,Related Minimum Variance, might be used: E= 1 2 K X k=1 Nksk (2.11) where sk= 1 N2 k X x∈Ck X x′∈C k f s(x, x′) or sk= min x,x∈Ck f s(x, x′)
The SSE or Related Minimum Variance criterion can be applied in the functionsFintra(Ck)
andFinter(Ck, Ck′) of Definition 2.1 to guide the clustering procedure or evaluate the derived result. For example, ink-means we iteratively adjust the membership of data instances in each cluster in order to reduce the SSE value over all the clusters.
The SSE criterion, also named as theWithin-cluster Scatter Matrix (denoted as
SW) in multiple discriminant analysis, can be used to evaluate the intra-cluster distances.
Additionally, the criteria of Between-Cluster Scatter Matrix and Total Scatter Matrix
(denoted asSB andST respectively) are used to evaluate the inter-cluster distances and
the scattering extent of the whole dataset respectively:
SB = K X k=1 Nk·f d2(xk, x) and ST = X x∈D f d2(x, x) (2.12)
where xk is the center for all data objects in cluster Ck as before, x is the center for
the whole dataset. In propositional clustering, the center of a cluster is the mean of all data in that cluster: xk = N1kPx∈Ckx when clusterCk⊂R
m. Similarly we have x=
1
N
P
x∈Dx. However, in relational clustering the center of a cluster is usually determined by the medoid of all data objects in that cluster, so the constraintST =SW+SB that
is valid in propositional datasets will be invalid for relational clustering. Additionally, the operation of determining the medoid within a cluster has quadratic computational complexity. Such disadvantage heavily restricts the application of relational clustering algorithms that are designed to minimize the SSE criterion of a partition for a relational dataset. This issue will be investigated thoroughly in the Section 2.2.1.
To evaluate the quality of the cluster result, it is possible to examine the homo- geneity within the clusters and the heterogeneity between the clusters when the class labels of data objects are available. The Jaccard Coefficient [141] is suitable for this purpose, which is computed by the number of pairs of objects in same cluster and with same class label over that of pairs of objects either in same cluster or with same class label. In Table 2.3 we set the parametersb= 0andw= 1to get the Jaccard Coefficient. Alternatively, we can evaluate the quality of clusters based on the idea of entropy. Entropy was first introduced in thermodynamics to measure the system’s thermal energy. Being obtained from the disordered molecular motion, entropy reflects the molecular disorder in the thermodynamic system [51]. Later, entropy was extended to measure the uncertainty associated with a random variable in information theory [133]. Recently the entropy is used to evaluate the disorder or impurity of clusters [145]. Formally, given the class labels of data objects in a clusterCk,Ck’s entropy is computed by:
E(Ck) =−
X
h
Ph,klog2Ph,k (2.13)
where Ph,k is the proportion of data objects of class h in the cluster Ck. The total
entropy is defined as:
E =X
Ck
E(Ck) (2.14)
Generally speaking, smaller entropy values indicate higher accuracy of cluster result.