• No se han encontrado resultados

Composición corporal analizada mediante BIVA

A hierarchical algorithm produces a family of partitions of theninitial statistical units, or better still, a succession of n clusterings of the observations, with the number of groups decreasing from n to 1. To verify that the partitions achieve the primary objective of the cluster analysis – internal cohesion and external separation – the goodness of the partition obtained should be measured at every step of the hierarchical procedure.

A first intuitive criterion for goodness of the clustering is the distance between the joined groups at every step; the process can be stopped when the distance increases abruptly. A criterion used more frequently is based on the decomposi- tion of the total deviance of the p variables, as in Ward’s method. The idea is to have a low deviance within the groups (W) and a high deviance between the groups (B). For a partition ofg groups here is a synthetic index that expresses this criterion:

R2=1−W

T = B T .

Since T =W+B, the index R2∈[0,1]; if the value of R2 approaches 1, it means that the corresponding partition is optimal, since the observations belong- ing to the same group are very similar (low W) and the groups are well sepa- rated (highB). Correspondingly, the goodness of the clustering decreases asR2

approaches 0.

Note thatR2=0 when there is only one group andR2=1 when there are as many groups as observations. As the number of groups increases, the homogene- ity within the groups increases (as each group contains fewer observations), and so doesR2. But this leads to a loss in the parsimony of the clustering. Therefore the maximisation of R2 cannot be considered the only criterion for defining the optimal number of groups. Ultimately it would lead to a clustering (for which

R2=1) ofn groups, each having one unit.

A common measure to accompany R2 is the pseudo-F criterion. Let c be a certain level of the procedure, corresponding to a number of groups equal toc, and let n be the number of observations available. The pseudo-F criterion is defined as

Fc=B/(c−1) W/(nc).

GenerallyFcdecreases withcsince the deviance between groups should decrease

and the deviance within groups should increase. If there is an abrupt fall, it means that very different groups are united among them. The advantage of the pseudo-F

criterion is that, by analogy with what happens in the context of the normal lin- ear model (Section 4.11), it is possible to show how to build a decision rule that allows us to establish whether to accept the fusion among the groups (null hypothesis) or to stop the procedure, choosing the less parsimonious represen- tation (alternative hypothesis). This decision rule is specified by a confidence interval based on the F distribution, with (c−1) and (nc) degrees of free- dom. But in applying the decision rule, we assume that the observations follow

a normal distribution, reducing the advantages of a model-free formulation, such as that adopted here.

An alternative toR2 is the root mean square standard deviation (RMSSTD). This only considers the part of the deviance in the additional groups formed at each step of the hierarchical clustering. Considering thehth step (h=2, . . . , n

1) of the procedure, the RMSSTD is defined as: RMSSTD=

Wh

p (nh−1)

,

whereWh is the deviance in the group constituted at stephof the procedure,nh

is its numerosity andp is the number of available variables. A strong increase in RMSSTD from one step to the next shows that the two groups being united are strongly heterogeneous and therefore it would be appropriate to stop the procedure at the earlier step.

Another index that, similar to RMSSTD, measures the ‘additional’ contribu- tion of thehth step of the procedure is the so-called ‘semipartial’R2(SPRSQ), given by

SPRSQ= WhWrWs

T ,

wherehis the new group, obtained at step has a fusion of groupsr ands.T is the total deviance of the observations, whileWh, Wr andWs are the deviance of

the observations in groupsh, r and s, respectively. In other words, the SPRSQ measures the increase in the within-group devianceW obtained by joining groups

r and s. An abrupt increase in SPSRQ indicates that heterogeneous groups are being united and therefore it is appropriate to stop at the previous step.

We believe that choosing one index from the ‘global’ indexesR2and pseudo-F

and one index from the ‘local’ indexes RMSSTD and SPRSQ allows us to eval- uate adequately the degree of homogeneity of the obtained groups in every step of a hierarchical clustering and therefore to choose the best partition.

Table 4.3 gives an example of cluster analysis, obtained with Ward’s method, in which the indexesR2and SPRSQ are indeed able to give an indication of the Table 4.3 Output of a cluster analysis.

NCL Clusters Joined FREQ SPRSQ RSQ

11 CL19 CL24 13 0.0004 0.998 10 CL14 CL18 42 0.0007 0.997 9 CL11 CL13 85 0.0007 0.996 8 CL16 CL15 635 0.0010 0.995 7 CL17 CL26 150 0.0011 0.994 6 CL9 CL27 925 0.0026 0.991 5 CL34 CL12 248 0.0033 0.988 4 CL6 CL10 967 0.0100 0.978 3 CL4 CL5 1215 0.0373 0.941 2 CL7 CL3 1365 0.3089 0.632 1 CL2 CL8 2000 0.6320 0.000

number of partitions to choose. A number of cluster (NCL) equal to 3 is more than satisfactory, as indicated by the row third from last, in which clusters 4 and 5 are united (obtained in correspondence of NCL equal to 4 and 5). In fact, the further step of uniting groups 7 and 3 leads to a relevant reduction in R2and to an abrupt increase in SPRSQ. On the other hand, Choosing NCL equal to 4 does not give noticeable improvements in R2. Note that the cluster joined at NCL= 3 contains 1215 observations (FREQ).

To summarise, there is no unequivocal criterion for evaluating the methods of cluster analysis but a whole range of criteria. Their application should strike a balance between simplicity and information content.

4.2.3 Non-hierarchical methods

The non-hierarchical methods of clustering allow us to obtain one partition of the

nobservations inggroups (g < n), withgdefined a priori. Unlike what happens with hierarchical methods, the procedure gives as output only one partition that satisfies determined optimality criteria, such as the attainment of the grouping that allows us to get the maximum internal cohesion for the specified number of groups. For any given value ofg, according to which it is intended to classify the

n observations, a non-hierarchical algorithm classifies each of the observations only on the basis of the selected criterion, usually stated by means of an objective function. In general, a non-hierarchical clustering can be summarised by the following algorithm:

1. Choose the number of groups g and choose an initial clustering of the n

statistical units in that number of groups.

2. Evaluate the ‘transfer’ of each observation from the initial group to another group. The purpose is to maximise the internal cohesion of the groups. The variation in the objective function determined by the transfer is calculated and, if relevant, the transfer becomes permanent.

3. Repeat step 2 until a stopping rule is satisfied.

Non-hierarchical algorithms are generally much faster than hierarchical ones, because they employ an interactive structure calculation which does not require us to determine the distance matrix. The construction of non-hierarchical algorithms tends to make them more stable with respect to data variability. Furthermore, non-hierarchical algorithms are suitable for large data sets where hierarchical algorithms would be too slow. Nevertheless, there can be many possible ways of dividing n observations into g non-overlapping groups, especially for real data, and it is impossible to obtain and compare all these combinations. This can make it difficult to do a global maximisation of the objective function, and non-hierarchical algorithms may produce constrained solutions, often correspond- ing to local maxima of the objective function.

In a non-hierarchical clustering we need to begin by defining the number of the groups. This is usually done by conducting the analysis with different values ofg(and different algorithm initialisations) and determining the best solution by

comparing appropriate indexes for the goodness of the clustering (such asR2or the pseudo-F index).

The most commonly used method of non-hierarchical clustering is thek-means method, where k indicates the number of groups established a priori (g in this section). The k-means algorithm performs a clustering of the n starting ele- ments, ing distinct groups (withg previously fixed), according to the following operational flow:

1. Initialisation. Having determined the number of groups, g points, called

seeds, are defined in the p-dimensional space. The seeds constitute the centroids (measures of position, usually means) of the clusters in the ini- tial partition. There should be sufficient distance between them to improve the properties of convergence of the algorithm. For example, to space the centroids adequately in Rp, one can select g observations (seeds) whose reciprocal distance is greater than a predefined threshold, and greater than the distance between them and the observations. Once the seeds are defined, an initial partition of the observations is constructed, allocating each observation to the group whose centroid is closest.

2. Transfer evaluation. The distance of each observation from the centroids

of theg groups is calculated. The distance between an observations and the centroid of the group to which it has been assigned has to be a minimum; if it is not a minimum, the observations will be moved to the cluster whose centroid is closest. The centroids of the old group and the new group are then recalculated.

3. Repetition. We repeat step 2 until we reach a suitable stabilisation of the groups.

To calculate the distance between the observations and the centroids of the groups, thek-means algorithm employs the Euclidean distance: at thetth iteration, the dis- tance between theith observation and the centroid of groupl(withi=1,2, . . . , n

andl=1,2, . . . , g) will be equal to

d xi, x(t)l = p s=1 xisx(t)sl 2 , where x(t)l = x(t)1l, . . . , x(t)pl

is the centroid of group l calculated at the tth iteration. This shows that the k-means method searches for the partition of the

n observations in g groups (with g fixed in advance) that satisfies a criterion of internal cohesion based on the minimisation of the within-group devianceW, therefore the goodness of the obtained partition can be evaluated by calculating the indexR2of the pseudo-F statistic. A disadvantage of thek-means method is the possibility of obtaining distorted results when there are outliers in the data. Then the non-anomalous units will tend to be classified into very few groups, but the outliers will tend to be put in very small groups on their own. This can

create so-called ‘elephant clusters’ – clusters too big and containing most of the observations.

Documento similar