• No se han encontrado resultados

1. REVISIÓN BIBLIOGRÁFICA

1.1 PROCESO DE FAENAMIENTO DE POLLOS DE CARNE

1.1.3 ETAPAS DEL PROCESO DE FAENAMIENTO DE POLLOS BROILER

1.1.3.8 Lavado o Preenfriamiento

Internal methods of validation use the variables that have already been used in the clustering process. These methods attempt to represent the goodness of fit between the input data and the resulting cluster solution (Milligan, 1996). Internal methods are closely linked to the choice of the number of clusters and many of the same statistical techniques are used.

2.2.2.1.2.1 Statistical methods

The SPSS 10.0 manual (1997) suggests that the best way to validate a cluster solution is conducting a discriminant analysis on the clustered data. In this method, if the discriminant analysis indicates that the groups (clusters) are significantly different, then the solution is validated. However, the use of discriminant analysis, ANOVA and MANOVA has been strongly criticized (e.g. Milligan, 1996; Aldenderfer and

Blashfield, 1984). Milligan (1996) argued that the clustering process separates cases into groups that minimize overlap, and so techniques such as discriminant analysis, ANOVA and MANOVA will always show good results when the cluster groups are compared. Interestingly, while Hair et al. (1995) also noted this method is

inappropriate, the researchers use a one-way ANOVA to compare cluster groups in an example presented in their work, although it was not used to validate the solution. It would seem that the use of these tests on clustering variables still offers useful information to the researcher in establishing which variables differ between groups, but this information does not validate the cluster solution on its own.

Milligan (1981) compared 30 methods for internal validation, using 108 Monte Carlo datasets with known cluster solutions and using two external criteria measures (Rand and Jaccard statistic). Applying each statistical technique to the clusters, the study ranked the 30 internal methods based on how close the result was to the known clustering solution as indicated by the external criteria measures. Milligan identified a group of six ‘strong’ methods that could form the basis of a validation procedure in applied research. Milligan (1996) noted that this testing was conducted on artificial data and may not hold for real world data. However, it does offer strong objective data for the relative effectiveness of different methods that can be used by the researcher to guide the choice of a validation method.

2.2.2.1.2.2 Replication

Replication has been reported as a possible method for validating cluster analysis (e.g. McIntyre and Blashfield, 1980; Morey et al., 1983). Replication refers to the process of repeating the cluster analysis on a randomly drawn subset of the original data. If a cluster is robust (i.e. if its characteristics remain despite the use of different sub-sets from the sample population) the researcher has some evidence to support the

solution’s existence. Milligan (1996) noted that replication is analogous to the cross- validation procedure in regression analysis. Replication was used by Hodge and Petlichkoff (2000), for example, reanalyzing a randomly selected subset of two thirds of the original dataset. The researchers reported that 94% of the subset subjects maintained the same cluster membership as the original analysis concluded that the solution was robust based on these results.

Milligan (1996) reported a slightly different replication method. Two samples of data are obtained (usually by randomly dividing the initial dataset). Cluster analysis is performed on the first dataset and means for each cluster are calculated. Using these means, each case from the second dataset is allocated to the nearest cluster, and the cluster that each case is allocated to is noted for later comparison. Then the second dataset is cluster analysed. The two cluster solutions for the second dataset (i.e. from cluster means from dataset one and from cluster analysis of dataset two) are then compared. The level of agreement between the two cluster solutions reflects the stability of the cluster solution. Breckenridge (1989), as reported by Milligan (1996), found this method useful for validating clusters in work with Monte Carlo datasets.

However, Aldenderfer and Blashfield (1984) criticised the replication method of validation. They suggested that finding a similar cluster structure using replication is a check of the internal consistency of the result. While the failure of a cluster solution to be replicated is reason for rejecting the solution, or the existence of an individual cluster, successful replication does not guarantee validity of a solution. Unfortunately, Aldenderfer and Blashfield did not expand on this issue. No other author has

expressed concern with replication as a validation method. As well, replication is a method being used more extensively in association with other statistics (e.g. regression, Pedhazur, 1997). Re-sampling methods such as bootstrapping and jack- knifing (Zhu, 1997) are examples of replication analyses that are being used more extensively to provide confidence limits and validation to solutions using other statistics (e.g. Ball et al., 2003a). It should also be noted that no single measure (replication, statistical measures, theoretical assessment) completely validates a

2.2.2.1.2.3 Use of more than one clustering algorithm

The use of more than one method or measure of cluster calculation has been proposed as a useful method of validation of cluster analysis. There are a number of methods within the cluster process for measuring the distance between cases and between clusters. For example, the ‘between-groups’ method clusters cases that maximize the distance between clusters at each step of the hierarchical process while the ‘within- groups’ method simply clusters the two nearest cases or clusters at each step. There are also a number of measures used to define how the distance between cases/clusters is quantified and include Euclidean distance, squared Euclidean distance (referred to as measures of dissimilarity) and Pearson’s correlation (referred to as a measure of similarity).

Hair et al. (1995) suggested that re-analyzing a cluster solution using non-hierarchical techniques with random selection of starting seeds is a way to test the robustness of the cluster solution and validate results. In the example Hair et al. presented, the initial solution, calculated using hierarchical techniques, was found. Then, using a non-hierarchical clustering process and using k random seeds (i.e. randomly selected cluster centroids or means) and where k equalled the number of clusters chosen from the hierarchical process, clustering was performed on the data again. Hair et al. reported that this non-hierarchical cluster analysis obtained the same clusters as the hierarchical procedure and, based on this finding, concluded that the solution is robust and valid. However, Milligan (1996) reported that, while non-hierarchical procedures with known seed points are better at obtaining correct cluster numbers than

hierarchical procedures, if random seeds are used, clustering is poor. This being the case, the method suggested by Hair et al. (1995) would seem to be inappropriate. Interestingly, Hair et al. also noted that using a non-hierarchical method with random seeds leads to poor clustering solutions.

Kos and Psenick (2000) suggested that added validity is provided to a cluster solution if the clusters appear using different methods for measuring the distance between cases and clusters. Kos and Psenick clustered a dataset using both the between-group method and within-group method, and suggested that for a cluster to be considered valid, it must appear in both analyses. Hair et al. (1995) also suggested that the use of more than one clustering method would be an appropriate way to validate a cluster solution, although the researchers do not use it in the example they provide. However, as noted by Hair et al. (1995) and Milligan (1996), the choice of method of

determining clusters should have a theoretical basis. The use of more than one method suggested by Kos and Pesnik (2000) may depart from this theoretical basis in many applied research applications, making the results of such comparisons invalid themselves.

2.2.2.1.2.4 Monte Carlo datasets

Aldenderfer and Blashfield (1984) suggested Monte Carlo data sets might be a useful method of validation of a cluster solution. In this case, the Monte Carlo data set is generated so that its characteristics are the same as the characteristics of the original data set (such as means, standard deviations etc.) but with no pre-defined clusters. Both the original and Monte Carlo data sets are then cluster analysed. Aldenderfer and Blashfield suggested that the next step could involve performing one-way ANOVA on each of the parameters between clusters for the original data set. Similarly, one-way ANOVA between clusters from the Monte Carlo data set is also performed. If the difference in the F-ratios between the original and Monte Carlo data sets is large, then it might be considered that the cluster solution is sufficiently

removed from a random result to be considered valid. Conversely, if F-ratios were similar then little support exists for the cluster groupings being valid, and more likely exist due to chance. Aldenderfer and Blashfield noted that this method had not been widely used (by 1984) and this researcher found little use of it in the literature since this time. Milligan used this technique in the series of studies evaluating different cluster methodologies but not to validate a real data solution (e.g. Milligan, 1981; Milligan and Cooper, 1985).

2.2.2.1.2.5 Cophenetic correlation

Aldenderfer and Blashfield (1984) also discussed the use of cophenetic correlation to validate a cluster solution. Briefly, this method examines the dendrogram to see how well it represents the pattern among the clustered cases. An implied similarity matrix

is developed based on when cases were clustered together. For example, similar cases will cluster together early in the process and these cases will incur a small value. Conversely, dissimilar cases will cluster together late and will incur a large value. This matrix is compared with the original matrix obtained from the Euclidean distance between cases. The cophenetic correlation is the correlation between values in the original and implied matrix, with a larger value indicating a better clustering of the data.

Aldenderfer and Blashfield (1984) were critical of the use of cophenetic correlation, suggesting that the assumption of normality (required for correlation) is usually violated and so the correlation coefficient is not an optimal estimator of the degree of similarity between the two matrices. This might not be a valid criticism given

numerous authors suggested correlation is relatively robust to violations of non- normality (e.g. Tabachnick and Fidell, 1996) and non-parametric tests might avoid this problem (Aldenderfer and Blashfield did not discuss this possibility). However, other limitations exist. The technique can only be used on data that has been clustered using the hierarchical method. As well, Aldenderfer and Blashfield reported that the two matrices contain different amounts of data and so contain considerably different information. Further, Holgersson (1978), as reported by Aldenderfer and Blashfield (1984), found the cophenetic correlation to be a generally misleading indicator of cluster quality based on assessment using Monte Carlo datasets.

Documento similar