• No se han encontrado resultados

Subtema N°2 ¿El Impuesto Verde ha contribuido para que las compañías

5. DESARROLLO Y RESULTADOS

5.1. Subtema N°2 ¿El Impuesto Verde ha contribuido para que las compañías

This complete and sophisticated clustering framework, defined in (Soria and Garibaldi, 2010), formulates a consensus clustering solution that emphasises building a cluster ensemble that is characterised by quality and diversity complying perfectly with the findings of (Fern and Lin,2008;Kuncheva et al.,2006) and then uses the agreement of the different clustering results to define the consensus clusters, which in this case they are named as core classes. In more detail the framework consists of five steps:

1. Data pre-processing: In the first step, if necessary the data is cleaned, nor- malised and inconsistencies are fixed.

2. Clustering: Two clustering methods are utilised, K-means (KM) and Partitioning Around Medoids (PAM)(Kaufman and Rousseeuw,1987). These techniques differ from each other as they have different measurements to define proximity of the data instances to establish the clusters. This means that they group the data considering different characteristics present in the data. They were chosen as they are among the most widely used clustering methods in data mining.

3. Determining the number of clusters: In this step, validity indices are applied to clustering results. These indices indicate the appropriate number of groups to consider in the analysis.

4. Data Visualisation: Graphs like box plots and biplots are employed in order to obtain a general characterisation of the clusters obtained.

5. Consensus: In this step, the clusters found by the different techniques are aligned and the core classes are established on those samples assigned to the same group by distinct techniques, while those that do not co-occur in the same cluster are considered unclassified.

After the first essential step of pre-processing, the framework ensures the desired di- versity in the second step by obtaining different clusterings from K-means and PAM while the third one guarantees the quality of the cluster ensemble as six validation in- dices, namely Calinski and Harabasz (Cali´nski and Harabasz,1974), Hartigan (Hartigan, 1975), Scott and Symons (Scott and Symons,1971), Marriot (Marriott,1971), traceW and Friedman (Friedman and Rubin, 1967), are utilised in order to find the optimal number of clusters for each clustering algorithm, iterating them for different number of clusters each time and following the rules dictated in (Dimitriadou et al.,2002). Accord- ing to these rules the number of cluster for each index is selected if it produces a much better value from the index it produces for one cluster less and if it is not much worse than the index it produces for one more cluster. In general, this logic is very similar to the Elbow method that finds the optimal number of factors in the case of Factor Analysis. In case of disagreement among the indices the groupings are being ranked for each index and the best one is chosen based on the minimum sum of ranks.

The level of agreement between the algorithms can serve as validation criterion of the quality of the clustering results. The bigger the agreement between the two clustering algorithms the bigger is the chance that these algorithms identified true patterns within the data. This stems from the fact that that patterns belonging to a natural cluster are likely to be in the same cluster in different data partitions (Duarte et al., 2010). The level of agreement can be measured by calculating the Cohen’s kappa (Cohen et al., 1960) between the different clusterings. The Cohen’s kappa is defined as:

p0− pc

1 − pc

(3.5)

where p0 is the observed proportion of agreement and pcis the proportion of agreement

expected by chance. Kappa takes negative values when there is less observed agreement than is expected by chance, zero when observed agreement can be exactly accounted for by chance and one when there is complete agreement.

In addition to this, we are going to use the Silhouette criterion (Rousseeuw,1987) in order to check how well separated, distinct and compact the clustering of the two algorithms are. The Silhouette coefficient of an object o is defined as:

s(o) = b(o) − a(o)

max{a(o), b(o)} (3.6)

where a(o) is the average distance between o and all other objects in the cluster to which o belongs and b(o) is the minimum average distance from o to all clusters to which does not belong. The silhouette value of a clustering is calculated as the average silhouette of all objects and it takes value from -1 to 1, with 1 being the best case. Therefore it is a standardised metric for measuring the goodness of clustering, a fact that makes it appropriate for comparing the results of clusterings on different datasets.

Apart from determining the number of clusters of the respective clustering solutions based on the values of validation metrics, characterisation of clusters is also a very significant task. Characterisation of clusters can reveal the desired patterns that can en- hance the Knowledge Discovery on the data, establishing the importance of the clusters for the domain of application. Visual inspection can be utilised as the framework dic- tated, producing a low dimensionality plot to depict the clusters or alternative ANOVA techniques to observe the differences in the numerical values. However considering the big size of the data some of these techniques can potentially lead to wrong conclusions. A pattern that will characterise a cluster should hold two important qualities. First it should be common in the cluster being expressed by the majority of instances and secondly it should be uncommon in the whole population. For this reason we propose a different technique to characterise the returned clusters that we apply in the fifth chapter.

The numerical variables are discretised in three classes (Low, Average, High) depending on which side of the interquartile range they fall. So the Low class describes the bottom 25%, the High Class the top 25% and the Average Class the middle 50%. Then after this transformation is being applied, each cluster will be expressed by the mode of each numerical variable. As the Average class is the most common class, a potential

occurrence of the Low or the High class would be significant enough to characterise the cluster. For the categorical clusters we borrow the rules of Confidence and Lift from association rule mining (Agrawal et al.,1993) in order to assess the impact of the clustering on the initial distribution of categorical levels. Clustering is seen as an event and we measure the Lift and the Confidence of the rule (Cl → L) where Cl is the cluster examined and L the level of the categorical variable under examination. The Lift of a rule (Cl → L) is defined as

P (Cl → L) = P (Cl ∩ L

P (Cl) ∗ P (L) (3.7)

and it measures how bigger or smaller the probability of the level L is after the event of clustering Cl take place. High values can be used to characterise clusters only if the level of the categorical variables is possessed by the majority of the members of the cluster. On the other hand the Confidence of the same rule is defined as

P (Cl → L) = P (Cl ∩ L

P (Cl) (3.8)

and shows how many instances of the initial population that express the categorical level L fall into the cluster CL. High values of confidence can reveal patterns that are expressed exclusively on this cluster even if they are not representative of the majority of the members of the clusters.

Finally the fifth step defines the core classes of the consensus, modelling strictly the agreement of the cluster ensemble.

Documento similar