• No se han encontrado resultados

1.2.9 Modelos teóricos de familia

1.2.9.3. Modelo Circumplejo de Olson FACES

The fundamental challenge, from a theoretical point of view, associated with analysing high dimensional data is that data points become increasingly sparse as dimensionality increases (Steinbach et al., 2004). This concept is most easily intuited using a grid-based representation. For a fixed number of grid cells partitioning each dimension, the number of total grid cells in the full dimensional space grows exponentially with the dimension. Unless the number of available data grows with at least the same rate, then for a very large number of dimensions the ratio of non-empty cells to empty cells approaches zero. The space is, in a sense, “almost everywhere" sparse.

From a practical point of view, one of the major challenges for data partitioning is that certain distance measures lose meaning in very high di- mensions (Kriegel et al.,2009). This is related to the fact that pairwise dis- tances between points tend to be more uniform in high dimensions (Beyer et al.,1999;Aggarwal et al.,2001). This is expressed theoretically byBeyer et al. (1999), who show that for certain distributions underlying the data, the difference between the largest and smallest distance in a data set, divided by the smallest distance, tends to zero in probability as dimension approaches infinity. There ispoor discriminationbetween the nearest and furthest neigh- bour (Aggarwal et al.,2001).

The standard approach to handling high-dimensional data is via dimen- sion reduction. Dimension reduction can be performed as a preprocessing task before any attempt to partition a data set is undertaken, or it can be performed in conjunction with the partitioning step. Dimension reduction techniques can also help significantly even in relatively low dimensions, by removing the effect of features which are irrelevant for determining clusters, or identifying pairs of features which are highly correlated with one another.

Subspace clusteringusually refers to the case where it is assumed that only

clusters (Steinbach et al.,2004). A challenge in this context is that different subsets may be relevant for different clusters, and so attempting to cluster directly using only a single subset of features may not lead to meaningful re- sults. Grid-based clustering, rather counterintuitively, offers a useful means for subspace clustering. It is counterintuitive since the number of grid cells to process is so large that it seems grid-based methods would be particu- larly limited in high dimensional applications. Their use is well described byAgrawal et al. (1998) in relation to theCLIQUEalgorithm. The observa- tion is that a density-based cluster defined in a subset of dimensions, when

projectedonto each of those dimensions will exhibit a (one-dimensional) high

density region. Importantly, the intersection of two or more one-dimensional high density regions does not necessarily correspond to a dense grid cell in those dimensions. Low dimensional dense grid cells, when intersected, there- fore represent the potential locations of higher dimensional clusters. Only those intersections need to be considered when searching for clusters, rather than trying to find dense regions over an exponentially large number of grid cells.

Other cluster definitions, such as centroid-based, have also been consid- ered for subspace clustering. For example, thePROCLUSalgorithm (Aggar- wal et al.,1999) used ak-median based approach in which each cluster has an associated set of dimensions within which the associated data are most compact, or have least variability. Distance calculations for each cluster are only computed within their relevant subspace, and using theL1norm.

Subspace clustering is somewhat limited by restricting attention to clus- ters defined in axis-parallel subspaces. The term projected clustering will be used to refer to clustering techniques which attempt to find clusters in ar- bitrarily oriented subspaces. It is important to note that other authors have used “projected clustering" to refer to the subspace clustering above, and may refer to clustering within arbitrary subspaces as “correlation clustering".

The most common approach to projected clustering uses Principal Com- ponent Analysis (PCA), either locally (on subsets of the data set) or glob- ally, to determine subspaces within which data have high and low variabil- ity (Kriegel et al.,2009).ORCLUS(Aggarwal and Yu,2000) is an extension of PROCLUS based on low order (low variability) PCA projections. Variations on this approach all use PCA on a local level.

The Principal Direction DivisivePartitioning algorithm (PDDP) (Boley,

1998) uses PCA iteratively within a divisive hierarchical procedure. First the entire data set is projected onto the first principal component (the univariate subspace in which the variability is maximised). The data are then split in two at their mean within this subspace. This process is then repeated recursively on the resulting subsets, selecting the next subset to be partitioned based on a heuristic measure of cohesion, called scatter value. When the number of subsets reaches a chosen number the process terminates. This algorithm is

not motivated by a particular cluster definition, but rather uses the reasoning that subspaces in which the data have high variability are likely to display

high between cluster variability, in which case partitioning at the projected

mean is likely to separate clusters, rather than cut through them.

Two extensions to the PDDP algorithm were considered byTasoulis et al.

(2010). Both are motivated by density-based clustering, and the equivalent low density separation assumption. Thedensityenhanced PDDP (dePDDP) algorithm projects a data set onto its first principal component, as in PDDP, and then uses a kernel density estimate of the projected data to find a low density separator. It then splits the subset which induces the lowest den- sity separation based on its respective density estimate. The interval PDDP (iPDDP) method is similar, but rather than using a kernel estimate of the density it splits a data set at the largest gap between consecutive projected points, thereby separating by the largest margin hyperplane orthogonal to the first principal component.

The generality offered by projected clustering over subspace clustering is clearly beneficial in many cases. PCA projections have been successfully ap- plied in many areas, however it is trivial to construct examples where PCA is inappropriate. Projection pursuitrefers to a class of optimisation problems aimed at finding the most “interesting" subspace within a multivariate data set (Jones and Sibson,1987). The interestingness of a data set within a given subspace is referred to as the projection index. The term projection pursuit is attributed toFriedman and Tukey (1974), however an associated practice dates back toKruskal(1969). By defining a projection index which is relevant to the ultimate task at hand, e.g. clustering, it is possible to overcome some of the shortcomings associated with using off-the-shelf dimension reduction techniques like PCA. While these off-the-shelf methods have been extremely useful in the modern era of data analysis, the subsequent task, while eased by the reduced size of the data, often remains a challenging problem. By per- forming dimension reduction in tandem with the corresponding analysis, the ultimate task can be made much easier. This may be of particular relevance in relation to clustering. In a theoretical study of the concept of clusterabil- ity, Ackerman and Ben David (2009) observed that “Although most of the common clustering tasks are NP-hard, finding a close-to-optimal clustering for well clusterable data sets is easy (computationally)".

Documento similar