• No se han encontrado resultados

CONOCIMIENTO DE LA EMPRESA “DIALICON” 1 HISTORIA DE LA EMPRESA

ANÁLISIS SITUACIONAL

3 ANÁLISIS SITUACIONAL

3.1 CONOCIMIENTO DE LA EMPRESA “DIALICON” 1 HISTORIA DE LA EMPRESA

rather objective and is extremely simplified by BOSS.

6.3

Summary and Discussion

In this chapter, we introduced some details regarding the implementation of BOSS. In addition, we outlined two sample applications of BOSS. The first is an application to visual data mining. We used BOSS for semi-automatic clus- ter analysis of a database of protein structures. BOSS significantly simplifies this procedure of extracting valuable knowledge. The second application of BOSS was to evaluate different similarity models for voxelized CAD data. We compared four different models using a database of car parts. Again, BOSS significantly simplifies this evaluation and allows the deduction of im- portant hints for the usability of each model. The evaluation of the models using hierarchical clustering is much more objective than applying samplek- nn queries because all data objects are taken into account for the evaluation rather than some sample (random) data objects.

Part III

Adopting Density-Based

Clustering to High Dimensional

Data

Chapter 7

Clustering High Dimensional

Data

Clustering high dimensional data is usually a difficult task. In fact, most tra- ditional (“full dimensional”) clustering algorithms tend to break down when applied to high dimensional feature spaces. The reasons for this behavior is also known by the term curse of dimensionality and are worked out within this chapter in Section 7.1. Since the importance of clustering high dimen- sional data is steadily increasing with new data generation capabilities, new approaches have been developed recently to address this problem. Section 7.2 provides a general classification of these approaches. Section7.3 outlines two motivating examples and describes some data sets used as an evaluating test bed for the methods proposed in the next chapters.

7.1

The Curse of Dimensionality

In this section, we will explore some general properties of high dimensional feature spaces that have an impact on the performance of clustering algo- rithms. These phenomena are usually summed up by the term curse of dimensionality. Let us note that there are several properties contributing to the curse of dimensionality that may be missed in this section, but are less important in the context of this thesis.

Observation 7.1 The probability that points are located at the border of the data space increases with growing dimensionality.

The correctness of this observation can be made clear with the following considerations. If we assume uniform distribution of the data points inside a hypercube with side length 1, i.e. D ⊆ [0,1]d (cf. Figure 7.1 left), the

volume of such a data space is 1d = 1. The probability P

surface(r) that a

point randomly taken from a uniform and independent distribution in a d- dimensional space has a distance ofr or below to the space boundary can be determined as given below:

Psurface(r) = 1−(1−2·r)d.

As it is shown in Figure 7.1 (right), the probability that a point is inside a 10% border of the data boundary rapidly increases with growing dimen- sionality. For d = 3 dimensions, Psurface(0.1) is already 0.488% and reaches

0.965% for d= 15 dimensions.

Observation 7.2 In high dimensional feature spaces, theε-neighborhoods of the points will most likely exceed the boundaries of the data space.

Due to Observation7.1, the points tend to be located nearer to the boundaries of the data space with increasing dimensionality. As a consequence the hyper- sphere of the ε-range query of these points, growing with each dimension, will exceed the boundaries of the data space. Since density-based clustering

7.1 The Curse of Dimensionality 101 0 1 1 0.9 0.1 0.1 0.9 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0 5 10 15 20 25 30 35 40 dimension Psurface(0,1)

Figure 7.1: Probability of a point near by the data space boundary.

works on top of ε-neighborhoods, this observation may cause problems. If the points are located at the boundary of the data space, theε-neighborhood of these points are usually “smaller” because they exceed the boundaries of the data space, i.e. the probability that they contain a certain number of points decreases.

The first two observations have an impact on the density-based clustering notion. However, the next observation challenges the entire idea of clustering in high dimensional feature spaces.

Observation 7.3 In high dimensional feature spaces, the furthest neighbor of a point is usually as far as the nearest neighbor.

In [HAK00] the authors experimentally show that with growing dimensional- ity the concept of density tends to become meaningless because nearest and furthest neighbors of objects tend to be no more discriminable. Concepts like nearest neighbor or ε-neighborhood also tend to become meaningless in high dimensional spaces. The general consequence of this observation is that clustering makes no sense in high dimensional feature spaces because the data objects usually do not cluster any more but are sparsely distributed.

Most clustering methods mentioned in Section 2.1 compute “full dimen- sional” clusters in a given feature space, i.e. each dimension of this fea- ture space is equally weighted when computing the distance between points. These approaches are successful for low-dimensional feature spaces. How- ever, in higher dimensional feature spaces, their accuracy and/or efficiency deteriorates significantly due to the curse of dimensionality; in particular due to Observation7.3.

Documento similar