6. RESULTADOS
6.3 PRUEBA DE CHI CUADRADO PARA MEDIR CONCORDANCIA
The simplest way of clustering multi-represented objects is to select one representation Ri and cluster all objects according to this representation.
However, this approach restricts data analysis to a limited part of the avail- able information and does not use the remaining representations to find a meaningful clustering. Another way to handle multi-represented objects is to combine the different representations and use a combined distance func- tion. Then any established clustering algorithm can be applied. However, this approach yields several drawbacks. First of all, the feature spaces of the different object representations might have various distance functions that are specialized to a certain kind of data, but often are not applicable to general data spaces. For example, for text objects the most established distance measure is the cosine distance, whereas trees and sequences are
often compared by variants of the edit distance. To combine these differ- ent approaches of similarity into one common distance function is difficult since certain distance functions like edit distance do not necessarily provide a finite range of values. Thus, a normalization to achieve comparability for each representation is a difficult task. Another problem of combined distance functions is the handling of missing representations. The part of the combined distance that relates to a missing representation has to be considered somehow. A common approach is to define some dummy value. However, the choice of such a dummy value might have a major influence on the distance and thus has to be considered carefully. Last but not least, the efficiency of processingε-range queries strongly depends on the use of index structures and filters. Since these index structures are also dependent on the employed distance measures, building a common feature space usually prohibits the use of specialized index structures. Therefore, for combined data spaces only very general index structures like metric trees [CNBYM01] are applicable.
The idea of our approach is to combine the information of all different representations as early as possible, i.e. during the run of the clustering algorithm, and as late as necessary, i.e.after using the different distance functions of each representation. To do so, we adapt the core object property proposed for DBSCAN. To decide whether an object is a core object, we use the local ε-neighborhoods of each representation and combine the results to a global neighborhood. Therefore, we must adapt the predicate direct density-reachability proposed for DBSCAN. In the next two subsections, we will show how we can use the concepts of union and intersection of local neighborhoods to handle multi-represented objects.
8.3.3 Union of Different Representations
This variant is especially useful for sparse data. In this setting, the clus- terings in each single representation will provide several small clusters and a large amount of noise. Simply enlargingε would relief the problem, but on the other hand, the separation of the clusters would suffer. The union- method assigns objects to the same cluster if they are similar in at least one of the representations. Thus, it keeps up the separation of local clusters,
8.3 Clustering Multi-Represented Objects 125
R
1 X1 X1 X2 X2 X3 X3 X4 X4 C CR
2Figure 8.2: Union method: local clusters and a noise object are aggregated to a multi-represented cluster C.
but still overcomes the sparsity. If the object is placed in a dense area of at least one representation, it is still a core object regardless of how many other representations are missing. Thus, we do not need to define dummy values. Figure 8.2 illustrates the basic idea.
We adapt some of the definitions of DBSCAN to capture our new notion of clusters. To decide whether an object o is a union core object, we unite all local εi-neighborhoods and check whether there are enough objects in
the global neighborhood, i.e. whether the global neighborhood ofois dense. Definition 8.2 (union core object)
Let ε1, ε2, ..., εm ∈ IR0+, k ∈ IN. An object o ∈ DB is called union core object, denoted byCoreUkε1,..,εm(o) if the union of all localε-neighborhoods contains at least kobjects, formally:
CoreUkε1,..,εm(o)⇔ | [
Ri(o)∈o NRi
ε (o)| ≥k.
Definition 8.3 (direct union-reachability)
Letε1, ε2, .., εm∈IR0+,k∈IN. An objectp∈DB isdirectly union-reachable
from q∈DB if q is a union core object and p is an element of at least one local NRi ε (q), formally: DirReachUkε1,..,εm(q, p)⇔CoreU k ε1,..,εm(q)∧∃i∈ {1, .., m}:Ri(p)∈ N Ri ε (q).
The predicate direct union-reachability is obviously symmetric for pairs of core objects if all disti are as demanded symmetric distance functions.
C1 C2 C1 C2 X1 X1 X2 X2 X3 X3
R
1R
2Figure 8.3: Intersection method: a local clustering is divided into the clustersC1 and C2.
Union-reachability and union-connectivity can be defined analogously to the original DBSCAN. A union-connected cluster is then defined as a set of union-connected objects which is maximal w.r.t. union-reachability. Thus, given the parameters ε1, ..., εm and k, we can discover a union-connected
cluster in a two-step approach. First, we choose an arbitrary database object
o, satisfying the union core object property. Second, we retrieve all objects that are union-reachable fromo, thereby obtaining the cluster containingo.