• No se han encontrado resultados

3.10 RESULTADOS DE LA IVESTIGACIÓN

3.10.1 Análisis e interpretación de las encuestas realizadas en los habitantes de la

most extreme case, the relationship might not be the same everywhere in the data set. This

makes anything but the simplest cases of intrinsic dimensionality hard to grasp analytically.

Roughly defined, the “intrinsic dimensionality” is the number of variables needed to repre-

sent the data. However, this definition is not usable in practice. Without putting additional constraints on the mapping function, any data could be considered just one dimensional, as a

function of a unique object ID, or by mapping the data to its closest position on a space filling

curve.3 Furthermore, for any practical use, one will want to allow an application dependent amount of error in this representation.

How to practically measure or use the intrinsic dimensionality remains an ongoing question.

There exist different definitions and measures to estimate a global or local intrinsic dimen-

sionality, such as the expansion dimension [dCH10] and the generalized expansion dimen- sion [HKN12]. Yet it has been noticed [HS05; dCH10; Hou+12] that many of the problems

ascribed to the curse of dimensionality can be handled with appropriate approaches when the

intrinsic dimensionality is low. As hinted upon by Equation 4.2, the concentration of distances is expected to happen with the intrinsic dimensionality. The Johnson-Lindenstrauss [JL84]

error bounds of random projections [Ach01] can also be expected to depend on the intrinsic

dimensionality instead of the technical data set dimensionality. Other problems, such as the combinatorial explosion for grid-based approaches or the bias problems of approaches testing

excessive combinations of subspaces however will remain since they do not exploit correlations

in the data set that may reduce the effective dimensionality.

4.4 Shared Nearest Neighbors (SNN)

The material discussed in this section is a condensed version of the following publications:

M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. “Can Shared- Neighbor Distances Defeat the Curse of Dimensionality?” In:Proceedings of the 22nd

International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany. 2010, pp. 482–500. doi:10.1007/978-3-642-13818-8_34 T. Bernecker, M. E. Houle, H.-P. Kriegel, P. Kröger, M. Renz, E. Schubert, and A. Zimek. “Quality of Similarity Rankings in Time Series”. In: Proceedings of the 12th

International Symposium on Spatial and Temporal Databases (SSTD), Minneapolis, MN. 2011, pp. 422–440. doi:10.1007/978-3-642-22922-0_25

An interesting alternative to traditional similarity measurement is the definition of second or- der distance measures based on the rankings induced by a primary similarity measure (such

as an Lp-norm, cosine similarity or also a domain specific distance). The simplest and most common of these methods involves the use of shared nearest neighbor (SNN) information, in

3

Assuming floating point precision, any point can actually be represented with a one-dimensional coordinate on a space filling curve – there are no irrational floating point numbers.

which the similarity value for an object pair(x, y)is a function of the number of data objects in the common intersection of fixed sized neighborhoods centered at xand y, as determined by the primary measure. The primary similarity measure can be any function that determines a ranking of the data objects relative to the query; it is not necessary for the data objects to be

represented as vectors, but for example rankings obtained by a text search function can also be

used as primary similarity measure.

The most basic form of shared nearest neighbor similarity measure is that of the “overlap”. Given a data setS consisting of n = |S| objects ands ∈ N+, letkNNs(x) ⊆ S be the set of

snearest neighbors ofx ∈ S as determined using some specified primary similarity measure. The overlap between objectsxandyis defined to be the intersection size

SNNs(x, y) =|kNNs(x)∩kNNs(y)|. (4.3)

Based on this definition of overlap, other shared-nearest-neighbor-based similarity measures

such as thecosine measurehave been proposed:

simcoss(x, y) = SNN

s(x, y)

s , (4.4)

so called since it is equivalent to the cosine of the angle between the zero-one set membership

vectors forkNNs(x)andkNNs(y). This was used in [ESK03; Hou03] as a local density measure for clustering. An alternative similarity definition based on shared nearest neighbors is set

correlation, given as:

simcorrs(x, y) = n n−s SNN s(x, y) s − s n , (4.5)

which results when the standard Pearson correlation formula

r= Pn i=1xiyi−nx¯y¯ p (Pn i=1x2i −nx¯2)( Pn i=1y2i −ny¯2)

is applied using the coordinates of the characteristic vectors ofkNNs(x)andkNNs(y)as vari- able pairs. Objects ofSthat appear in bothkNNs(x)andkNNs(y), or neither ofkNNs(x)and

kNNs(y), support the correlation of the two neighborhoods (and by extension the similarity of

xandy); those objects that appear in one neighborhood but not the other detract from the cor- relation. Note that the set correlation value tends to the cosine measure when s

n tends to zero.

Set correlation was introduced in [Hou08] for the purpose of assessing the quality of cluster

candidates, as well as ranking the cluster objects according to their relevance (or centrality) to

the cluster.

A common variant of shared nearest neighbor similarity uses kNN sparsification, where two objects are required to be in each otherskNNin order to be considered for similarity. This can

4.5 Empirical Observations on SNN Similarity 37

Documento similar