FRECUENCIA DE CONSUMO
CONCLUSIONES Y RECOMENDACIONES
In this section, we define the high-density clusters and low-density separators, that we aim to locate throughout the main body of this thesis. We define high-density clusters based on the estimated density overX,pˆxby adapting the definition inHartigan(1975) as follows:
Definition 1. [High-density clusters] (Hartigan,1975) LetX = {xi}ni=1wherexi ∈ Rdbe a set of realisations of a random variableXwith estimated probability density functionpˆx.
High-density clusters are defined as maximally connected subsets of the level sets ofpˆx,
L(c; ˆpx) =
{
x∈ Rd|pˆx(x) ⩾c
}
; c ⩾0.
structure exists. Ifpˆxis multi-modal,L(c; ˆpx)may be connected or not depending on the
value ofc. If it is disconnected, it is formed by two or more connected components, which correspond to regions surrounding the modes ofpˆx(Menardi and Azzalini,2014). A di-
rect consequence of defining clusters as observations that lie in contiguous regions of high density inpˆxis that cluster boundaries pass through regions of low density. Therefore, we
define a low-density separator according to Definition 2.
Definition 2. [Low-density separator] For a connected setS ⊂Rd, the surface ofS,∂S, is a low-density separator if∃c⩾0for which the following hold:
1. there exist distinct componentsC1,C2ofL(c; ˆpx)s.t.C1 ⊂S,C2∩S=∅; 2. maxx∈∂S pˆx(x)⩽c.
IfX contains a family of high-density clusters, then a collection of low-density separators can identify all of these clusters. However, the evaluation of the density along a cluster sep- arator is computationally intractable for separators of arbitrary shape. Therefore, we must restrict attention to linear separators (hyperplanes) that partition dense, linearly separable sets as defined in Definition 3,
Definition 3. [Dense linearly separable sets] LetX = {xi}ni=1be a set of realisations of a random variableXwith estimated probability density functionpˆx. A familyC1, . . . ,Ckof
mutually disjoint subsets ofX isdense and linearly separableif there existsc >0, such that for anyxi,xj ∈ Cm,m∈ {1, . . . ,k}, min t∈[0,1]pˆx ( txi+ (1−t)xj ) >c. (2.9)
Moreover, there existsIsuch that∅̸= I ⫋{1, . . . ,k}, such that, conv(∪i∈ICi)∩conv ( ∪j∈ICCj ) =∅, (2.10)
and for anyxi ∈ conv(∪i∈ICi)andxj ∈ conv
( ∪j∈ICCj ) , max t∈[0,1]pˆx ( txi+ (1−t)xj) <c, (2.11)
whereIC ={1, . . . ,k} \Iis the complement ofI, andconv(·)denotes the convex hull.
As a consequence of applying Definition 3, the family of clusters inX is linearly separable if there exists a hyperplane along which the maximum value ofpˆxis at mostc, and which
also separates at least one cluster from the rest of the data. This definition results in the sets
C1, . . . ,Ckcorresponding to dense clusters, as defined in Definition 1 with the additional
constraint of convexity. We further define the setX to bedense and linearly clusterable
(with respect to the density estimatorpˆx) if it contains a family of convex dense clusters,
C1, . . . ,Cksuch that any (non-trivial) subset of this family is linearly separable.
Definition 4. [Dense linearly clusterable sets] LetX = {xi}ni=1be a set of realisations of the random variableXwith estimated probability density functionpˆx. A familyC1, . . . ,Ck
of mutually disjoint subsets ofX isdense and linearly clusterableif for any subsetI ⫋ {1, . . . ,k}satisfying|I| >1, the family{Ci}i∈I is dense and linearly separable.
3
Continuous Representations of Mixed Data
AbstractWe consider the problem of locating clusters in datasets with diverse (mixed) attributes. A number of approaches to clustering, including density-based algorithms require a set of contin- uous observations to correctly identify the clustering structure present. Therefore, we consider the production of a continuous representation of mixed datasets, upon which clustering may be performed. We apply three continuous representations across simulated and real-world datasets with varying characteristics, and evaluate the clustering performance of projective density-based and other well-established clustering algorithms over these representations. We find that locating an appropriate continuous representation can be challenging but in general, the most consistently high-quality results were located using the continuous representation from constant shift embedding (Roth et al.,2003).
3.1 Introduction
Although there is no single, universally adopted definition of a cluster, the vast major- ity of approaches to clustering rely on spatial separation in some way to define the clus- ters present. Consequently, many clustering algorithms rely on the set of observations X = {xi}ni=1having continuous attributes. However, many datasets contain observa- tions with diverse types of features (mixed data). In this case, defining similarity solely on spatial separation induces maximal similarity between observations with the same set of out- comes in discrete dimensions. This is not meaningful for the detection of the true clustering structure, since each possible combination of outcomes in the discrete dimensions ofX ap- pears as an individual cluster. There exist general distance metrics, for example the Gower distance metric (Gower,1971), which may be used in place of metrics such as Euclidean dis- tance. These allow the construction of a more meaningful dissimilarity matrix for mixed data. Thus, the application of clustering algorithms that can operate on pairwise dissimilar- ities alone is still possible. However, this is not sufficient for the density-based approach to clustering, which requires a set of continuous observations in order to define a continuous estimated probability density function, in which subsets of observations in contiguous re- gions of high probability density are associated with clusters. Another, related, challenge as- sociated with density-based clustering is that density estimation becomes unreliable in even moderate dimensions by modern standards (Rinaldo and Wasserman,2010). This means that for the practical application of density-based clustering techniques, dimensionality re- duction becomes a necessity. However, it is not clear how to specify appropriate projections for clustering in the case of non-continuous observations.
These two limitations mean that to apply density-based clustering to mixed datasets, it is necessary to transform the original observations to obtain a continuous representation,
upon which clustering can be performed. In this chapter, we investigate different continu- ous representations of mixed datasets, and their appropriateness for cluster detection. We quantify the quality of the continuous representations by the clustering performance of projective density-based algorithms (which are the focus of this thesis) and alternative algo- rithms, which also require a set of continuous observations.
The only work we are aware of that discusses the problem of finding an appropriate con- tinuous representation of mixed data for density-based clustering isAzzalini and Menardi
(2016). This employs multi-dimensional scaling (MDS) (Borg and Groenen,2005) and then locates clusters by constructing an estimated density over all dimensions of the trans- formed data. For this reason, this work is limited to using a small number of dimensions in the continuous representation. Since we focus on projective density-based methods, which remain applicable for high-dimensional applications, we remove this restriction and allow the continuous representations used to have higher dimensionality. We further extend this work by also investigating the continuous representations produced by mixed probabilistic principal component analysis (mPPCA) (Khan et al.,2010), and constant shift embedding (CSE) (Roth et al.,2003).
Like standard probabilistic principal components analysis (PPCA) (Tipping and Bishop,
1999), mPPCA assumes that observations originate from a Gaussian latent variable model. Each categorical variable is assumed to be sampled from a multinomial distribution, with probabilities given by a multinomial logistic regression function applied on the latent vari- able. Both CSE, and MDS, make no assumptions about the data generating process, and rely exclusively on pairwise dissimilarities, defined by a metric which is appropriate for non- continuous data. In all our work, we use the Gower distance metric. MDS aims to produce a continuous representation which retains the pairwise distances. Meanwhile, CSE seeks a
continuous representation, upon whichk-means clustering is guaranteed to assign all obser- vations to the same clusters as pairwise clustering on the dissimilarity matrix.
The remainder of this chapter is organised as follows: Sections 3.2, 3.3 and 3.4 present the processes of locating continuous representations by MDS, mPPCA and CSE respectively. Section 3.5 discusses our choice of dimensionality for each of the continuous representa- tions. Section 3.6 presents experimental results for the clustering performance of projective density-based and well-established clustering algorithms across the continuous represen- tations of simulated and real-world benchmark datasets. Finally, the work is concluded in Section 3.7.