Clustering Algorithms

(1)

Clustering Algorithms

Dalya Baron (Tel Aviv University)

XXX Winter School, November 2018

(2)

Clustering

Feature 1

Featur e 2

(3)

Clustering

Feature 1

Featur e 2 cluster #1

cluster #2

(4)

Clustering

Feature 1

Featur e 2 cluster #1

cluster #2

Why should we look for clusters?

(5)

Clustering

(6)

K-means

Input: measured features, and the number of clusters, k. The algorithm will classify all the objects in the sample into k clusters.

Feature 1

Featur e 2

(7)

K-means

(I) The algorithm places randomly k points that represent the centroids of the clusters.

The algorithm performs several iterations, in each of them:

(II) The algorithm associates each object with a single cluster, according to its distance from the cluster centroid.

(III) The algorithm recalculates the cluster centroid according to the objects that are associated with it.

Feature 1

Featur e 2

(8)

K-means

Feature 1

Featur e 2

Two centroids are

randomly placed

(9)

K-means

Feature 1

Featur e 2

The objects are associated to the

closest cluster centroid (Euclidean

distance).

(10)

K-means

Feature 1

Featur e 2

New cluster centroids are

computed using the

average location of

the cluster members.

(11)

K-means

Feature 1

Featur e 2

The objects are associated to the

closest cluster centroid (Euclidean

distance).

(12)

K-means

Feature 1

Featur e 2

The process stops when the objects that

are associated with a given class do not

change.

(13)

The anatomy of K-means

Internal choices and/or internal cost function:

(I) Initial centroids are randomly selected from the set of examples.

(II) The global cost function that is minimized by K-means:

Euclidean distance cluster centroids

cluster members

(14)

The anatomy of K-means

(I) Initial centroids are randomly selected from the set of examples.

(II) The global cost function that is minimized by K-means:

Euclidean distance cluster centroids

cluster members

k=3, and two diﬀerent random placements of centroids

(15)

The anatomy of K-means

Input dataset: a list of objects with measured features.

For which datasets should we use K-means?

Feature 1

Featur e 2

Feature 1

Featur e 2

(16)

(17)

The anatomy of K-means

What happens when we have an outlier in the dataset?

Feature 1

Featur e 2 outlier!

(18)

The anatomy of K-means

What happens when we have an outlier in the dataset?

Feature 1

Featur e 2 outlier!

(19)

The anatomy of K-means

What happens when the features have diﬀerent physical units?

input dataset K-means output

(20)

The anatomy of K-means

What happens when the features have diﬀerent physical units?

input dataset K-means output

How can we avoid this?

(21)

The anatomy of K-means

Hyper-parameters: the number of clusters, k.

Can we find the optimal k using the cost function?

k=2 k=3 k=5

(22)

The anatomy of K-means

Hyper-parameters: the number of clusters, k.

Can we find the optimal k using the cost function?

k=2 k=3 k=5

Number of clusters

Minimal cost function

Elbow

(23)

Questions?

(24)

Hierarchal Clustering

Correa-Gallego+ 2016

or, how to visualize complicated similarity measures

(25)

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method.

Initialization: each object is a cluster of size 1.

Feature 1

Featur e 2

(26)

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method.

Initialization: each object is a cluster of size 1.

Feature 1

Featur e 2

Next: the algorithm merges the two closest clusters into a single cluster.

Then, the algorithm re-calculates the distance of the newly-formed cluster

to all the rest.

(27)

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method.

Initialization: each object is a cluster of size 1.

Feature 1

Featur e 2

Next: the algorithm merges the two closest clusters into a single cluster.

Then, the algorithm re-calculates the distance of the newly-formed cluster

to all the rest.

distance

Dendrogram

(28)

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method.

Initialization: each object is a cluster of size 1.

Feature 1

Featur e 2

Next: the algorithm merges the two closest clusters into a single cluster.

Then, the algorithm re-calculates the distance of the newly-formed cluster

to all the rest.

distance

Dendrogram

(29)

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method.

Initialization: each object is a cluster of size 1.

Feature 1

Featur e 2

Next: the algorithm merges the two closest clusters into a single cluster.

Then, the algorithm re-calculates the distance of the newly-formed cluster

to all the rest.

distance

Dendrogram

(30)

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method.

Initialization: each object is a cluster of size 1.

Feature 1

Featur e 2

Next: the algorithm merges the two closest clusters into a single cluster.

Then, the algorithm re-calculates the distance of the newly-formed cluster

to all the rest.

distance

Dendrogram

(31)

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method.

Initialization: each object is a cluster of size 1.

Feature 1

Featur e 2

Next: the algorithm merges the two closest clusters into a single cluster.

Then, the algorithm re-calculates the distance of the newly-formed cluster

to all the rest.

distance

Dendrogram

(32)

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method.

Initialization: each object is a cluster of size 1.

Feature 1

Featur e 2

Next: the algorithm merges the two closest clusters into a single cluster.

Then, the algorithm re-calculates the distance of the newly-formed cluster

to all the rest.

distance

Dendrogram

(33)

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method.

Initialization: each object is a cluster of size 1.

Feature 1

Featur e 2

Next: the algorithm merges the two closest clusters into a single cluster.

Then, the algorithm re-calculates the distance of the newly-formed cluster

to all the rest.

distance

Dendrogram

(34)

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method.

Initialization: each object is a cluster of size 1.

Feature 1

Featur e 2

The process stops when all the objects are merged into a single cluster

distance

Dendrogram

(35)

The anatomy of Hierarchal Clustering

The linkage method is used to define a distance between two newly formed clusters. Methods include: single (minimal), complete (maximal), average, etc.

Feature 1

Featur e 2 distance single

Dendrogram

(36)

Feature 1

Featur e 2 distance

complete

Dendrogram

The anatomy of Hierarchal Clustering

(37)

Feature 1

Featur e 2 distance

average

Dendrogram

The anatomy of Hierarchal Clustering

(38)

Feature 1

Featur e 2 distance

Dendrogram

d

The anatomy of Hierarchal Clustering

Hyper-parameters: clusters are defined beneath a threshold d. Alternatively, we can select a threshold d that corresponds to the desired number of clusters, k.

(39)

Feature 1

Featur e 2 distance

Dendrogram

d

The anatomy of Hierarchal Clustering

(40)

The anatomy of Hierarchal Clustering

We can use the resulting dendrogram to choose a “good” threshold:

distance

(41)

The anatomy of Hierarchal Clustering

We can use the resulting dendrogram to choose a “good” threshold:

distance

(42)

The anatomy of Hierarchal Clustering

Input dataset: can either be a list of objects with measured properties, or a distance matrix that represents pair-wise distances between objects.

What happens if we have an outlier in the dataset?

(43)

The anatomy of Hierarchal Clustering

What happens if the dataset does not have clear clusters?

distance

(44)

The anatomy of Hierarchal Clustering

Diﬀerent linkage methods are helpful with diﬀerent datasets.

single linkage complete linkage average linkage

(45)

“Statistics, Data Mining, and Machine Learning in Astronomy”, by Ivezic, Connolly, Vanderplas, and Gray (2013).

Hierarchal Clustering in Astronomy

(46)

Visualizing similarity matrices with Hierarchical Clustering

Input: 10,000 emission line spectra, covering the wavelength range 300 - 700 nm. There are ~90 emission lines in each spectrum, with an average SNR of 2-4.

wavelength (nm)

normalized flux normalized flux

(47)

Visualizing similarity matrices with Hierarchical Clustering

We compute a correlation matrix of all the observed wavelengths.

wavelength (nm)

wavelength (nm) corr elation coe ﬃ cient

(48)

Visualizing similarity matrices with Hierarchical Clustering

We convert the correlation matrix to a distance matrix, and build a dendrogram

(49)

Visualizing similarity matrices with Hierarchical Clustering

We reorder the correlation matrix (the wavelengths) according to the resulting dendrogram.

reordered axis

(50)

Visualizing similarity matrices with Hierarchical Clustering

de Souza et. al 2015

(51)

Questions?

(52)

Gaussian Mixture models

See: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm- covariances-py

(53)