• No se han encontrado resultados

Dimensionality Reduction Algorithms

N/A
N/A
Protected

Academic year: 2022

Share "Dimensionality Reduction Algorithms"

Copied!
33
0
0

Texto completo

(1)

Dimensionality

Reduction Algorithms

(and how to interpret their output)

Dalya Baron (Tel Aviv University)

XXX Winter School, November 2018

(2)

What is Dimensionality Reduction?

Dimensionality Reduction algorithm

28 x 28 features per object 2 features per object

feature 1

feature 2

(3)

Why do we need

dimensionality reduction?

“Practical”:

• Improve performance of supervised learning

algorithms: original features can be correlated and

redundant, most algorithms cannot handle thousands of features.

• Compressing data (e.g., SKA).

“Artistic”:

• Data visualization and interpretation.

• Uncover complex trends.

• Look for “unknown unknowns”.

(4)

Two types of dimensionality reduction

1. Decomposition of the objects into “prototypes”. Each object can be represented using the prototypes.

We gain: prototypes that represent the population and low-dimensional embedding.

For example: SVD, PCA, ICA, NNMF, SOM and more…

(5)

Two types of dimensionality reduction

2. Embedding of a high-dimensional dataset into a lower dimensional dataset.

We gain: low-dimensional embedding.

For example: tSNE, autoencoders

(6)

Principle Component Analysis (PCA)

feature 1

feature 2

principle component 1 principle

component 2

PCA is a transformation that converts a set of observations (possibly from correlated variables) into a set of values of linearly uncorrelated variables, called principle

components.

-

The first principle component has the largest possible variance.

-

Each succeeding component has the highest possible variance, under the constrain that it is orthogonal to the preceding components.

(7)

Principle Component Analysis (PCA)

index of principle component

variance cumulative percentage of the variance

PCA allows us to compress the data, by representing each object as a

projection on the first principle components.

(8)

Principle Component Analysis (PCA)

The principle components may represent the true building blocks of the objects in our dataset.

observed object

= A * + B *

+ C *

principle comp. 1 principle comp. 2

principle comp. 3

(9)

Principle Component Analysis (PCA)

The projection onto the principle components gives a low-dimensional representation of the objects in the sample.

A (corresponds to principle component 1)

B (corresponds to principle component 2)

principle comp. 1

principle comp. 2

(10)

Advantages:

• Very simple and intuitive to use.

• No free parameters!

• Optimized to reduce variance.

Disadvantages:

• Linear decomposition: we will not be able to describe absorption lines, dust extinction, distance, etc..

• Can produce negative principle components, which is not always physical in astronomy.

PCA: Pros & Cons

(11)

From: http://www.astroml.org/book_figures/chapter7/fig_spec_decompositions.html

(12)

t-distributed stochastic neighbor embedding

(tSNE)

Embedding high-dimensional data in a low dimensional space (2 or 3) Input: (1) raw data, extracted features, or a distance matrix

(2) hyper-parameters: perplexity

high-dimensional space:

low-dimensional space:

(13)

tSNE

high-dimensional space:

low-dimensional space:

Intuition: tSNE tries to find a low-dimensional embedding that preserves, as much as possible, the distribution of distances between different objects.

perplexity: the neighborhood that tSNE considers in optimization

distance in neighborhood N

(14)

tSNE - example

28 x 28 features per object

feature 1

feature 2

(15)

tSNE - example

https://distill.pub/2016/misread-tsne/

(16)

tSNE : Pros & Cons

Advantages:

• Can take as an input a general distance matrix.

• Non-linear embedding.

• Preserves high-dimensional clustering well (depending on the chosen perplexity).

Disadvantages:

• No prototypes.

• Sensitive to distance scales < perplexity.

• Large distances are meaningless.

feature 1

feature 2

28 x 28 features per object

(17)

UMAP

See: https://arxiv.org/abs/1802.03426 https://github.com/lmcinnes/umap

(18)

Autoencoders

loss function = ( - ) 2

(19)

Autoencoders - Pros & Cons

Advantages:

• Can reduce the dimensions of raw images (CNN) or time-series (RNN)!

• Can be used to produce an uncertainty on the embedding.

Disadvantages:

• No prototypes.

• Complexity and interpretability.

(20)

Self Organizing Maps (SOM) and PINK

See: https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2016-116.pdf

http://www.astron.nl/LifeCycle2018/Documents/Talks_Session1/Harwood_LifeCycle18.pdf

(21)

How to interpret the output of a

dimensionality reduction algorithm?

High-dimensional data

Dimensionality

Reduction algorithm

2D embedding

(22)

How to interpret the output of a

dimensionality reduction algorithm?

If we have prototypes - try to understand what they mean

(23)

How to interpret the output of a

dimensionality reduction algorithm?

Dimension #1

Dimension #2

(24)

How to interpret the output of a

dimensionality reduction algorithm?

Dimension #1

Dimension #2

(25)

wavelength (A)

normalized flux

Dimension #1

Dimension #2

tSNE embedding in two dimensions

(26)

Example with the APOGEE dataset

• APOGEE stars: infrared spectra of ~250K stars.

• Calculate Random Forest distance matrix —> Apply tSNE for dimensionality reduction.

• See Reis+17.

(27)

Example with the APOGEE dataset

tSNE dimension #1

tSNE dimension #2

(28)

Example with the APOGEE dataset

tSNE dimension #1

tSNE dimension #2

stack spectra along this axis

1. Stack observations along different axises.

(29)

Example with the APOGEE dataset

tSNE dimension #1

tSNE dimension #2

2. Color points according to tabulated parameters (e.g., from the SDSS)

(30)

Example with the APOGEE dataset

tSNE dimension #1

tSNE dimension #2

2. Color points according to tabulated parameters (e.g., from the SDSS)

(31)

Example with the APOGEE dataset

tSNE dimension #1

tSNE dimension #2

2. Color points according to tabulated parameters (e.g., from the SDSS)

(32)

Example with the APOGEE dataset

tSNE dimension #1

tSNE dimension #2

2. Color points according to tabulated parameters (e.g., from the SDSS)

(33)

Questions?

Referencias

Documento similar

Phase 2 Identification of data sources (e.g., pasture, soil, animal weight) and definition of the multi-dimensional data model Phase 3 Implementation of the ACODAT for

Imparte docencia en el Grado en Historia del Arte (Universidad de Málaga) en las asignaturas: Poéticas del arte español de los siglos XX y XXI, Picasso y el arte español del

De esta manera, ocupar, resistir y subvertir puede oponerse al afrojuvenicidio, que impregna, sobre todo, los barrios más vulnerables, co-construir afrojuvenicidio, la apuesta

Diffusion Maps is a recent manifold learning method for dimensionality reduction that assumes the original sample patterns being embedded in a low dimensional manifold whose metric

Data were extracted which included: at RMEv level (ATC group, type of aRMM, targeted safety concerns, number of studies conforming the RMEv i.e., 1 study, 2 studies or

Si el progreso de las instituciones de Derecho público no ha tenido lugar en los pueblos que se han reservado para el Poder judicial en abso- luto las

Tal como se ha expresado en El Salvador coexisten dos tipos de control de constitucionalidad: el abstracto y el concreto. Sobre ambos se ha proporcionado información que no precisa

Lo más característico es la aparición de feldespatos alcalinos y alcalino térreos de tamaño centimétrico y cristales alotriomorfos de cuarzo, a menudo en agregados policristalinos,