Correspondence analysis and related methods

(1)

Michael Greenacre

WEEK 3:

MULTIDIMENSIONAL SCALING (MDS)

EIGENVALUE-EIGENVECTOR DECOMPOSITION MDS BIPLOTS

Correspondence analysis and related methods

Multidimensional scaling (MDS) biplots

Data set “countries” by student MT

Countries I E HR BR RU D T R MA PE N G F MX Z A

Italy 0 2 5 5 8 7 3 5 6 8 4 5 7

Spain 2 0 5 4 9 7 4 7 3 8 4 4 6

Croa tia 5 5 0 7 4 3 6 7 7 8 4 6 7

Brazil 5 4 7 0 9 8 3 6 2 7 4 3 5

Ru ssia 8 9 4 9 0 4 7 7 8 8 7 7 7

G erm a ny 7 7 3 8 4 0 7 8 8 8 4 8 8

T urkey 3 4 6 3 7 7 0 5 4 5 6 4 5

M orocco 5 7 7 6 7 8 5 0 7 4 6 6 4

Peru 6 3 7 2 8 8 4 7 0 6 7 2 4

Nigeria 8 8 8 7 8 8 5 4 6 0 6 3 3

F rance 4 4 4 4 7 4 6 6 7 6 0 8 7

Mexico 5 4 6 3 7 8 4 6 2 3 8 0 4

South A frica 7 6 7 5 7 8 5 4 4 3 7 4 0

Data on www.multivariatestatistics.org

(2)

MDS of “countries”, using classical scaling

-2 0 2 4

-2024

dim 1

dim 2

I

E

HR

BR

RU

D TR

MA

PE NG

F MX

ZA

Distance variance explained

= 33.3% + 23.4%

= 56.7%

Distances and dissimilarities...

• n _objects

• d

_ij

= distance between object i _{and object} j Properties of a distance (metric)

1. d

_ij

= d

_ji

2. d

_ij

 0, d

_ij

= 0  i = j

3. d

_ij

 d

_ik

+ d

_kj

(the triangle inequality) (If 3. not satisfied we call d a dissimilarity)

The chi-square distance is a true distance, whereas Bray-Curtis

is a dissimilarity

(3)

CITIES Amst. Aths. Barc. Basel Berlin Bordx

Amsterdam 0 2979 1533 768 676 1076 ...

Athens 2979 0 3261 2594 2486 3250 ...

Barcelona 1533 3261 0 1061 1945 600 ...

Basel 768 2594 1061 0 884 898 ...

Berlin 676 2486 1945 884 0 1631 ...

Bordeaux 1076 3250 600 898 1631 0 ...

: : : : : : :

Amsterdam

Bordeaux

Basel

Berlin

Athens Barcelona

OK ?

Distances and dissimilarities...

CITIES Amst. Aths. Barc. Basel Berlin Bordx

Amsterdam 0 d

₁₂

d

₁₃

d

₁₄

d

₁₅

d

₁₆

Athens d

₂₁

0 d

₂₃

d

₂₄

d

₂₅

d

₂₆

Barcelona d

₃₁

d

₃₂

0 d

₃₄

d

₃₅

d

₃₆

Basel d

₄₁

d

₄₂

d

₄₃

0 d

₄₅

d

₄₆

Berlin d

₅₁

d

₅₂

d

₅₃

d

₅₄

0 d

₅₆

Bordeaux d

₆₁

d

₆₂

d

₆₃

d

₆₄

d

65

0 Amsterdam

Bordeaux

Basel

Berlin

Athens Barcelona

Observed distances

d

_ij

etc...

ˆ

13

d ˆ

16

d

ˆ

15

d dˆ

ij

Fitted distances

ˆ

12

d ˆ

34

d

Multidimensional scaling (MDS)

(4)

Observed distances

d

_ij

 ^

ij

d

d ˆ )

²

(

Minimize

dˆ

ij

Fitted

distances  ^

ij

f d

d )

²

( ( ˆ )

Minimize

Objective is to minimize some measure of discrepancy, or error, between observed and fitted distances.

or

Maximize the agreement between the rank-ordered

distances in the map and the rank-ordering of the original distances (nonmetric MDS), similar idea to that of

Spearman’s rank correlation; R function isoMDS .

also called “Sammon’s non-linear mapping”; R function sammon

for any monotonically increasing function f

Multidimensional scaling (MDS)

points distances

• From a map to a distance matrix

(squared) distance

matrix (-1,3) • ₁

(-1,-1) • ₄

(3,2) • ₃ (3,4) • ₂

0 25 41 16

25 0

4 16

41 4

0 17

16 17

17 0

 





 





Classical (multidimensional) scaling

(5)

points distances

• suppose you have n _points x

_i

( i=1,...,n ) _in p -dimensional Euclidean space

 







 









 







 









np n

n

p p

n

x x x

x x

x

x x

x









2 1

2 22

21

1 12

11 2

1

T T T

x x x X

p _dimensions

n _points

• squared distance between the

i _{-th and} j -th points is









^p

k

jk ik

ij

x x

1

)

2

 (

2 1

2 22

21

1 12

11

 





 







nn n

n

n n









(squared) Δ

distance matrix

Classical (multidimensional) scaling

• in matrix notation:

S 1s

Δ s1 2

2 1

2 22

21

1 12

11







 





 







^T ^T

nn n

n

n n









where S  XX

^T

distances points

• the problem in classical scaling:

• given  _{solve for} X

is matrix of scalar products

) diag( S s 

and

Classical scaling



   











^p

k

jk ik p

k jk p

k ik p

k

jk ik

ij

x x x x x x

1 1

2 1

2

2 )

 (

(6)

• If we had scalar propducts S and had to recover X it would be simple:

S = XX ^T

• Recall the eigenvalue-eigenvector decomposition of a square symmetric matrix, for example of S:

S = UU ^T

where

UU ^T = I ; 0

0 0

2 1 2

1



 





 







_n

n









 Λ

so a possible solution would be:

X = U ^1/2

Classical scaling

• but we don’t have the scalar products S but rather the squared distances  =

• we can recover the matrix of scalar products S* with respect to the centroid of the n points by a transformation of  called double- centring:

– subtract the row means from all the squared distances – subtract column means from the resultant matrix

then multiply double-centred matrix by - 1/2 _{to obtain} S*

In matrix notation: **S* =**

Then carry on as before:

**S* = UU** T

**X* = U** ^1/2 S

1s s1

^T



^T

 2

Classical scaling

) (

)

( ¹ ¹

2

1

T T

11 I

Δ 11

I  n  n



(7)

R code to double-centre and eigendecompose

# read in the squared distances as a square matrix

d2 <- matrix(c(0,17,17,16,17,0,4,41,17,4,0,25,16,41,25,0),nrow=4)

# compute scalar products (mimmicking matrix algebra) n <- nrow(d2)

ones <- rep(1,n) I <- diag(ones)

S <- -0.5*(I-(1/n)*ones%*%t(ones)) %*% d2 %*% (I-(1/n)*ones%*%t(ones))

# alternatively use R function sweep to centre row- and column-wise rowm <- apply(d2,1,mean)

S <- sweep(d2,1,rowm) colm <- apply(S,2,mean) S <- sweep(S,2,colm) S <- -0.5*S

# compute eigenvalues and eigenvectors using R function eigen S.eig <- eigen(S)

# compute coordinates and plot

X <- S.eig$vectors[,1:2] %*% diag(sqrt(S.eig$values[1:2])) plot(X, type="n")

text(X, labels=1:4)

# check squared distances between displayed points dist(X)^2

Fits the scalar products directly, hence the distances indirectly.

Classical MDS situates the points in a space of as high dimensionality as possible to reproduce the observed distances and then projects the points onto low-dimensional suspaces, usually a plane:

•

• • •

•

• • •

•  



 



centroid

d

_i

dˆ

i centroid



Method aims to minimize the sum of squares of these ‘errors’

This is equivalent to maximizing (thanks, Pythagoras!).

The quality of the fit is usually measured by

expressed as a %. 

i

d ˆ

i²

/ 

i

d

i²



i

d ˆ

i²

R function cmdscale

• 

Classical scaling - geometry

(8)

R code to perform classical scaling

#---

# Classical multidimensional scaling

#---

# Data set "countries" on multivariatestatistics.org MT <- read.table("clipboard")

MT.mds <- cmdscale(MT)

plot(MT.mds, type="n", asp=1)

text(MT.mds, labels=colnames(MT), col="blue", font=2)

# MT.mds only contains coordinates, use option eig=T to get eigenvalues MT.mds <- cmdscale(MT, eig=T)

plot(MT.mds$points, type="n", asp=1)

text(MT.mds$points, labels=colnames(MT), col="blue", font=2)

# sum of ABSOLUTE values of eigenvalues is measure of total variance sum(abs(MT.mds$eig))

# sum of POSITIVE eigenvalues is the Euclidean-embeddable part sum(MT.mds$eig[MT.mds$eig>0])

# proportion of "Euclideanness" in student's dissimilarity matrix sum(MT.mds$eig[MT.mds$eig>0]) / sum(abs(MT.mds$eig))

# proportion of explained Euclidean variance in two dimensional solution sum(MT.mds$eig[1:2])/sum(MT.mds$eig[MT.mds$eig>0])

# proportion of explained total variance in two dimensional solution sum(MT.mds$eig[1:2])/sum(abs(MT.mds$eig))

Classical MDS of Bray-Curtis dissimilarities

-40 -20 0 20 40

-200204060

s1

s2 s3

s4 s5

s6 s7 s8

s9 s10

s11

s12

s14 s15 s13

s16 s17

s18

s19

s20 s21

s22

s23

s24

s25 s26s27

s28 s29

s30

Goodness

of fit:

53.1%

(9)

Stress:

13.5%

-40 -20 0 20 40

-200204060

s1

s2 s3

s4 s5

s6 s7 s8

s9 s10

s11 s12

s14 s13

s15

s16 s17

s18

s19

s20 s21

s22

s23

s24 s25

s26 s27

s28 s29

s30

Nonmetric MDS of Bray-Curtis dissimilarities

-1.0 -0.5 0.0 0.5 1.0 1.5

-1 .0 -0 .5 0 .0 0 .5

s1 s2

s3

s4 s5

s7 s6

s8

s9

s10

s11 s12

s13 s14

s16 s15 s17

s18 s19

s20

s21

s22 s23

s24 s25

s26

s27 s29 s28 s30

Goodness of fit:

74.4%

Classical MDS of chi-square distances

(10)

Goodness of fit:

75.2%

-1.0 -0.5 0.0 0.5 1.0 1.5

-1.0-0.50.0

s1 s2

s3

s4

s5 s6 s7

s8

s9

s10

s11 s12

s13 s14

s15 s16

s17 s18

s19 s20

s21

s22 s23

s24 s25

s26

s27 s28

s29 s30

a b

d c

e

Notice that the rows and the columns are depicted in a joint map.

To be

continued...

Correspondence analysis

Exercise 2

COUNTRIES A B C D E F

A: standard of living 1 (low) ... 9 (high) B: climate

1 (terrible) ... 9 (excellent) C: food

1 (awful) ... 9 (delicious) D: security

1 (dangerous) ... 9 (safe) E: hospitality

1 (friendly) ... 9 (unfriendly) F: infrastructure

1 (poor) ... 9 (excellent)

Belgium

Denmark

Estonia

France

Germany

Greece

Holland

Hungary

Italy

Romania

Spain

Turkey

U.K.

(11)

Data set “attributes”

C ountries living c limate food security hos pitality infras tructure dim 1 dim 2

Italy 7 8 9 5 3 7 0.01 -2.94

Spain 7 9 9 5 2 8 -1.02 -3.68

Croatia 5 6 6 6 5 6 3.70 -0.88

Brazil 5 8 7 3 2 3 -2.56 -2.01

Russia 6 2 2 3 7 6 4.41 2.91

G erm any 8 3 2 8 7 9 5.01 0.00

T urkey 5 8 9 3 1 3 -1.38 -0.48

M oro cco 4 7 8 2 1 2 -0.87 2.27

P eru 5 6 6 3 4 4 -2.77 -0.74

Nigeria 2 4 4 2 3 2 -1.97 3.91

F rance 8 4 7 7 9 8 2.18 -1.76

M exico 2 5 5 2 3 3 -2.58 0.77

S outh Africa 4 4 5 3 3 3 -2.16 2.62

Student MT’s ratings of the 13 countries on 6 attributes:

standard of living (1=low,…,9=high);

climate (1=terrible,…,9=excellent);

food (1=awful,…,9=delicious);

security (1=dangerous,…,9=safe);

hospitality (1=friendly,…,9=unfriendly);

infrastructure (1=poor,…,9=excellent).

On the right are the coordinates of the country points in the MDS map Data on www.multivariatestatistics.org

variances explained:

living: 74.4%

climate: 69.3%

food: 61.0%

security: 78.1%

hospitality: 56.9%

infrastructure: 81.8%

attributes dimns

countries

attributes plotted using regression coefficients



regrn coeffs

Adding variables in regression biplot

(12)

variances explained:

a: 76.7%

b:

18.6%

c: 92.1%

d:

14.6%

e:

95.8%

species dimns

sites

species plotted using regression coefficients



regrn coeffs

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

-1.0-0.50.00.51.01.5

1

2 3

4 5

7 6

8

9

10

11 12

13 14

16 15

17 18

19 20

21

22 23

24 25

26

27 2928 30 a

b

c d

e C G

S

C G

S

Explained chi-square distances

52.4% + 22.0% = 74.4%

--- :logistic biplot

C, S, G

: averages of sites

MDS biplot, based on chi-square distances

R code to perform MDS biplot

#---

# MDS biplot, adding data set "attributes"

#---

# Data set "attributes" on multivariatestatistics.org MT_ratings <- read.table("clipboard")

# Save two-dimensional coordinates of previous classical MDS MT_dims <- MT.mds$points[,1:2]

colnames(MT_dims) <- c("dim1","dim2")

# calculate the regression coefficients of the 6 attributes on the coordinates MT_coefs <- lm(MT_ratings[,1] ~ MT_dims[,1] + MT_dims[,2])$coefficients

for(j in 2:6) MT_coefs <- rbind(MT_coefs,

lm(MT_ratings[,j] ~ MT_dims[,1] + MT_dims[,2])$coefficients)

# plot the regression coefficients on the MDS plot

plot(MT_dims, type = "n", xlab = "dim 1", ylab = "dim 2", asp = 1) text(MT_dims, labels = colnames(MT), col = "blue", font=2)

arrows(0, 0, MT_coefs[,2], MT_coefs[,3], col = "red", length = 0.1, angle = 10) text(1.2*MT_coefs[,2:3], labels = colnames(MT_ratings), col = "black", font = 4,

cex = 0.7)

Correspondence analysis and related methods

Michael Greenacre

WEEK 3:

MULTIDIMENSIONAL SCALING (MDS)

EIGENVALUE-EIGENVECTOR DECOMPOSITION MDS BIPLOTS