Michael Greenacre
WEEK 3:
MULTIDIMENSIONAL SCALING (MDS)
EIGENVALUE-EIGENVECTOR DECOMPOSITION MDS BIPLOTS
Correspondence analysis and related methods
Multidimensional scaling (MDS) biplots
Data set “countries” by student MT
Countries I E HR BR RU D T R MA PE N G F MX Z A
Italy 0 2 5 5 8 7 3 5 6 8 4 5 7
Spain 2 0 5 4 9 7 4 7 3 8 4 4 6
Croa tia 5 5 0 7 4 3 6 7 7 8 4 6 7
Brazil 5 4 7 0 9 8 3 6 2 7 4 3 5
Ru ssia 8 9 4 9 0 4 7 7 8 8 7 7 7
G erm a ny 7 7 3 8 4 0 7 8 8 8 4 8 8
T urkey 3 4 6 3 7 7 0 5 4 5 6 4 5
M orocco 5 7 7 6 7 8 5 0 7 4 6 6 4
Peru 6 3 7 2 8 8 4 7 0 6 7 2 4
Nigeria 8 8 8 7 8 8 5 4 6 0 6 3 3
F rance 4 4 4 4 7 4 6 6 7 6 0 8 7
Mexico 5 4 6 3 7 8 4 6 2 3 8 0 4
South A frica 7 6 7 5 7 8 5 4 4 3 7 4 0
Data on www.multivariatestatistics.org
MDS of “countries”, using classical scaling
-2 0 2 4
-2024
dim 1
dim 2
I
E
HR
BR
RU
D TR
MA
PE NG
F MX
ZA
Distance variance explained
= 33.3% + 23.4%
= 56.7%
Distances and dissimilarities...
• n objects
• d
ij= distance between object i and object j Properties of a distance (metric)
1. d
ij= d
ji2. d
ij 0, d
ij= 0 i = j
3. d
ij d
ik+ d
kj(the triangle inequality) (If 3. not satisfied we call d a dissimilarity)
The chi-square distance is a true distance, whereas Bray-Curtis
is a dissimilarity
CITIES Amst. Aths. Barc. Basel Berlin Bordx
Amsterdam 0 2979 1533 768 676 1076 ...
Athens 2979 0 3261 2594 2486 3250 ...
Barcelona 1533 3261 0 1061 1945 600 ...
Basel 768 2594 1061 0 884 898 ...
Berlin 676 2486 1945 884 0 1631 ...
Bordeaux 1076 3250 600 898 1631 0 ...
: : : : : : :
Amsterdam
Bordeaux
Basel
Berlin
Athens Barcelona
OK ?
Distances and dissimilarities...
CITIES Amst. Aths. Barc. Basel Berlin Bordx
Amsterdam 0 d
12d
13d
14d
15d
16Athens d
210 d
23d
24d
25d
26Barcelona d
31d
320 d
34d
35d
36Basel d
41d
42d
430 d
45d
46Berlin d
51d
52d
53d
540 d
56Bordeaux d
61d
62d
63d
64d
650
Amsterdam
Bordeaux
Basel
Berlin
Athens Barcelona
Observed distances
d
ijetc...
ˆ
13d ˆ
16d
ˆ
15d dˆ
ijFitted distances
ˆ
12d ˆ
34d
Multidimensional scaling (MDS)
Observed distances
d
ij
ij
ij
ij
d
d ˆ )
2(
Minimize
dˆ
ijFitted
distances
ij
ij
ij
f d
d )
2( ( ˆ )
Minimize
Objective is to minimize some measure of discrepancy, or error, between observed and fitted distances.
or
or
Maximize the agreement between the rank-ordered
distances in the map and the rank-ordering of the original distances (nonmetric MDS), similar idea to that of
Spearman’s rank correlation; R function isoMDS .
also called “Sammon’s non-linear mapping”; R function sammon
for any monotonically increasing function f
Multidimensional scaling (MDS)
points distances
• From a map to a distance matrix
(squared) distance
matrix (-1,3) • 1
(-1,-1) • 4
(3,2) • 3 (3,4) • 2
0 25 41 16
25 0
4 16
41 4
0 17
16 17
17 0
Classical (multidimensional) scaling
points distances
• suppose you have n points x
i( i=1,...,n ) in p -dimensional Euclidean space
np n
n
p p
n
x x x
x x
x
x x
x
2 1
2 22
21
1 12
11 2
1
T T T
x x x X
p dimensions
n points
• squared distance between the
i -th and j -th points is
pk
jk ik
ij
x x
1
)
2 (
2 1
2 22
21
1 12
11
nn n
n
n n
(squared) Δ
distance matrix
Classical (multidimensional) scaling
• in matrix notation:
S 1s
Δ s1 2
2 1
2 22
21
1 12
11
T Tnn n
n
n n
where S XX
Tdistances points
• the problem in classical scaling:
• given solve for X
is matrix of scalar products
) diag( S s
and
Classical scaling
pk
jk ik p
k jk p
k ik p
k
jk ik
ij
x x x x x x
1 1
2 1
2 1
2
2
)
(
• If we had scalar propducts S and had to recover X it would be simple:
S = XX T
• Recall the eigenvalue-eigenvector decomposition of a square symmetric matrix, for example of S:
S = UU T
where
UU T = I ; 0
0 0
0 0
0 0
2 1 2
1
nn
Λ
so a possible solution would be:
X = U 1/2
Classical scaling
• but we don’t have the scalar products S but rather the squared distances =
• we can recover the matrix of scalar products S* with respect to the centroid of the n points by a transformation of called double- centring:
– subtract the row means from all the squared distances – subtract column means from the resultant matrix
then multiply double-centred matrix by - 1/2 to obtain S*
In matrix notation: S* =
Then carry on as before:
S* = UU T
X* = U 1/2 S
1s s1
T
T 2
Classical scaling
) (
)
( 1 1
2
1
T T11 I
Δ 11
I n n
R code to double-centre and eigendecompose
# read in the squared distances as a square matrix
d2 <- matrix(c(0,17,17,16,17,0,4,41,17,4,0,25,16,41,25,0),nrow=4)
# compute scalar products (mimmicking matrix algebra) n <- nrow(d2)
ones <- rep(1,n) I <- diag(ones)
S <- -0.5*(I-(1/n)*ones%*%t(ones)) %*% d2 %*% (I-(1/n)*ones%*%t(ones))
# alternatively use R function sweep to centre row- and column-wise rowm <- apply(d2,1,mean)
S <- sweep(d2,1,rowm) colm <- apply(S,2,mean) S <- sweep(S,2,colm) S <- -0.5*S
# compute eigenvalues and eigenvectors using R function eigen S.eig <- eigen(S)
# compute coordinates and plot
X <- S.eig$vectors[,1:2] %*% diag(sqrt(S.eig$values[1:2])) plot(X, type="n")
text(X, labels=1:4)
# check squared distances between displayed points dist(X)^2
Fits the scalar products directly, hence the distances indirectly.
Classical MDS situates the points in a space of as high dimensionality as possible to reproduce the observed distances and then projects the points onto low-dimensional suspaces, usually a plane:
•
•
• •
•
•
• •
•
centroid
d
idˆ
i centroid
Method aims to minimize the sum of squares of these ‘errors’
This is equivalent to maximizing (thanks, Pythagoras!).
The quality of the fit is usually measured by
expressed as a %.
id ˆ
i2/
id
i2
id ˆ
i2R function cmdscale
•
Classical scaling - geometry
R code to perform classical scaling
#---
# Classical multidimensional scaling
#---
# Data set "countries" on multivariatestatistics.org MT <- read.table("clipboard")
MT.mds <- cmdscale(MT)
plot(MT.mds, type="n", asp=1)
text(MT.mds, labels=colnames(MT), col="blue", font=2)
# MT.mds only contains coordinates, use option eig=T to get eigenvalues MT.mds <- cmdscale(MT, eig=T)
plot(MT.mds$points, type="n", asp=1)
text(MT.mds$points, labels=colnames(MT), col="blue", font=2)
# sum of ABSOLUTE values of eigenvalues is measure of total variance sum(abs(MT.mds$eig))
# sum of POSITIVE eigenvalues is the Euclidean-embeddable part sum(MT.mds$eig[MT.mds$eig>0])
# proportion of "Euclideanness" in student's dissimilarity matrix sum(MT.mds$eig[MT.mds$eig>0]) / sum(abs(MT.mds$eig))
# proportion of explained Euclidean variance in two dimensional solution sum(MT.mds$eig[1:2])/sum(MT.mds$eig[MT.mds$eig>0])
# proportion of explained total variance in two dimensional solution sum(MT.mds$eig[1:2])/sum(abs(MT.mds$eig))
Classical MDS of Bray-Curtis dissimilarities
-40 -20 0 20 40
-200204060
s1
s2 s3
s4 s5
s6 s7 s8
s9 s10
s11
s12
s14 s15 s13
s16 s17
s18
s19
s20 s21
s22
s23
s24
s25 s26s27
s28 s29
s30
Goodness
of fit:
53.1%
Stress:
13.5%
-40 -20 0 20 40
-200204060
s1
s2 s3
s4 s5
s6 s7 s8
s9 s10
s11 s12
s14 s13
s15
s16 s17
s18
s19
s20 s21
s22
s23
s24 s25
s26 s27
s28 s29
s30
Nonmetric MDS of Bray-Curtis dissimilarities
-1.0 -0.5 0.0 0.5 1.0 1.5
-1 .0 -0 .5 0 .0 0 .5
s1 s2
s3
s4 s5
s7 s6
s8
s9
s10
s11 s12
s13 s14
s16 s15 s17
s18 s19
s20
s21
s22 s23
s24 s25
s26
s27 s29 s28 s30
Goodness of fit:
74.4%
Classical MDS of chi-square distances
Goodness of fit:
75.2%
-1.0 -0.5 0.0 0.5 1.0 1.5
-1.0-0.50.0
s1 s2
s3
s4
s5 s6 s7
s8
s9
s10
s11 s12
s13 s14
s15 s16
s17 s18
s19 s20
s21
s22 s23
s24 s25
s26
s27 s28
s29 s30
a b
d c
e
Notice that the rows and the columns are depicted in a joint map.
To be
continued...
Correspondence analysis
Exercise 2
COUNTRIES A B C D E F
A: standard of living 1 (low) ... 9 (high) B: climate
1 (terrible) ... 9 (excellent) C: food
1 (awful) ... 9 (delicious) D: security
1 (dangerous) ... 9 (safe) E: hospitality
1 (friendly) ... 9 (unfriendly) F: infrastructure
1 (poor) ... 9 (excellent)
Belgium
Denmark
Estonia
France
Germany
Greece
Holland
Hungary
Italy
Romania
Spain
Turkey
U.K.
Data set “attributes”
C ountries living c limate food security hos pitality infras tructure dim 1 dim 2
Italy 7 8 9 5 3 7 0.01 -2.94
Spain 7 9 9 5 2 8 -1.02 -3.68
Croatia 5 6 6 6 5 6 3.70 -0.88
Brazil 5 8 7 3 2 3 -2.56 -2.01
Russia 6 2 2 3 7 6 4.41 2.91
G erm any 8 3 2 8 7 9 5.01 0.00
T urkey 5 8 9 3 1 3 -1.38 -0.48
M oro cco 4 7 8 2 1 2 -0.87 2.27
P eru 5 6 6 3 4 4 -2.77 -0.74
Nigeria 2 4 4 2 3 2 -1.97 3.91
F rance 8 4 7 7 9 8 2.18 -1.76
M exico 2 5 5 2 3 3 -2.58 0.77
S outh Africa 4 4 5 3 3 3 -2.16 2.62
Student MT’s ratings of the 13 countries on 6 attributes:
standard of living (1=low,…,9=high);
climate (1=terrible,…,9=excellent);
food (1=awful,…,9=delicious);
security (1=dangerous,…,9=safe);
hospitality (1=friendly,…,9=unfriendly);
infrastructure (1=poor,…,9=excellent).
On the right are the coordinates of the country points in the MDS map Data on www.multivariatestatistics.org
variances explained:
living: 74.4%
climate: 69.3%
food: 61.0%
security: 78.1%
hospitality: 56.9%
infrastructure: 81.8%
attributes dimns
countries
attributes plotted using regression coefficients
regrn coeffs
Adding variables in regression biplot
variances explained:
a: 76.7%
b:
18.6%
c: 92.1%
d:
14.6%
e:
95.8%
species dimns
sites
species plotted using regression coefficients
regrn coeffs
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
-1.0-0.50.00.51.01.5
1
2 3
4 5
7 6
8
9
10
11 12
13 14
16 15
17 18
19 20
21
22 23
24 25
26
27 2928 30 a
b
c d
e C G
S
C G
S
Explained chi-square distances
52.4% + 22.0% = 74.4%
--- :logistic biplot
C, S, G: averages of sites
MDS biplot, based on chi-square distances
R code to perform MDS biplot
#---
# MDS biplot, adding data set "attributes"
#---
# Data set "attributes" on multivariatestatistics.org MT_ratings <- read.table("clipboard")
# Save two-dimensional coordinates of previous classical MDS MT_dims <- MT.mds$points[,1:2]
colnames(MT_dims) <- c("dim1","dim2")
# calculate the regression coefficients of the 6 attributes on the coordinates MT_coefs <- lm(MT_ratings[,1] ~ MT_dims[,1] + MT_dims[,2])$coefficients
for(j in 2:6) MT_coefs <- rbind(MT_coefs,
lm(MT_ratings[,j] ~ MT_dims[,1] + MT_dims[,2])$coefficients)
# plot the regression coefficients on the MDS plot
plot(MT_dims, type = "n", xlab = "dim 1", ylab = "dim 2", asp = 1) text(MT_dims, labels = colnames(MT), col = "blue", font=2)
arrows(0, 0, MT_coefs[,2], MT_coefs[,3], col = "red", length = 0.1, angle = 10) text(1.2*MT_coefs[,2:3], labels = colnames(MT_ratings), col = "black", font = 4,
cex = 0.7)