Correspondence analysisand related methods

(1)

Structure of Project Report

• Introduction

• Data

• Methods

• Results

• Discussion

• Bibliography

About 15 pages, including figures, text in 1.5 spacing, 11-point font + Appendices: Data (if it fits on one page, or part of the data)

R code (or description of other software used)

Michael Greenacre

WEEK 7:

CA OF CONCATENATED TABLES

MULTIPLE CORRESPONDENCE ANALYSIS (MCA)

Correspondence analysis

and related methods

(2)

 In simple CA we analyze the relationship between two categorical variables, a row variable and a column variable.

 In order to generalize this to more than two variables, we need to distinguish two situations:

1. There are several “explanatory” variables and one or more “response”

variables, and we are interested in the relationship between these two sets.

2. There is a set of “homogeneous” variables, usually all measured on the same scale, and we are interested in their inter-relationships (similar to factor analysis context, but with categorical data).

We have already had an example of the first situation: in the Spanish Health Survey example: explanatory variables = age and sex, response variable = perceived health; we combined the 7 age categories and 2 sex categories into one interactively coded variable with 14 categories and did a simple CA. We first look at this easier approach of stacking – or concatenating – tables

Many categorical variables:

associations between or within sets?

Describing (“explanatory”) variables: A(GE), S(EX), E(DUCATION), etc...

with levels A1, A2, ...; M, F ; E1, E2, ... ; etc...

Variables to be described (“response”): H(EALTH), P(RODUCTS), O(PINIONS), etc... with levels H1, H2, ...; P1, P2, ...; O1, O2, ... ; etc...

M/A1 M/A2

. . F/A1 F/A2

. .

Rows coded interactively H1 H2 ...

A1 A2 . . M

F E1 E2 . . .

H1 H2 ...

A1 A2 . . M

F E1 E2 . . .

H1 H2 ... P1 P2 ... O1 O2 ...

Concatenation of tables

(3)

• ISSP 1994 survey on Family & Changing Gender Roles

• 33123 respondents

• 24 countries (former West and East Germany still kept separate)

• We focus on four questions related to women’s participation in work outside the home

• Should women “work full-time” (W), “work part-time” (w), “stay at home”

(H), or “unsure/don’t know”(?) at these four different periods of married life:

1. before having a child, with possible responses 1W 1w 1H 1?

2. with a pre-school child 2W 2w 2H 2?

3. when youngest child is still at school 3W 3w 3H 3?

4. when all children have left home 4W 4w 4H 4?

Example: “women working” data set from ISSP94

We also have various explanatory variables, from which we select the following six:

Country 24 countries: AUS (Australia), DW (West Germany), DE (East Germany), GB (Great Britain), NI (Northern Ireland), USA,

A (Austria), H (Hungary), I (Italy), IRL (Ireland), NL (Netherlands), N (Norway), S (Sweden), CZ (Czechoslovakia), SLO (Slovenia), PL (Poland), BG (Bulgaria), RUS (Russia), NZ (New Zealand), CDN (Canada), RP (Philippines), IL (Israel), J (Japan), E (Spain)

Sex 2 categories: M, F

Age 6 groups: A1 (up to 25), A2 (26-35), A3 (36-45) A4 (46-55), A5 (56-65), A6 (66 and over)

Marital status5 groups: ma (married), wi(widowed), di (divorced), se (separated), si(single)

Education 7 groups:E0 (none), E1 (incomplete primary), E2 (primary), E3 (incomplete secondary), E4 (secondary), E5 (incomplete tertiary), E6 (tertiary)

Social class 7 groups: S0 (other), S1 (lower class), S2 (working class),

S3 (upper working/lower middle), S4 (middle), S5 (upper middle), S6 (upper)

“women working” data – explanatory variables

(4)

(a) (b)

1 2 3 4 Sex Age ...

1 2 3 4 Sex Age... W w P ? W w P ? W w P ? W w P ? M F 1 2 3 4 5 6

33123

cases

1 3 2 2 2 6 ...

1 2 2 2 2 4 ...

1 3 4 4 2 1 ...

1 2 2 1 2 4 ...

1 3 2 4 1 5 ...

1 2 1 1 2 1 ...

. . . . . . ...

1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 ...

1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 ...

1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 ...

1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 ...

1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 ...

1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 ...

. . . . . . . . . . . . . . . . . . . . . ...

1 2 3 4 W w H ? W w H ? W w H ? W w H ? Sex M

Sex F Age A1 Age A2 Age A3 Age A4 Age A5 Age A6 ...

...

10309 2436 1207 1091 3220 7634 3099 1090 1360 4972 7732 979 9300 3030 1182 1531 13712 2435 842 1056 4020 9999 2829 1197 1628 6926 8324 1167 12100 3408 893 1644 3562 744 272 361 1621 2408 527 383 494 2095 1947 403 3448 741 261 489 5192 885 337 459 1829 3659 864 521 804 2840 2710 519 4668 1151 334 720 5173 879 332 486 1523 3795 1038 514 738 2603 3053 476 4516 1337 321 696 4006 767 321 326 953 3065 1062 340 399 1833 2879 309 3436 1154 326 504 3158 749 338 231 711 2406 1120 239 317 1397 2577 185 2691 1037 397 351 2953 851 453 288 609 2320 1322 294 240 1139 2908 258 2660 1027 436 422 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

(c)

Z

₁

Z

₂

Z

₂^T

Z

₁

(a) Original (b) Indicator (c) Concatenated

1 2 3 4 W w H ? W w H ? W w H ? W w H ? Sex M

Sex F Age A1 Age A2 Age A3 Age A4 Age A5 Age A6 ...

...

10309 2436 1207 1091 3220 7634 3099 1090 1360 4972 7732 979 9300 3030 1182 1531 13712 2435 842 1056 4020 9999 2829 1197 1628 6926 8324 1167 12100 3408 893 1644 3562 744 272 361 1621 2408 527 383 494 2095 1947 403 3448 741 261 489 5192 885 337 459 1829 3659 864 521 804 2840 2710 519 4668 1151 334 720 5173 879 332 486 1523 3795 1038 514 738 2603 3053 476 4516 1337 321 696 4006 767 321 326 953 3065 1062 340 399 1833 2879 309 3436 1154 326 504 3158 749 338 231 711 2406 1120 239 317 1397 2577 185 2691 1037 397 351 2953 851 453 288 609 2320 1322 294 240 1139 2908 258 2660 1027 436 422 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

₁= 0.009451 ( 93.0%)

₂= 0.000582 ( 5.7%)

average +

Concatenated format

(5)

S 6

S 5

S 4

S 3 S 2

S 1

E6 S 0

E5

E4

E3 E2

E1

E0 s i

s e

di

m a wi

A5 A6 A4

A3 A2 A1

F

M E

J IL

R P

C DN

NZ

R US B G

P L

S LO

C Z S

N NL

IR L

I

H

A US A

NIR L

GB DE

DW AUS

4?

4H

4w 4W

3?

3H 3w

3W

2?

2H

2w 2W

1?

1H

1w

1W

-0,4 -0,3 -0,2 -0,1 0 0,1 0,2 0,3 0,4 0,5

₁= 0.018888 (48.8%)

₂= 0.008542 (22.1%)

2H

3H

2H E2 2W

3W

 

CA of concatenated matrix

S = 6

Q = 4

Inertia of concatenated table = Average of inertias in QS _subtables

Inertia = Average inertia over all tables

(6)

Tables (e.g., C(OUNTRIES) by Q(UESTION) repeated over T(IME) T1, T2, ...

T1/C1 T1/C2

. . T2/C1 T2/C2

. . T3/C1 T3/C2

. . T4/C1 T4/C2

. .

Q1 Q2 ...

CA

C1

C2 C3

Q1

Q2 Q3

Q4

Q5

Trajectories of countries over time

Trend studies

Data set “women 2002”

ISSP 2002 survey on Family and Changing Gender Roles III. Spanish sample

(N=2471), but respondents with some missing values taken out, leaving N=2107.

A: a working mother can establish a warm relationship with her child B: a pre-school child suffers if his or her mother works

C: when a woman works the family life suffers D: what women really want is a home and kids

E: running a household is just as satisfying as a paid job F: work is best for a woman’s independence

G: a man’s job is to work; a woman’s job is the household H: working women should get paid maternity leave

with responses: 1=strongly agree, 2=agree, 3=neither agree nor disagree, 4=disagree, 5=strongly disagree

g: gender (1=male, 2=female)

m: marital status (1=married/living as married, 2=widowed, 3=divorced, 4=separated, but married, 5=single, never married)

e: education (0=no formal education, 1=lowest education, 2= above lowest education, 3=higher secondary completed, 4=above higher secondary level, below full university, 5=university degree completed a: age (1=16-25 years, 2= 26-35, 3=36-45, 4=46-55, 5=56-65, 6=66 and older) Substantive

questions:

Demographics:

(7)

A B C D E F G H g m e a 2 4 3 3 4 1 4 1 1 1 3 4 2 4 3 9 4 1 4 1 1 1 3 3 3 2 2 3 4 1 3 2 2 2 1 6 3 9 2 2 2 1 3 1 1 1 1 6 9 1 2 2 3 2 3 1 2 1 1 4 2 4 4 4 2 2 5 1 1 5 4 2 1 3 2 3 2 4 4 2 2 1 5 5 2 9 4 9 9 1 3 1 1 5 3 5 2 4 2 3 4 4 5 1 2 1 5 3 2 4 2 2 4 2 4 1 2 1 2 3 . . . . . . . . . . . . . . . . . . . . . 2471rows

(2107 without missing data)

Data file “women 2002”

A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 n

m1 192 486 54 381 56 80 550 138 334 67 ··· 1169 m2 21 68 9 63 11 13 101 16 37 5 ··· 172

m3 10 17 2 12 4 5 16 10 12 2 ··· 45

m4 12 32 5 16 4 4 30 7 24 4 ··· 69

m5 162 329 21 126 14 22 259 66 258 47 ··· 652 e0 23 97 16 90 14 20 138 26 52 4 ··· 240 e1 76 203 21 192 30 39 286 62 114 21 ··· 522 e2 99 263 28 168 22 37 264 69 182 28 ··· 580 e3 100 203 16 95 17 17 178 43 160 33 ··· 431 e4 48 81 2 32 3 5 52 13 78 18 ··· 166 e5 51 85 8 21 3 6 38 24 79 21 ··· 168 ma1 38 80 3 27 3 6 70 13 55 7 ··· 151 ma2 41 116 9 53 7 10 92 23 92 9 ··· 226 ma3 30 94 13 37 7 14 77 18 60 12 ··· 181 ma4 25 65 9 40 4 13 68 12 43 7 ··· 143 ma5 16 54 0 50 5 15 72 17 18 3 ··· 125 ma6 15 43 9 64 15 18 95 13 18 2 ··· 146 fa1 48 83 7 39 6 5 56 21 84 17 ··· 183 fa2 59 92 5 58 13 10 82 23 86 26 ··· 227 fa3 46 81 9 51 6 7 78 21 68 19 ··· 193 fa4 37 75 7 60 3 7 76 30 55 14 ··· 182 fa5 21 52 5 42 6 6 65 15 34 6 ··· 126 fa6 21 97 15 77 14 13 125 31 52 3 ··· 224

23 rows, 39 columns (H4 and H5 combined) inter-

active coding of

gender and age marital

status

education

Remember that:

In such a supermatrix, where the rows and columns have the same margins in each

submatrix, the inertia is the average of the inertias of the subtables

Concatenated table

(8)

-2 -1 0 1 2 3

-2-10123

A1

A2

A3 A4

A5

B2 B1 B3

B4 B5

C1

C2

C3 C4

C5

D1

D2 D3

D4 D5

E1 E2 E3

E4 E5

F1

F2 F3

F4 F5

G2 G1

G3 G4

G5

H1

H2

H3

H45

CA asymmetric row principal map/biplot

H45

H3

H2 H1

G5

G4

G3

G2 G1

F5 F4 F3 F2

F1 E5

E4

E3 E2 E1

D5

D4

D3

D2 D1 C5

C4

C3

C2 C1 B5

B4

B3

B2

B1 A5

A4

A2 A3 A1

fa6 fa5

fa4 fa3

fa2

fa1

ma6

ma5 ma4

ma3 ma2 ma1 e5 e4

e3

e2

e1 e0

m5 m4 m3

m2

m1

-0.2 0 0.2

-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

0.0571 (82.1%) 0.00303 (4.4%)

CA symmetric map

(strictly speaking, not a biplot)

(9)

R script for subset CA of “women 2002” (N=2471)

# read original data for Spain, 2002

women02 <- read.table("clipboard", header=T)

# getting the stacked matrix from the Burt matrix women02.B <- mjca(women02)$Burt

rownames(women02.B) colnames(women02.B)

women02.stack <- women02.B[49:69,1:48]

rownames(women02.stack) colnames(women02.stack)

# checking frequencies of each substantive category apply(women02.stack,2,sum)/4

# combine H4 and H5 (H5 frequency very low)

women02.stack[,46] <- women02.stack[,46]+women02.stack[,47]

women02.stack <- women02.stack[,-47]

# CA of whole stacked matrix plot(ca(women02.stack))

# checking frequencies of each demographic category apply(women02.stack,1,sum)/8

# subset analysis (subsetting out the missing categories) plot(ca(women02.stack, subsetrow=seq(1,21)[-c(8,15)], + subsetcol=seq(1,47)[-c(6,12,18,24,30,36,42,47)]))

1

48 49

69

1 48 49 69

Subset CA of “women 2002”, excluding missings

(10)

Contribution biplot

(cleaned up – categories removed inside box)

Multiple correspondence analysis (MCA):

the indicator matrix

Original data Indicator matrices

A B C D E F G H g m e a ga A1 A2 A3 A4 A 5 B1 B2 B3 B4 B5 ··· g 1 g2 m 1 m 2 m 3 m 4 m 5 ···

2 4 3 3 4 1 4 1 1 1 3 4 4 0 1 0 0 0 0 0 0 1 0 ··· 1 0 1 0 0 0 0 ···

3 2 2 3 4 1 3 2 2 2 1 6 12 0 0 1 0 0 0 1 0 0 0 ··· 0 1 0 1 0 0 0 ···

2 4 4 4 2 2 5 1 1 5 4 2 2 0 1 0 0 0 0 0 0 1 0 ··· 1 0 0 0 0 0 1 ···

1 3 2 3 2 4 4 2 2 1 5 5 11 1 0 0 0 0 0 0 1 0 0 ··· 0 1 1 0 0 0 0 ···

2 4 2 3 4 4 5 1 2 1 5 3 9 0 1 0 0 0 0 0 0 1 0 ··· 0 1 1 0 0 0 0 ···

: : : : : : : : : : : : : : : : : : : : : : : ··· : : : : : : : ···

Total inertia of an indicator matrix Z = (J–Q)/Q

For example, for the substantive variables, i.e. first 39 columns of above indicator matrix,

total inertia = (39–8)/8 = 3.875

#categories #variables