Structure of Project Report
• Introduction
• Data
• Methods
• Results
• Discussion
• Bibliography
About 15 pages, including figures, text in 1.5 spacing, 11-point font + Appendices: Data (if it fits on one page, or part of the data)
R code (or description of other software used)
Michael Greenacre
WEEK 7:
CA OF CONCATENATED TABLES
MULTIPLE CORRESPONDENCE ANALYSIS (MCA)
Correspondence analysis
and related methods
In simple CA we analyze the relationship between two categorical variables, a row variable and a column variable.
In order to generalize this to more than two variables, we need to distinguish two situations:
1. There are several “explanatory” variables and one or more “response”
variables, and we are interested in the relationship between these two sets.
2. There is a set of “homogeneous” variables, usually all measured on the same scale, and we are interested in their inter-relationships (similar to factor analysis context, but with categorical data).
We have already had an example of the first situation: in the Spanish Health Survey example: explanatory variables = age and sex, response variable = perceived health; we combined the 7 age categories and 2 sex categories into one interactively coded variable with 14 categories and did a simple CA. We first look at this easier approach of stacking – or concatenating – tables
Many categorical variables:
associations between or within sets?
Describing (“explanatory”) variables: A(GE), S(EX), E(DUCATION), etc...
with levels A1, A2, ...; M, F ; E1, E2, ... ; etc...
Variables to be described (“response”): H(EALTH), P(RODUCTS), O(PINIONS), etc... with levels H1, H2, ...; P1, P2, ...; O1, O2, ... ; etc...
M/A1 M/A2
. . F/A1 F/A2
. .
Rows coded interactively H1 H2 ...
A1 A2 . . M
F E1 E2 . . .
H1 H2 ...
A1 A2 . . M
F E1 E2 . . .
H1 H2 ... P1 P2 ... O1 O2 ...
Concatenation of tables
• ISSP 1994 survey on Family & Changing Gender Roles
• 33123 respondents
• 24 countries (former West and East Germany still kept separate)
• We focus on four questions related to women’s participation in work outside the home
• Should women “work full-time” (W), “work part-time” (w), “stay at home”
(H), or “unsure/don’t know”(?) at these four different periods of married life:
1. before having a child, with possible responses 1W 1w 1H 1?
2. with a pre-school child 2W 2w 2H 2?
3. when youngest child is still at school 3W 3w 3H 3?
4. when all children have left home 4W 4w 4H 4?
Example: “women working” data set from ISSP94
We also have various explanatory variables, from which we select the following six:
Country 24 countries: AUS (Australia), DW (West Germany), DE (East Germany), GB (Great Britain), NI (Northern Ireland), USA,
A (Austria), H (Hungary), I (Italy), IRL (Ireland), NL (Netherlands), N (Norway), S (Sweden), CZ (Czechoslovakia), SLO (Slovenia), PL (Poland), BG (Bulgaria), RUS (Russia), NZ (New Zealand), CDN (Canada), RP (Philippines), IL (Israel), J (Japan), E (Spain)
Sex 2 categories: M, F
Age 6 groups: A1 (up to 25), A2 (26-35), A3 (36-45) A4 (46-55), A5 (56-65), A6 (66 and over)
Marital status5 groups: ma (married), wi(widowed), di (divorced), se (separated), si(single)
Education 7 groups:E0 (none), E1 (incomplete primary), E2 (primary), E3 (incomplete secondary), E4 (secondary), E5 (incomplete tertiary), E6 (tertiary)
Social class 7 groups: S0 (other), S1 (lower class), S2 (working class),
S3 (upper working/lower middle), S4 (middle), S5 (upper middle), S6 (upper)
“women working” data – explanatory variables
(a) (b)
1 2 3 4 Sex Age ...
1 2 3 4 Sex Age... W w P ? W w P ? W w P ? W w P ? M F 1 2 3 4 5 6
33123
cases
1 3 2 2 2 6 ...
1 2 2 2 2 4 ...
1 3 4 4 2 1 ...
1 2 2 1 2 4 ...
1 3 2 4 1 5 ...
1 2 1 1 2 1 ...
. . . . . . ...
. . . . . . ...
. . . . . . ...
. . . . . . ...
1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 ...
1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 ...
1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 ...
1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 ...
1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 ...
1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 ...
. . . . . . . . . . . . . . . . . . . . . ...
. . . . . . . . . . . . . . . . . . . . . ...
. . . . . . . . . . . . . . . . . . . . . ...
. . . . . . . . . . . . . . . . . . . . . ...
1 2 3 4 W w H ? W w H ? W w H ? W w H ? Sex M
Sex F Age A1 Age A2 Age A3 Age A4 Age A5 Age A6 ...
...
...
10309 2436 1207 1091 3220 7634 3099 1090 1360 4972 7732 979 9300 3030 1182 1531 13712 2435 842 1056 4020 9999 2829 1197 1628 6926 8324 1167 12100 3408 893 1644 3562 744 272 361 1621 2408 527 383 494 2095 1947 403 3448 741 261 489 5192 885 337 459 1829 3659 864 521 804 2840 2710 519 4668 1151 334 720 5173 879 332 486 1523 3795 1038 514 738 2603 3053 476 4516 1337 321 696 4006 767 321 326 953 3065 1062 340 399 1833 2879 309 3436 1154 326 504 3158 749 338 231 711 2406 1120 239 317 1397 2577 185 2691 1037 397 351 2953 851 453 288 609 2320 1322 294 240 1139 2908 258 2660 1027 436 422 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(c)
Z
1Z
2Z
2TZ
1(a) Original (b) Indicator (c) Concatenated
1 2 3 4 W w H ? W w H ? W w H ? W w H ? Sex M
Sex F Age A1 Age A2 Age A3 Age A4 Age A5 Age A6 ...
...
...
10309 2436 1207 1091 3220 7634 3099 1090 1360 4972 7732 979 9300 3030 1182 1531 13712 2435 842 1056 4020 9999 2829 1197 1628 6926 8324 1167 12100 3408 893 1644 3562 744 272 361 1621 2408 527 383 494 2095 1947 403 3448 741 261 489 5192 885 337 459 1829 3659 864 521 804 2840 2710 519 4668 1151 334 720 5173 879 332 486 1523 3795 1038 514 738 2603 3053 476 4516 1337 321 696 4006 767 321 326 953 3065 1062 340 399 1833 2879 309 3436 1154 326 504 3158 749 338 231 711 2406 1120 239 317 1397 2577 185 2691 1037 397 351 2953 851 453 288 609 2320 1322 294 240 1139 2908 258 2660 1027 436 422 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1= 0.009451 ( 93.0%)
2= 0.000582 ( 5.7%)
average +
Concatenated format
S 6
S 5
S 4
S 3 S 2
S 1
E6 S 0
E5
E4
E3 E2
E1
E0 s i
s e
di
m a wi
A5 A6 A4
A3 A2 A1
F
M E
J IL
R P
C DN
NZ
R US B G
P L
S LO
C Z S
N NL
IR L
I
H
A US A
NIR L
GB DE
DW AUS
4?
4H
4w 4W
3?
3H 3w
3W
2?
2H
2w 2W
1?
1H
1w
1W
-0,4 -0,3 -0,2 -0,1 0 0,1 0,2 0,3 0,4 0,5
1= 0.018888 (48.8%)
2= 0.008542 (22.1%)
2H
3H
2H E2 2W
3W
CA of concatenated matrix
S = 6
Q = 4
Inertia of concatenated table = Average of inertias in QS subtables
Inertia = Average inertia over all tables
Tables (e.g., C(OUNTRIES) by Q(UESTION) repeated over T(IME) T1, T2, ...
T1/C1 T1/C2
. . T2/C1 T2/C2
. . T3/C1 T3/C2
. . T4/C1 T4/C2
. .
Q1 Q2 ...
CA
C1
C2 C3
Q1
Q2 Q3
Q4
Q5
Trajectories of countries over time
Trend studies
Data set “women 2002”
ISSP 2002 survey on Family and Changing Gender Roles III. Spanish sample
(N=2471), but respondents with some missing values taken out, leaving N=2107.
A: a working mother can establish a warm relationship with her child B: a pre-school child suffers if his or her mother works
C: when a woman works the family life suffers D: what women really want is a home and kids
E: running a household is just as satisfying as a paid job F: work is best for a woman’s independence
G: a man’s job is to work; a woman’s job is the household H: working women should get paid maternity leave
with responses: 1=strongly agree, 2=agree, 3=neither agree nor disagree, 4=disagree, 5=strongly disagree
g: gender (1=male, 2=female)
m: marital status (1=married/living as married, 2=widowed, 3=divorced, 4=separated, but married, 5=single, never married)
e: education (0=no formal education, 1=lowest education, 2= above lowest education, 3=higher secondary completed, 4=above higher secondary level, below full university, 5=university degree completed a: age (1=16-25 years, 2= 26-35, 3=36-45, 4=46-55, 5=56-65, 6=66 and older) Substantive
questions:
Demographics:
A B C D E F G H g m e a 2 4 3 3 4 1 4 1 1 1 3 4 2 4 3 9 4 1 4 1 1 1 3 3 3 2 2 3 4 1 3 2 2 2 1 6 3 9 2 2 2 1 3 1 1 1 1 6 9 1 2 2 3 2 3 1 2 1 1 4 2 4 4 4 2 2 5 1 1 5 4 2 1 3 2 3 2 4 4 2 2 1 5 5 2 9 4 9 9 1 3 1 1 5 3 5 2 4 2 3 4 4 5 1 2 1 5 3 2 4 2 2 4 2 4 1 2 1 2 3 . . . . . . . . . . . . . . . . . . . . . 2471rows
(2107 without missing data)
Data file “women 2002”
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 n
m1 192 486 54 381 56 80 550 138 334 67 ··· 1169 m2 21 68 9 63 11 13 101 16 37 5 ··· 172
m3 10 17 2 12 4 5 16 10 12 2 ··· 45
m4 12 32 5 16 4 4 30 7 24 4 ··· 69
m5 162 329 21 126 14 22 259 66 258 47 ··· 652 e0 23 97 16 90 14 20 138 26 52 4 ··· 240 e1 76 203 21 192 30 39 286 62 114 21 ··· 522 e2 99 263 28 168 22 37 264 69 182 28 ··· 580 e3 100 203 16 95 17 17 178 43 160 33 ··· 431 e4 48 81 2 32 3 5 52 13 78 18 ··· 166 e5 51 85 8 21 3 6 38 24 79 21 ··· 168 ma1 38 80 3 27 3 6 70 13 55 7 ··· 151 ma2 41 116 9 53 7 10 92 23 92 9 ··· 226 ma3 30 94 13 37 7 14 77 18 60 12 ··· 181 ma4 25 65 9 40 4 13 68 12 43 7 ··· 143 ma5 16 54 0 50 5 15 72 17 18 3 ··· 125 ma6 15 43 9 64 15 18 95 13 18 2 ··· 146 fa1 48 83 7 39 6 5 56 21 84 17 ··· 183 fa2 59 92 5 58 13 10 82 23 86 26 ··· 227 fa3 46 81 9 51 6 7 78 21 68 19 ··· 193 fa4 37 75 7 60 3 7 76 30 55 14 ··· 182 fa5 21 52 5 42 6 6 65 15 34 6 ··· 126 fa6 21 97 15 77 14 13 125 31 52 3 ··· 224
23 rows, 39 columns (H4 and H5 combined) inter-
active coding of
gender and age marital
status
education
Remember that:
In such a supermatrix, where the rows and columns have the same margins in each
submatrix, the inertia is the average of the inertias of the subtables
Concatenated table
-2 -1 0 1 2 3
-2-10123
A1
A2
A3 A4
A5
B2 B1 B3
B4 B5
C1
C2
C3 C4
C5
D1
D2 D3
D4 D5
E1 E2 E3
E4 E5
F1
F2 F3
F4 F5
G2 G1
G3 G4
G5
H1
H2
H3
H45
CA asymmetric row principal map/biplot
H45
H3
H2 H1
G5
G4
G3
G2 G1
F5 F4 F3 F2
F1 E5
E4
E3 E2 E1
D5
D4
D3
D2 D1 C5
C4
C3
C2 C1 B5
B4
B3
B2
B1 A5
A4
A2 A3 A1
fa6 fa5
fa4 fa3
fa2
fa1
ma6
ma5 ma4
ma3 ma2 ma1 e5 e4
e3
e2
e1 e0
m5 m4 m3
m2
m1
-0.2 0 0.2
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
0.0571 (82.1%) 0.00303 (4.4%)
CA symmetric map
(strictly speaking, not a biplot)
R script for subset CA of “women 2002” (N=2471)
# read original data for Spain, 2002
women02 <- read.table("clipboard", header=T)
# getting the stacked matrix from the Burt matrix women02.B <- mjca(women02)$Burt
rownames(women02.B) colnames(women02.B)
women02.stack <- women02.B[49:69,1:48]
rownames(women02.stack) colnames(women02.stack)
# checking frequencies of each substantive category apply(women02.stack,2,sum)/4
# combine H4 and H5 (H5 frequency very low)
women02.stack[,46] <- women02.stack[,46]+women02.stack[,47]
women02.stack <- women02.stack[,-47]
# CA of whole stacked matrix plot(ca(women02.stack))
# checking frequencies of each demographic category apply(women02.stack,1,sum)/8
# subset analysis (subsetting out the missing categories) plot(ca(women02.stack, subsetrow=seq(1,21)[-c(8,15)], + subsetcol=seq(1,47)[-c(6,12,18,24,30,36,42,47)]))
1
48 49
69
1 48 49 69
Subset CA of “women 2002”, excluding missings
Contribution biplot
(cleaned up – categories removed inside box)
Multiple correspondence analysis (MCA):
the indicator matrix
Original data Indicator matrices
A B C D E F G H g m e a ga A1 A2 A3 A4 A 5 B1 B2 B3 B4 B5 ··· g 1 g2 m 1 m 2 m 3 m 4 m 5 ···
2 4 3 3 4 1 4 1 1 1 3 4 4 0 1 0 0 0 0 0 0 1 0 ··· 1 0 1 0 0 0 0 ···
3 2 2 3 4 1 3 2 2 2 1 6 12 0 0 1 0 0 0 1 0 0 0 ··· 0 1 0 1 0 0 0 ···
2 4 4 4 2 2 5 1 1 5 4 2 2 0 1 0 0 0 0 0 0 1 0 ··· 1 0 0 0 0 0 1 ···
1 3 2 3 2 4 4 2 2 1 5 5 11 1 0 0 0 0 0 0 1 0 0 ··· 0 1 1 0 0 0 0 ···
2 4 2 3 4 4 5 1 2 1 5 3 9 0 1 0 0 0 0 0 0 1 0 ··· 0 1 1 0 0 0 0 ···
: : : : : : : : : : : : : : : : : : : : : : : ··· : : : : : : : ···
Total inertia of an indicator matrix Z = (J–Q)/Q
For example, for the substantive variables, i.e. first 39 columns of above indicator matrix,
total inertia = (39–8)/8 = 3.875
#categories #variables