The data for this example come from a study of the effects of childhood sexual abuse on adult females reported in Rodriguez et al. (1997):45 women treated at a clinic, who reported childhood sexual abuse (csa), were measured for post-traumatic stress disorder (ptsd) and childhood physical abuse (cpa) both on standardized scales. 31 women treated at the same clinic, who did not report childhood sexual abuse, were also measured. The full study was more complex than reported here and so readers interested in the subject matter should refer to the original article.
We take a look at the data and produce a summary subsetted by csa:
> data (sexab)
> sexab
cpa ptsd csa 1 2.04786 9.71365 Abused
3. Separate regression lines for each group with the different slopes: y=! +! +
2 0.83895 6.16933 Abused
> plot (ptsd ˜ cpa, pch=as.character (csa), sexab)
We see that those in the abused group have higher levels of PTSD than those in the nonabused in the left panel of Figure 13.2. We can test this difference:
> t.test(sexab$ptsd[l:45],sexab$ptsd[46:76]) Welch Two Sample t-test
data: sexab$ptsd[1:45] and sexab$ptsd[46:76]
t = 8.9006, df = 63.675, p-value = 8.803e-13
alt. hypothesis: true difference in means is not equal to 0 95 percent confidence interval:
5.6189 8.8716 sample estimates:
mean of x mean of y 11.9411 4.6959
and find that it is clearly significant. However, in the right panel of Figure 13.2 we see that there is positive correlation between PTSD and childhood physical abuse and in the numerical summary we see that those in the abused group suffered higher levels (3.08 vs.
1.31) of cpa than those in the nonabused group. This suggests physical abuse as an alternative explanation of higher PTSD in the sexually abused group.
Figure 13.2 PTSD comparison of abused and nonabused subjects on the left. A=Abused and N=NotAbused on the right.
ANCOVA allows us to disentangle these two competing explanations. We fit the separate regression lines model. ptsd ˜ cpa*csa is an equivalent model formula:
> g < - lm (ptsd cpa+csa+cpa:csa, sexab)
> summary (g) Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 10.557 0.806 13.09 < 2e–16 cpa 0.450 0.208 2.16 0.034 csaNotAbused !6.861 1.075 !6.38 1.5e–08 cpa:csaNotAbused 0.314 0.368 0.85 0.397 Residual standard error: 3.28 on 72 degrees of freedom Multiple R-Squared: 0.583, Adjusted R-squared: 0.565 F-statistic: 33.5 on 3 and 72 DF, p-value: 1.13e–13
Because csa is nonnumeric, R automatically treats it as a qualitative variable and sets up a coding. We can discover the coding by examining the X-matrix:
> model.matrix (g)
(Intercept) cpa csaNotAbused cpa: csaNotAbused 1 1 2.04786 0 0.00000 2 1 0.83895 0 0.00000
……
75 1 2.85253 1 2.85253 76 1 0.81138 1 0.81138
We see that “Abused” is coded as zero and “NotAbused” is coded as one. The default choice is made alphabetically. This means that “Abused” is the reference level here and that the parameters represent the difference between “NotAbused” and this reference level. In this case, it would be slightly more convenient if the coding was reversed. The interaction term cpa: csaNotAbused is represented in the fourth column of the matrix as
the product of the second and third columns which represents the terms from which the interaction is formed.
We see that the model can be simplified because the interaction term is not significant.
We reduce to this model:
> g < - lm (ptsd ˜ cpa+csa,sexab)
> summary (g) Coefficients:
Estimate Std. Errort t value Pr(>|t|) (Intercept) 10.248 0.719 14.26 <2e–16 cpa 0.551 0.172 3.21 0.002 csaNotAbused !6.273 0.822 !7.63 6.9 e–11 Residual standard error: 3.27 on 73 degrees of freedom Multiple R-Squared: 0.579, Adjusted R-squared: 0.567 F-statistic: 50.1 on 2 and 73 DF, p-value: 2e–14
No further simplification is possible because the remaining predictors are statistically significant.
Put the two parallel regression lines on the plot, as seen in the left panel of Figure 13.3.
> plot (ptsd ˜ cpa, pch=as.character (csa), sexab)
> abline (10.248, 0.551)
> abline (10.248–6.273, 0.551, lty=2)
Figure 13.3 Model fit shown on the left and fitted vs. residuals plot on the right.
A=Abused and N=NotAbused.
The slope of both lines is 0.551, but the “Abused” line is 6.273 higher than the
“NonAbused.” From the t-test above, the estimated effect of childhood sexual abuse is 11.9411&4.6959=7.2452. So after adjusting for the effect of childhood physical abuse, our estimate of the effect of childhood sexual abuse on PTSD is mildly reduced.
We can also compare confidence intervals for the effect of csa:
> confint (g) [3,]
2.5 % 97.5 %
!7.9108 !4.6347
compared to the (5.6189, 8.8716) found for the unadjusted difference. In this particular case, the confidence intervals are about the same width. In other cases, particularly designed experiments, adjusting for a covariate can increase the precision of the estimate of an effect.
The usual diagnostics should be checked. It is worth checking whether there is some difference related to the categorical variable as we do here:
> plot (fitted (g), residuals (g), pch=as.character (sexab$csa), xlab="Fitted", ylab="Residuals")
We see in the right panel of Figure 13.3 that there are no signs of heteroscedasticity.
Furthermore, because the two groups happen to separate, we can also see that the variation in the two groups is about the same. If this were not so, we would need to make some adjustments to the analysis, possibly using weights.
For convenience, you can change the reference level:
> sexab$csa < - relevel (sexab$csa, ref="NotAbused")
> g < - lm (ptsd ˜ cpa+csa,sexab) Residual standard error: 3.27 on 73 degrees of freedom Multiple R-Squared: 0.579, Adjusted R-squared: 0.567 F-statistic: 50.1 on 2 and 73 DF, p-value: 2e–14
Although some of the coefficients have different numerical values, this coding leads to the same conclusion as before.
Finally, we should point out that childhood physical abuse might not be the only factor that is relevent to assessing the effects of childhood sexual abuse. It is quite possible that the two groups differ according to other variables such as socioeconomic status and age.
Issues such as these were addressed in Rodriguez et al. (1997).