7 DIAGNOSTICO AMBIENTAL LINEA BASE
7.1 Medio Físico
Although most current survey data analysis software is programmed to cor-rectly account for subclasses in analysis, a useful step in preparing for data analysis is to examine the distribution of the targeted subpopulation sample with respect to the sampling error strata and clusters that have been defined under the sampling error calculation model. Figure 4.4 illustrates the differ-ent distributional patterns that might be observed in practice.
Schematic Illustration of Subclass Types:
Stratified, Clustered Sample Design
“Design Domain”
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
Str. PSU 1 PSU 2 Str. PSU 1 PSU 2 Str. PSU 1 PSU 2
“Mixed Class” “Cross Class”
Figure 4.4
Schematic illustration of subclass types for a stratified, clustered design.
© 2010 by Taylor and Francis Group, LLC
112 Applied Survey Data Analysis
THEORy BOx 4.2 THEORETICAL MOTIVATIOn FOR UnCOnDITIOnAL SUBCLASS AnALySES
To illustrate the importance of following the unconditional subclass analysis approach mathematically, we consider the variance of a sam-ple total (the essential building block for variance estimation based on Taylor series linearization). We denote design strata by h (h = 1, 2, …, H), first-stage PSUs within strata by α (α = 1, 2, …, ah), and sample elements within PSUs by i (i = 1, 2, …, nhα). The weight for element i, taking into account factors such as unequal probability of selection, nonresponse, and possibly poststratification, is denoted by whαi. We refer to specific subclasses using the notation S. An estimate of the total for a variable Y in a subclass denoted by S is computed as follows (Cochran, 1977):
ˆ ,
In this notation, I represents an indicator variable, equal to 1 if sam-ple element i belongs to subclass S and 0 otherwise. The closed-form analytic formula for the variance of this subclass total can be written as follows:
This formula shows that the variance of the subclass total is calcu-lated by summing the between-cluster variance in the subclass totals within strata, across the H sample strata. The formula also shows how the indicator variable is used to ensure that all sample elements (and their design strata and PSUs) are recognized in the variance calcula-tion; this emphasizes the need for the software to recognize all of the original design strata and PSUs. Analysts should note that if all nhα elements within a given stratum denoted by h and PSU denoted by α do not belong to the subclass S (although elements from that sub-class theoretically could belong to that PSU in any given sample), that PSU will still contribute to the variance estimation: The PSU helps to
Design domains, or subclasses that are restricted to only a subset of the primary stage strata (e.g., adults in the Census South Region, or residents of urban counties), constitute a broad category of analysis subclasses. In general, analysis of design domain subclasses should not be problematic in most contemporary survey analysis software. Analysts should recog-nize that sampling errors estimated for domain subclasses will be based on fewer degrees of freedom than estimates for the full sample or cross-classes, given that design domains are generally restricted to specific sam-pling strata.
A second pattern that may be observed is a mixed class, or a population subclass that is not isolated in a subset of the sample design strata but is unevenly and possibly sparsely distributed across the strata and clusters of the sampling error calculation model. Experience has shown that many sur-vey analysts often “push the limits” of sursur-vey design, focusing on rare or highly concentrated subclasses of the population (e.g., Hispanic adults with asthma in the HRS). While such analyses are by no means precluded by the survey design, survey analysts are advised to exercise care in approaching subclass analyses of this type for several reasons: (1) Nominal sample sizes for mixed classes may be small but design effects due to weighting and clus-tering may still be substantial; (2) a highly uneven distribution of cases to strata and clusters will introduce instability or bias in the variance estimates (especially those based on the Taylor series linearization method); and (3) software approximations to the complex sample design degrees of freedom (df = # clusters – # strata) may significantly overstate the true precision of the estimated sampling errors and confidence intervals. Survey analysts who intend to analyze data for mixed classes are encouraged to consult with a survey statistician, but we would recommend that these subclasses be han-dled using unconditional subclass analysis approaches (see Section 4.5.2).
The third pattern we will label a cross-class (Kish, 1987). Cross-classes are subclasses of the survey population that are broadly distributed across all strata and clusters of a complex sample design. Examples of cross-classes in a national area probability sample survey of adults might include males, or individuals age 40 and above. Properly identified to the software, subclasses that are true cross-classes present very few problems in survey data analy-sis—sampling error estimation and degrees of freedom determination for confidence intervals and hypothesis tests should be straightforward.
define the total number of PSUs within stratum h (ah) and contributes a value of 0 to the sums in the variance estimation formula. In this way, sample-to-sample variability in the estimation of the total due to the fact that the subclass sample size is a random variable is captured in the variance calculation.
© 2010 by Taylor and Francis Group, LLC
114 Applied Survey Data Analysis
We now consider two alternative approaches to subclass analysis that ana-lysts of complex sample survey data sets could take in practice and discuss applications where one approach might be preferred over another.