• No se han encontrado resultados

Componente Socio Económico y Cultural

7 DIAGNOSTICO AMBIENTAL LINEA BASE

7.3 Componente Socio Económico y Cultural

The estimation of a population total and its sampling variance has played a central role in the development of probability sampling theory. In survey practice, optimal methods for estimation of population totals are extremely important in government and academic surveys that focus on agriculture (e.g., acres in production), business (e.g., total employment), and organiza-tional research (e.g., hospital costs). Acres of corn planted per year, total natu-ral gas production for a 30-day period, and total expenditures attributable to a prospective change in benefit eligibility are all research questions that require efficient estimation of population totals. In those agencies and dis-ciplines where estimation of population totals plays a central role, advanced

“model-assisted” estimation procedures and specialized software are the norm. Techniques such as the generalized regression (GREG) estimator or calibration estimators integrate the survey data with known population controls on the distribution of the population weighting factors to produce efficient weighted estimates of population totals (DeVille and Särndal, 1992;

Valliant, Dorfman, and Royall, 2000).

Most statistical software packages do not currently support advanced techniques for estimating population totals such the GREG or calibration

600

Male Female

500

400

Total Cholesterol (mg/dl)

300

200

100

Graphs by gender.

Figure 5.3

Boxplots of the gender-specific weighted distributions of total serum cholesterol for U.S. adults.

(From 2005–2006 NHANES data.)

© 2010 by Taylor and Francis Group, LLC

124 Applied Survey Data Analysis

methods. Stata and other software systems that support complex sample survey data analysis do provide the capability to compute simple weighted or expansion estimates of finite population totals and also include a limited set of options for including population controls in the form of post-stratified estimation (see Theory Box 5.1).

In the case of a complex sample design including stratification (with strata indexed by h = 1, …, H) and clustering (with clusters within stratum h indexed by α = 1, 2, …, ah), the simple weighted estimator for the population total can

A closed-form, unbiased estimate of the variance of this estimator is

var( ˆ ) weighted total plays an important role in the computation of Taylor series linearization (TSL) estimates of sampling variances for more complex esti-mators that can be approximated as linear functions of estimated totals.)

This simple weighted estimator of the finite population total is often labeled the Horvitz–Thompson or H–T estimator (Horvitz and Thompson, 1952). Practically speaking, this labeling is convenient, but the estimator in Equation 5.1 and the variance estimator in Equation 5.2 make additional assumptions beyond those that are explicit in Horvitz and Thompson’s origi-nal derivation. Theory Box 5.1 provides interested readers with a short sum-mary of the theory underlying the H–T estimator.

Two major classes of total statistics can be estimated using Equation 5.1. If yh iα is a binary indicator for an attribute (e.g., 1 = has the disease, 0 = disease free), the result is an estimate of the size of the subpopulation that shares that attribute:

THEORy BOx 5.1 THE HORVITz–THOMPSOn ESTIMATOR OF A POPULATIOn TOTAL

The Horvitz–Thompson estimator (Horvitz and Thompson, 1952) of the population total for a variable Y is written as follows:

ˆY Y and pi is the probability of inclusion in the sample for element i. The H–T estimator is an unbiased estimator for the population total Y, because the only random variable defined in the estimator is the indicator of inclu-sion in the sample (the yi and pi values are fixed in the population):

E Y E Y

An unbiased estimator of the sampling variance of the H–T estimator is

In this expression, pij represents the probability that both elements i and j are included in the sample; these joint inclusion probabilities must be supplied to statistical software to compute these variance estimates.

The H–T estimator weights each sample observation inversely propor-tionate to its sample selection probability,wHT i, =wsel i, = 1/pi, and does not explicitly consider nonresponse adjustment or post-stratification (Section 2.7). In fact, when the analysis weight incorporates all three of these conventional weight factors, the variance estimator in Equation 5.2 does not fully reflect the stochastic sample-to-sample variability associ-ated with the nonresponse mechanism, nor does it capture true gains in precision that may have been achieved through the poststratification of the weights to external population controls. Because survey nonre-sponse is a stochastic process that operates on the selected sample, the variance estimator could (in theory) explicitly capture this added com-ponent of sample-to-sample variability (Valliant, 2004). This method assumes that the data user can access the individual components of the survey weight. Stata does provide the capability to directly account for

© 2010 by Taylor and Francis Group, LLC

126 Applied Survey Data Analysis

Alternatively, if y is a continuous measure of an attribute of the sample case (e.g., acres of corn, monthly income, annual medical expenses), the result is an estimate of the population total of y,

ˆ ˆ

Yw w yh i h i Y

i a n h

H h h

= =

=

=

=

∑ ∑

α α

α α 1 1 1

Example 5.3 will illustrate the estimation of a subpopulation total, and Example 5.4 will illustrate the estimation of a population total.

example 5.3: using the NCS-r Data to estimate the Total Count of u.S. adults with lifetime Major Depressive episodes (MDe)

The MDE variable in the NCS-R data set is a binary indicator (1 = yes, 0 = no) of whether an NCS-R respondent reported a major depressive episode at any point in his or her lifetime. The aim of this example analysis is to estimate the total number of individuals who have experienced a lifetime major depressive episode along with the standard error of the estimate (and the 95% confidence interval).

For this analysis, the NCS-R survey weight variable NCSRWTSH is selected to analyze all respondents completing the Part I survey (n = 9,282), where the lifetime diagnosis of MDE was assessed. Because the NCS-R data producers normalized the values of the NCSRWTSH variable so that the weights would sum to the sample size, the weight values must be expanded back to the population scale to obtain an unbiased estimate of the population total. This is accomplished by multiplying the Part I weight for each case by the ratio of the NCS-R survey population total (N = 209,128,094 U.S. adults age 18+) divided by the count of sample observations (n

= 9,282). The SECLUSTR variable contains the codes representing NCS-R sampling error clusters while SESTRAT is the sampling error stratum variable:

gen ncsrwtsh_pop = ncsrwtsh * (209128094 / 9282)

svyset seclustr [pweight = ncsrwtsh_pop], strata(sestrat) Once the complex design features of the NCS-R sample have been identified using the svyset command, the svy: total command is issued to obtain an the reduction in sampling variance due to the poststratification using the poststrata() and postweight() options on the svyset command.

The effect of nonresponse and poststratification weighting on the sampling variance of estimated population totals and other descriptive statistics may also be captured through the use of replicate weights, in which the nonresponse adjustment and the poststratification controls are separately developed for each balanced repeated replication (BRR) or jackknife repeated replication (JRR) replicate sample of cases.

unbiased weighted estimate of the population total along with a standard error for the estimate. The Stata estat effects command is then used to compute an estimate of the design effect for this estimated total:

svy: total mde estat effects

n df ˆYw se Y( )ˆw CI Y..9955(( ˆw) d Y22(( ˆw)

9,282 42 40,092,206 2,567,488 (34,900,000, 45,300,000) 9.03

The resulting Stata output indicates that 9,282 observations have been analyzed and that there are 42 design-based degrees of freedom. The weighted estimate of the total population of U.S. adults who have experienced an episode of major depression in their lifetime is ˆY= 40 092 206 . The estimated value of the design , , effect for the weighted estimate of the population total is d Y2( ˆ )w = 9.03, suggesting that the NCS-R variance of the estimated total is approximately nine times greater than that expected for a simple random sample of the same size.

Weighted estimates of population totals can also be computed for subpopula-tions. Consider subpopulations of NCS-R adults classified by marital status (married, separated/widowed/divorced, and never married). Under the complex NCS-R sam-ple design, correct unconditional subpopulation analyses (see Section 4.5.2) can be specified in Stata by adding the over() option to the svy: total command:

svy: total mde, over(mar3cat) estat effects

Subpopulation n

Estimated Total Lifetime MDE

Standard Error

95% Confidence

Interval d Y22(( ˆ) Married 5322 20,304,190 1,584,109 (17,100,000, 23,500,000) 6.07 Sep./Wid./Div. 2017 10,360,671 702,601 (8,942,723, 11,800,000) 2.22 Never Married 1943 9,427,345 773,137 (7,867,091, 11,000,000) 2.95

Note that the MAR3CAT variable is included in parentheses to request estimates for each subpopulation defined by the levels of that variable.

example 5.4: using the HrS Data to estimate Total Household assets Next, consider the example problem of estimating the total value of household assets for the HRS target population (U.S. households with adults born prior to 1954). We first identify the HRS variables containing the sampling error com-putation units, or ultimate clusters (SECU) and the sampling error stratum codes (STRATUM). We also specify the KWGTHH variable as the survey weight variable for the analysis, because we are performing an analysis at the level of the HRS household financial unit. The HRS data set includes an indicator variable (KFINR for 2006) that identifies the individual respondent who is the financial reporter for each HRS sample household. This variable is used to create a subpopulation indi-cator (FINR) that restricts the estimation to only sample members who are financial

© 2010 by Taylor and Francis Group, LLC

128 Applied Survey Data Analysis

reporters for their HRS household unit. We then apply the svy: total command to the H8ATOTA variable, measuring the total value of household assets:

gen finr=1

replace finr=0 if kfinr !=1

svyset secu [pweight=kwgthh], strata(stratum) svy, subpop(finr): total h8atota

n df ˆYw se Y( ˆ )w CI Y..9955(( ˆw) 11,942 56 $2.84 × 1013 $1.60 × 1012 (2.52 × 1013, 3.16 × 1013)

The Stata output indicates that the 2006 HRS target population includes approx-imately 53,853,000 households (not shown). In 2006, these 53.9 million estimated households owned household assets valued at an estimated ˆYw = $2.84 × 1013, with a 95% confidence interval (CI) of ($2.52 × 1013, $3.16 × 1013).