The least squares estimators are less useful for data sets with severe heteroscedas- ticity. One strategy is to use a variation of least squares estimation by weighting observations. The idea is that, when minimizing the sum of squared errors using heteroscedastic data, the expected variability of some observations is smaller than others. Intuitively, it seems reasonable that the smaller the variability of the response, the more reliable that response and the greater weight that it should receive in the minimization procedure. Weighted least squares is a technique that accounts for this “varying variability.”
Specifically, we use Section 3.2.3 assumptions E1, E2 and E4, with E3 replaced by E εi = 0 and Var εi = σ2/wi, so that the variability is proportional to a known
weight wi. For example, if unit of analysis i represents a geographical entity such
as a state, you might use the number of people in the state as a weight. Or if i represents a firm, you might use firm assets for the weighting variable. Larger values of wi indicate a more precise response variable through the smaller vari-
ability. In actuarial applications, weights are used to account for an exposure such as the amount of insurance premium, number of employees, size of the payroll, number of insured vehicles and so forth (further discussion is in Chapter 18).
This model can be readily converted to the “ordinary”least squares problem by multiplying all regression variables by √wi.That is, if we define yi∗= yi× √wi
and xij∗ = xij× √wi, then from assumption E1 we have
yi∗= yi×√wi = (β0xi0+ β1xi1+ · · · + βkxik+ εi)√wi
= β0xi∗0+ β1xi∗1+ · · · + βkxik∗ + ε∗i,
where ε∗
i = εi× √wi has homoscedastic variance σ2. Thus, with the rescaled
variables, all inference can proceed as earlier.
This work has been automated in statistical packages where the user merely specifies the weights wiand the package does the rest. In terms of matrix algebra,
this procedure can be accomplished by defining an n× n weight matrix W = diag(wi) so that the ith diagonal element of W is wi. Extending equation (3.14),
for example, the weighted least squares estimates can be expressed as
bW LS =
X WX−1X Wy. (5.13)
Additional discussions of weighted least squares estimation will be presented in Section 15.1.1.
5.7.4 Transformations
Another approach that handles severe heteroscedasticity, introduced in Sec- tion 1.3, is to transform the dependent variable, typically with a logarithmic
5.8 Further Reading and References 179
transformation of the form y∗= ln y. As we saw in Section 1.3, transformations can serve to “shrink”spread-out data and symmetrize a distribution. Through a change of scale, a transformation also changes the variability, potentially alter- ing a heteroscedastic dataset into a homoscedastic one. This is both a strength and limitation of the transformation approach –a transformation simultaneously
affects both the distribution and the heteroscedasticity. The transformation of the dependent variable affects both the skewness of the distribution and the heteroscedasticity. Power transformations, such as the logarithmic transform, are most useful
when the variability of the data grows with the mean. In this case, the transform will serve to “shrink”the data to a scale that appears to be homoscedastic. Con- versely, because transformations are monotonic functions, they will not help with patterns of variability that are nonmonotonic. Further, if your data is reasonably symmetric but heteroscedastic, a transformation will not be useful because any choice that mitigates the heteroscedasticity will skew the distribution.
When data are nonpositive, it is common to add a constant to each observation so that all observations are positive prior to transformation. For example, the transform ln(1+ y) accommodates the presence of zeros. One can also multiply
by a constant so that the approximate original units are retained. For example, the transform 100 ln(1+ y/100) may be applied to percentage data, where negative
percentages sometimes appear.
Our discussions of transformations have focussed on transforming dependent variables. As noted in Section 3.5, transformations of explanatory variables are also possible. This is because the regression assumptions condition on explanatory variables (Section 3.2.3). Some analysts prefer to transform variables to approx- imate normality, thinking of multivariate normal distributions as a foundation for regression analysis. Others are reluctant to transform explanatory variables because of the difficulties in interpreting resulting models. The approach taken here is to use transforms that are readily interpretable, such as those introduced in Section 3.5. Other transforms are certainly candidates to include in a selected model but they should provide substantial dividends in terms of fit or predictive power if they are difficult to communicate.
5.8 Further Reading and References
Long and Ervin (2000) gather compelling evidence for the use of alternative heteroscedasticity-consistent estimators of standard errors that have better finite sample performance than the classic versions. The large sample properties of empirical estimators have been established by Eicker (1967), Huber (1967), and White (1980) in the linear regression case. For the linear regression case, MacKinnon and White (1985) suggest alternatives that provide superior small- sample properties. For small samples, the evidence is based on (1) the biasedness of the estimators, (2) their motivation as jackknife estimators and (3) their per- formance in simulation studies.
Other measures of collinearity based on matrix algebra concepts involving eigenvalues, such as condition numbers and condition indices, are used by some analysts. See Belseley, Kuh, and Welsch (1980) for a solid treatment of collinear- ity and regression diagnostics. Hocking (2003) provides additional background
reading on collinearity and principal components. See Carroll and Ruppert (1988) for further discussions of transformations in regression.
Hastie, Tibshirani, and Friedman (2001) give an advanced discussion of model selection issues, focusing on predictive aspects of models in the language of machine learning.
Chapter References
Belseley, David A., Edwin Kuh, and Roy E. Welsch (1980). Regression Diagnostics: Identifying
Influential Data and Sources of Collinearity. Wiley, New York.
Bendel, R. B., and A. A. Afifi (1977). Comparison of stopping rules in forward “stepwise” regression. Journal of the American Statistical Association 72, 46–53.
Box, George E. P. (1980). Sampling and Bayes inference in scientific modeling and robustness (with discussion). Journal of the Royal Statistical Society, Ser. A, 143, 383–430. Breusch, T. S., and A. R. Pagan (1980). The Lagrange multiplier test and its applications to
model specification in econometrics. Review of Economic Studies, 47, 239–53.
Carroll, Raymond J., and David Ruppert (1988). Transformation and Weighting in Regression, Chapman-Hall, New York.
Eicker, F. (1967), Limit theorems for regressions with unequal and dependent errors. Proceed-
ings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, L. M. LeCam and J. Neyman, eds. University of California Press Berkeley, CA, 1:59–82. Hadi, A. S. (1988). Diagnosing collinearity-influential observations. Computational Statistics
and Data Analysis7, 143–59.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2001). The Elements of Statistical
Learning: Data Mining, Inference and Prediction.Springer-Verlag, New York.
Hocking, Ronald R. (2003). Methods and Applications of Linear Models: Regression and the
Analysis of Variance. Wiley, New York.
Huber, P. J. (1967). The behaviour of maximum likelihood estimators under non-standard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and
Probability, L. M. LeCam and J. Neyman, eds. University of California Press Berkeley, CA, 1:221–33.
Long, J. S., and L. H. Ervin (2000). Using heteroscedasticity consistent standard errors in the linear regression model. American Statistician 54, 217–24.
MacKinnon, J. G., and H. White (1985). Some heteroskedasticity consistent covariance matrix estimators with improved finite sample properties. Journal of Econometrics 29, 53–7. Mason, R. L., and R. F. Gunst (1985). Outlier-induced collinearities. Technometrics 27, 401–7. Picard, R. R., and K. N. Berk (1990). Data splitting. American Statistician 44, 140–47. Rencher, A. C., and F. C. Pun (1980). Inflation of R2in best subset regression. Technometrics
22, 49–53.
Snee, R. D. (1977). Validation of regression models. Methods and examples. Technometrics 19, 415–28.
5.9 Exercises
5.1. You are doing regression with one explanatory variable and so consider the basic linear regression model yi = β0+ β1xi+ εi.
a. Show that the ith leverage can be simplified to
hii = 1 n+ (xi− x)2 (n− 1)s2 x .