CAPÍTULO 2: PROPUESTA DE SOLUCIÓN
2.8 Modelo de diseño
2.8.1 Patrones de Diseño
where ˙g(y) is the derivative of g evaluated at y. Integrating ˙g(μ) = μ−p/2with respect to μ yields
g(y) =
y1−p/2 for p= 2 ln(y) for p = 2 .
This is the Box–Cox transformation described in Section 1.3. As an example, when y has variability proportional to the mean (such as in the case of Poisson data), Var(y) = φμ, p = 1 and hence the variance stabilizing transform is g(y) = √
y. If the standard deviation is proportional to the mean and hence p = 2, then the log transform stabilizes the variance.
Variance stabilizing transformations were important before the GLM methodology became available. When non-normal responses were modeled, the only option was to find a transformation which would render the data amenable to analysis using the normal linear model. For example, when ana-lyzing count data which had the Poisson distribution, a normal linear model analysis was performed using the square root of y as the response. With GLMs the data is modelled directly, using the appropriate response distribution.
4.10 Categorical explanatory variables
When a potential explanatory variable is categorical, then it is “dummied up”
for inclusion into a multiple linear regression. For example, suppose the area of residence of a policyholder is considered to be a potential explanatory variable in a model for claim size y. If there are three areas A, B and C, then two indicator (dummy) variables x1and x2are defined:
Area x1 x2
A 1 0
B 0 1
C 0 0
Suppose area is the only explanatory variable. Then the relationship is modeled as
y≈ β0+ β1x1+ β2x2.
This states that y ≈ β0in area C, y ≈ β0+ β1in area A, and y ≈ β0+ β2 in area B. Thus β1is the difference between areas C and A, while β2is the difference between areas C and B. Note the following:
(i) An indicator variable x3 for C is not used. The “left out” category is called the base level. In the above example, area C is the base level,
52 Linear modeling
meaning that differences are measured between C and each of the other levels. The choice of base level is up to the analyst: this is discussed in more detail below.
(ii) In general, when the explanatory variable has r levels, r − 1 indi-cator variables are introduced, modeling the difference between each category and the base level.
(iii) It is not sensible to define, for example, a variable x = 1, 2 or 3 accord-ing to whether the level is A, B or C, since y≈ β0+ β1x implies equal spacing between A, B and C, i.e. a difference of β1between areas A and B; and β1between areas B and C.
Choice of base level. The base level should not be sparse. To explain this suppose one has a categorical explanatory variable with r levels, and level r has been chosen as the base level. In the extreme scenario, if there are no cases having level r then for each case one of x1, . . . , xr−1 is always equal to 1, and the others equal to zero. This in turn implies that x1+· · · + xr−1 = 1 and hence the X matrix is singular: the sum of the last r− 1 columns equals the intercept. More realistically, if the base level has very few cases, then for most cases x1+· · · + xr−1= 1, implying near linear dependency between the columns of X. Although ˆβ is computable, it would be numerically unsta-ble, analogous to the result obtained when dividing by a number close to zero.
Any level which is not sparse is an appropriate base level. Since βj is the difference in the effect of the explanatory variable at level j compared with the base level, it is convenient to choose the base level as the “normal” or “usual”
level, against which other levels are to be compared. This is often the level having the most cases. For example, in the vehicle insurance data set, the most commonly occurring vehicle body type is “Sedan,” which comprises almost a third of the cases. Comparing other body types with Sedan makes good sense, and makes the latter a good choice as the base level. Note, however, that one is not limited to making comparisons relative to the base level. Differences between non-base levels are of the form βj− βk.
SAS notes. The software chooses the base level as the highest level – numer-ically or alphabetnumer-ically. When levels are coded numernumer-ically, the highest level is often the category “Other.” This is not a good choice of base level as it is usually sparse, and comparisons relative to “Other” are generally not helpful.
In this case a more suitable base level needs to be specified. The terminology for base level in the SAS manual is “reference level.”
4.11 Polynomial regression 53 4.11 Polynomial regression
Given a single explanatory variable x consider y≈ β0+ β1x + β2x2.
This is a linear regression since the right hand side is linear in the β coeffi-cients. The relationship between y and x, however, is quadratic. In essence there is one explanatory variable, albeit used twice in different forms. A unit increase in x has the effect of changing y by about β1+ 2β2x and hence the slope of the relationship depends on the value of x.
This idea can be extended by defining further variables x3, x4, and so on. Incorporating more polynomial terms permits an increasingly complicated response structure. When a polynomial term of degree m, i.e. xm, is included in a model, all lower order terms x, x2, . . . , xm−1are generally included. Note that fitting polynomials of an order which is unnecessarily high, results in models with fitted values close to the observations but low predictive ability.
This is illustrated in Figure 4.3, in which a small data set (n = 20), simu-lated with a quadratic relationship between x and y, has had polynomials of degree m = 1, 2, 10 and 19 fitted. Clearly the linear relationship (m = 1) is
5 10 15 20
Fig. 4.3. Polynomial fits, simulated data set
inappropriate; m = 2 approximates the general shape of the relationship, with-out following local variation too closely; m = 10 produces fitted values much closer to the observations, but which tend to follow small local variations; and m = 19 (which fits 20 parameters to 20 observations) produces a perfect but useless fit, which “joins the dots.” The ˆβs from the degree 10 fit will have lower precision, and the resulting fitted values will be less reliable, than those from the quadratic model. Judgement about what degree of polynomial to fit to
54 Linear modeling
a continuous explanatory variable is guided by significance testing of the coef-ficients (Section 4.15), and model selection criteria (Section 4.19). Typically one starts with a linear term, and adds in increasingly higher order terms. This is illustrated, in the context of logistic regression, in Section 7.3.
Numerical difficulties. These can arise when high order terms are included in a model, due to the finite length of computer storage. If x is large (negative or positive), then overflow occurs in the computation of xmwhen m is large.
Frequently a further result of this is that underflow occurs in the corresponding β. This problem is avoided if x is suitably scaled before entering it into theˆ regression. Specifically, denoting xmaxas the maximum value of x in the data, and xmin as the minimum, then (xmax + xmin)/2 is the midrange of x, and (xmax− xmin)/2 is half the range of x. The linear transformation
x∗= x− (xmax+ xmin)/2 (xmax− xmin)/2
lies between−1 and 1. Overflow and underflow problems are overcome by including x∗rather than x in the polynomial regression.
Further problems of collinearity (see Section 4.14) can arise when comput-ing high order polynomial regressions. Successive powers of x may be highly correlated, leading again to numerical instability and consequent unreliability of the estimates. This problem is avoided by the use of orthogonal polynomi-als. Interested readers are referred to Chambers and Hastie (1991). Orthogonal polynomials are used in a regression model for mortality in Section 6.2.