2. EVALUACIÓN DE LA PESCA RECREATIVA DESDE COSTA
2.2. Marisqueo del oricio en la campaña 2010 - 2011
Overdispersion, or extra-Poisson variation, was discussed in Section 2.9.
Third party claims. Table 6.8 shows the observed mean and variance of the number of claims in 176 geographical areas, when the data is segmented into five groups corresponding to quintiles of the number of accidents. The variance is much larger than the mean. Thus a Poisson model for the number of claims is inappropriate.
One alternative to Poisson regression is negative binomial regression. In Section 2.9 it was shown that the negative binomial distribution may be viewed as a statistical model for counts, in the situation where overdispersion
90 Models for count data
Table 6.9. Negative binomial regression results for third party claims Response v ariab le Number of claims
Response d istrib ution negative binomial
Link log
Offset log population
De viance 192.3
Deg rees of freedom 174
P arameter df βˆ se χ2 p-v alue
Intercept 1 −6.954 0.162 1836.69 < 0.0001 log accidents 1 0.254 0.025 100.04 < 0.0001
κ 0.172 0.020
is explained by heterogeneity of the mean over the population. The negative binomial regression model, using the log link, is
y∼ NB(μ, κ) , ln μ = ln n + xβ . (6.2) Third party claims. In Section 5.3 a linear relationship between the log of the number of third party claims and log of the number of accidents, was demonstrated. A suitable negative binomial regression model is then
y∼ NB(μ, κ) , ln μ = ln n + β1+ β2ln z , (6.3) where y is the number of third party claims, z is the number of accidents and n is the population size, of an area. Statistical division is also significant in the regression, and it is shown in Section 10.3 that the model selected according to the AIC has both log accidents and statistical division as explanatory vari-ables. However, for simplicity of the discussion this section considers only log accidents.
The results of the fit are given in Table 6.9 – see code and output on page 156. The mean number of claims is related to log accidents, correcting for exposure:
ˆ
μ = ne−6.954+0.254 ln z.
If z increases by a factor of a to az then the rate μ/n is estimated to increase by a factor of e0.254 ln a = a0.254. For example, with a 10% increase in the number of accidents, a = 1.1, the estimated effect on the expected claim rate is 1.10.254= 1.02, a 2% increase.
The above fit can be compared to a fit with κ = 0, that is the Poisson model – see code and output on page 155. This fitted model has a deviance of 15 836.7 on 174 degrees of freedom, indicating a clear lack of fit, which cannot be remedied by the addition of more explanatory variables to the model.
6.2 Poisson overdispersion and negative binomial regression 91 In comparison, the negative binomial model has a deviance of 192.3 on 174 degrees of freedom.
SAS notes. The log link is the default link for negative binomial fits in SAS.
The canonical link g(μ) = ln{μ/(1 + κμ)} is generally not useful.
Testing for Poisson overdispersion. With a negative binomial fit, an esti-mated κ close to zero suggests a Poisson response. A formal test of κ = 0 is based on the likelihood ratio test. Since κ = 0 is at the boundary of the possible range κ≥ 0, the distribution of the test statistic is non-standard and requires care. The likelihood ratio test statistic is 2( NB− P) where NBand P are the values of the log-likelihood under the negative binomial and Pois-son models, respectively. The distribution of the statistic has a mass of 0.5 at zero, and a half-χ21distribution above zero. A test at the 100α% significance level, requires a rejection region corresponding to the upper 2α point of the χ21 distribution (Cameron and Trivedi 1998).
Third party claims. The Poisson and negative binomial regressions yield P = 644 365, NB= 651 879. Hence the likelihood ratio statistic is 15 027.
The hypothesis κ = 0 is rejected, at all significance levels. The conclusion is that overdispersion is indeed present. For a significance level α = 0.05, the hypothesis κ = 0 is rejected if the likelihood ratio statistic is greater than the upper 10% point of the χ21distribution, which is 2.71.
SAS notes. The log-likelihoods for the Poisson and negative binomial reported by SAS are correct up to a constant – they omit the ln c(y, φ) terms, which for both Poisson and negative binomial are− ln y! . These terms cancel in the computation of the likelihood ratio.
Swedish mortality. These data, described on page 17, are displayed in the top panel of Figure 6.4. For male deaths, the Poisson GLM which treats both age and year as categorical variables is
yij ∼ P(μij) , ln(μij) = ln(nij) + xijβ . (6.4) Here i = 0, . . . , 109 refers to age, and j = 1, . . . 55 to calendar years 1951 through to 2005. Response yijis the number of deaths while nijis the number at risk. The vector xijcontains the values of the explanatory variables for each age i, year j combination. The explanatory variables are the usual intercept 1, indicator variables corresponding to age, and indicators corresponding to calendar year. The model has 109 + 54 + 1 = 164 parameters.
The deviance for the fitted Poisson regression is 21 589 on 5704 degrees of freedom, suggesting overdispersion. Using a negative binomial response
92 Models for count data
Fig. 6.4. Observed and fitted Swedish male death rates
6.2 Poisson overdispersion and negative binomial regression 93
0 20 40 60 80 100
108642
Age
Log death rate
data (at year 2005) Fitted values:
categorical age polynomial age
1950 1960 1970 1980 1990 2000
5.85.65.45.2
Year
Log death rate
data (at age 50) Fitted values:
categorical year polynomial year
Fig. 6.5. Fitted Swedish male deaths rates using negative binomial model
yields a much better fit: a deviance of 2838 on 5704 degrees of freedom. The fitted regression coefficients are displayed in Figure 6.5 – see code and output on page 157.
The smooth progression of estimated beta coefficients over age and year sug-gests fewer parameters. Smoothness is exploited by specifying polynomials in the regression, for both age and year:
ln(μij) = ln(nij) + β0+ β1i +· · · + βpip+ βp+1j +· · · + βp+qjq. Using the AIC as a selection criterion for p and q yields p = 25 and q = 4, and hence a model with 30 parameters. The AIC for this “optimal” model is 53 898, compared to a value of 54 023 corresponding to unconstrained coeffi-cients. Figure 6.5 and the bottom panel of Figure 6.4 display the fitted values in two and three dimensions respectively – see code and output on page 157.
More weight can be given to recent data by using weights which increase with calendar year. Weighted fits are appropriate when future rates are predicted and the more recent data is seen as more relevant for prediction.
SAS notes. Numerical problems often occur when there are a large number of polynomial terms, as in the current Swedish mortality example. These prob-lems manifest themselves in spuriously large standard errors of one or more coefficient estimates or the unavailability of coefficients for the high order polynomials. Standardization of the x variable is suggested in Section 4.11;
however, in this case this is not successful. Orthogonal polynomials are used
94 Models for count data
to avoid these numerical problems. These are not implemented in SAS proc genmod, but are available in the statistical language R (Ihaka and Gentleman 1996). The above application was carried out using R.