Aplica - E; GARCÍA DÍAZ, R.P; SUÁREZ QUIRÓS, J

M. E; GARCÍA DÍAZ, R.P; SUÁREZ QUIRÓS, J

2.4. Aplica

with corresponding changes in the covariance matrix and that Z can always be transformed to d independent standardized normal random variables. See also Note2.6.

It follows that the pivots

(θ − ˆθ)^Ti⁻¹(θ)(θ − ˆθ) (6.53) or

(θ − ˆθ)^Tˆ⁻¹(θ − ˆθ) (6.54) can be used to form approximate confidence regions forθ. In particular, the second, and more convenient, form produces a series of concentric similar ellipsoidal regions corresponding to different confidence levels.

The three quadratic statistics discussed in Section 6.3.4 take the forms, respectively

WE = ( ˆθ − θ0)^Ti(θ0)( ˆθ − θ0), (6.55)

WL = 2{l( ˆθ) − l(θ0)}, (6.56)

WU = U(θ0; Y)^Ti⁻¹(θ0)U(θ0; Y). (6.57) Again we defer discussion of the relative merits of these until Sections6.6 and6.11.

6.4 Nuisance parameters

6.4.1 The information matrix

In the great majority of situations with a multidimensional parameterθ, we need to writeθ = (ψ, λ), where ψ is the parameter of interest and λ the nuisance para-meter. Correspondingly we partition U(θ; Y) into two components Uψ(θ : Y) and U_λ(θ; Y). Similarly we partition the information matrix and its inverse in the form

i(θ) =

i_ψψ i_ψλ

i_λψ i_λλ

, (6.58)

and

i⁻¹(θ) =

i^ψψ i^ψλ

i^λψ i^λλ

. (6.59)

There are corresponding partitions for the observed information ˆ.

6.4.2 Main distributional results

A direct approach for inference about ψ is based on the maximum likeli-hood estimate ˆψ which is asymptotically normal with mean ψ and covariance matrix i^ψψ or equivalently ˆ^ψψ. In terms of a quadratic statistic, we have for testing whetherψ = ψ0the form

WE = ( ˆψ − ψ0)^T(i^ψψ)⁻¹( ˆψ − ψ0) (6.60) with the possibility of using( ˆ^ψψ)⁻¹rather than(i^ψψ)⁻¹. Note also that even ifψ0were used in the calculation of i^ψψit would still be necessary to estimateλ except for those special problems in which the information does not depend onλ.

Now to studyψ via the gradient vector, as far as possible separated from λ, it turns out to be helpful to write U_ψ as a linear combination of U_λ plus a term uncorrelated with U_λ, i.e., as a linear least squares regression plus an uncorrelated deviation. This representation is

U_ψ = i_ψλi⁻¹_λλU_λ+ U_ψ·λ, (6.61) say, where U_ψ·λdenotes the deviation of U_ψfrom its linear regression on U_λ. Then a direct calculation shows that

cov(U_ψ·λ) = i_ψψ·λ, (6.62)

where

i_ψψ·λ = i_ψψ − i_ψλi⁻¹_λλi_λψ = (i^ψψ)⁻¹. (6.63) The second form follows from a general expression for the inverse of a partitioned matrix.

A further property of the adjusted gradient which follows by direct evaluation of the resulting matrix products by (6.63) is that

E(U_ψ·λ;θ + δ) = i_ψψ·λδ_ψ+ O(δ²), (6.64) i.e., to the first order the adjusted gradient does not depend onλ. This has the important consequence that in using the gradient-based statistic to test a null hypothesisψ = ψ0, namely

WU = U_ψ·λ^T (ψ0,λ)i^ψψ(ψ0,λ)U_ψ·λ(ψ0,λ), (6.65) it is enough to replace λ by, for example, its maximum likelihood estimate givenψ0, or even by inefficient estimates.

The second version of the quadratic statistic (6.56), corresponding more directly to the likelihood function, requires the collapsing of the log likelihood into a function ofψ alone, i.e., the elimination of dependence on λ.

6.4 Nuisance parameters 111

This might be achieved by a semi-Bayesian argument in whichλ but not ψ is assigned a prior distribution but, in the spirit of the present discussion, it is done by maximization. For givenψ we define ˆλ_ψto be the maximum likelihood estimate ofλ and then define the profile log likelihood of ψ to be

lP(ψ) = l(ψ, ˆλ_ψ), (6.66)

a function ofψ alone and, of course, of the data. The analogue of the previous likelihood ratio statistic for testingψ = ψ0is now

WL = 2{lP( ˆψ) − lP(ψ0)}. (6.67) Expansions of the log likelihood about the point (ψ0,λ) show that in the asymptotic expansion, we have to the first term that WL = WU and there-fore that WLhas a limiting chi-squared distribution with d_ψdegrees of freedom whenψ = ψ0. Further because of the relation between significance tests and confidence regions, the set of values ofψ defined as

{ψ : 2{lP( ˆψ) − lP(ψ)} ≤ kd^∗2_ψ; c} (6.68) forms an approximate 1− c level confidence set for ψ.

6.4.3 More on profile likelihood

The possibility of obtaining tests and confidence sets from the profile log like-lihood lP(ψ) stems from the relation between the curvature of lP(ψ) at its maximum and the corresponding properties of the initial log likelihood l(ψ, λ).

To see this relation, let∇_ψand∇_λdenote the d_ψ× 1 and d_λ× 1 operations of partial differentiation with respect toψ and λ respectively and let D_ψdenote total differentiation of any function ofψ and ˆλψ with respect toψ. Then, by the definition of total differentiation,

D^T_ψlP(ψ) = ∇_ψ^Tl(ψ, ˆλψ) + ∇_λ^Tl(ψ, ˆλψ)Dψ(ˆλ^T_ψ}^T. (6.69) Now apply D_ψ again to get the Hessian matrix of the profile likelihood in the form

D_ψD^T_ψlP(ψ) = ∇_ψ∇_ψ^Tl(ψ, ˆλ_ψ) + {∇_ψ∇_λ^Tl(ψ, ˆλ_ψ)(∇_ψˆλ^T_ψ)}^T

+ {∇_λ^Tl(ψ, ˆλ_ψ)}{∇_ψ∇_ψ^Tˆλ^T_ψ} + (∇_ψˆλ^T)∇_λ∇_ψ^Tl(ψ, ˆλ_ψ) + (∇ψˆλ^T_ψ){∇λ∇_λ^Tl(ψ, ˆλψ)(∇ψˆλ^T_ψ)^T}. (6.70) The maximum likelihood estimate ˆλψ satisfies for all ψ the equation

∇_λ^Tl(ψ, ˆλ_ψ) = 0. Differentiate totally with respect to ψ to give

∇ψ∇_λ^Tl(ψ, ˆλψ) + (Dψˆλ^T_ψ)∇λ∇_λ^Tl(ψ, ˆλψ) = 0. (6.71)

Thus three of the terms in (6.70) are equal except for sign and the third term is zero in the light of the definition of ˆλψ. Thus, eliminating D_ψˆλψ, we have that the formal observed information matrix calculated as minus the Hessian matrix of lP(ψ) evaluated at ˆψ is

ˆP,ψψ = ˆψψ− ˆψλˆ_λλ⁻¹ˆλψ = ˆψψ·λ, (6.72) where the two expressions on the right-hand side of (6.72) are calculated from l(ψ, λ). Thus the information matrix for ψ evaluated from the profile likelihood is the same as that evaluated via the full information matrix of all parameters.

This argument takes an especially simple form when bothψ and λ are scalar parameters.

6.4.4 Parameter orthogonality

An interesting special case arises when i_λψ = 0, so that approximately j_λψ = 0.

The parameters are then said to be orthogonal. In particular, this implies that the corresponding maximum likelihood estimates are asymptotically independent and, by (6.71), that D_ψˆλ_ψ = 0 and, by symmetry, that D_λˆψ_λ= 0. In nonortho-gonal cases if ψ changes by O(1/√

n), then ˆλψ changes by Op(1/√ n); for orthogonal parameters, however, the change is Op(1/n). This property may be compared with that of orthogonality of factors in a balanced experimental design. There the point estimates of the main effect of one factor, being contrasts of marginal means, are not changed by assuming, say, that the main effects of the other factor are null. That is, the Op(1/n) term in the above discussion is in fact zero.

There are a number of advantages to having orthogonal or nearly orthogonal parameters, especially component parameters of interest. Independent errors of estimation may ease interpretation, stability of estimates of one parameter under changing assumptions about another can give added security to conclusions and convergence of numerical algorithms may be speeded. Nevertheless, so far as parameters of interest are concerned, subject-matter interpretability has primacy.

Example 6.4. Mixed parameterization of the exponential family. Consider a full exponential family problem in which the canonical parameterφ and the canonical statistic s are partitioned as(φ1,φ2) and (s1, s2) respectively, thought of as column vectors. Suppose that φ2is replaced by η2, the corresponding component of the mean parameter η = ∇k(φ), where k(φ) is the cumulant generating function occurring in the standard form for the family. Thenφ1and η2are orthogonal.

6.4 Nuisance parameters 113

To prove this, we find the Jacobian matrix of the transformation from(φ1,η2) to(φ1,φ2) in the form

Here∇1denotes partial differentiation with respect toφlfor l= 1, 2.

Combination with (6.51) proves the required result. Thus, in the analysis of the 2× 2 contingency table the difference of column means is orthogonal to the log odds ratio; in a normal distribution mean and variance are orthogonal.

Example 6.5. Proportional hazards Weibull model. For the Weibull distribution with density

γρ(ρy)^{γ −1}exp{−(ρy)^γ} (6.74)

and survivor function, or one minus the cumulative distribution function,

exp{−(ρy)^γ}, (6.75)

the hazard function, being the ratio of the two, isγρ(ρy)^{γ −1}.

Suppose that Y1,. . . , Ynare independent random variables with Weibull dis-tributions all with the same γ . Suppose that there are explanatory variables z1,. . . , znsuch that the hazard is proportional to e^βz. This is achieved by writing the value of the parameterρ corresponding to Ykin the form exp{(α +βzk)/γ }.

Here without loss of generality we takezk = 0. In many applications z and β would be vectors but here, for simplicity, we take one-dimensional explanatory variables. The log likelihood is

(log γ + α + βzk) + (γ − 1) log yk− exp(α + βzk)y_k^γ. (6.76) Direct evaluation now shows that, in particular,

where Euler’s constant, 0.5771, arises from the integral

_∞

v log vdv. (6.80)

Now locally nearβ = 0 the information elements involving β are zero or small implying local orthogonality ofβ to the other parameters and in particular toγ . Thus not only are the errors of estimating β almost uncorrelated with those of the other parameters but, more importantly in some respects, the value of ˆβ_γ will change only slowly withγ . In some applications this may mean that analysis based on the exponential distribution,γ = 1, is relatively insensitive to that assumption, at least so far as the value of the maximum likelihood estimate ofβ is concerned.

In document UNIVERSITAT DE BARCELONA (página 52-81)