• No se han encontrado resultados

Aplica

In document UNIVERSITAT DE BARCELONA (página 52-81)

M. E; GARCÍA DÍAZ, R.P; SUÁREZ QUIRÓS, J

2.4. Aplica

with corresponding changes in the covariance matrix and that Z can always be transformed to d independent standardized normal random variables. See also Note2.6.

It follows that the pivots

(θ − ˆθ)Ti−1(θ)(θ − ˆθ) (6.53) or

(θ − ˆθ)Tˆ−1(θ − ˆθ) (6.54) can be used to form approximate confidence regions forθ. In particular, the second, and more convenient, form produces a series of concentric similar ellipsoidal regions corresponding to different confidence levels.

The three quadratic statistics discussed in Section 6.3.4 take the forms, respectively

WE = ( ˆθ − θ0)Ti(θ0)( ˆθ − θ0), (6.55)

WL = 2{l( ˆθ) − l(θ0)}, (6.56)

WU = U(θ0; Y)Ti−10)U(θ0; Y). (6.57) Again we defer discussion of the relative merits of these until Sections6.6 and6.11.

6.4 Nuisance parameters

6.4.1 The information matrix

In the great majority of situations with a multidimensional parameterθ, we need to writeθ = (ψ, λ), where ψ is the parameter of interest and λ the nuisance para-meter. Correspondingly we partition U(θ; Y) into two components Uψ(θ : Y) and Uλ(θ; Y). Similarly we partition the information matrix and its inverse in the form

i(θ) =

iψψ iψλ

iλψ iλλ



, (6.58)

and

i−1(θ) =

iψψ iψλ

iλψ iλλ



. (6.59)

There are corresponding partitions for the observed information ˆ.

6.4.2 Main distributional results

A direct approach for inference about ψ is based on the maximum likeli-hood estimate ˆψ which is asymptotically normal with mean ψ and covariance matrix iψψ or equivalently ˆψψ. In terms of a quadratic statistic, we have for testing whetherψ = ψ0the form

WE = ( ˆψ − ψ0)T(iψψ)−1( ˆψ − ψ0) (6.60) with the possibility of using( ˆψψ)−1rather than(iψψ)−1. Note also that even ifψ0were used in the calculation of iψψit would still be necessary to estimateλ except for those special problems in which the information does not depend onλ.

Now to studyψ via the gradient vector, as far as possible separated from λ, it turns out to be helpful to write Uψ as a linear combination of Uλ plus a term uncorrelated with Uλ, i.e., as a linear least squares regression plus an uncorrelated deviation. This representation is

Uψ = iψλi−1λλUλ+ Uψ·λ, (6.61) say, where Uψ·λdenotes the deviation of Uψfrom its linear regression on Uλ. Then a direct calculation shows that

cov(Uψ·λ) = iψψ·λ, (6.62)

where

iψψ·λ = iψψ − iψλi−1λλiλψ = (iψψ)−1. (6.63) The second form follows from a general expression for the inverse of a partitioned matrix.

A further property of the adjusted gradient which follows by direct evaluation of the resulting matrix products by (6.63) is that

E(Uψ·λ;θ + δ) = iψψ·λδψ+ O(δ2), (6.64) i.e., to the first order the adjusted gradient does not depend onλ. This has the important consequence that in using the gradient-based statistic to test a null hypothesisψ = ψ0, namely

WU = Uψ·λT 0,λ)iψψ0,λ)Uψ·λ0,λ), (6.65) it is enough to replace λ by, for example, its maximum likelihood estimate givenψ0, or even by inefficient estimates.

The second version of the quadratic statistic (6.56), corresponding more directly to the likelihood function, requires the collapsing of the log likelihood into a function ofψ alone, i.e., the elimination of dependence on λ.

6.4 Nuisance parameters 111

This might be achieved by a semi-Bayesian argument in whichλ but not ψ is assigned a prior distribution but, in the spirit of the present discussion, it is done by maximization. For givenψ we define ˆλψto be the maximum likelihood estimate ofλ and then define the profile log likelihood of ψ to be

lP(ψ) = l(ψ, ˆλψ), (6.66)

a function ofψ alone and, of course, of the data. The analogue of the previous likelihood ratio statistic for testingψ = ψ0is now

WL = 2{lP( ˆψ) − lP0)}. (6.67) Expansions of the log likelihood about the point 0,λ) show that in the asymptotic expansion, we have to the first term that WL = WU and there-fore that WLhas a limiting chi-squared distribution with dψdegrees of freedom whenψ = ψ0. Further because of the relation between significance tests and confidence regions, the set of values ofψ defined as

{ψ : 2{lP( ˆψ) − lP(ψ)} ≤ kd∗2ψ; c} (6.68) forms an approximate 1− c level confidence set for ψ.

6.4.3 More on profile likelihood

The possibility of obtaining tests and confidence sets from the profile log like-lihood lP(ψ) stems from the relation between the curvature of lP(ψ) at its maximum and the corresponding properties of the initial log likelihood l(ψ, λ).

To see this relation, let∇ψand∇λdenote the dψ× 1 and dλ× 1 operations of partial differentiation with respect toψ and λ respectively and let Dψdenote total differentiation of any function ofψ and ˆλψ with respect toψ. Then, by the definition of total differentiation,

DTψlP(ψ) = ∇ψTl(ψ, ˆλψ) + ∇λTl(ψ, ˆλψ)Dψ(ˆλTψ}T. (6.69) Now apply Dψ again to get the Hessian matrix of the profile likelihood in the form

DψDTψlP(ψ) = ∇ψψTl(ψ, ˆλψ) + {∇ψλTl(ψ, ˆλψ)(∇ψˆλTψ)}T

+ {∇λTl(ψ, ˆλψ)}{∇ψψTˆλTψ} + (∇ψˆλT)∇λψTl(ψ, ˆλψ) + (∇ψˆλTψ){∇λλTl(ψ, ˆλψ)(∇ψˆλTψ)T}. (6.70) The maximum likelihood estimate ˆλψ satisfies for all ψ the equation

λTl(ψ, ˆλψ) = 0. Differentiate totally with respect to ψ to give

ψλTl(ψ, ˆλψ) + (DψˆλTψ)∇λλTl(ψ, ˆλψ) = 0. (6.71)

Thus three of the terms in (6.70) are equal except for sign and the third term is zero in the light of the definition of ˆλψ. Thus, eliminating Dψˆλψ, we have that the formal observed information matrix calculated as minus the Hessian matrix of lP(ψ) evaluated at ˆψ is

ˆP,ψψ = ˆψψ− ˆψλˆλλ−1ˆλψ = ˆψψ·λ, (6.72) where the two expressions on the right-hand side of (6.72) are calculated from l(ψ, λ). Thus the information matrix for ψ evaluated from the profile likelihood is the same as that evaluated via the full information matrix of all parameters.

This argument takes an especially simple form when bothψ and λ are scalar parameters.

6.4.4 Parameter orthogonality

An interesting special case arises when iλψ = 0, so that approximately jλψ = 0.

The parameters are then said to be orthogonal. In particular, this implies that the corresponding maximum likelihood estimates are asymptotically independent and, by (6.71), that Dψˆλψ = 0 and, by symmetry, that Dλˆψλ= 0. In nonortho-gonal cases if ψ changes by O(1/

n), then ˆλψ changes by Op(1/n); for orthogonal parameters, however, the change is Op(1/n). This property may be compared with that of orthogonality of factors in a balanced experimental design. There the point estimates of the main effect of one factor, being contrasts of marginal means, are not changed by assuming, say, that the main effects of the other factor are null. That is, the Op(1/n) term in the above discussion is in fact zero.

There are a number of advantages to having orthogonal or nearly orthogonal parameters, especially component parameters of interest. Independent errors of estimation may ease interpretation, stability of estimates of one parameter under changing assumptions about another can give added security to conclusions and convergence of numerical algorithms may be speeded. Nevertheless, so far as parameters of interest are concerned, subject-matter interpretability has primacy.

Example 6.4. Mixed parameterization of the exponential family. Consider a full exponential family problem in which the canonical parameterφ and the canonical statistic s are partitioned as(φ1,φ2) and (s1, s2) respectively, thought of as column vectors. Suppose that φ2is replaced by η2, the corresponding component of the mean parameter η = ∇k(φ), where k(φ) is the cumulant generating function occurring in the standard form for the family. Thenφ1and η2are orthogonal.

6.4 Nuisance parameters 113

To prove this, we find the Jacobian matrix of the transformation from1,η2) to1,φ2) in the form

Here∇1denotes partial differentiation with respect toφlfor l= 1, 2.

Combination with (6.51) proves the required result. Thus, in the analysis of the 2× 2 contingency table the difference of column means is orthogonal to the log odds ratio; in a normal distribution mean and variance are orthogonal.

Example 6.5. Proportional hazards Weibull model. For the Weibull distribution with density

γρ(ρy)γ −1exp{−(ρy)γ} (6.74)

and survivor function, or one minus the cumulative distribution function,

exp{−(ρy)γ}, (6.75)

the hazard function, being the ratio of the two, isγρ(ρy)γ −1.

Suppose that Y1,. . . , Ynare independent random variables with Weibull dis-tributions all with the same γ . Suppose that there are explanatory variables z1,. . . , znsuch that the hazard is proportional to eβz. This is achieved by writing the value of the parameterρ corresponding to Ykin the form exp{(α +βzk)/γ }.

Here without loss of generality we take zk = 0. In many applications z and β would be vectors but here, for simplicity, we take one-dimensional explanatory variables. The log likelihood is

(log γ + α + βzk) + (γ − 1) log yk− exp(α + βzk)ykγ. (6.76) Direct evaluation now shows that, in particular,

E

where Euler’s constant, 0.5771, arises from the integral



0

v log vdv. (6.80)

Now locally nearβ = 0 the information elements involving β are zero or small implying local orthogonality ofβ to the other parameters and in particular toγ . Thus not only are the errors of estimating β almost uncorrelated with those of the other parameters but, more importantly in some respects, the value of ˆβγ will change only slowly withγ . In some applications this may mean that analysis based on the exponential distribution,γ = 1, is relatively insensitive to that assumption, at least so far as the value of the maximum likelihood estimate ofβ is concerned.

In document UNIVERSITAT DE BARCELONA (página 52-81)

Documento similar