where $ = (P - P)TXTWX(P - P) .
A /V A
Proof: Denoting p(ß) and p(ß) by p and p respectively and observing
A/ A A A*
that y-p = y-p+p-p, it can be verified easily that:
A* yv /A* rp -j A Ay yV A» T» ■% yv
4XP) = [P - p] v [p - p] + 2[p - p] V [y - p] .
A /%/
Since ß and ß are both consistent for ß, first order Taylor series
A A/
expansions of p and p about ß give: h = h + VGX(p-p) + op (n_1/} ü = H + VGX(P-P) + o (n-14) . This implies that
<t>= [VGX(P-P)]TV_1[VGX(P-P)] + 2[VGX(p-p)]TV _1[y-jz] + op (n_i4) = (p-p) X GVGX(p-p) + 2(p-p)1X 1G[y-n] + o (n .
.H
As GVG = W and X^G[y-p] = 0 , |({) - (ß-ß)^X^WX(ß-ß) | ^ 0 as required, h
The measure may be interpreted as the asymptotic confidence
A
region displacement from ß. Restricting our attention to the class of A/
consistent and shrinkage estimators ß (ie. having shorter length than yv
ß ) , a ridge type estimator for GLMs can be obtained as follows. Lemma 3.2:
The estimator ß that has minimum squared norm subject to
(3.2) ß* = (XT WX + kl) X (XT WX) ß
/ » /v»
w h e r e k > 0 is chosen to satisfy the constraint <() = <()
Proof: This can be established by minimizing:
F(P) = PP + k [(P-P)VWX(P-P) - (J)o]
w i t h respect to ß, where k ^ denotes the Lagrange multiplier.
C o n s i d e r :
dF/dß = 2ß + k _ 1 [2XTWX(ß-ß)]
d^F/dßdß1 = 21 + 2k _ 1 (XT WX) .
c y a» /'»■p
For k>0, the h e s s i a n m a t r i x 3 F/dßdß is positive definite an d hence a
A /
local m i n i m u m of F exists w h i c h is g i ven by setting dF/dß to 0. This
y i e l d s ß as in (3.2) w i t h k satisfying:
(P-P*)T XT WX(P-P*) = $ o . (3.3)
We a re not interested in the case k<0 as this will not correspond to the class of shrinkage estimators.
To solve for k, one substitutes the expression (3.2) for
ß
into(3.3), yielding:
cb = { [ I - ( X AWX+kI) 1XT WX]ß}T XT WX { [ I - ( X iWX+kI) T, - L X
Vwx]ß>
= k 2ßT (XT WX+kI) 1X TW X (X T W X + k I ) lß
since I-(XTW X + k I ) _1XT W X = k (XT W X + k I ) _ 1 .
(3.4)
U s i n g (3.1), (X^WX+kl) P(E+kI) a nd thus (3.4) may be w r itten as
$ o = k 2aT (E + k I ) ~ 1E ( E + k I ) ' 1a
= k 2 I e . a 2/ ( e . + k ) 2 ,
J — i J J J
a t~ P A2 2
w h ere a = P ß. The function f(k) = j2^e^.a^./(l+e^./k) is monotone
P ~ 2
increasing from f(0) = 0 to lim f(k) = .2..e.a.. Therefore there
^ J J
exists a unique positive root k Q say as the solution to the nonlinear
eq u a t i o n (3.3) p r o vided that (J)o < .E.e.a.. The corresponding
An alternative derivation of ß is given in the next Lemma, the proof of which is similar to that of Lemma 3.2.
Lemma 3. 3 :
~ 2 2 AT A *
If the squared norm of ß is fixed at r , 0 < r < ß ß, then ß is » v
the value of ß that gives a minimum asymptotic confidence ellipsoid displacement (p, where k>0 is now chosen so that ß ß = r . H
$4
Definition 3 . 3 : The general ridge estimator for GLMs is:
ß*(k) = (XTWX+kI)_1(XTWX)ß (3.5)
where the scalar k is called the ridge parameter. H The dependence of the general ridge estimator on k is clearly emphasized in the above definition. Furthermore, the estimator (3.5) can be considered as a generalized version of the ordinary ridge estimator of Hoerl & Kennard (1970a). For the special case of binary logistic regression, it is also easy to show that the estimator simplifies to the 'ridge logistic estimator* of Schaefer, Roi & Wolfe
(1984). Lemma 3.4:
For the Normal linear model with identity link =
xTß,
the general ridge estimator becomes:ß*(k) = (XTX+kI)_1XTy . (3.6)
~ T -1 T Proof: The MLE and least squares estimator here is ß = (X X) X y. Assuming without loss of generality that the prior weights gj = 1 for all i and <j> = , the information matrix X^WX simplifies to a ^X^X. Thus,
ß* = (a"2XTX+kI)_1cr"2XTX(XTX)"1XTy = ( X^X+cr^k I) ~ 1 XTy
= (XTX+kI)_1XTy
In any application of ß (k), W may be estimated using ß, the resulting estimator is:
ß*(k) = (XTWX+kI)-1(XTWX)ß . (3.7)
Computations of ß (k) may be performed by combining stand alone programs (or macros) with existing GLM software that can output ß and X WX. The latter two quantities are readily available from generalized linear modelling packages.
3.2.2 Geometric and Shrinkage Interpretations
As shown in Lemma 3.2, the ridge estimator ß is the point on the
Ay Ay A A/ »T* rp A Ay Ay
curve 4Kß) = (ß~ß) X WX(ß-ß) = (j)o fixed, that is closest to the
Ay Ay
origin. This curve 4^/3) represents an asymptotic confidence ellipsoid
_ a
in IRP centred at ß. Alternatively, the ridge estimator can be viewed as the solution that minimizes 4>(ß) subject to a spherical restriction on ß, the appropriate value of the ridge parameter k being determined by the radius r of the restriction (Lemma 3.3).
A geometric representation of the general ridge estimator for the
A
case p = 2 is displayed in Figure 3.1. The MLE ß is positioned at the centre of the elliptical (j) contours. The spherical (here circular) restriction is shown and the constrained solution is the innermost elliptical contour that just touches the spherical restriction. The whole sequence of ridge solutions is obtained as one 'closes in’ the
A
circle, that is, reduces the radius r from llßll where the circle would
A
just pass through ß (corresponding to k = 0) to zero which is the point circle origin (corresponding to k = 00). In addition, we are not
A
interested in large radius values beyond llßll. The 'ridge’ of ridge regression is the path or solution locus traced by ß as the circle radius is contracted or expanded from one extreme to the other of its effective range.
The general ridge estimator ß*(k) = Sß where S = (X^WX+kl) *X^WX may also be regarded as a shrunken estimator (see eg. Copas (1983) for a review on shrunken estimators) with non-constant shrinkage factor S. Lemma 3.5: For k > 0, llß*(k)ll < llßll. Proof: ß ß = ß (X WX)(X WX+kl) Z (X WX)ß AT* __<T\ A = a E(E+kI) HEa P -2^2 = 2 (1 + k/e.) j = i J J - X " < a a AT»A = ß ß A '-p/X where a = P ß. H
At k = 0, ß (0) = ß and as k increases, ß (k) is shrunk towards the origin since lim ß (k) = 0.
k-*»
t
Figure 3.1 Geometric representation of the general ridge estimator for the case p=2. The ellipses correspond to contours of constant (p .
3.2.3 Asymptotic Properties
Under the appropriate regularity conditions, consistency and
x /v
asymptotic Normality of the general ridge estimator ß (k) = S(k)ß are established below, where S(k) = (XTWX+kI)_1XTWX = I - k(XTWX+kI)_ 1 . Lemma 3.6:
ß (k) is consistent for ß provided that k = o^(n). Proof: S(k) (n X WX + n kl) n X WX-1 T -1 -1 -1 T
- 1
(1„ + o (1)) 1R -» I
v ß P ß P
where i