List of abbreviations
1.3. RadA/Sms
In this section, we begin by defining prediction error, which represents the error from the perspective of accuracy in prediction, and then discuss the evaluation criteria derived as its estimator.
105
5.1.1 Prediction Errors
Suppose that we have n set of independent data{(yi, xi); i = 1, 2, · · · , n}
observed for response variable y and p predictor variables x = (x1, x2,
· · ·, xp)T. Consider the regression model
yi= u(xi;β) + εi, i= 1, 2, · · · , n, (5.1) whereβ is a parameter vector. Let y = u(x; ˆβ) be the fitted model. The residual sum of squares (RSS)
RSS=
n i=1
yi− u(xi; ˆβ)2
(5.2)
may be used as a measure of the goodness of fit of a regression model to the observed data. It is not effective, however, as a model evaluation criterion for variable selection or order selection. This may be verified by considering the following polynomial model for a predictor variable x
u(x;β) = β0+ β1x+ β2x2+ · · · + βpxp. (5.3) The curves in Figure 5.1 represent the fitting of 3rd-, 8th-, and 12th-order polynomial models to 15 data points. As shown by this figure, increasing the degree of the polynomial increases the closeness of the curve to the data, and effectively reduces the RSS. Ultimately, in this case, the 14th-order polynomial model (i.e., of degree one less than the number of data points) passes through all of the data points and reduces the RSS to 0. In this way, the RSS generally results in selection of the highest-degree polynomial and thus the most complex model, and there-fore does not function effectively as a criterion for variable selection or order selection.
This shows that a predictive perspective is needed for the model eval-uation. More specifically, to evaluate the goodness of a model that has been constructed on the basis of observed data (training data), it is nec-essary to use data (test data) that have been obtained independently from those data. For this purpose, the goodness of a model y= u(x; ˆβ) that has been estimated on the basis of observed data is evaluated by applying the predictive sum of squares (PSS), rather than the RSS, to data ziobtained independently from the observed data at points xi, as follows:
PSS=
n i=1
zi− u(xi; ˆβ)2
. (5.4)
UG WK
WK
\
[
Figure 5.1 Fitting of 3rd-, 8th-, and 12th-order polynomial models to 15 data points.
It should also be noted that RSS generally yields values that are smaller than those of PSS and thus overestimates the goodness of an estimated model. This is important, as the underlying objective of evaluation is to determine which of the constructed models will best predict future phenomena, and not simply the one that best fits the observed data.
Equation (5.4) represents the error that will occur with a single future data set. On the other hand, the error that will occur when data sets of the same size are repeatedly obtained is defined as follows:
PSE=
n i=1
E*
Zi− u(xi; ˆβ)2
. (5.5)
The predictive mean squared error (PSE) is a measure of the difference that occurs on average in data Zi = zi obtained randomly at points xi
independently of the observed data, and it is essential to find an estimator for this measure.
Using the RSS as a PSE estimator would simply mean reusing the same data yithat were used in estimating the model as a substitute for future data zi, and the RSS would thus not function effectively as a model
evaluation criterion. Unfortunately, it is generally impractical to obtain future data separately from the observed data. Various methods have therefore been considered for model evaluation in the predictive per-spective based solely on the observed data. One such method is cross-validation.
5.1.2 Cross-Validation
In cross-validation (CV), the data used for model evaluation are sepa-rated from those used for the model estimation. This can be performed by the following steps:
(1) The model is estimated on the basis of (n− 1) data, with the i-th ob-servation (yi, xi) excluded from the n observed data, and calculating the value of u(x; ˆβ(−i)).
(2) The value of{yi−u(xi; ˆβ(−i))}2is then obtained for the i-th observa-tion (yi, xi), which was excluded in Step 1.
(3) Steps 1 and 2 are repeated for all i∈ {1, 2, · · · , n}, and the resulting
CV=1 n
n i=1
+yi− u(xi; ˆβ(−i)),2
(5.6)
is taken as the criterion for assessment of the goodness of the esti-mated model based on the observed data.
This process is known as leave-one-out cross-validation. In polyno-mial modeling, the CV is computed for various polynopolyno-mial degrees, and the model yielding the smallest CV is selected as the optimum model.
Similarly, in linear regression modeling the model yielding the smallest CV is selected from among the different models obtained in correspon-dence with the combination of predictor variables. Theoretical derivation has shown, moreover, that the CV is an estimator of the PSE as defined in (5.5) (e.g., Konishi and Kitagawa, 2008, p. 241).
K-fold cross-validation In CV, the n observed data are generally par-titioned into k data sets {χ1, χ2, · · · , χk} with approximately the same number of data in each set. The model is estimated using the (k− 1) data sets remaining after exclusion of the i-th data set χi, and then evaluated using the excluded data set χi. This process is performed for i= 1, · · · , k, in this order, and the resulting average value is taken as the estimated value of the PSE. This is known as K-fold cross-validation.
Generalized cross-validation Computing time may become a problem in executing CV for model evaluation for large-scale, large numbers of data.
It can be substantially reduced in cases where the predicted value vector y can be given as ˆy = Hy for the matrix H that does not depend on theˆ observation vectory, since it then becomes unnecessary to perform the estimation process n times (once for each excluded observation). Matrix H is known as the hat matrix, as it maps observed datay to prediction values ˆy. It is also known as the smoother matrix (see Section 3.4.3) because of its application to curve (surface) estimations such as those of nonlinear regression models with basis functions.
One example of the hat matrix occurs in the predicted value of the linear regression model given by (2.24) as ˆy = X ˆβ = X(XTX)−1XTy, in which H = X(XTX)−1XT is the hat matrix and is thus not dependent on datay. Others occur in the predicted value ˆy = X ˆβ = X(XTX+λIn)−1XTy of the ridge estimator in Section 2.3.1 and in the predicted value ˆy = B ˆw
= B(BTB+ γK)−1BTy for the nonlinear regression model (3.26) (see (3.51) for the parameter estimation). The hat matrices are then H(λ)= X(XTX+ λIn)−1XTand H(λ, m)= B(BTB+ λ ˆσ2K)−1BT, respectively.
where hiiis the i-th diagonal element of the hat matrix. If we replace the term 1−hiiin the denominator with the mean value 1−n−1trH, moreover, we then have the generalized cross-validation (Craven and Wahba, 1979) written as follows:
This equation eliminates the need to execute n iterations with one obser-vation excluded in each iteration and therefore facilitates efficient com-putation. For an explanation of why the replacement can be performed using the hat matrix in this manner, see Green and Silverman (1994) and Konishi and Kitagawa (2008, p. 243).
5.1.3 Mallows’ Cp
One of the model evaluation criteria based on prediction error is Mal-lows’ Cp (Mallows, 1973), which is particularly used for variable se-lection in regression modeling. This criterion was derived under the as-sumption that the probabilistic structure of the specified model is differ-ent from the true probabilistic structure that generates data in the frame-work of a linear regression model. It is assumed that the expectation and the variance covariance matrix of the n-dimensional observation vector y = (y1, y2,· · · , yn)T for response variable y are
E[y] = μ, cov(y) = E[(y − μ)(y − μ)T]= ω2In, (5.9) respectively. We then estimate the true expectation μ, using the linear regression model
y = Xβ + ε, E[ε] = 0, cov(ε) = σ2In, (5.10) whereβ = (β0, β1, · · ·, βp)T,ε = (ε1,· · ·, εn)Tand X is an n×(p+1) design matrix formed from the data for the predictor variables. The expectation and the variance-covariance matrix of the observation vectory under this linear regression model are, respectively,
E[y] = Xβ, cov(y) = E[(y − Xβ)(y − Xβ)T]= σ2In. (5.11) Comparison of (5.9) and (5.11) shows that the objective is estimation of the true structureμ via the assumed model under the assumption that the variance (ω2) of the observed data is different from the one (σ2) in the linear regression model.
Using the least squares estimator ˆβ = (XTX)−1XTy of the regression coefficient vector β, μ is estimated by
μ = X ˆβ = X(Xˆ TX)−1XTy ≡ Hy. (5.12) The goodness of this estimator ˆμ is measured by the mean square error
Δp= E[(ˆμ − μ)T( ˆμ − μ)]. (5.13) It follows from (5.9) and (5.12) that the expectation of the estimator ˆμ is E[ ˆμ] = X(XTX)−1XTE[y] ≡ Hμ. (5.14) Hence the mean square errorΔpcan be expressed as
Δp= E[(ˆμ − μ)T( ˆμ − μ)]
= E
{Hy − Hμ − (In− H)μ}T{Hy − Hμ − (In− H)μ}
= E
(y − μ)TH(y − μ)
+ μT(In− H)μ
= tr
HE[(y − μ)(y − μ)T]
+ μT(In− H)μ
= (p + 1)ω2+ μT(In− H)μ. (5.15)
Here, since H and In − H are idempotent matrices, we used the re-lationships H2 = H, (In − H)2 = In − H, H(In − H) = 0 and trH
= tr{X(XTX)−1XT} = trIp+1= p + 1, tr(In− H) = n − p − 1.
In (5.15) the first term, (p+1)ω2, increases as the number of parame-ters increases. The second term,μT(In−H)μ, is the sum of squared biases of the estimator ˆμ, and decreases as the number of parameters increases.
If an estimator ofΔp is available, then it can be used as a criterion for model evaluation.
The expectation of the residual sum of squares can be calculated as E[(y − ˆy)T(y − ˆy)]
= E[(y − Hy)T(y − Hy)]
= E[{(In−H)(y−μ) + (In−H)μ}T{(In−H)(y−μ) + (In−H)μ}]
= E[(y − μ)T(In− H)(y − μ)] + μT(In− H)μ
= tr
(In− H)E[(y − μ)(y − μ)T]
+ μT(In− H)μ
= (n − p − 1)ω2+ μT(In− H)μ. (5.16) Comparison between (5.15) and (5.16) reveals that, if ω2 is assumed known, then the unbiased estimator ofΔpis given by
ˆΔp= (y − ˆy)T(y − ˆy) + {2(p + 1) − n}ω2. (5.17) By dividing both sides of the above equation by the estimator ˆω2of ω2, we obtain Mallows’ Cpcriterion as an estimator ofΔpin the form
Cp=(y − ˆy)T(y − ˆy)
ˆω2 + {2(p + 1) − n}. (5.18) It may be seen that model preferability increases as the Cpcriterion decreases. The estimator ˆω2is usually represented by the unbiased esti-mator of error variance in the most complex model. In a linear regression model, for example, it is represented by the unbiased estimator of error variance in the model including all of the predictor variables.
For order selection in a time series autoregressive model, the final prediction error (FPE; Akaike, 1969) is given as the PSE estimator. For
a linear regression model y= ˆβ0 +ˆβ1x1 + · · · +ˆβpxp estimated by least squares, it is given as
FPE= n+ p + 1
n(n− p − 1)(y − ˆy)T(y − ˆy). (5.19) Moreover, FPE can be rewritten as
n log FPE= n log
n+ p + 1 n− (p + 1)
+ n log1
n(y − ˆy)T(y − ˆy)
≈ n log
1+2
n(p+ 1)
+ n log1
n(y − ˆy)T(y − ˆy) (5.20)
≈ 2(p + 1) + n log1
n(y − ˆy)T(y − ˆy).
In contrast to RSS= (y − ˆy)T(y − ˆy), which decreases with increas-ing model complexity, both Mallows’ Cpcriterion and the FPE criterion thus include the number of model parameters and thereby penalize model complexity. It may also be noted that they yield equations equivalent to that of the AIC in (5.43) for the Gaussian linear regression model in Ex-ample 5.4 of Section 5.2.2, which shows that they are closely related to AIC.