RadA/Sms

List of abbreviations

1.3. RadA/Sms

In this section, we begin by deﬁning prediction error, which represents the error from the perspective of accuracy in prediction, and then discuss the evaluation criteria derived as its estimator.

105

5.1.1 Prediction Errors

Suppose that we have n set of independent data{(yi, xi); i = 1, 2, · · · , n}

observed for response variable y and p predictor variables x = (x1, x₂,

· · ·, xp)^T. Consider the regression model

y_i= u(xi;β) + εi, i= 1, 2, · · · , n, (5.1) whereβ is a parameter vector. Let y = u(x; ˆβ) be the ﬁtted model. The residual sum of squares (RSS)

RSS=

n i=1

yi− u(xi; ˆβ)2

(5.2)

may be used as a measure of the goodness of fit of a regression model to the observed data. It is not effective, however, as a model evaluation criterion for variable selection or order selection. This may be verified by considering the following polynomial model for a predictor variable x

u(x;β) = β0+ β1x+ β2x²+ · · · + βpx^p. (5.3) The curves in Figure 5.1 represent the fitting of 3rd-, 8th-, and 12th-order polynomial models to 15 data points. As shown by this figure, increasing the degree of the polynomial increases the closeness of the curve to the data, and effectively reduces the RSS. Ultimately, in this case, the 14th-order polynomial model (i.e., of degree one less than the number of data points) passes through all of the data points and reduces the RSS to 0. In this way, the RSS generally results in selection of the highest-degree polynomial and thus the most complex model, and there-fore does not function effectively as a criterion for variable selection or order selection.

This shows that a predictive perspective is needed for the model eval-uation. More speciﬁcally, to evaluate the goodness of a model that has been constructed on the basis of observed data (training data), it is nec-essary to use data (test data) that have been obtained independently from those data. For this purpose, the goodness of a model y= u(x; ˆβ) that has been estimated on the basis of observed data is evaluated by applying the predictive sum of squares (PSS), rather than the RSS, to data ziobtained independently from the observed data at points xi, as follows:

PSS=

n i=1

zi− u(xi; ˆβ)2

. (5.4)

UG WK

WK

\

[

Figure 5.1 Fitting of 3rd-, 8th-, and 12th-order polynomial models to 15 data points.

It should also be noted that RSS generally yields values that are smaller than those of PSS and thus overestimates the goodness of an estimated model. This is important, as the underlying objective of evaluation is to determine which of the constructed models will best predict future phenomena, and not simply the one that best ﬁts the observed data.

Equation (5.4) represents the error that will occur with a single future data set. On the other hand, the error that will occur when data sets of the same size are repeatedly obtained is deﬁned as follows:

PSE=

n i=1

Z_i− u(xi; ˆβ)2

. (5.5)

The predictive mean squared error (PSE) is a measure of the diﬀerence that occurs on average in data Zi = zi obtained randomly at points xi

independently of the observed data, and it is essential to ﬁnd an estimator for this measure.

Using the RSS as a PSE estimator would simply mean reusing the same data yithat were used in estimating the model as a substitute for future data zi, and the RSS would thus not function eﬀectively as a model

evaluation criterion. Unfortunately, it is generally impractical to obtain future data separately from the observed data. Various methods have therefore been considered for model evaluation in the predictive per-spective based solely on the observed data. One such method is cross-validation.

5.1.2 Cross-Validation

In cross-validation (CV), the data used for model evaluation are sepa-rated from those used for the model estimation. This can be performed by the following steps:

(1) The model is estimated on the basis of (n− 1) data, with the i-th ob-servation (yi, xi) excluded from the n observed data, and calculating the value of u(x; ˆβ⁽⁻ⁱ⁾).

(2) The value of{yi−u(xi; ˆβ⁽⁻ⁱ⁾)}²is then obtained for the i-th observa-tion (yi, x_i), which was excluded in Step 1.

(3) Steps 1 and 2 are repeated for all i∈ {1, 2, · · · , n}, and the resulting

CV=1 n

n i=1

+yi− u(xi; ˆβ⁽⁻ⁱ⁾),2

(5.6)

is taken as the criterion for assessment of the goodness of the esti-mated model based on the observed data.

This process is known as leave-one-out cross-validation. In polyno-mial modeling, the CV is computed for various polynopolyno-mial degrees, and the model yielding the smallest CV is selected as the optimum model.

Similarly, in linear regression modeling the model yielding the smallest CV is selected from among the diﬀerent models obtained in correspon-dence with the combination of predictor variables. Theoretical derivation has shown, moreover, that the CV is an estimator of the PSE as deﬁned in (5.5) (e.g., Konishi and Kitagawa, 2008, p. 241).

K-fold cross-validation In CV, the n observed data are generally par-titioned into k data sets {χ1, χ₂, · · · , χk} with approximately the same number of data in each set. The model is estimated using the (k− 1) data sets remaining after exclusion of the i-th data set χi, and then evaluated using the excluded data set χi. This process is performed for i= 1, · · · , k, in this order, and the resulting average value is taken as the estimated value of the PSE. This is known as K-fold cross-validation.

Generalized cross-validation Computing time may become a problem in executing CV for model evaluation for large-scale, large numbers of data.

It can be substantially reduced in cases where the predicted value vector y can be given as ˆy = Hy for the matrix H that does not depend on theˆ observation vectory, since it then becomes unnecessary to perform the estimation process n times (once for each excluded observation). Matrix H is known as the hat matrix, as it maps observed datay to prediction values ˆy. It is also known as the smoother matrix (see Section 3.4.3) because of its application to curve (surface) estimations such as those of nonlinear regression models with basis functions.

One example of the hat matrix occurs in the predicted value of the linear regression model given by (2.24) as ˆy = X ˆβ = X(X^TX)⁻¹X^Ty, in which H = X(X^TX)⁻¹X^T is the hat matrix and is thus not dependent on datay. Others occur in the predicted value ˆy = X ˆβ = X(X^TX+λIn)⁻¹X^Ty of the ridge estimator in Section 2.3.1 and in the predicted value ˆy = B ˆw

= B(B^TB+ γK)⁻¹B^Ty for the nonlinear regression model (3.26) (see (3.51) for the parameter estimation). The hat matrices are then H(λ)= X(X^TX+ λIn)⁻¹X^Tand H(λ, m)= B(B^TB+ λ ˆσ²K)⁻¹B^T, respectively.

where hiiis the i-th diagonal element of the hat matrix. If we replace the term 1−hiiin the denominator with the mean value 1−n⁻¹trH, moreover, we then have the generalized cross-validation (Craven and Wahba, 1979) written as follows:

This equation eliminates the need to execute n iterations with one obser-vation excluded in each iteration and therefore facilitates eﬃcient com-putation. For an explanation of why the replacement can be performed using the hat matrix in this manner, see Green and Silverman (1994) and Konishi and Kitagawa (2008, p. 243).

5.1.3 Mallows’ Cp

One of the model evaluation criteria based on prediction error is Mal-lows’ Cp (Mallows, 1973), which is particularly used for variable se-lection in regression modeling. This criterion was derived under the as-sumption that the probabilistic structure of the speciﬁed model is diﬀer-ent from the true probabilistic structure that generates data in the frame-work of a linear regression model. It is assumed that the expectation and the variance covariance matrix of the n-dimensional observation vector y = (y1, y₂,· · · , yn)^T for response variable y are

E[y] = μ, cov(y) = E[(y − μ)(y − μ)^T]= ω²In, (5.9) respectively. We then estimate the true expectation μ, using the linear regression model

y = Xβ + ε, E[ε] = 0, cov(ε) = σ²In, (5.10) whereβ = (β0, β₁, · · ·, βp)^T,ε = (ε1,· · ·, εn)^Tand X is an n×(p+1) design matrix formed from the data for the predictor variables. The expectation and the variance-covariance matrix of the observation vectory under this linear regression model are, respectively,

E[y] = Xβ, cov(y) = E[(y − Xβ)(y − Xβ)^T]= σ²In. (5.11) Comparison of (5.9) and (5.11) shows that the objective is estimation of the true structureμ via the assumed model under the assumption that the variance (ω²) of the observed data is diﬀerent from the one (σ²) in the linear regression model.

Using the least squares estimator ˆβ = (X^TX)⁻¹X^Ty of the regression coeﬃcient vector β, μ is estimated by

μ = X ˆβ = X(Xˆ ^TX)⁻¹X^Ty ≡ Hy. (5.12) The goodness of this estimator ˆμ is measured by the mean square error

Δp= E[(ˆμ − μ)^T( ˆμ − μ)]. (5.13) It follows from (5.9) and (5.12) that the expectation of the estimator ˆμ is E[ ˆμ] = X(X^TX)⁻¹X^TE[y] ≡ Hμ. (5.14) Hence the mean square errorΔpcan be expressed as

Δp= E[(ˆμ − μ)^T( ˆμ − μ)]

= E

{Hy − Hμ − (In− H)μ}^T{Hy − Hμ − (In− H)μ}

= E

(y − μ)^TH(y − μ)

+ μ^T(In− H)μ

= tr

HE[(y − μ)(y − μ)^T]

+ μ^T(In− H)μ

= (p + 1)ω²+ μ^T(In− H)μ. (5.15)

Here, since H and In − H are idempotent matrices, we used the re-lationships H² = H, (In − H)² = In − H, H(In − H) = 0 and trH

= tr{X(X^TX)⁻¹X^T} = trIp+1= p + 1, tr(In− H) = n − p − 1.

In (5.15) the ﬁrst term, (p+1)ω², increases as the number of parame-ters increases. The second term,μ^T(In−H)μ, is the sum of squared biases of the estimator ˆμ, and decreases as the number of parameters increases.

If an estimator ofΔp is available, then it can be used as a criterion for model evaluation.

The expectation of the residual sum of squares can be calculated as E[(y − ˆy)^T(y − ˆy)]

= E[(y − Hy)^T(y − Hy)]

= E[{(In−H)(y−μ) + (In−H)μ}^T{(In−H)(y−μ) + (In−H)μ}]

= E[(y − μ)^T(In− H)(y − μ)] + μ^T(In− H)μ

= tr

(In− H)E[(y − μ)(y − μ)^T]

+ μ^T(In− H)μ

= (n − p − 1)ω²+ μ^T(In− H)μ. (5.16) Comparison between (5.15) and (5.16) reveals that, if ω² is assumed known, then the unbiased estimator ofΔpis given by

ˆΔp= (y − ˆy)^T(y − ˆy) + {2(p + 1) − n}ω². (5.17) By dividing both sides of the above equation by the estimator ˆω²of ω², we obtain Mallows’ Cpcriterion as an estimator ofΔpin the form

Cp=(y − ˆy)^T(y − ˆy)

ˆω² + {2(p + 1) − n}. (5.18) It may be seen that model preferability increases as the Cpcriterion decreases. The estimator ˆω²is usually represented by the unbiased esti-mator of error variance in the most complex model. In a linear regression model, for example, it is represented by the unbiased estimator of error variance in the model including all of the predictor variables.

For order selection in a time series autoregressive model, the final prediction error (FPE; Akaike, 1969) is given as the PSE estimator. For

a linear regression model y= ˆβ0 +ˆβ1x₁ + · · · +ˆβpxp estimated by least squares, it is given as

FPE= n+ p + 1

n(n− p − 1)(y − ˆy)^T(y − ˆy). (5.19) Moreover, FPE can be rewritten as

n log FPE= n log

n+ p + 1 n− (p + 1)

+ n log1

n(y − ˆy)^T(y − ˆy)

≈ n log

1+2

n(p+ 1)

+ n log1

n(y − ˆy)^T(y − ˆy) (5.20)

≈ 2(p + 1) + n log1

n(y − ˆy)^T(y − ˆy).

In contrast to RSS= (y − ˆy)^T(y − ˆy), which decreases with increas-ing model complexity, both Mallows’ Cpcriterion and the FPE criterion thus include the number of model parameters and thereby penalize model complexity. It may also be noted that they yield equations equivalent to that of the AIC in (5.43) for the Gaussian linear regression model in Ex-ample 5.4 of Section 5.2.2, which shows that they are closely related to AIC.

In document Bacillus subtilis RadA/Sms and RecA contribute in concert to double-strand break repair and natural transformation, and with DisA to DNA damage tolerance (página 47-51)

List of abbreviations

1.3. RadA/Sms

UG WK

WK

\

[

UG WK

WK