e c,cx x ¡ y además de (4.22) también
4.3. Teoremas de la Función Inversa en Espacios Normados
Our proposed method to accommodate incomplete observations is motivated by EM-principles as described in section 1.2.3. The focus of our analysis is the regularized empirical risk function which defines the SVM solution (section 1.4). Here, we make use of the weighted form of the objective function derived in Yang et al. [114]. With observation weightswi, the weighted SVM objective function is
ˆ fTn =arg min f∈H 1 2||f|| 2 H + n X i wimax(0, 1−yif(xi)). (2.1)
We use the same notation as chapter 1; recall that outcomes and covariates are denoted as yi and xi for the n observations of the training set Tn. Building on the weighted
objective function, we assume that the sampling model ofyi|xican be approximated by
a quasi-probability model which is a function of the lossLhand decision boundary f.
We denote this quasi-probability model as ˜p(y|x, f). Second, we assume a distribution p(x; θ). Together, these assumptions imply a conditional quasi-probability model for the missing covariatesxm conditional on the observed data and parameters, i.e.,
˜ p(xmi |xoi,yi, f,θ)= ˜ p(yi|xi, f)p(xmi |xoi,θ) R ˜ p(yi|xi, f)p(xmi |xoi,θ)dxmi . (2.2)
The proposed method is to replace max[0, 1−yif(xi)] in equation (2.1) with its condi-
tional expectation ˜Enmax[0, 1−yif(xi)]|xoi,yi, f,θ
o
for incomplete observations. Fur- ther, we propose that the expectation be replaced as a sum of finite draws:
˜ Emax[0, 1−yif(xi)]| · · · = 1 r r X k=1 max[0, 1−yif(x(ik))]
wherex(ik)isxiwith missing values replaced with a draw from ˜p(xmi |xoi,yi, f,θ).
The key observation is that this setup essentially amounts to replacing each incomplete observation with r draws from the conditional distribution and then ad- justing the corresponding weight for each observation. Thus the algorithm begins with a postulated rule, usually the complete case solution, and then proceeds between drawing replacement observations for missing values and constructing a new decision rule which is then used in the next step of drawing replacement values. The iterative solution is outlined in Table 2.1 for the standard case when each observation is initially weighted the same, i.e.,wi =C/n.
In many situations, sampling from the conditional distribution of ˜p(xm i |x
o
i,yi, f,θ)
will require MCMC methods. As such, we describe a Metropolis-Hastings algorithm which can be used in the augment step which uses a normal proposal distribution. For an observation x with missing values at an arbitrary augment step (t+1) of the AWSVM algorithm, start the sampling chain with an initial, prior value of x(1). Let
xo
(1) =x
o. Then generate a proposal value from the normal distribution centered atx(1),
call itxp(1). Again, ensure the observed elements ofxmatch the corresponding elements inxp(1). (The proposal distribution my be tuned via the covariance if needed.) Letube a uniform random variable, and letφbe the normal distribution function. If
˜
p(yi|xp(1), f(t))φ(xp(1);θ(t))
˜
p(yi|x(1), f(t))φ(x(1);θ(t)) ≥u
Table 2.1: AWSVM Algorithm Step Procedure
0 Choose a distribution forP(x;θ), say normal. Let f(0) andθ(0) be
initial, starting values of f andθ. 1 Lettindex the iterations. Startt=0.
1a (Augment) For each observation with missing values, constructr replicates (indexed byk) ofx(ik)with draws from ˜p(xm
i |x o i,yi, f
(t),θ(t)).
Augment the data set with these draws, call itTaw.
1b (Weight) For entries corresponding to complete cases, set the weight toC/n. For entries of incomplete cases, set the weight to C/(rn).
1c Find ˆfTaw(x) and weighted maximum likelihood estimates ofθ. Set f(t+1)= fˆ
Taw(x) andθ
(t+1) = ˆ
θ
2 Repeat step 1 until f(t+1)converges.
The procedure continues with x(2) acting as the prior value in order to generate x(3), and so forth. The first 1000 values are discarded, and then every 20th draw is kept. The set ofrvalues drawn from this procedure are the replicatesx(ik) described in step 1a of the AWSVM algorithm reported in Table 2.1. These draws are appended to the training set with the corresponding weights set to 1/rof the complete case weight.
Users of AWSVM must select the cost parameter C and any kernel specific parameters just as they might in a situation without missing data. We propose two options: the first is standard cross-validation in which one uses the same proposed value of the tuning parameters at every iteration. The second is a cross validation step at the start of each iteration, thus allowing the tuning parameters to change. Depend- ing on the computational burden of (a) generating draws from the quasi-conditional distribution and (b) cross-validation, one option may be more time-effective. If (b) is more time intensive, then the standard method is likely time-effective. Conversely, if
(a) is more time intensive, then allowing the parameters to change at each iteration is preferred. Our simulations have not identified either one to be better/worse in terms of prediction error.
2.2.1 Properties of the AWSVM
We frame the AWSVM in terms of a quasi-likelihood; it is in that context that we derive the method’s properties. When weights are uniform each observation contributes
1
C||f||H+max(0,1−yif(xi))
to the penalized empirical risk. The loss function is such that properly classified observations contribute less while misclassified observations contribute more. We build the proposed quasi-conditional probability, ˜p(yi|xi, f), with this risk contribution.
LetDbe a normalizing constant, and let∆L(xi)= max[0,1+ f(xi)]−max[0,1− f(xi)].
We define the quasi-conditional probability model as
˜
p(yi|xi, f)=Dexp{−empirical risk contribution}
=1+exp−yi∆L(xi) −1.
Ifxiis on the boundary of the decision rule f, the induced conditional probability is 12.
Otherwise, ˜p(yi|xi, f) is larger in regions of smaller loss, and conversely, is smaller in
regions of larger loss.
The proposed method has important asymptotic properties which we state here and prove section A.1 (page 104) and section A.2 (page 108) of the appendix.
Proposition 1. The AWSVM solution maximizes the observed data quasi-likelihood.
Proposition 2. If the data model p(x;θ)is specified correctly, then then decision function that
The first proposition and its proof indicate that the AWSVM algorithm does con- verge to a meaningful classifier. Similar to other EM methods, the AWSVM algorithm generates a rule which maximizes the observed data quasi-likelihood. The second proposition indicates that under certain conditions, the AWSVM solution is consistent in the sense that the AWSVM solution is a Bayes classifier.