7 Con /01 mi,mo1 hi p ólui1 de la De/inicih 4.4 Si odemú

como I CI (u1, 0 2)·prescrvant.c en :co 1 podemos ha.llar un Be a2 tal que

Lema 4. 7 Con /01 mi,mo1 hi p ólui1 de la De/inicih 4.4 Si odemú

Due to the popularity of SVMs in the two-class setting, a natural extension of the method is the multi-class setting in which one wants to construct a classifier that distinguishes between more than two classes. There are two families of multi-class SVMs, which we consider in turn.

1.5.1 Composite-of-binary SVMs

. The first family, which we call the composite-of-binary family, constructs a set of two-class SVM rules from which a single multi-class rule is constructed. Procedures of this type are widely used in many classification settings and are not specific to SVMs [38, 49]. The three most popular composite-of-binary SVMs are one-versus-one SVM, the one-versus-many SVM, and the one-versus-one DAGSVM [81]. In aK-class setting, the one-versus-one multi-class SVM is constructed by generating allK(K−1)/2 binary rules that separate generic classi from generic class j. To classify a point, the covariate vector is input to each binary classifier which outputs a vote for one of two classes. The overall multi-class rule outputs the class which receives the most votes. The one-versus-one DAGSVM is a similar alternative in which classification is based on a decision tree with a one-versus-one classifier at each node. The decision tree

is constructed so that one potential class is eliminated at each decision node, and the terminal node represents the class prediction. Lastly, we describe the one-versus-many classifier which generatesKbinary rules which separate generic classifrom all other classes. To classify a point, the covariate vector is input to each two-class rule; the final prediction corresponds to the class which generated the largest signed distance, i.e.,

y = arg maxi=1,...,K fi(x). Variations of the these composite-of-binary SVMs exist, and

generally vary by how each component classifier casts a vote for a potential class. For example, using the component SVMs to generate probability estimates which in turn are combined in a final decision rule. See [49, 80, 20, 110] for a discussion of generating and combining multi-class probabilities from SVMs.

Because of their operational efficiency, versions of the composite-of-binary multi-class SVM are implemented widely in statistical software packages [20]. Both in simulation settings and applied settings, the composite-of-binary family has demon- strated its usefulness [54, 32]. Much of the operational efficiency of the composite- of-binary SVM stems from the fact that it takes advantage of specialized but widely distributed algorithms for the two-class SVM. The draw back to this type of multi-class SVM is that voting does not always generate a clear winning class. For example, a one-versus-one multi-class rule in a K = 3 class setting is built on 3 two-class rules. For some covariate combinations, the 3 two-class rules may generate a vote for each class. The same scenario can occur with one-versus-all multi-class rules as well. The primary advantage of the DAGSVM is that it avoids the ambiguities like ties that can arise in the standard voting scheme. Beyond the issue of ties is the issue of consistency, for which the author in [68] shows examples when one-versus-all is not Fisher consistent. Despite this issue, the composite-of-binary approach continues to be a popular multi-class SVM method [84].

1.5.2 Simultaneously Trained SVMs

The second family of multi-class SVM builds a decision rule by simultaneously trainingKfunctions where each function corresponds to a single class [107, 26, 62, 69]. Distinguished by specific multi-class loss functions, the various flavors of multi-class SVMs simultaneously construct theKfunctions so that the predicted class corresponds to the function with the largest value. The simultaneous estimation of theKfunctions ensures that the estimated multi-class rule targets the multi-class Bayes rule.

The general setup is very similar to the two-class setting, and we describe it here. Consider a training set, _Tn, ofn observations, each consisting of a d-vector of

covariates,x _∈ Rd, and multi-class outcome, y _{∈ {}1, . . . ,k_}. Each observation is an iid draw from an unknown distributionP(x,y). Consider functionsf(x)=_{f1(x), . . . , fk(x)}

so that the class label ofxcan be predicted as

y=arg max

i fi(x).

The classifier which minimizes the average classification error overP(x,y) is the Bayes classifier,

fbayes=arg min

f EP[y

,_{arg max}

i fi(x)].

The average classification error of the Bayes classifier is called the Bayes risk. The goal is to construct a classifier from the training set_Tnwhich is asymptotically a Bayes

classifier but also performs well in finite sample situations.

The SVM solution frames the task within empirical risk minimization; specifi- cally, the SVM solution is

ˆ f=arg min f∈H λ||f|| 2 H + 1 n n X i=1 L[yi,f(xi)] (1.9)

such that

fi(x)=0

where_||f||H = Pki ||fi||2_H and L[yi,f(xi)] is a loss function which penalizes misclassifica-

tion. The multi-class SVM methods discussed here build on the reinforced multi-class SVM (RMSVM) proposed in [69] because it provides a multi-class loss function which unifies the earlier work of [62] and [107] as special cases of a general multi-class loss function. The RMSVM loss function is

L[y,f(x)]=_γ[(k₋1)₋ fy(x)]++(1−γ)

j,y

[1+ fj(x)]+

where the function [t]+ = max{0,t} and γ is a tuning parameter which calibrates the

loss. The set of solutions,_H, is constructed so that each component of solution, ˆf, is of the form fi(x)=b+ n X j=1 cjK(x,xj) xj ∈ Tn

whereK(u,v) is a kernel function. The linear kernel,K(u,v) = ut_v_{, and the Gaussian}

kernel,K(u,v)=exp{−σ||u−v||2_}_{, are commonly used.}

As the solution to the empirical risk minimization problem, the SVM targets the conditional expectation of the loss function, E{L[y,f(x)]|x}. When γ ≤ 1/2, the RMSVM solution is Fisher consistent in the sense that the minimizer ofE_{L[y,f(x)_|x]_}is also the multi-class Bayes rule [69]. Simulation examples in [69] show that the RMSVM performs better whenγ =_{1/2 than when}_γ=0 or 1.

In document Diferenciabilidad en espacios vectoriales topológicos (página 137-143)