Although linear SVM algorithms provide a statistically principled approach to learning from data, they are limited in practical applications where exible methods that can handle nonlinearity are required. As in classical linear modelling, one can introduce nonlinearity by pre-processing the data by some nonlinear function before learning a decision function, that is
: X ! H
x 7! (x) (3.45)
where H is some feature space. For example, given data in R2 space and assuming most
information is contained in 2nd order products of vector entries (xi xj) for i; j = (1; 2), it
maybe preferable to work in R3feature space given by the mapping
: R2 ! R3
(x1; x2) 7! (x12;
p
2x1x2; x22): (3.46)
Here, the term p2 compensates for number of occurrences of the term x1x2 (Schölkopf,
1997). Generalizing to n-dimensional inputs, it can be shown that extracting p-th order monomials gives a feature space of dimensionality (Schölkopf, 1997; Schölkopf and Smola, 2002) dim(H) = n + p 1 p (3.47) = (n + p 1)!p!(n 1)! : (3.48)
and, consequently, a computationally complex problem to solve. For example, for 24 24
dimensions and p = 5 the feature space is of dimension of the order 1010!
Nonlinear SVMs and related kernel-based methods are based on the observation that the training data only appear as inner or dot products in the dual optimization algorithms, Equations (3.29), (3.39), and (3.43) (Boser et al., 1992). In other words, one never works directly with the feature space representation except via the dot products(6). Thus,
pre-processing the data using Equation (3.45) implies the transformed data appears in the learning algorithm as pairwise comparisons h(xi); (xj)i in feature space H. The combi-
natorial problem associated with working in high-dimensional spaces can be circumvented if computation of the dot products can be done implicitly via a function k such that
k(xi; xj) = h(xi); (xj)i: (3.49)
This raises the question of what functions k(xi; xj) exist that are equivalent to computing
dot products in a high-dimensional feature space H. Mercer's theorem guarantees that the so-called Mercer kernels evaluated on data in input space X correspond to computing dot products in H (Boser et al., 1992; Burges, 2004; Schölkopf and Smola, 2002; Vapnik, 2000). Mercer's condition is formally stated as follows.
Theorem 3.3 (Mercer's Theorem). Given a continuous symmetric kernel k of a positive integral operator K such that (Kf )(y) = RXk(x; y)f (x)dx is positive denite,
Z
k(x; x0)f (x)f (x0)dx dx0 0 (3.50)
for all f 2 L2(X ) (where X compact), it can be expanded in a uniformly convergent series
on X X in terms of normalized eigenfunctions i with i the corresponding eigenvalues
Z k(x; y0) = NF1 X i=1 i i(x) i(y): (3.51)
Thus for any k satisfying Theorem 3.3 the relationship in Equation (3.49) holds.
As an illustration of how to think of the eect of a mapping function on some input space, consider data dened on a square [ 1; 1] [ 1; 1] 2 R2. Assuming most of the
information is contained in 2nd order monomials (independent of the ordering), the R3
features are extracted via the mapping (7)
: R2 ! R3
(x1; x2) 7! (x12;
p
2x1x2; x22): (3.52)
(6) Recall the denition of a dot product:
Denition 1 (Dot Product). Given a vector space X , a dot product is a mapping h; i with X X ! R which for all 2 R and x; y; z 2 X satises
1. hx; yi = hy; xi (symmetry) 2. hx; yi = hx; yi (linearity)
3. hx; y + zi = hx; yi + hx; zi (additivity)
3.2 SUPERVISED LEARNING 43
As shown in Figure 3.5, the transformation has the eect of warping the data in a high dimensional R3 feature space, although the intrinsic dimension of the data is R2(Burges,
2004). 0 0.2 0.4 0.6 0.8 1 −2 −1.2 −0.4 0.4 1.2 2 −0.2 0 0.2 0.4 0.6 0.8 1 η x 1x2 (b) x 1 2 x 2 2 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 x 1 x 2 (a)
Figure 3.5: (a) Input space R2and (b) resulting image when pre-processed by .
More generally, for any positive denite kernel k one can dene a mapping into a feature space Hk such that k computes the dot product under the image of . Positive denite
kernels are kernels which are symmetric and satisfy X
i;j
aiajKij 0 (3.53)
for any a1; : : : ; am 2 R, where Kij is the kernel matrix entry corresponding to k(xi; xj).
Moreover, positive kernels satisfy Kii 0.
The feature space associated with positive denite kernels can be constructed as follows. Given a positive denite kernel k, a data set fx1; : : : ; xmg X dene Hk as the set of
functions f : X ! R. Let be the mapping
: X ! RX
x 7! k(; x) (3.54)
where RX is the space of all functions mapping X into R. The image of X under can
be turned into a linear space by taking linear combinations f () =Xm
i=1
Given two functions of Hk: f () = m X i=1 ik(; xi); and g() = m0 X j=1 jk(; x0i);
the dot product can be found as hf ; gi =m;m
0 X
i;j=1
ijk(xi; x0j): (3.56)
Hence, Hk is a dot product space. The Hilbert space is obtained by completion of the
norm by adding all the limit points under the norm induced by the kernel, that is kf k2Hk =
m
X
i;j=1
ijk(xi; xj): (3.57)
Interesting properties follow from the foregoing construction of H. The value of a function f 2 Hk at a point x can be expressed as a dot product in the feature space
f (x) = hf ; k(x; )iHk: (3.58)
In particular, we have
h(xi); (xj)i = hk(; xj); k(; xi)iHk (3.59)
= k(xi; xj): (3.60)
Because of the last property k is sometimes referred to as the reproducing kernel taking the dot product of the kernel with itself recovers the kernel again. The corresponding space of functions Hk is called the reproducing kernel Hilbert space (RKHS) associated with k
(Schölkopf and Smola, 2002) formally dened as follows.
Denition 2. Let X be a nonempty set and H a Hilbert space of functions f : X ! R. Then H is called a reproducing kernel Hilbert space endowed with the dot product h; i if there exists a function k : H H ! R with the properties that
1. k has the reproducing property hf ; k(; x)i = f (x) for all f 2 H; in particular hk(; xj); k(; xi)iH = k(xi; xj);
2. k spans H, that is H = spanfk(x; )jx 2 X g, where fg denotes completion of the argument.
The RKHS as dened uniquely determines the kernel k (Schölkopf and Smola, 2002). Therefore, instead of explicitly specifying the mapping function in Equation (3.45), use of a kernel function satisfying Mercer's condition is enough to guarantee the existence of an appropriate Hilbert space. Thus, in applying a learning algorithm in feature space, the exact formulation of the mapping function is not necessary choosing an appropriate k is sucient. Table 3.1 lists some commonly used kernel functions.
3.2 SUPERVISED LEARNING 45
Table 3.1: Examples of commonly used Mercer kernel functions Name Functional Form
Homogeneous polynomial (x y)d; d 2 N
Inhomogeneous polynomial (x y + )d; d 2 N and 0
Gaussian radial basis function exp kx yk2=22
; > 0 Sigmoid* tanh(x y + #); > 0 and # < 0
*This is equivalent to a specic two-layer multilayer perceptron and is positive denite
only for certain values of the hyperparameters (Vapnik, 2000)
The idea of performing operations in Hilbert space is not a novel one; it has been known in the mathematical sciences for a long time (see, e.g. Aronszajn (1950)). Although the idea of using kernels in machine learning applications was rst proposed in Aizerman et al. (1964) (who used it in a convergence proof for the linear perceptron), the insight that it can be used in a mathematical programming setup is central in the development of powerful algorithms such as support vector machines (Boser et al., 1992), kernel discriminant analysis (Baudat and Anouar, 2000; Mika et al., 1999), kernel principal component analysis (Schölkopf et al., 1998), and density estimation (Girolami, 2002; Mukherjee and Vapnik, 1999) among other. Kernel functions are also used in Gaussian processes, a family of algorithms in machine learning closely related to support vector algorithms, where they are known as covariance operators (Rasmussen and Williams, 2006). The use of kernels has been extended to handle non-vectorial data such as strings in natural language processing (Haussler, 1999; Joachims, 1998; Watkins, 2000), graph kernels (Gärtner, 2003; Kashima et al., 2004), and tree kernels (Collins and Duy, 2002).
Kernels as regularization operators
The RKHS representation of the feature space permits a useful interpretation of SVMs and other kernel methods from a functional analysis viewpoint. In regularization theory, instead of minimizing the upper bound on the empirical risk and a capacity term (Equation 3.6), one minimizes a regularized risk (Poggio and Girosi, 1990; Tikhonov and Arsenin, 1977),
Rreg(f ) = Remp(f ; Z) +2kf kH (3.61)
over the entire space H, where Remp(f ; Z) is small when f ts the data well. The norm
kf kH ensures a smooth solution which prevents overtting (Schölkopf and Smola, 2002;
Vert et al., 2004).
The representer theorem, originally due to Kimeldorf and Wahba (1971) and generalized in Schölkopf et al. (2000a), states that the solutions of certain risk minimization problems involving regularized risks of the form in Equation (3.61) can be expanded in terms of the training sample mapped into feature space even though the optimization is carried over a potentially innite dimensional space:
Theorem 3.4 (Representer Theorem (Kimeldorf and Wahba, 1971)). Let : [0; 1] ! R be a strictly monotonic increasing function, x1; : : : ; xm 2 X and L:(X R2) ! R S 1 an
arbitrary loss function. Then each minimizing function f 2 H of the regularized risk L((x1; y1; f (x1)); : : : ; (xm; ym; f (xm))) + (kf kH) (3.62)
admits a representation of the form f (x) =Xm
i=1
ik(xi; x); for all x 2 X : (3.63)
The computational advantage arising from the representer theorem is that the optimal solu- tion to Equation (3.62) can be obtained by re-formulating the problem as an m-dimensional optimization problem by substituting Equation (3.63) and expressing the solution in terms of i; : : : ; m (Vert et al., 2004).
Nonlinear Support Vector Machines
In the preceding sections, the key ideas underlying support vector machines have been outlined, namely (i) a learning bias (large margin) from statistical learning theory and (ii) a method for implicitly evaluating dot products in feature spaces. In this section it is shown how to extend the exibility of the linear SVMs using the kernel trick for both separable and non-separable pattern recognition problems, that is Equations (3.32) and (3.39). This is simple to achieve since the data appear only as dot products in both algorithms. Hence, the occurrence of each inner product is substituted with a kernel function, for example the Gaussian kernel (Table 3.1). This corresponds to pre-processing the data using a nonlinear mapping : x ! (x) and subsequently learning a linear function in the feature space H. The choice of the kernel induces nonlinearity in input space.
Replacing all occurrences of dot product with a kernel function between two data points, the equivalent nonlinear formulation of the linearly separable (hard margin) dual optimization problem in Equation (3.32) is max LD := 1 2 m X i;j=1 ijyiyjk(xi; xj) (3.64) subject to m X i iyi = 0; i 0; i = 1; : : : ; m: (3.65) Similarly, the nonlinear soft margin equivalent of Equation (3.39) is
max LD := m X i=1 i 12 m X i;j=1 ijyiyjk(xi; xj) (3.66) subject to 0 i C; i = 1; : : : ; m; m X i=1 iyi = 0: (3.67)
3.2 SUPERVISED LEARNING 47
In both cases the decision function for a test point x is obtained as f (x) = sgn X
i2SVs
ik(xi; x) + b
!
(3.68) where SVs is the set of support vectors, that is training points with i > 0. The bias
parameter is obtained by exploiting the equivalent KKT conditions (Equation 3.34) or as solution to interior point optimization algorithm. In Appendix B.1 is sample MATLAB R code that implements a basic support vector algorithm for solving a binary pattern recog- nition problem.
Use of kernels circumvents the need to explicitly know the mapping or feature space H except that it is vector space. By using geometrical concepts of angles and distances, kernel representation reduces otherwise complex nonlinear algorithms in X to simple lin- ear formulations in H. This insight is summed up as the kernel trick: any algorithm in which data appears as dot products can be implicitly performed in H by using kernel func- tions which permits the design of nonlinear versions of linear algorithms using rich function classes in input spaces (Müller et al., 2001; Schölkopf and Smola, 2002). Figure 3.6 illus- trates some of the learning properties of SVMs using dierent models and hyperparameter specications in the case of a Gaussian kernel.