Si la ciencia la escriben los que mandan, eso quiere decir…

The learning problem as above cannot be solved directly because P(X ; Y) is unknown. Hence, an approximate solution can only be obtained on the basis of the available data as well as the properties of the hypothesis space H. In particular, one wants to nd a function f from the set of functions in H whose expected loss converges to the minimal actual risk (Equation 3.2) over all f 2 H in the limit of innite data (m ! 1). An induction principle is required to choose such an f . The empirical risk minimization (ERM) principle is typically used to select an optimal f _{(in the sense described above) that minimizes the}

empirical error Remp(f ) = _m1 m X i=1 `(f (xi); yi): (3.3)

The ERM principle has been the main focal point of much research since the introduction of Rosenblatt's perceptron. Specically, it was then argued that learning corresponded to choosing a network structure (coecients or weights) with best performance on a training set, and achieving an optimal error on the training set automatically guaranteed similar performance on test data or generalization(1) _{(Vapnik, 2000). Unfortunately, a function}

with minimal empirical error (Equation 3.3) does not necessarily give the minimal actual risk (Equation 3.2) because of the overtting phenomenon (referred to as the bias-variance trade o or capacity control in some contexts). Briey, given a large class of functions H which contains all possible mappings f : X ! Y, the optimal function chosen according to the ERM will retain zero training error. However, the test error may not converge to the training error if the function takes arbitrary values on test data. This is the key (though trivial) insight captured in the so-called No Free Lunch Theorems(2) _(Wolpert

and Macready, 1997). Without restricting the capacity of the hypothesis space, it is impossible to estimate the true underlying function using empirical data, as illustrated in Figure 3.1.

Statistical learning theory, which was developed mainly by Vapnik and Chervonenkis (see, e.g. Vapnik (1998, 2000)), provides a complete characterization of the necessary and suf- cient conditions for the generalization and consistency of the ERM principle. Additionally, it provides for understanding and controlling the rate of convergence of Remp(f ) to the

actual risk R(f ). Using such a framework, it is possible to construct learning algorithms with improved generalization performance that optimize these quantities using nite data. An important consideration in learning is consistency, or how well a learned model approxi- mates the true underlying function as more training data becomes available. The following (1) _{More formally, the i.i.d. assumption implies that a correlation between the performance on the training}

and testing data sets can be related by probability theory. In particular, condence intervals for risk functional R(f ) can be obtained on the basis of the corresponding Rempfor each f 2 H.

(2) _{In simple terms, given any two functions f (a) and f (b), there are as many targets (or priors over}

targets) for which f (a) has lower expected training error than f (b) and vice-versa, for loss functions like 0/1 loss.

(a) (b) (c)

Figure 3.1: Empirical risk minimization and the over tting phenomenon. Given the sample set consisting of two classes, a learning machine with limited capacity learns a simple function as shown by the solid decision boundary in (a), whereas a very exible machine achieves zero training error (c) but fails to generalize to the true underlying function indicated by the dashed line. The optimal machine (b) trade-os between capacity and minimizing training error.

key theorem of VC theory provides sucient and necessary conditions for convergence of the ERM principle; any learning algorithm which is based on the ERM principle must satisfy it.

Theorem 3.1 (Asymptotic Consistency, (Vapnik and Chervonenkis, 1991)). One-sided uniform convergence in probability,

lim N!1P sup f 2H(R(f ) Remp(f )) > = 0; (3.4)

for all > 0, is a necessary and sucient condition for (nontrivial(3)_{) consistency of}

empirical risk minimization. The set of functions is assumed to have a bounded loss for some probability measure , that is

A Z

f d B; for all f 2 H: (3.5)

Theorem 3.1 asserts that the worst case over all functions that the learning machine can implement determines the consistency of ERM (Vapnik, 1999). Intuitively, the key learning theorem using the ERM principle requires one to choose f _{from the set of functions that}

satisfy the necessary and sucient conditions. To this end, a notion of dimensionality or capacity of H which captures the complexity of functions in it is required. A simple measure of the complexity of a hypothesis space originally proposed in VC theory is the Vapnik- Chervonenkis dimension (hd). The complexity metric hd measures how many (training)

points can be separated or shattered for all possible labellings using functions of the class. As an illustration of the concept, consider a binary classication problem in R2_{. Taking the}

set of linear separating hyperplanes as the hypothesis space, a maximum of three instances (3) _{This requires removing atypical functions from the hypothesis space otherwise if there is an f} _which

has the smallest error over all f 2 H for all sample sizes m, the learning algorithm will always choose that function.

3.1 LEARNING THEORY 33

can be separated without error for all arbitrary labellings, Figure 3.2. However, a set of four points cannot be shattered by the same class. Hence, the shattering dimension hd of

R2 _{for the class of linear hyperplanes is three. In general, a maximum of d + 1 points can}

be shattered by the class of hyperplanes for any Rd_.

Figure 3.2: An illustration of the shattering dimension for R2_{space for a class of linear separating}

hyperplanes. Here, lled circles indicate negative instances, and the open circles positive labels.

Generalization bounds using capacity metrics such as hd can be derived to characterize the

performance of a learning algorithm as

R(f ) Remp(f ; T )) + G(H; m; ); > 0 (3.6)

where G is a condence function, a probability, T the training sample, and m the sample size. The generalization bound is a sum of the empirical error and a condence term that depends on the hypothesis space from which f is chosen and the sample size of the training set. Ideally, to achieve some guarantee (up to some probability specied by ) that the actual risk or generalization error is small an induction principle is needed that minimizes both terms. In the case of a pattern recognition, an example of such a bound is dened as follows (Vapnik, 2000). Given some 0 1, and for a 0/1 loss function the following bound on the functional risk

R(f ) Remp+r hd(log(2m=hd) + 1) log(=4)_m (3.7)

holds with probability 1 for m > hd over a random draw of the sample T .

It must be noted that Equation (3.7) is independent of P(X ; Y) (all the information available concerning the generating distribution is the i.i.d. training data). Although the term R(f ) may not be computable, the right hand side can be evaluated if hd is specied.

Therefore, given a hypothesis space H, selecting an f that minimizes the right hand side gives an f with minimal upper bound on the expected loss up to some probability 1 . This motivates the induction principle of structural risk minimization (SRM) (Burges, 2004; Vapnik, 1979).

In document Perspectivas Metodológicas 18 (página 69-75)