• No se han encontrado resultados

Preparar un resumen del Capítulo 2 en un máximo de 3 páginas, indicando claramente cuáles son los aspectos primordiales allí tratados y su relevancia para la época presente.

In Chapter 6, we showed that networks which can approximate monomials are unlikely to be efficiently agnostically leamable. This means that fairly severe conditions have to be imposed on function classes in order for them to be efficiently agnostically leamable. In this chapter, we study how smoothness of the function class affects efficient agnostic learning.

Barron (1993) has shown that the class of functions with bounded first absolute moment of the Fourier transform can be learned with sample complexity O ^ j I n J + l o g i ) ) . However, it is not known if this function class is efficiently leamable (computationally). In this chapter, we put more restrictions on the function class by demanding that the g-th absolute moment of the Fourier transform of the functions (where q depends linearly on the input dimension n) and the L\ norm of the functions in the class be uniformly bounded. We show that such a function class is efficiently agnostically leamable.

Previous work in nonparametric statistics on learning functions in high input dimensions usually does not concentrate on the computational complexity but rather on the sample size

Basis Functions Agnostic learning Learning with noise

Sinusoidal (In ( 0 + I n J ) ) 0 ( l n ( 0 + l n | ) ) Linear threshold a i n ( i ) + l n i ) )

Sigmoid 0 , „ ( ! ) + i n D )

Table 7.1: Number of basis functions for used for efficiently learning the class of functions with bounded q-th absolute moment of the Fourier transform (q and n fixed).

required (stated in terms of the rate of convergence of the risk of the estimator as a function of the sample size). Kernel methods are computationally efficient in high dimensions and give good rates of convergence for certain classes such as the class of s-times differentiable functions when the s-th derivative is Holder continuous and s is proportional to n (see (Hardle 1990)). However, unlike our framework, which requires bounds to hold for arbitrary input distributions, the bounds for kernel methods depend on the input distribution. For functions where all partial derivatives of order s are square-integrable, the asymptotic minimax rate of convergence of the mean integrated squared error is O (Ibragimov & Hasminskii 1980, Pinsker 1980, Stone 1982, Nussbaum 1986), where m is the sample size and n is the input dimension, and this rate can be achieved by using a linear combination of fixed basis functions. With s of order n, learning can be done with a reasonable sample size. However, to achieve this rate, an exponential number (with respect to the input dimension) of basis functions is used.

To obtain our results for functions with bounded Fourier transform moments, we use a Monte Carlo method to evaluate the function via the inverse Fourier transform. For computational effi- ciency, we multiply the Fourier transform with an appropriate sized uniform window and evaluate the resulting inverse Fourier transform (integral) by sampling uniformly over the appropriate subset of the parameter space. Because we are using the Fourier transform, our hypothesis class consists of linear combinations of sinusoidal basis functions. Similar results can be achieved using linear threshold basis functions and sigmoid basis functions. The sample complexity is bounded by O ( i [k In ( i ) + j ) ) where k, the number number of basis functions used, is shown in Table 7.1.

The results are obtained in the agnostic leaming framework. However, as shown in Table 7.1, we are able to obtain a better bound for the case of leaming with noise, where the target conditional expectation satisfies the assumptions we are using.

series and a uniform bound on the Li norms of the functions (with (j growing linearly with n) we are also able to show that for a desired accuracy of approximation, the size of a fixed set of basis functions which will provide the required approximation to all the functions in the class grows only polynomially (instead of exponentially) with the input dimension. The fact that the set of basis functions is fixed means that for multi-output networks, the number of hidden units does not need to grow as the number of outputs grow. This is interesting because in most neural network applications, all the different outputs of the network share the same hidden units.

In Section 7.1, we describe the class of functions with bounded ^-th moment of the Fourier transform. We state the results and describe the algorithm used in Section 7.2. We discuss the results on learning smooth functions in Section 7.3.

In Section 7.4, we show the existence of small (polynomial size) sets of fixed basis functions that can be used to uniformly approximate all the functions with uniform bounds on g-th absolute moment of the Fourier series and the L\ norm {q growing linearly with n).

7.1 Functions with Bounded q-th Absolute Moment of the Fourier

Transform

We will restrict the domain to [-tt, tt]" . Any bounded subset of M" can be rescaled to be within this domain. For T, M , C G 1R+, let Fg be the class of functions satisfying the following conditions:

1. \f{x)\ < T f o r a l l x e

2. \f{x)\dx < M

3. /Rn \2T^Uj\^\F{u)\du < C where F ( u ) = /«„ ^^e Fourier trans- form of / and Uj is the j t h component of u.

Functions on a bounded domain can be represented as a Fourier series by having a periodic extension outside the domain. However, for a condition similar to (3) on the Fourier series to be satisfied, the functions and their derivatives have to be continuous on the boundary of the domain. By having a Fourier transform representation, the functions do not have to satisfy the boundary conditions.

Using techniques from (Barron 1993), it is possible to show that if all partial derivatives of order less than or equal to s = [ n / 2 j + g + 1 of a function / are square-integrable, then it

satisfies the moment condition for Tg. Write = a{u)b{u) witha(w) = (1 +

and b{u) = |'Uj|^|F(u)|(l + By the Cauchy-Schwarz inequality, J a{u)b{u)du < (fa^(u)duy/^(fb^(u)duy/^. The integral / = /(I + is finite for 2i > n. By Parseval's theorem the integral / b^{u)du = f + is finite when the partial derivatives of / of order t + qand of order q are square-integrable on W. This relates the class Tg to more traditional smoothness classes considered in nonparametric statistics.

7.2 Results and Algorithms

The results are stated in the following theorems.