leamable if Q is properly efficiently agnostically leamable. Furthermore the sample complexity for properly efficiently learning M^ is at most
14000C^ / 3 2 C 2 \ \ 32C^ 12\
,Q\x,hj + l j + — ^ I n 2 + l n y 1
where C = max{iC6,T, 1}.
Proof. As in the proof of Theorem 5.2, we can scale the function class and target random variable
by dividing by C.
The covering number of the scaled function class is the same as the Ce covering number of the unsealed class. By calculating the bounds for accuracy e/C^ and rescaling back, we obtain the desired bound. Assume the scaled function class is f . Let a = 1/2 in Theorem 3.7 and use Lemma 6.4 with u = Uc = ejAC^ to get
3771 z G : 3 / e E [(y - f{x)f - (y - /alx))^^
> 2Ez [(y - - (y - fa{x))'] + e/{2C')} < 6 max N xex^"' / / <6x2" max N \ \ 1024CK' ,^^|x,/i) + l ) ' e x p ( - e m / 1 4 0 0 0 C 2 ) .
Let f'{x) = Ez[y\X = 2;]. Let be the estimated function and let fa be the function in the convex closure which minimizes the empirical error. Then E
{y - fk{x)f - {y - fa{x))'\ = ^z[{f'{x)-Mx))^-{f'{x)-fa{x)y
Note that E z [(/'(X) - - { f i x ) - /a(x))2] < E z [ { f i x ) - - ( / ' ( x ) - f a { x ) fIn Theorem 6.1, set c = To get approximation within e / 4 (with respect to the empirical mean squared error), we require k > yiC^lt. Setting the right hand side of (5.2) to be bjl and k — 32C^/e and solving for m , we get a sample size bound of
14000C^ (T.'yn'i- f f f \ \ 12^
In max iV — ^
xex^"' \ \ 1 0 2 4 C a ^ J J e 0 J
\ ^
Having selected a sample of size m, we now need an algorithm to find fk. We show that a proper efficient agnostic learning algorithm for G can be used as an efficient randomized algorithm for optimizing the error on the sample using Cf^ Learning algorithms are often formed from optimization algorithms, and in such cases, the algorithms can be used directly to minimize the error on the sample. The idea is to use the learning algorithm to sample and learn from the empirical distribution so that at each stage i of the iterative approximation, the error relative to the optimum is less than e^ (from Theorem 6.1) with probability greater than 1 - 5/2k. This can be done in a way similar to the proof of Theorem 6.3 except that we can test the hypotheses directly using the same sample (and we do not have to compose the resulting function with clipr at each iteration). Knowing the covering number of Q enables us to bound the size of the sample required to be sampled according to the empirical distribution (Corollary 3.5). Note that since we are sampling from the empirical distribution, no new observations need to be drawn from the original distribution. Theorem 6.1 assures us that if we are successful at each iteration, we will be within the desired error on the empirical distribution which gives us the desired error on the sample. •
6.3 Relationship with Agnostic PAC learning
Let ^ be a class of {0, l}-valued functions. Let the observed range be {0,1}. We call proper agnostic learning with discrete loss under these assumptions proper agnostic PAC learning. In this section, we show that if Q is properly efficiently agnostically PAC leamable, then jV^ is properly efficiently agnostically leamable (with the squared loss function). Note that Mf^ has real-valued output with real-valued targets while the algorithm for agnostic PAC learning only
handles {0, l}-valued function classes with {0, l}-valued targets.
As shown by Jones (1992), the iterative approximation result holds even if the inner product of the basis function with fk - f (where / , the target function is in the closure of the convex hull and fk is the current network) is minimized instead of the empirical quadratic error. This is also true for the proof given by Koiran (1994) for the case where the target function is not in the closure of the convex hull of the function class. We use this property and transform the problem of minimizing the inner product on a finite set of observations into the problem of agnostic PAC learning.
The following theorem follows from the proof of Theorem 1 given in (Koiran 1994) with minor changes. For completeness we include the proof here.
T h e o r e m 6.6 Let Q be a subset of a Hilbert space U with || 5 ||< 6 for each g eQ. Let co{Q) be
the convex hull of Q. For any f eH, let dj - infg/gco(a) II ~ / II- Let fo = 0, c> 2b +df and iteratively for k > suppose fk is chosen tobe fk = — \/k)fk-\ + g'/k - /, where g' E Q is chosen to satisfy
{ f k - i - f , g ' ) < inf {fk-x-f,g)+ek g^G and €k < - — T h e n
f - f k f - 4 <
2cdf
Proof. We will show that for any function h in H and any a G [0,1] and a = \ - a.
\ah + ag- / f < Q^\\h - f\\^ + 2aadf\\h - f\\ + a^{2b + dfY + ^k (6.4)
where g is chosen to satisfy {h - f,g) < infg'^cih - f,g') + ^k- Setting a = 0 shows that the result holds for A; = 1. Assume the desired inequality holds for fk-\. The result then follows by induction. From (6.4), with o: = 1 - 1/A; and a = l/Zc, we get
f J\\2 ^ [k - m\fk-i - fW , 2{k-l)df\\Jk-i-f\\ , {2b + df)^ ^ {c?-{2b + df)^)
A;2
By the induction hypothesis.
\fk-f
<
k^ 4 + 2dfC+
c2
2df{k- \)idf + ^ ^ A;2 ^ A;2
k^ ^ k ^ fc •
It follows that
as required. We now verify (6.4). For any g E Q,
\\ah + ag- f f = a^\\h - f f + - f f + 2aa{h - f , g - / ) .
Given 5 > 0, let f* G co(^) be such that \\f* - f\\ < df + 5. For some sufficiently large p, f* is of the form Xlf^, ^igi with > 0, = 1 and c^j G The average value of the inner product
{h - f , f - f ) for g e {gi,...,gp} is
f,gi- f ) = {h- f , r - f ) <{df + §)\\h- f i=l
(6.5)
Furthermore, for any g £ G,
9 - f f = \\9 - r + r - /ir < (li^ii + iirii + iir -/iir < i^b+df + sy.
Hence, with the average of the inner product bounded as in (6.5), if the g chosen as described.
ah + ag- f f < a^\\h - f f + a^[{2b + df + + 2aQ\\h - f\\{df + 5) + 2aaefc. 2 ,
Letting 5 go to 0 and noting that 2aa < 1 completes the proof. •
Theorem 6.7 Let Q be a class of admissible {0, l}-valued basis functions. Then M^ is properly