• No se han encontrado resultados

Plan de implantación

Capítulo VI Pruebas e implantación del Sistema

6.5 Plan de implantación

Let Ω = Ω1× . . . × Ωn. For a vector α ∈Rn+, we define the distance dα : Ω2 →R as dα(x, y) = Pn

i=1αi1[xi 6= yi]. We define Talagrand’s convex distance between a set A ⊂ Ω and a point x ∈ Ω as

dT(x, A) := sup

α∈Rn+,P α2i≤1

y∈Ainf dα(x, y). (2.3.2)

Then the strongest form of Talgrand’s convex distance inequality for product spaces is the following.

Theorem 2.3.3. Let µ be a product measure on Ω. Let X ∼ µ. Then for any A ⊂ Ω,

E



ed2T(X,A)/4

≤ 1

P(A). (2.3.3)

The original proof of this result is based on mathematical induction in the dimen-sion n.

Let At := {x ∈ Ω : dT(x, A) > t}, then this theorem implies the following weaker form. For any t ≥ 0, any A ⊂ Ω,

P(A) · P(At) ≤ exp(−t2/4). (2.3.4)

This, in turn, implies the method of non-uniformly bounded differences (Theorem 2.2.2). For a short proof of these, see pages 139-140 of Dubhashi and Panconesi (2009).

Besides product spaces, Talagrand’s convex distance inequality also holds for uni-form permutations (see Talagrand (1995)). In this case, an equation of the uni-form of (2.3.3) holds, with constant 16 instead of 4.

In addition to the definition (2.3.2), Talagrand has defined set distances in sev-eral other ways as well. His so called ”control by sevsev-eral points method” gener-alises dT(x, A) to define a distance between a point and several sets, of the type dT(x, A1, . . . , Aq). This method has lead to important new concentration inequalities for suprema of empirical processes in product spaces (in particular, for sums of the form supf ∈FPn

i=1f (Xi), where X1, . . . , Xn are independent random variables, and F is a countable set of real valued functions). These inequalities have proven to be very useful for applications in model selection, and machine learning.

For a concise proof of this result Talagrand’s inequality for uniform permutations, see Section 8.2 of Ledoux (2001). Talagrand’s inequality for uniform permutations was further generalised in McDiarmid (2002), and Luczak and McDiarmid (2003).

Boucheron, Bousquet, and Lugosi (2005a) is a great survey on applications of con-centration inequalities for empirical processes to the theory of classification. For applications to model selection problems, see Massart (2007).

Finally, it is worth noting that most of the inequalities obtained by Talagrand’s set distance method have been also proven using Ledoux’s log-Sobolev-type entropy method (Ledoux (1995/97), Massart (2000)), and using transportation cost inequal-ities (see Dembo (1997)).

2.3.3 Log-Sobolev inequalities and the entropy method

In this section first we state the simplest form of log-Sobolev inequalities, show how they imply concentration via the so called Herbst argument. Then we explain the basics of the entropy method.

Log-Sobolev inequalities were introduced in Gross (1975) in relation with quantum field theory. They have later found applications in many fields of mathematics, see the lecture notes Guionnet and Zegarlinksi (2003), and An´e, Blach`ere, Chafa¨ı, Foug`eres, Gentil, Malrieu, Roberto, and Scheffer (2000). For applications to Markov chains (bounding for the spectral gap of the chain), see Diaconis and Saloff-Coste (1996).

More recently, a version of log-Sobolev inequalities, the entropy method, has proven to be a powerful method to prove concentration inequalities (see Boucheron, Lugosi, and Massart (2013b)).

Given a probability space (Ω, F , µ), and a measurable function f : Ω → R, we

define its entropy as

Entµ(f ) := Eµ(f log(f )) − Eµ(f ) log(Eµ(f )).

Now in the case of Ω =Rn, and F being all the Borel sets ofRn, we say that µ satisfies the log-Sobolev inequality with constant C if for all smooth enough functions f ,

Entµ f2 ≤ 2CEµ |∇f |2 , (2.3.5)

with |∇f (x)| denoting the Euclidean length of the gradient vector of f at point x.

Then the following theorem gives an example about log-concave distributions where the log-Sobolev constant C can be bounded.

Theorem (Theorem 5.2 of Ledoux (2001)). Suppose that Ω = Rn, and F contains all the Borel sets. Let dµ = e−U (x)dx, where for some c > 0, λmin(Hess U (x)) ≥ c uniformly for every x ∈ Rnmin denotes the smallest eigenvalue). Then for all smooth enough functions f on Rn,

Entµ f2 ≤ 2

cEµ |∇f |2 ,

i.e. the log-Sobolev inequality holds with constant C = 1/c.

Remark 2.3.4. In the special case of the n dimensional standard Gaussian distri-bution, U (x) = kxk22/2 + n/2 log(2π), and thus λmin(Hess U (x)) = λmin(I) = 1, therefore we have c = 1 and C = 1.

The following proposition relates the log-Sobolev inequality with concentration of Lipschitz functions.

Proposition 2.3.5 (Herbst argument). Suppose that µ satisfies (2.3.5). Let X ∼ µ, the for any f :Rn→R, any t ≥ 0,

P(|f (X) − E(f )| ≥ t) ≤ 2 exp − t2 2Ckf k2Lip

! ,

where kf kLip denotes the Lipschitz coefficient of f with respect to the Euclidean dis-tance.

Remark 2.3.6. The proof of this result is given on pages 94-95 of Ledoux (2001).

In the special case of the standard normal distribution, C = 1, thus we obtain the Cirelson-Ibragimov-Sudakov inequality (see Section 1.2.1 of Massart (2007)).

Proposition 2.3.7. Let X = (X1, . . . , Xn) be a vector of independent standard nor-mal random variables. Let f : Rn → R be a 1-Euclidean Lipschitz function. Then for any t ≥ 0,

P(|f (X) − E(f )| ≥ t) ≤ 2 exp(−t2/2).

Now we turn to the basics of the entropy method.

A classical inequality from probability theory is the Efron-Stein inequality (in this form, see Boucheron, Lugosi, and Massart (2003)).

Theorem 2.3.8. Let Z = g(X1, . . . , Xn) be square integrable, where X1, . . . , Xn are independent random variables. Let X10, . . . , Xn0 be independent copies of them. For some real valued function g, let Z := g(X1, . . . , Xn), and Z(i) := g(X1, . . . , Xi0, . . . , Xn).

Then

Var(Z) ≤ 1 2

n

X

i=1

E(Z − Z(i))2, whenever all the expectations exist.

The advantage of this result is that it is using the typical deviations of g when changing each of the random variables X1, . . . , Xn separately (instead of using the maximal possible deviations, as in the bounded differences inequality). The disadvan-tage is that it only gives bound on the variance, and not an exponential concentration result. The entropy method allows us to recover exponential concentration bounds of similar type.

The following theorem is an exponential version of the Efron-Stein inequality (see Boucheron, Lugosi, and Massart (2003)).

Theorem 2.3.9. Let X1, . . . , Xn, Z, and Z(i) be as in Theorem 2.3.8. Let

These inequalities give bounds on the moment generating function of Z − E(Z) in terms of the moment generating function of V+ and V. The mean of V+ and V

which is exactly the bound from the Efron-Stein inequality. If we assume that V+ has finite exponential moments for a non-empty range of positive exponents, then it follows from the theorem that for small values of λ, log E[λ(Z − E(Z))] ≤ exp(λ2E(V+)), which in turn implies that for sufficiently small deviations, Gaussian tails hold with constants proportional to the right hand side of the Efron-Stein in-equality, Pn

i=1E(Z − Z(i))2. Thus whenever the Efron-Stein bound gives the right order of variance, we can get sharp Gaussian tails for sufficiently small deviations.

The proof of Theorem 2.3.9 is based on the following modified log-Sobolev in-equality (see Massart (2000)).

Theorem 2.3.10. Let ψ(x) := ex− x − 1. Suppose that X1, . . . , Xn are independent random variables, and X10, . . . , Xn0 are independent copies of them. For some real valued function g, let Z := g(X1, . . . , Xn), and Zi0 := g(X1, . . . , Xi0, . . . , Xn). Then for any s > 0,

sE[ZesZ] − E[esZ] log E[esZ] ≤

n

X

i=1

E[esZψ(−s(Z − Zi0))].

Moreover, denote τ (x) := x(ex− 1). Then for all s ∈R,

sE[ZesZ] − E[esZ] log E[esZ] ≤

n

X

i=1

EesZτ (−s(Z − Zi0))1[Z > Zi0] , and

sE[ZesZ] − E[esZ] log E[esZ] ≤

n

X

i=1

EesZτ (−s(Zi0− Z))1[Z < Zi0] .

The entropy method was shown to imply the strongest form (2.3.3) of Talagrand’s convex distance inequality in Boucheron, Lugosi, and Massart (2009). In Chapter 7, we use parts of the approach of that paper to prove Talagrand’s convex distance

inequality for dependent variables. The entropy method has also been generalised to obtain moment bounds for functions of independent random variables in Boucheron, Bousquet, Lugosi, and Massart (2005b).

2.3.4 Transportation cost inequality method

Transportation cost inequalities are a powerful tools of proving concentration results.

They were introduced by Marton, based on ideas from information theory. Here we briefly review the basics of this method.

Suppose that we have a Polish metric space (Ω, d), and distributions µ and ν on it. Then the L1 and L2 Wasserstein distances are defined as

W1(µ, ν) := inf

π[X∼µ,Y ∼ν]Eπ(d(X, Y )), (2.3.6)

W2(µ, ν) := inf

π[X∼µ,Y ∼ν][Eπ(d2(µ, ν))]1/2, (2.3.7) where the infimum is taken over all distributions π defined on Ω2 having marginals µ and ν. Define the relative entropy of two measures ν and µ as

D(ν||µ) = Z

log dν dµ



dν, (2.3.8)

with the convention that it is infinity if ν is not absolutely continuous with respect to µ. A distribution µ on (Ω, d) is said to satisfy a transportation cost inequality with constant c if for any distribution ν on (Ω, d),

W1(ν, µ) ≤p

2cD(ν||µ).

Alternatively, a distribution µ on (Ω, d) is said to satisfy a quadratic transportation cost inequality with constant c if for any distribution ν on (Ω, d),

W2(ν, µ) ≤p

2cD(ν||µ).

In general spaces, transportation cost inequalities imply Gaussian concentration for d-Lipschitz functions (in fact, as it was shown in Djellout, Guillin, and Wu (2004), Gaussian concentration is equivalent to transportation cost inequalities). In product-like spaces (such that independent random variables, or uniform permutations) they can be shown to imply McDiarmid’s bounded differences inequality.

Quadratic transportation cost inequalities are stronger results. In product-like spaces, some special type of quadratic transportation cost inequalities also imply Ta-lagrand’s convex distance inequality, Bernstein’s inequality, and further inequalities, see Samson (2000), Marton (2003). In the seminal work Otto and Villani (2000), it was shown that in a general setting, log-Sobolev inequalities imply quadratic trans-portation cost inequalities.

One great success of the transportation cost inequality method was proving con-centration inequalities for so called contracting Markov chains. For a homogeneous Markov chain with Polish state space Ω, and transition probabilities P (x, y), let us denote a := supx,y∈ΩdTV(P (x, ·), P (y, ·)), then Proposition 1 of Marton (1996b) proves a transportation cost inequality 1/(1 − a)2 times worse than in the indepen-dent case (see (3.1.1) for the definition of the total variational distance). Marton (1996a) shows a quadratic transportation cost inequality for such chains, again, with constants 1/(1 − a)2 times weaker than in the independent case. Further extension was given in Samson (2000) and an unpublished manuscript of Marton.

In this thesis, we improve upon these bounds for Markov chains, and show that McDiarmid’s bounded difference inequality holds with constants that are the mixing time of the chain times weaker than in the independent case. In fact, we have found two proofs for this result, one using transportation cost inequalities (which is more general, and also yields Talagrand’s convex distance inequality, Bernstein’s inequality, and further inequalities), and one simpler approach using a martingale-type argument.

Because of space considerations, we have decided to only include the martingale-type approach in this thesis.

In this short paragraph, we have only attempted a cover the basics of the trans-portation cost inequality method, which have became popular in the last decade, and found many connections with other fields. More complete references are Villani (2009), and Gozlan and L´eonard (2010).

Documento similar