3.1. Marketing Estratégico
3.1.1. Análisis de las Cinco Fuerzas Competitivas de Michael Porter
5.3 The chaining method
In the previous section, we developed a simple method to bound the supremum of a random process that satisfies the Lipschitz property Xt Xs . d(t, s)
in analmost sure sense. However, we have seen that this requirement is very restrictive: in many cases, the typical size of the incrementsXt Xs is much
smaller than in the worst case. We therefore aim to develop a method to bound the suprema of random processes that only requires the Lipschitz property Xt Xs.d(t, s) to hold in probability in a suitable sense.
To understand how one might approach this problem, let us recall the basic idea behind the proof of Lemma 5.7. IfN is an"-net, we can estimate
Esup t2T Xt E sup t2T X⇡(t) +E sup t2T{ Xt X⇡(t)} .
The first term is a finite maximum that can be controlled by the maximal inequality of Lemma 5.1. The second term is a small remainder: each variable inside the supremum has magnitude of order"by the Lipschitz property of the process. If the Lipschitz property holds in an almost sure sense, the supremum drops out and we can immediately control the remainder term.
However, if the Lipschitz property only holds in probability, we cannot directly control the remainder term. Indeed, in this case each variable inside the supremum has “typical”size"; however, we have to control the supremum of many such variables, whose magnitude can be much larger than"(e.g., the maximum ofnindependentN(0, 2) variables is of order plogn , even though each variable is only of order ). Therefore, in this case, the problem of controlling the remainder term is essentially of the same type as that of controlling the original supremum of interest. Nonetheless, we expect that the remainder term is smaller than the original supremum, as the size of each variable in the remainder term is now smaller. To shrink the remainder term further, we can approximate it once again by a finite maximum at a smaller scale. For example, ifN0 is an"/2-net, then we can estimate
Esup t2T{ Xt X⇡(t)} E sup t2T{ X⇡0(t) X⇡(t)} +E sup t2T{ Xt X⇡0(t)} . The first term on the right is a finite maximum that can be controlled by Lemma 5.1. The remainder term is still an infinite supremum, but now each variable inside the supremum is only of order "/2: that is, we have cut the remainder term roughly by half. The key idea of this section is that we can repeat this procedure over and over again, each time cutting the size of the remainder term roughly by half. Let us investigate this idea a bit more sys- tematically. For eachk 0, letNk be a 2 k-net and choose ⇡k(t)2Nk such
E sup t2TXt E sup t2TX⇡0(t) + n X k=1 E sup t2T{ ⇠2 k z }| { X⇡k(t) X⇡k 1(t)} +Esup t2T{ ⇠2 n z }| { Xt X⇡n(t)} .
The remainder term is now a supremum of variables of order 2 n. Under mild
conditions, the remainder term will disappear if we letn! 1without having
to invoke any almost sure Lipschitz property of the process. Thus we surmount the inefficiency of Lemma 5.7 by approximating the supremum not at a single scale, but at infinitely many scales. The remaining bound is now an infinite sum: thekth term in the sum is a finite maximum of random variables at the scale 2 k. To control these finite maxima, we also do not require an almost
sure Lipschitz property: in view of Lemma 5.1, it suffices to assume that the Lipschitz property holds “in probability” in the following sense.
Definition 5.20 (Subgaussian process). A random process {Xt}t2T on
the metric space(T, d)is called subgaussian if E[Xt] = 0and
E[e {Xt Xs}]
e 2d(t,s)2/2 for all t, s2T, 0.
Remark 5.21.The subgaussian property should indeed be interpreted as an “in probability” form of the Lipschitz property: by Problem 3.1, the subgaus- sian assumption is equivalent up to constants to an assumption of the form
P[|Xt Xs| x d(t, s)]Ce x
2/C .
Note also that the assumption E[e {Xt Xs}] e 2d(t,s)2/2 already implies E[Xt Xs] = 0 (as lim #0{ec
2/2
1}/ = 0), so the assumptionE[Xt] = 0
merely imposes a convenient normalization. In section 5.4, we will see how to control the suprema of random processes with nontrivial mean t7!E[Xt].
The technique that we have outlined above is known aschaining: the idea is to approximateXtby a “chain”X⇡k(t)of increasingly accurate approxima- tions (the “links” in the chain are the incrementsX⇡k(t) X⇡k 1(t)). The main remaining difficulty in implementing the method is to show that the remain- der term does indeed vanish asn! 1. To get around this, we will impose a very mild technical assumption that holds in almost all cases of interest. Definition 5.22 (Separable process).A random process{Xt}t2T is called
separableif there is a countable set T0✓T such that Xt2 lim
s!t s2T0
Xs for allt2T a.s.
5.3 The chaining method 131
Remark 5.23.The assumption of separability is technical, and is almost always trivially satisfied. For example, if t 7!Xt is continuous a.s., we can take T0
to be any countable dense subset of T. At the same time, the separability assumption is in some sense intrinsic to the chaining argument. After all, the main idea of the chaining argument is to approximateXt= limk!1X⇡k(t)for everyt2T. If this is in fact valid, however, then the definition of a separable process will hold for the countable setT0={⇡k(t) :k 0, t2T}.
For completeness, let us note a somewhat esoteric point that we swept under the rug. IfT is uncountable, supt2TXtis the supremum of an uncount-
able family of random variables. In general, the supremum of uncountably many measurable functions is not even necessarily measurable. Measurability issues do arise, on occasion, in the control of suprema, but we will shamelessly ignore such problems in these notes. Under the separability assumption, how- ever, supt2TXt= supt2T0Xt a.s., and thus no measurability problems arise
(as a countable supremum of measurable functions is always measurable). We now have all the ingredients to implement the chaining argument. Theorem 5.24 (Dudley). Let {Xt}t2T be a separable subgaussian process
on the metric space(T, d). Then we have the following estimate:
E sup t2TXt 6X k2Z 2 kplogN(T, d,2 k).
Proof. We first prove the result in the finite case |T| <1, which allows us to easily eliminate the remainder term in the chaining argument. We subse- quently use the separability assumption to lift this restriction.
Let|T|<1. Letk0be the largest integer such that 2 k0 diam(T). Then any singletonNk0 ={t0}is trivially a 2 k0-net. We therefore start chaining at the scale 2 k0. Fork > k0, letN
k be a 2 k-net such that|Nk|=N(T, d,2 k).
Running the chaining argument up to the scale 2 n yields
E sup t2TXt E[Xt0] + n X k=k0+1 E sup t2T{X⇡k(t) X⇡k 1(t)} +Esup t2T{ Xt X⇡n(t)} .
Let us consider each of the terms. AsE[Xt0] = 0 by assumption, the first term disappears. Moreover, as|T|<1, we can choose nsufficiently large so that Nn=T. Then the last term disappears. To control the terms inside the sum,
note that the maximum in thekth term contains at most|Nk||Nk 1||Nk|2
terms (as|Nk 1||Nk|). Moreover, we can readily estimate
d(⇡k(t),⇡k 1(t))d(t,⇡k(t)) +d(t,⇡k 1(t))3⇥2 k.
AsX⇡k(t) X⇡k 1(t) isd(⇡k(t),⇡k 1(t))
Esup t2T Xt 6 X k>k0 2 kplog |Nk|.
But|Nk|=N(T, d,2 k) by construction, so the proof is complete.
In the proof we have used the assumption|T|<1to control the remainder term in the chaining argument. We now use separability to show that one can approximate the general case by the finite case. Indeed, by separability, there is a countable subsetT0 ✓T such that supt2TXt = supt2T0Xt a.s. Denote
byTk the firstkelements ofT0 (in arbitrary order). Then
Esup t2T Xt =E sup t2T0Xt = supk 1 Esup t2Tk Xt
by monotone convergence. Applying the chaining inequality to each finite maximum and usingN(Tk, d,")N(T, d,") yields the general result. ut
Very often the result of Theorem 5.24 is written in a slightly di↵erent form by noting that the sum can be viewed as a Riemann sum approximation to a certain integral. There is no particular mathmatical significance to this reformulation: it is made for purely aesthetic reasons.
Corollary 5.25 (Entropy integral). Let {Xt}t2T be a separable subgaus-
sian process on the metric space(T, d). Then we have the following estimate:
Esup t2T Xt 12 Z 1 0 p logN(T, d,")d".
Proof. We can readily estimate
X k2Z 2 kplogN(T, d,2 k) = 2X k2Z Z 2 k 2 k 1 p logN(T, d,2 k)d" 2X k2Z Z 2 k 2 k 1 p logN(T, d,")d" = 2Z 1 0 p logN(T, d,")d",
where we used thatN(T, d,") is decreasing in". ut
Remark 5.26.It is important to note that we always haveN(T, d,") = 1 when " diam(T), as in this case any singletonN={t0}is trivially an"-net. Thus it suffices to take integral in Corollary 5.25 only up to"= diam(T).
Remark 5.27.The logarithm of the covering number logN(T, d,") is often called metric entropy in analogy with information theory: it measures the number of bits needed to specify an element of T up to precision ". It is customary to refer to the integral in Corollary 5.25 as theentropy integral.
5.3 The chaining method 133 To illustrate Corollary 5.25, let us revisit Example 5.15.
Example 5.28 (Wasserstein law of large numbers revisited). We adopt the same setting and notation as in Example 5.15. Recall that we want to estimate the expected Wasserstein distance between the empirical and true measures
W1(µn, µ) = sup f2F
Xf,
whereX1, X2, . . .are i.i.d. variables in [0,1] with distributionµand
Xf = n X k=1 f(Xk) µf n , F={f 2Lip([0,1]) : 0f 1}. By the Azuma-Hoe↵ding inequality (Corollary 3.9), we have
E[e {Xf Xg}]
e 2kf gk21/2n.
The process {Xf}f2F is therefore subgaussian with respect to the metric d(f, g) =n 1/2
kf gk1. We can consequently estimate using Corollary 5.25 E[W1(µn, µ)]12 Z 1 0 p logN(F, n 1/2k·k 1,")d".
But it is easily seen that
N(F, n 1/2k·k1,") =N(F,k·k1, n1/2"),
so that changing variables in the integral and using Lemma 5.16 yields
E[W1(µn, µ)] p12 n Z 1 0 p logN(F,k·k1,")d" p12n Z 1 2 0 r c "d". As" 1/2is integrable at the origin, we have proved
E[W1(µn, µ)].n 1/2,
which is a huge improvement over the n 1/3 rate obtained by the crude method used in Example 5.15. It is evident from the above computations that the crucial improvement is due to the fact that|Xf Xg|.n 1/2kf gk1in
probability (as is made precise by the subgaussian property), while the best almost sure Lipschitz bound one can hope for is|Xf Xg|.kf gk1.
In the present example, it is rather easy to obtain a matching lower bound on the Wasserstein distance. Indeed, note that for any functionf 2Fthat is not constantµ-a.s., we obtain by the central limit theorem
E[W1(µn, µ)] E[Xf_X1 f] =E|Xf|⇠n 1/2.
Now that we understand the chaining principle, we can use it to obtain more sophisticated results. For example, just as we could obtain a tail bound in Lemma 5.2 corresponding to the maximal inequality of Lemma 5.1, we can obtain a tail bound counterpart to Corollary 5.25.
Theorem 5.29 (Chaining tail inequality). Let {Xt}t2T be a separable
subgaussian process on the metric space(T, d). Then for allt02T andx 0 P sup t2T{Xt Xt0} C Z 1 0 p logN(T, d,")d"+x Ce x2/Cdiam(T)2,
whereC <1is a universal constant.
Proof. The beginning of the proof is identical to that of Theorem 5.24, and we adopt the notations used there. As in Theorem 5.24, it is easily seen that it suffices to consider|T|<1, as we will assume in the remainder of the proof. The idea here is to run the chaining argument without taking the expec- tation. As|T|<1, we have⇡n(t) =tfornsufficiently large. Thus
Xt Xt0=
X
k>k0
{X⇡k(t) X⇡k 1(t)}
by the telescoping property of the sum. This elementarychaining identitylies at the heart of the chaining argument. We immediately obtain
sup t2T{ Xt Xt0} X k>k0 sup t2T{ X⇡k(t) X⇡k 1(t)}.
Rather than bounding the expectation of this quantity, as we did in Theorem 5.24, we will bound the tail behavior of every term in this sum. To this end, note that the subgaussian property of{Xt}t2T and Lemma 5.2 yield
P sup t2T{X⇡k(t) X⇡k 1(t)} 6⇥2 kplog |Nk|+ 3⇥2 kz e z 2/2 . Thus with high probability, every linkX⇡k(t) X⇡k 1(t)at the scalekis small. We would like to show that all links atevery scale are small simultaneously, that is, that the probability of the union over allk of the events in the above bound is small. We can use a crude union bound to control the latter prob- ability, but it is clear that we must then choose z to be increasing in such a way that the probabilities of the individual events are summable: that is, P[⌦] :=P9k > k0 s.t. sup t2T{ X⇡k(t) X⇡k 1(t)} 6 2 kplog |Nk|+ 3 2 kzk X k>k0 Psup t2T{ X⇡k(t) X⇡k 1(t)} 6 2 kplog |Nk|+ 3 2 kzk X k>k0 e z2k/2.
5.3 The chaining method 135 How to choosezk is not so important. An easy choicezk=x+pk k0yields
P[⌦] X
k>k0
e z2k/2e x2/2X
k>0
e k/2Ce x2/2. Now note that on the event⌦c, we have
sup t2T{ Xt Xt0} X k>k0 sup t2T{ X⇡k(t) X⇡k 1(t)} 6 X k>k0 2 kplog |Nk|+ 3 2 k0 X k>0 2 kpk+ 3 2 k0X k>0 2 kx C Z 1 0 p logN(T, d,")d"+Cdiam(T)x, where we have used that 2 k0 2 diam(T) and
2 k0
C2 k0 1plogN(T, d,2 k0 1)C X
k>k0
2 kplog
|Nk|
by the definition ofk0. Thus Psup t2T{ Xt Xt0} C Z 1 0 p logN(T, d,")d"+Cdiam(T)x P[⌦], and the proof is readily completed. ut
Remark 5.30.Note that the result of Theorem 5.29 is reminiscent of a concen- tration inequality. Indeed, if we could establish the concentration inequality
Psup t2T{ Xt Xt0} E sup t2T{ Xt Xt0} +x Ce x2/Cdiam(T)2,
then the conclusion of Theorem 5.29 would follow directly by combining this inequality with the chaining bound of Corollary 5.25 for the expected supre- mum. Despite the similarities, however, Theorem 5.29 should not be confused with a concentration inequality. Its conclusion is both weaker and stronger: weaker, because Theorem 5.29 cannot establish a deviation inequality from the mean, but only from a particular upper bound on the mean; stronger, because the subgaussian assumption of Theorem 5.29 is much weaker than would be required to establish a concentration inequality.
The proof of Theorem 5.29 suggests that at its core, the chaining method boils down to simultaneously controlling, using a union bound, the magnitude of all the linksX⇡k(t) X⇡k 1(t)in the chaining identity. We might therefore ex- pect that chaining yields sharp results if the links{X⇡k(t) X⇡k 1(t)}t2T,k>k0 are “nearly independent” in some sense. This is not entirely implausible, as two links are either far apart or are at a di↵erent scale. It turns out that the
chaining method that we have developed here yields sharp results in many cases, but falls short in others. In the next chapter, we will see that the chain- ing method can be further improved to adapt to the structure of the set T. The resulting method, called the generic chaining, is so efficient that it cap- tures exactly (up to universal constants) the magnitude of the supremum of Gaussian processes! Once this has been understood, we can truly conclude that chaining is the “correct” way to think about the suprema of random pro- cesses. Nonetheless, considering that we have ultimately used no idea more sophisticated than the union bound, the remarkably far-reaching power of the chaining method remains somewhat of a miracle to this author.
Problems
5.9 (The entropy integral and sum).Show that
Z 1 0 p logN(T, d,")d"X k2Z 2 kplogN(T, d,2 k)2 Z 1 0 p logN(T, d,")d".
Thus nothing is lost in expressing the chaining bound as an integral rather than a sum, as we have done in Corollary 5.25, up to a constant factor. 5.10 (Chaining with arbitrary tails). The chaining method is not re- stricted to subgaussian processes: it can be developed analogously for pro- cesses that are Lipschitz “in probability” in a more general sense.
Let{Xt}t2T be a separable process withE[Xt] = 0 and
logE[e {Xt Xs}/d(t,s)]
( ) for allt, s2T, 0, where is as in Lemma 5.1. Show that
Esup t2T Xt . Z 1 0 ⇤ 1(2 logN(T, d,"))d".
5.11 (An improved chaining bound and Wasserstein LLN). The key improvement of the chaining bound of Corollary 5.25 over the crude approxi- mation of Lemma 5.7 is that the former uses only anin probability Lipschitz property, while the latter uses a strongeralmost sureLipschitz property. These two ideas are not mutually exclusive, however: when the process{Xt}t2T sat-
isfies both types of Lipschitz property, we can obtain an improved chaining bound that is a sort of hybrid between Corollary 5.25 and Lemma 5.7.