INSTRUCCIONES PARA LA INSTALACIÓN - Refrigerador de Dos Puertas

In this section we present a generic approach for implementing the TEST function used in Algorithm 4 together with some examples. Our approach generalizes the use of L∞and Lp∞distances for distinguishing

states in algorithms for learning PDFA (see Chapter 2 for details). Basically, we define a measure of distinguishability by taking the supremum over some collection of events of the difference between probabilities assigned by two distributions to these events. In particular, we will take the set of events in such a way that lets us use VC bounds to obtain finite-sample guarantees like the ones required by Assumption 1.

We begin by recalling the definition of L∞ and Lp∞distances between distributions on Σ?. This will

easily hint to the correct generalization in the case of distributions over (Σ × X )?_{. Given distributions}

D and D0 _{over Σ}?_{, the supremum distance L}

∞is defined as

L∞(D, D0) = sup

x∈Σ?|D(x) − D

0_{(x)| .}

The prefix supremum distance Lp

∞ is defined as

Lp∞(D, D0) = sup

x∈Σ? |D(xΣ

) − D0(xΣ?)| .

Looking at these definitions it is immediate to see that both these distances are particular instances of a general measure of discrepancy between distributions given by

∞(D, D0) = sup E∈E

|D(E) − D0_{(E)| ,}

where E is a collection of events over Σ?_{. In particular, E = {{x} : x ∈ Σ}?_{} in the case of L} ∞, and

E= {xΣ?_{: x ∈ Σ}?_{} in the case of L}p

∞. Now, since there is nothing special about Σ? in the definition of

∞, we can easily use this general form to define discrepancy measures between distributions on (Σ×X )?.

Note that in general, LE

∞ is not necessarily a distance: if E is not large enough, there may exist

distributions D 6= D0 _{such that L}E

∞(D, D0) = 0. On the other hand, we will see that very large sets of

events E yield worse statistical performance when estimating LE

∞(D, D0) from samples, in addition to

the larger computational effort required to compute such estimations. We will thus need to balance this trade-off if we want to obtain efficient learning algorithms for large classes of distributions on (Σ × X )?

using tests based on this principle. In the most general case, when defining E one should also take into account measurability issues that may arise if X is a continuous space, for example X = R equipped with a Lebesgue measure. We will however ignore such considerations because measurability problems will not appear in the applications we have in mind.

An important and desirable condition for LE

∞ is that we can estimate its true value using random

samples. We will see that this is a statistical property that is independent of whether LE

∞ defines a

proper distance or not. Also note that the statistical possibility of estimating LE

∞ from a finite set of

random examples does not a priori imply that such estimation can be performed efficiently. Let D be a distribution over (Σ × X )? _{and let S = (z}1_{, . . . , z}m_{) be a sample of m i.i.d. examples from D. Given an}

event E ∈ E we define the empirical probability of E with respect to sample S as S(E) = 1 m m X i=1 1zi_∈E .

If S0_{is an i.i.d. sample with m}0_{examples from another distribution D}0_{, we define the empirical discrepancy}

between S and S0 _as

∞(S, S0) = sup E∈E

|S(E) − S0(E)| .

We note that even though S and S0 _{are finite, computing such quantity requires evaluating a maximum}

over E. This computation and the statistical accuracy with which we can estimate LE

∞(S, S0) will depend on the shattering function of E defined in Appendix A.1 and denoted by ΠE. Like

we did in Chapter 2, we use this function to define, for any 0 < δ < 1, the following quantity: ∆(δ) = s 8 M0 ln 4(ΠE(2m) + ΠE(2m0)) δ , where M0_{= mm}0_/(√_{m +}√_m0₎2_{. Now let us write µ}

?= LE∞(D, D0) and ˆµ = LE∞(S, S0). With these defi-

nitions, the following two results follow from the same proofs used in Chapter 2 . Note that probabilities in these results are with respect to the sampling of S and S0_.

Proposition 4.4.1. With probability at least1 − δ we have µ?≤ ˆµ + ∆(δ).

Proposition 4.4.2. With probability at least1 − δ we have µ?≥ ˆµ − ∆(δ).

These confidence intervals can be used to build a procedure Test with finite-sample guarantees; see Chapter 2 for details. In particular, one can show the following useful result which quantifies the number of examples needed to make confident decisions using this Test in a particular case of interest.

Corollary 4.4.3. Let S and S0 _{be i.i.d. samples from distributions} _{D and D}0 _{respectively, and write}

m = |S| and m0 _{= |S}0_{|. Suppose that L}E

∞ is such that ΠE(m) = Θ(m). If µ? = LE∞(D, D0) > 0,

then with probability at least 1 − δ Test certifies this fact when min{m, m0_{} ≥ N = ˜}_O((1/µ2

?) ln(1/δ)).

Furthermore, if µ? = LE∞(D, D0) = 0, then with probability at least 1 − δ Test certifies µ? < µ when

min{m, m0_{} ≥ N}0 _{= ˜}_O((1/µ2_{) ln(1/δ)).}

Now we give a general construction for a set of events E which guarantees that LE

∞ is a distance

under very mild conditions. Let X be a set of events on X , i.e. X ⊆ 2X_{, such that L}X

∞ is a distance

between distributions over X . Besides the usual L∞, this setting includes other well known distances.

For example, if X = R and X = {(−∞, x] : x ∈ R}, then LX

∞ is the Kolmogorov–Smirnov distance.

Another example is the total variation distance which corresponds to taking X to be the full σ-algebra of X as a measure space; in particular X = 2X _{if X is finite or countable. Given x ∈ Σ}+ _{with |x| = t}

and X ∈ X we write Ex,X to denote the subset xΣ?× Xt−1XX?of (Σ × X )? that contains all z = (w, y)

such that x v0w and yt∈ X. With this notation we define the following set of events on (Σ × X )?:

E= {Ex,X: x ∈ Σ+, X ∈ X} .

It is easy to see in this case that LE

∞is in fact a distance. Since the proof involves the construction of a

GPDFA with an infinite number of states, we give it only in the case of distributions over (Σ × X )?_×

({ξ} × X ) because it fits better into the model of GPDFA we use in this chapter.

Proposition 4.4.4. LetD and D0 _{be distributions over}_{(Σ × X )}?_{× ({ξ} × X ). Suppose L}X

∞ is a distance

between distribution on X . IfD 6= D0 _then_LE

∞(D, D0) > 0.

Proof. We begin by noting that D and D0 _{can be easily realized by two GPDFA with an infinite number}

of states. Indeed, let A = hQ, Σ, τ, γ, q0, ξ, Di be defined as follows: Q = Σ?; q0 = λ; τ (x, σ) = xσ

for all x ∈ Q; for all x ∈ Q with |x| = t and σ ∈ Σ0 _{let γ(x, σ) = D((x}

1, X ) · · · (xt, X )(σ, X )(Σ0×

X )?_)/D((x

1, X ) · · · (xt, X )(Σ0× X )?);1 and for x ∈ Q with |x| = t and any measurable X ⊆ X , define

Dx,σ(X) = D((x1, X ) · · · (xt, X )(σ, X)(Σ0 × X )?)/D((x1, X ) · · · (xt, X )(Σ0× X )?). It is immediate to

check that A realizes D. Similarly we define an infinite GPFDA A0 _{realizing D}0_{. Now, if D 6= D}0_{, then}

A and A0 _{are not equal and this means there exists some string x ∈ Σ}?_{for which the corresponding local}

distributions on A and A0 _{are different: D}

x6= D0x. If there are many such x, we take the shortest one,

resolving ties using a lexicographical order on Σ0. For such x we must either have γ(x, σ) 6= γ0(x, σ) for some σ ∈ Σ0 _{or D}

x,σ(X) 6= Dx,σ(X) for some σ ∈ Σ0 and X ∈ X, where this last claim follows from LX∞

being a distance. This certifies the existence of an event Ex,X ∈ E such that |D(Ex,X) − D0(Ex,X)| >

Observe that even if LE

∞ is a distance, the set of events E does not necessarily have slowly growing

shattering functions. In particular, it is not hard to see that if we allow samples with arbitrarily long strings then ΠE(m) = Ω(2m) for any non-trivial X. Fortunately, since the distributions we will need

to distinguish are not arbitrary, there is a workaround to this problem. In particular, if D and D0 _are

distinct distributions over (Σ0_{× X )}? _{realized by GPDFA A}

D and AD0 with at most n states, then we

must have |D(Ex,X) − D0(Ex,X)| > 0 for some x ∈ Σ≤n and X ∈ X. Indeed, note that since AD has at

most n states, any transition in AD can be accessed with a string of length at most n; thus, if D and

D0 assign the same probability to all events Ex,X with |x| ≤ n, then AD and AD0 must define the same

probability distribution. Hence, given n and X we define

En = {Ex,X : x ∈ Σ1:n, X ∈ X} .

Now we can bound ΠEn(m) as a function of n and ΠX(m).

Lemma 4.4.5. The following holds: ΠEn(m) ≤ 2mnΠX(m).

Proof. The bound follows from well-known properties of shattering coefficients; see Appendix A.1 for details. Let us define for any k ≥ 1 the following collection of events on X?_:

X(k)= {Xk−1XX?_{: X ∈ X} .}

Writing P = {xΣ? _{: x ∈ Σ}?_{} it is immediate to see that we have E}

n ⊆ P ⊗ X(1:n), where X(1:n) =

∪n

k=1X(k). Thus, the following holds for any m:

ΠEn(m) ≤ ΠP⊗X(1:n)(m) ≤ ΠP(m)ΠX(1:n)(m) ≤ 2mnΠX(m) ,

where we used Lemma A.5.2 and ΠX(k)(m) = ΠX(m).

Now we turn our attention to the issue of how to actually compute the empirical LEn

∞ distance between

two samples drawn from distributions D and D0_{over (Σ × X )}?_{. First we introduce some notation. Recall}

that if SΣis a multiset of strings from Σ? we use SΣ(xΣ?) to denote the empirical frequency of prefix x

in SΣ. We use pref(SΣ) to denote the set of all prefixes of strings in SΣ. Furthermore, recall that if SX is

a multiset of elements from X we use SX(X) to denote the empirical frequency of X in SX. Let x ∈ Σ?

with x = x1· · · xt. For any n ≥ 1 we write x1:nto denote x1· · · xn if n ≤ t and x1· · · xtotherwise. Next,

let S = (z1_{, . . . , z}m_{) be a multiset from (Σ × X )}? _{with z}i_{= (x}i_{, y}i_{), x}i_{∈ Σ}?_{, and y}i_{∈ X}|xi_|

. We define SΣ?= (x1, . . . , xm) and S_Σ≤n = (x1_1:n, . . . , xm_1:n) for any n ≥ 1. Furthermore, given x ∈ Σ? with |x| = t

we write Sx,X to denote the multiset from X containing all the yti for which i is such that x v0xi.

Now suppose we are given two samples S and S0 _{from distributions over (Σ × X )}? _{and let U =}

pref(SΣ≤n∪ S0

Σ≤n). Then we can write LE∞n(S, S0) as the following maximum:

LEn

∞(S, S0) = max_x∈U max X∈X|SΣ

?(xΣ?)S_x,X(X) − S_Σ0?(xΣ?)Sx,X0 (X)| .

It is obvious that the maximum over U can be computed in time O(|U |) if we know max

X∈X|SΣ

?(xΣ?)S_x,X(X) − S_Σ0?(xΣ?)S_x,X0 (X)|

for each x ∈ U . In particular, it is easy to see that we have |U | = O(n(m + m0_{)) by bounding the number}

of prefixes in SΣ≤n∪ S0

Σ≤n. Thus, we are left with the problem of computing

max

X∈X|pSX(X) − p 0_S0

X(X)| ,

where p, p0 _{∈ [0, 1] are arbitrary and S}

X and SX0 are multisets over X . Note that by definition of

shattering coefficients the quantity inside the maximum cannot take more than ΠX(|SX| + |SX0 |) different

values. However, how the actual computation of this maximum can be done in each case depends on the particular structure of X and X. We now give three illustrative examples on how to perform such computation.

4.4.1 Examples with X = ∆

We begin by taking X = ∆ a finite alphabet and LX

∞ to be the usual L∞ distance; that is, X = {{δ} :

δ ∈ ∆}. Hence, if S and S0 _{is a sample from ∆, then we can compute}

max

δ∈∆|pS(δ) − p 0_S0_(δ)|

by reading each sample once, storing the frequencies of each symbol in a table, and then taking the maximum over the |∆| possible differences. Assuming constant time read-write data structures, this computation takes time O(|S| + |S0| + |∆|). Note also that in this case we have ΠX(m) ≤ |∆| + 1.

In our second example we take X = ∆ again but now let LX

∞be the total variation distance; that is,

X= {X : X ⊆ 2∆_{}. We note that in this particular example we have Π}

X(m) ≤ 2|∆|. In this case, given

samples S and S0 _{the naive approach to compute}

max

X⊆∆|pS(X) − p 0_S0_(X)|

would take time Θ(2|∆|_{). However, we can take advantage of a property that this maximum shares with}

the usual total variation distance. In particular, using the following result we see that this computation can also be done in time O(|S| + |S0_{| + |∆|).}

Lemma 4.4.6. For anyp, p0_{∈ [0, 1] and any samples S and S}0 _on_{∆ the following holds:}

max X⊆∆|pS(X) − p 0_S0_{(X)| =} |p − p0| 2 + 1 2 X δ∈∆ |pS(δ) − p0S0(δ)| .

Proof. Let us assume without loss of generality that p ≥ p0_{. Then it is easy to see using a standard}

argument that the maximum over X ⊆ ∆ is achieved on

X = {δ : pS(δ) ≥ p0S0(δ)} . On the other hand, using p =P

δpS(δ) and p0= P δp0S0(δ), we have X δ∈∆ |pS(δ) − p0S0(δ)| = X δ∈X (pS(δ) − p0S0(δ)) +X δ∈ ¯X (p0S0(δ) − pS(δ)) =X δ∈X pS(δ) − (p −X δ∈X pS(δ)) −X δ∈X p0S0(δ) + (p0−X δ∈X p0S0(δ)) = p0− p + 2(pS(X) − p0S0(X)) .

4.4.2 _{Example with X = R}

In the last example we let X = R and LX

∞ be the Kolmogorov–Smirnov distance; that is, X = {(−∞, x] :

x ∈ R}. Now note that if S is a sample from R, in order to compute S((−∞, x]) we only need to count how many points in S are less or equal to x. Furthermore, it is obvious that while x can range over R, S((−∞, x]) can only take values of the form k/|S| for 0 ≤ k ≤ |S|. Thus, given two samples S and S0_,

we can compute

max

x∈R|pS((−∞, x]) − p

0_S0_{((−∞, x])| ,}

by sorting the values in S and S0 _{and then taking the maximum over the points x ∈ S ∪ S}0_{. Overall,}

this computation takes time O(|S| log |S| + |S0_{| log |S}0_{|). Furthermore, it is easy to see that we have}

In document Refrigerador de Dos Puertas (página 30-34)