In this section we present a generic approach for implementing the TEST function used in Algorithm 4 together with some examples. Our approach generalizes the use of L∞and Lp∞distances for distinguishing
states in algorithms for learning PDFA (see Chapter 2 for details). Basically, we define a measure of distinguishability by taking the supremum over some collection of events of the difference between probabilities assigned by two distributions to these events. In particular, we will take the set of events in such a way that lets us use VC bounds to obtain finite-sample guarantees like the ones required by Assumption 1.
We begin by recalling the definition of L∞ and Lp∞distances between distributions on Σ?. This will
easily hint to the correct generalization in the case of distributions over (Σ × X )?. Given distributions
D and D0 over Σ?, the supremum distance L
∞is defined as
L∞(D, D0) = sup
x∈Σ?|D(x) − D
0(x)| .
The prefix supremum distance Lp
∞ is defined as
Lp∞(D, D0) = sup
x∈Σ? |D(xΣ
?
) − D0(xΣ?)| .
Looking at these definitions it is immediate to see that both these distances are particular instances of a general measure of discrepancy between distributions given by
LE
∞(D, D0) = sup E∈E
|D(E) − D0(E)| ,
where E is a collection of events over Σ?. In particular, E = {{x} : x ∈ Σ?} in the case of L ∞, and
E= {xΣ?: x ∈ Σ?} in the case of Lp
∞. Now, since there is nothing special about Σ? in the definition of
LE
∞, we can easily use this general form to define discrepancy measures between distributions on (Σ×X )?.
Note that in general, LE
∞ is not necessarily a distance: if E is not large enough, there may exist
distributions D 6= D0 such that LE
∞(D, D0) = 0. On the other hand, we will see that very large sets of
events E yield worse statistical performance when estimating LE
∞(D, D0) from samples, in addition to
the larger computational effort required to compute such estimations. We will thus need to balance this trade-off if we want to obtain efficient learning algorithms for large classes of distributions on (Σ × X )?
using tests based on this principle. In the most general case, when defining E one should also take into account measurability issues that may arise if X is a continuous space, for example X = R equipped with a Lebesgue measure. We will however ignore such considerations because measurability problems will not appear in the applications we have in mind.
An important and desirable condition for LE
∞ is that we can estimate its true value using random
samples. We will see that this is a statistical property that is independent of whether LE
∞ defines a
proper distance or not. Also note that the statistical possibility of estimating LE
∞ from a finite set of
random examples does not a priori imply that such estimation can be performed efficiently. Let D be a distribution over (Σ × X )? and let S = (z1, . . . , zm) be a sample of m i.i.d. examples from D. Given an
event E ∈ E we define the empirical probability of E with respect to sample S as S(E) = 1 m m X i=1 1zi∈E .
If S0is an i.i.d. sample with m0examples from another distribution D0, we define the empirical discrepancy
between S and S0 as
LE
∞(S, S0) = sup E∈E
|S(E) − S0(E)| .
We note that even though S and S0 are finite, computing such quantity requires evaluating a maximum
over E. This computation and the statistical accuracy with which we can estimate LE
LE
∞(S, S0) will depend on the shattering function of E defined in Appendix A.1 and denoted by ΠE. Like
we did in Chapter 2, we use this function to define, for any 0 < δ < 1, the following quantity: ∆(δ) = s 8 M0 ln 4(ΠE(2m) + ΠE(2m0)) δ , where M0= mm0/(√m +√m0)2. Now let us write µ
?= LE∞(D, D0) and ˆµ = LE∞(S, S0). With these defi-
nitions, the following two results follow from the same proofs used in Chapter 2 . Note that probabilities in these results are with respect to the sampling of S and S0.
Proposition 4.4.1. With probability at least1 − δ we have µ?≤ ˆµ + ∆(δ).
Proposition 4.4.2. With probability at least1 − δ we have µ?≥ ˆµ − ∆(δ).
These confidence intervals can be used to build a procedure Test with finite-sample guarantees; see Chapter 2 for details. In particular, one can show the following useful result which quantifies the number of examples needed to make confident decisions using this Test in a particular case of interest.
Corollary 4.4.3. Let S and S0 be i.i.d. samples from distributions D and D0 respectively, and write
m = |S| and m0 = |S0|. Suppose that LE
∞ is such that ΠE(m) = Θ(m). If µ? = LE∞(D, D0) > 0,
then with probability at least 1 − δ Test certifies this fact when min{m, m0} ≥ N = ˜O((1/µ2
?) ln(1/δ)).
Furthermore, if µ? = LE∞(D, D0) = 0, then with probability at least 1 − δ Test certifies µ? < µ when
min{m, m0} ≥ N0 = ˜O((1/µ2) ln(1/δ)).
Now we give a general construction for a set of events E which guarantees that LE
∞ is a distance
under very mild conditions. Let X be a set of events on X , i.e. X ⊆ 2X, such that LX
∞ is a distance
between distributions over X . Besides the usual L∞, this setting includes other well known distances.
For example, if X = R and X = {(−∞, x] : x ∈ R}, then LX
∞ is the Kolmogorov–Smirnov distance.
Another example is the total variation distance which corresponds to taking X to be the full σ-algebra of X as a measure space; in particular X = 2X if X is finite or countable. Given x ∈ Σ+ with |x| = t
and X ∈ X we write Ex,X to denote the subset xΣ?× Xt−1XX?of (Σ × X )? that contains all z = (w, y)
such that x v0w and yt∈ X. With this notation we define the following set of events on (Σ × X )?:
E= {Ex,X: x ∈ Σ+, X ∈ X} .
It is easy to see in this case that LE
∞is in fact a distance. Since the proof involves the construction of a
GPDFA with an infinite number of states, we give it only in the case of distributions over (Σ × X )?×
({ξ} × X ) because it fits better into the model of GPDFA we use in this chapter.
Proposition 4.4.4. LetD and D0 be distributions over(Σ × X )?× ({ξ} × X ). Suppose LX
∞ is a distance
between distribution on X . IfD 6= D0 thenLE
∞(D, D0) > 0.
Proof. We begin by noting that D and D0 can be easily realized by two GPDFA with an infinite number
of states. Indeed, let A = hQ, Σ, τ, γ, q0, ξ, Di be defined as follows: Q = Σ?; q0 = λ; τ (x, σ) = xσ
for all x ∈ Q; for all x ∈ Q with |x| = t and σ ∈ Σ0 let γ(x, σ) = D((x
1, X ) · · · (xt, X )(σ, X )(Σ0×
X )?)/D((x
1, X ) · · · (xt, X )(Σ0× X )?);1 and for x ∈ Q with |x| = t and any measurable X ⊆ X , define
Dx,σ(X) = D((x1, X ) · · · (xt, X )(σ, X)(Σ0 × X )?)/D((x1, X ) · · · (xt, X )(Σ0× X )?). It is immediate to
check that A realizes D. Similarly we define an infinite GPFDA A0 realizing D0. Now, if D 6= D0, then
A and A0 are not equal and this means there exists some string x ∈ Σ?for which the corresponding local
distributions on A and A0 are different: D
x6= D0x. If there are many such x, we take the shortest one,
resolving ties using a lexicographical order on Σ0. For such x we must either have γ(x, σ) 6= γ0(x, σ) for some σ ∈ Σ0 or D
x,σ(X) 6= Dx,σ(X) for some σ ∈ Σ0 and X ∈ X, where this last claim follows from LX∞
being a distance. This certifies the existence of an event Ex,X ∈ E such that |D(Ex,X) − D0(Ex,X)| >
0.
Observe that even if LE
∞ is a distance, the set of events E does not necessarily have slowly growing
shattering functions. In particular, it is not hard to see that if we allow samples with arbitrarily long strings then ΠE(m) = Ω(2m) for any non-trivial X. Fortunately, since the distributions we will need
to distinguish are not arbitrary, there is a workaround to this problem. In particular, if D and D0 are
distinct distributions over (Σ0× X )? realized by GPDFA A
D and AD0 with at most n states, then we
must have |D(Ex,X) − D0(Ex,X)| > 0 for some x ∈ Σ≤n and X ∈ X. Indeed, note that since AD has at
most n states, any transition in AD can be accessed with a string of length at most n; thus, if D and
D0 assign the same probability to all events Ex,X with |x| ≤ n, then AD and AD0 must define the same
probability distribution. Hence, given n and X we define
En = {Ex,X : x ∈ Σ1:n, X ∈ X} .
Now we can bound ΠEn(m) as a function of n and ΠX(m).
Lemma 4.4.5. The following holds: ΠEn(m) ≤ 2mnΠX(m).
Proof. The bound follows from well-known properties of shattering coefficients; see Appendix A.1 for details. Let us define for any k ≥ 1 the following collection of events on X?:
X(k)= {Xk−1XX?: X ∈ X} .
Writing P = {xΣ? : x ∈ Σ?} it is immediate to see that we have E
n ⊆ P ⊗ X(1:n), where X(1:n) =
∪n
k=1X(k). Thus, the following holds for any m:
ΠEn(m) ≤ ΠP⊗X(1:n)(m) ≤ ΠP(m)ΠX(1:n)(m) ≤ 2mnΠX(m) ,
where we used Lemma A.5.2 and ΠX(k)(m) = ΠX(m).
Now we turn our attention to the issue of how to actually compute the empirical LEn
∞ distance between
two samples drawn from distributions D and D0over (Σ × X )?. First we introduce some notation. Recall
that if SΣis a multiset of strings from Σ? we use SΣ(xΣ?) to denote the empirical frequency of prefix x
in SΣ. We use pref(SΣ) to denote the set of all prefixes of strings in SΣ. Furthermore, recall that if SX is
a multiset of elements from X we use SX(X) to denote the empirical frequency of X in SX. Let x ∈ Σ?
with x = x1· · · xt. For any n ≥ 1 we write x1:nto denote x1· · · xn if n ≤ t and x1· · · xtotherwise. Next,
let S = (z1, . . . , zm) be a multiset from (Σ × X )? with zi= (xi, yi), xi∈ Σ?, and yi∈ X|xi|
. We define SΣ?= (x1, . . . , xm) and SΣ≤n = (x11:n, . . . , xm1:n) for any n ≥ 1. Furthermore, given x ∈ Σ? with |x| = t
we write Sx,X to denote the multiset from X containing all the yti for which i is such that x v0xi.
Now suppose we are given two samples S and S0 from distributions over (Σ × X )? and let U =
pref(SΣ≤n∪ S0
Σ≤n). Then we can write LE∞n(S, S0) as the following maximum:
LEn
∞(S, S0) = maxx∈U max X∈X|SΣ
?(xΣ?)Sx,X(X) − SΣ0?(xΣ?)Sx,X0 (X)| .
It is obvious that the maximum over U can be computed in time O(|U |) if we know max
X∈X|SΣ
?(xΣ?)Sx,X(X) − SΣ0?(xΣ?)Sx,X0 (X)|
for each x ∈ U . In particular, it is easy to see that we have |U | = O(n(m + m0)) by bounding the number
of prefixes in SΣ≤n∪ S0
Σ≤n. Thus, we are left with the problem of computing
max
X∈X|pSX(X) − p 0S0
X(X)| ,
where p, p0 ∈ [0, 1] are arbitrary and S
X and SX0 are multisets over X . Note that by definition of
shattering coefficients the quantity inside the maximum cannot take more than ΠX(|SX| + |SX0 |) different
values. However, how the actual computation of this maximum can be done in each case depends on the particular structure of X and X. We now give three illustrative examples on how to perform such computation.
4.4.1
Examples with X = ∆
We begin by taking X = ∆ a finite alphabet and LX
∞ to be the usual L∞ distance; that is, X = {{δ} :
δ ∈ ∆}. Hence, if S and S0 is a sample from ∆, then we can compute
max
δ∈∆|pS(δ) − p 0S0(δ)|
by reading each sample once, storing the frequencies of each symbol in a table, and then taking the maximum over the |∆| possible differences. Assuming constant time read-write data structures, this computation takes time O(|S| + |S0| + |∆|). Note also that in this case we have ΠX(m) ≤ |∆| + 1.
In our second example we take X = ∆ again but now let LX
∞be the total variation distance; that is,
X= {X : X ⊆ 2∆}. We note that in this particular example we have Π
X(m) ≤ 2|∆|. In this case, given
samples S and S0 the naive approach to compute
max
X⊆∆|pS(X) − p 0S0(X)|
would take time Θ(2|∆|). However, we can take advantage of a property that this maximum shares with
the usual total variation distance. In particular, using the following result we see that this computation can also be done in time O(|S| + |S0| + |∆|).
Lemma 4.4.6. For anyp, p0∈ [0, 1] and any samples S and S0 on∆ the following holds:
max X⊆∆|pS(X) − p 0S0(X)| = |p − p0| 2 + 1 2 X δ∈∆ |pS(δ) − p0S0(δ)| .
Proof. Let us assume without loss of generality that p ≥ p0. Then it is easy to see using a standard
argument that the maximum over X ⊆ ∆ is achieved on
X = {δ : pS(δ) ≥ p0S0(δ)} . On the other hand, using p =P
δpS(δ) and p0= P δp0S0(δ), we have X δ∈∆ |pS(δ) − p0S0(δ)| = X δ∈X (pS(δ) − p0S0(δ)) +X δ∈ ¯X (p0S0(δ) − pS(δ)) =X δ∈X pS(δ) − (p −X δ∈X pS(δ)) −X δ∈X p0S0(δ) + (p0−X δ∈X p0S0(δ)) = p0− p + 2(pS(X) − p0S0(X)) .
4.4.2
Example with X = R
In the last example we let X = R and LX
∞ be the Kolmogorov–Smirnov distance; that is, X = {(−∞, x] :
x ∈ R}. Now note that if S is a sample from R, in order to compute S((−∞, x]) we only need to count how many points in S are less or equal to x. Furthermore, it is obvious that while x can range over R, S((−∞, x]) can only take values of the form k/|S| for 0 ≤ k ≤ |S|. Thus, given two samples S and S0,
we can compute
max
x∈R|pS((−∞, x]) − p
0S0((−∞, x])| ,
by sorting the values in S and S0 and then taking the maximum over the points x ∈ S ∪ S0. Overall,
this computation takes time O(|S| log |S| + |S0| log |S0|). Furthermore, it is easy to see that we have