A common drawback to all tests described so far is their need to store the whole sample in memory in order to compute some statistic. Since in the case D = D0 samples of size Ω(1/µ2) are needed before a
decision can be made confidently, this implies a lower bound of the same order on the amount of memory used by such tests. This rises the question of whether using this much memory is absolutely necessary for confidently testing similarity. And the answer is no. It turns out that tests using just O(1/µ) memory can be designed. The key to such constructions is using statistics that can be computed from sketches of the sample instead of the sample itself.
Sketches are basically data structures that receive items in the sample one at a time and process them sequentially. After processing the whole sample, a sketch keeps a summary that contains only its most relevant features. This summary can hen be used to efficiently compute some statistics of interest. Interestingly, the sketch can process every item in the sample in constant time and the memory used by the sketch only depends on the accuracy required to compute the desired statistics, i.e. it is independent of the sample size. Concrete implementations of sketches will be discussed in the next chapter. In the present section we shall adopt an axiomatic approach by just assuming that sketches exist and satisfy a certain property. Then we will concentrate on how to implement similarity testing using information contained in a sketch.
Let D be some distribution over Σ? and S a sample of i.i.d. examples drawn from D. The empirical
distribution defined by S can be interpreted as a function S : 2Σ?
→ [0, 1], where for A ⊆ Σ? the value
S(A) is the empirical probability of event A under sample S. A sketch is basically an approximation of this function over a particular family of subsets of Σ?. Let 0 < ν < 1. A ν-sketch for S with respect to
Lp
∞ is a function ˆS : 2Σ
?
→ [0, 1] such that Lp
In some cases, using ν-sketches for testing similarity of probability distributions is rather trivial. Suppose S and S0 are samples from distributions D and D0 respectively. If instead of S and S0 we
only have access to their ν-sketches ˆS and ˆS0, then we can compute a statistic ˆµ
ν = Lp∞( ˆS, ˆS0). By
definition of ν-sketch we have |ˆµν − Lp∞(S, S0)| ≤ 2ν. Therefore, trivial modifications of the proofs of
Propositions 2.2.1 and 2.2.2 yield the two following results for testing with sketches using this statistic. The definition of ∆(δ) is the same of Equation (2.1).
Corollary 2.4.1. With probability at least 1 − δ we have µ?≤ ˆµν+ 2ν + ∆(δ).
Corollary 2.4.2. With probability at least 1 − δ we have µ?≥ ˆµν− 2ν − ∆(δ).
Based on these results, it is easy to obtain a provably correct similarity test for µ? using memory
O(1/µ) by choosing, for example, ν = µ/4; that is because there exist ν-sketches using O(ν) memory. Details will be given in Section 3.3. Now we concentrate on the possibility of adapting the bootstrap approach from Section 2.3 to a sketching setting.
Deriving a test based on the bootstrap using sketches is much less straightforward than the case based on uniform convergence bounds. The main obstruction is the need to sample with replacement from a sample S to obtain bootstrapped samples B1, . . . , Br. Obviously, the whole sample S needs to
be stored in order to perform (exact) resampling. The rest of this section describes a method to obtain bootstrapped sketches ˆB1, . . . , ˆBr and derives a similarity test based on those sketches. We also state a
theorem about confidence intervals built from bootstrapped sketches; full proofs are given in the next section.
Let r be some fixed integer. Suppose that an i.i.d. sample S = (x1, . . . , xm) from some distribution D
is presented to an algorithm one example at a time. The algorithm will construct r sketch-bootstrapped samples ˆB1, . . . , ˆBras follows. For each element xt∈ S we drawn r indices i1, . . . , ir∈ [r] independently
at random from the uniform distribution over [r]. Then a copy of xt is added to the sketch ˆBij for
1 ≤ j ≤ r – repetitions are allowed. Note that this sampling process matches several first order mo- ments of the original bootstrap using sampling with replacement. In particular, the expected number of elements introduced in each sketch is m, and the expected number of occurrences of each string x in a particular sketch is S[x]. The same process is repeated with a sample S0 from distribution D0 and
sketch-bootstrapped samples ˆB0
1, . . . , ˆB0rare obtained.
Instead of r statistics like in the original bootstrap setting, in the sketched bootstrap we are going to compute r2 statistics. Since the main motivation for considering the use of sketches is to save memory,
and it turns out that memory usage grows linearly with r, this approach allows us to obtain more statistics with less memory, at the price of increasing the dependencies between them. Thus, for any i, j ∈ [r] we compute statistic ˆµi,j= Lp∞( ˆBi, ˆBj0). As before, we will sort these statistics increasingly and
obtain ˆµω(1) ≤ · · · ≤ ˆµω(r2), where now ω is a bijection between [r2] and [r] × [r].
In the same way we did before, an heuristic based on Efron’s bootstrap can be applied to this collection of statistics to say that ˆµω(d(1−δ)r2e) is an upper confidence limit for µ? at level δ. However, in order to
obtain formally justified bounds based on finite sample analyses, we proceed in a different direction. Let 1 ≤ k ≤ r2, 0 < δ < 1, and define the following.
∆k(δ) = s 8 M0ln 32M δ + s 192 M0 ln 800r3M2 δk2 .
Then the following theorem gives an upper confidence limit for µ? based on the sketch-bootstrapped
statistic ˆµω(k).
Theorem 2.4.3. Let 1 ≤ k ≤ min{r2, 30r10/7δ−4/7M }. With probability at least 1 − δ we have µ ? ≤
ˆ
µω(k)+ 2ν + ∆k(δ).
Using this result, a union bound argument automatically yields the following upper confidence limit based on the whole set of statistics.
Corollary 2.4.4. With probability at least 1 − δ it holds that µ?≤ min ˆµω(k)+ 2ν + ∆k(δ)
1 ≤ k ≤ r2