To prove correctness of our algorithm, the next theorem summarizes key properties of the partitioning {T1,…,Tk} produced by the PP algorithm when SA‟ =
SA. Property (i) ensures that Ti is as balanced as T and Property (ii) ensures that
condition 1i < 2 holds, which is a problem requirement in Definition 4. We
handle the SA‟ SA case immediately after Theorem 4 in Theorem 5.
Theorem 4 (Partitioning properties when SA’ = SA). Let SA‟ = SA. If 1i is the
maximum relative frequency in sub-table Ti, 2 is the bound on posterior
knowledge Pr[X = x | Y = y], {T1,…,Tk} is a partitioning of T returned by the PP
algorithm, and = |T|/fmax, where fmax is the maximum frequency of SA- values in T, then
(i) Ti is -balanced wrt SA and 1i ≤ 1/
(ii) If 2 > 1/, 1i < 2
Proof: We will prove (i) by showing 1i ≤ 1/, i.e., Ti is -balanced wrt SA. Note
that (ii) immediately follows from (i), since we are given 2 > 1/ and we can
substitute “< 2” for “≤1/” in 1i ≤ 1/ to get the desired result. We will prove (i),
i.e., show 1i ≤ 1/, in several steps and provide details for each step after
Equation (54).
Let fi and fj be frequencies of the same SA-value in initial groups gi and gj
(52)
(53)
(54)
Equation (52) says a relative frequency in a merged group can never be larger than the maximum relative frequency in an initial group prior to merging. We prove this inequality holds by contradiction. Assume is larger than (this
proof also works if we assume is larger). For the purpose of contradiction,
assume . We cross-multiply and simplify to get and we divide both sides by to get , which contradicts our
assumption that
; therefore, Equation (52) must hold.
Equation (53) is derived from the fact that both gi and gj are -balanced
(Lemma 4), so we know that the maximum relative frequency of either gi and gj
must be (Definition 5); i.e.,
.
Finally, Equation (54) holds because is a general expression for any relative frequency in a merged group Ti, which we can replace with the
specific maximum relative frequency of Ti, namely , to get the desired result:
.
A question remains: how likely is it that the condition 2 > 1/ in Property
(ii) of Theorem 4 holds? The answer is as likely as the gap 21 is greater than
1/ fmax/|T|, which is the gap created by the floor function = |T|/fmax. In fact, if 2 1/ > 1 – fmax/|T|, then 2 > 1/ follows because 1 ≥ fmax/|T|
(Equation (8)). In practice, 2 1/ > 1 – fmax/|T| normally holds.
The next theorem is the correctness counterpart of Theorem 4 for the case of SA‟ SA. Notice a difference in Theorem 5 (i) from Theorem 4 (i). In Theorem 5 (i) we say Ti is “nearly” -balanced wrt SA‟. We say this because
approaches when approaches 1 and the definition of = 1 / (1 –
(1/‟)) given in Theorem 5 implies approaches 1 when ‟ = |T‟|/fmax is large. Note that we expect ‟ to be large when SA‟ SA because we expect SA‟ to contain SA-values with small frequencies (including the maximum frequency in SA‟, fmax).
Theorem 5 (Partitioning properties when SA’ SA). Let SA‟ SA. If 1i is the
maximum relative frequency in sub-table Ti, 2 is the bound on posterior
algorithm, and = 1 / (1 – (1/‟)), ‟ = |T‟|/fmax, where fmax is the maximum frequency of SA‟-values in T (and T‟), then
(i) Ti is nearly -balanced wrt SA‟ and 1i ≤ /()
(ii) If 2 > /( ), then 1i < 2
Proof: We will prove (i) by showing 1i ≤ /( ). Note that (ii) immediately
follows from (i), since we are given 2 > /( ) and we can substitute “< 2” for
“≤ /( )” in 1i ≤ /( ) to get the desired result. We will prove (i) in several
steps and provide details for each step after Equation (57).
Let fi and fj be frequencies of the same SA-value in initial groups gi and gj
before merging. We have
(55)
(56)
(57)
Equation (55) says a relative frequency in a merged group can never be larger than the maximum relative frequency in an initial group prior to merging, as in the proof of Theorem 4 (Equation (52)).
Equation (56) follows from Lemma 5, which says when SA‟ SA, an SA‟- value in an initial group has a relative frequency less than or equal to /(), so we know that the maximum relative frequency of either gi and gj must be
; i.e.,
.
Finally, Equation (57) holds because is a general expression for any relative frequency in a merged group Ti, which we can replace with the
specific maximum relative frequency of Ti, namely , to get the desired result:
.
Now that we have shown our PP algorithm is correct, next let us consider the time complexity.
Theorem 6 (Time complexity). Let n be the size of the dataset T, i.e., n = |T|, let
m be the domain size of SA, and let t be the number of initial groups generated by the balancing phase of the PP algorithm. The time complexity of the PP algorithm is .
Proof: The total time complexity is made up of the sum of the time complexities of
(58)
The balancing phase first requires the m SA-values to be sorted, which takes time. Then each of the n records is examined only one time, taking time. Therefore,
The rearranging phase first multiplies two t × m matrices A × AT (recall A represents the t initial groups g1,…,gt), which takes time. Then the
resulting t × t matrix A × AT matrix is used as input to the Reverse Cuthill-McKee algorithm, taking time [23]. Therefore,
Finally, the merging phase involves running the dynamic programming algorithm in Figure 14. For the input sequence of size t, first all the values (g[i..j]), i < j, can be computed in a pre-processing step, which takes time. Then i, at most t values of r are evaluated in the recursion in Step 2 of Figure 14, taking time. Therefore,
Hence, using Equation (58), we can say the time complexity is
Since , adding to leaves us with a term , which can be simplified to ). Therefore, the overall time complexity is
as desired.
Our experiments show that the number of initial groups, t, is quite small on real life datasets (no more than 20). This is because the balancing phase
maximizes the size of each initial group. Therefore, the PP algorithm is linear in the cardinality n of T.