6. Los hijos, los jóvenes, los egresados
6.1 Ser joven en la aldea
An alternative technique to compute the sum of independent Bernoulli variables is the gen- erating functions technique. While showing the same complexity as the Poisson binomial recurrence, its advantage is its intuitiveness.
Represent each Bernoulli trial Xi by a polynomialpoly(Xi) =pi·x+ (1−pi). Consider
the generating function
FN = N Y i=1 poly(Xi) = N X i=0 cixi. (3.4)
The coefficient ci of xi in the expansion of FN equals the probability P(PNn=1Xn = i)
([123]). For example, the monomial 0.25·x4 implies that with a probability of 0.25, the
sum of all Bernoulli random variables equals four.
The expansion ofN polynomials, each containing two monomials leads to a total of 2N
monomials, one monomial for each sequence of successful and unsuccessful Bernoulli trials, i.e., one monomial for each possible worlds. To reduce this complexity, again an iterative computation ofFN, can be used, by exploiting that
Fk=Fk−1·poly(X
k). (3.5)
This rewriting of Equation 3.4 allows to inductively computeFkfromFk−1. The induction
is started by computing the polynomial F0, which is the empty product which equals the
neutral element of multiplication, i.e.,F0 = 1. To understand the semantics of this polyno-
mial, the polynomialF0 = 1 can be rewritten asF0 = 1·x0, which we can interpret as the
following tautology:“with a probability of one, the sum of all zero Bernoulli trials equals zero.” After each iteration, we can unify monomials having the same exponent, leading to a total of at most k+ 1 monomials after each iteration. This unification step allows to remove the combinatorial aspect of the problem, since any monomial xi corresponds to a
class of equivalent worlds, such that this class contains only and all of the worlds where the sum PN
k=1Xk = 1. In each iteration, the number of these classes is k and the probability
of each class is given by the coefficient of xi.
An example showcasing the generating functions technique is given in the following. This examples uses the identical Bernoulli random variables used in Example 14.
Example 15. Again, let N = 4 and let p1 = 0.1, p2 = 0.2, p3 = 0.3 and p4 = 0.4. We
obtain the four generating polynomials poly(X1) = (0.1x+ 0.9), poly(X2) = (0.2x+ 0.8), poly(X3) = (0.3x+ 0.7), and poly(X4) = (0.4x+ 0.6). We trivially obtain F0 = 1. Using
Equation 3.5 we get
F1 =F0·poly(X
1) = 1·(0.1x+ 0.9) = 0.1x+ 0.9.
Semantically, this polynomial implies that out of the first one Bernoulli variables, the probability of having a sum of one is 0.1 (according to monomial 0.1x = 0.1x1, and the
3.5 Generating Functions 51
probability of having a sum of zero is 0.9 (according to monomial 0.9 = 0.9x0. Next, we
compute F2, again using Equation 3.5:
F2 =F1·poly(X
2) = (0.1x1+0.9x0)·(0.2x1+0.8x0) = 0.02x1x1+0.08x1x0+0.18x0x1+0.72x0x0
In this expansion, the monomials have deliberately not been unified to give an intuition of how the generating function techniques is able to identify and unify equivalent worlds. In the above expansion, there is one monomial for each possible world. For example, the monomial 0.18x0x1 represents the world where the first trial was unsuccessful (represented
by the zero of the first exponent) and the second trial was succesful (represented by the one of the second exponent). The above notation allows to identify the sequence of successful and unsuccessful Bernouli trials, clearly leading to a total of 2k possible worlds for Fk.
However, we know that we only need to compute the total number of successful trials, we do not need to know the sequence of successful trials. Thus, we need to treat worlds having the same number of successful Bernoulli trials equivalently, to avoid the enumeration of an exponential number of sequences. This is done implicitly by polynomial multiplication, exploiting that
0.02x1x1+ 0.08x1x0+ 0.18x0x1 + 0.72x0x0 = 0.02x2+ 0.08x1+ 0.18x1+ 0.72x0
This representation no longer allows to distinguish the sequence of successful Bernouli trials. This loss of information is beneficial, as it allows to unify possible worlds having the same sum of Bernoulli trials.
0.02x2+ 0.08x1 + 0.18x1+ 0.72x0 = 0.02x2+ 0.26x1+ 0.72x0
The remaining monomials represent equivalent class of possible worlds. For example, mono- mial 0.26x1 represents all worlds having a total of one successful Bernoulli trials. This is
evident, since the coefficient of this monomial was derived from the sum of both worlds having a total of one successful Bernoulli trials.In the next iteration, we compute:
F3 =F2·poly(X3) = (0.02x2+ 0.26x1+ 0.72x0)·(0.3x+ 0.7)
= 0.006x2x1+ 0.014x2x0+ 0.078x1x1+ 0.182x1x0 + 0.216x0x1+ 0.504x0x0
This polynomial represents the three classes of possible worlds in F2 combined with the two
possible results of the third Bernoulli trial, yielding a total of 3 ˙2 monomials. Unification yields
0.006x2x1 + 0.014x2x0+ 0.078x1x1+ 0.182x1x0+ 0.216x0x1+ 0.504x0x0 = 0.006x3+ 0.092x2+ 0.398x1+ 0.504
The final generating function is given by
F4 =F3·poly(X4)
= (0.006x3+ 0.092x2+ 0.398x1+ 0.504)·(0.4x+ 0.6)
= 0.0024x4+ 0.0036x3+ 0.0368x3+ 0.0552x2+ 0.1592x2+ 0.2388x1+ 0.2016x1+ 0.3024x0
This polynomial describes the PDF of P4
i=1Xi, since each monomial cixi implies that
the probability, that out of all four Bernoulli trials, the total number of successful events equals i, is ci. Thus, we get P(P4i=1Xi = 0) = 0.0024, P(P4i=1Xi = 1) = 0.0404,
P(P4 i=1Xi = 2) = 0.2144, P( P4 i=1Xi = 3) = 0.4404 and P( P4 i=1Xi = 4) = 0.3024. Note
that this result equals the result we obtained by using the Poisson binomial recurrence in the previous section.
Complexity Analysis
The generating function technique requires a total of N iterations. In each iteration 1 ≤
k ≤ N, a polynomial of degree k, and thus of maximum length k+ 1, is multiplied with a polynomial of degree 1, thus having a length of 2. This requires to compute a total of (k+1)·2) monomials in each iteration, each requiring a scalar multiplication. Thus leads to a total time complexity ofPN
i=12k+ 2∈O(N2) for the polynomial expansions. Unification
of a polynomial of length k can be done in O(k) time, exploiting that the polynomials are sorted by the exponent after expansion. Unification at each iteration leads to a O(n2)
complexity for the unification step. This results in a total complexity ofO(n2), similar to
the Poisson binomial recurrence approach.
An advantage of the generating function approach is that this naive polynomial multi- plication can be accelerated using Discrete Fourier Transform (DFT). This technique allows to reduce to total complexity of computing the sum of N Bernoulli random variables to
O(N log2N)([126]). This acceleration is achieved by exploiting that DFT allows to expand
two polynomials of size k in O(klogk) time. Equi-sized polynomials are obtained in the approach of [126], by using a divide and conquer approach, that iteratively divides the set of N Bernoulli trials into two equi-sized sets. Their recursive algorithm then combines these results by performing a polynomial multiplication of the generating polynomials of each set. More details of this algorithm can be found in [126].
3.6
Summary
Both presented techniques, the Poisson-Binomial recurrence and the generating functions technique allow to efficiently compute the distribution of a Poisson-Binomial distributed random variable, i.e., the sum of independent but not necessarily identical distributed Bernoulli trials. Both approaches achieve this efficient computation, by unifying sets of possible worlds, and treating the resulting set as a whole, rather than treating each world individually. In particular, the Poisson-Binomial recurrence unifies, in each iteration, all world having the same number of successful Bernoulli trials out of the currently considered Bernoulli trials. The generating functions techniques unifies worlds using simple algebra, adding monomials having the same exponent. Thus, both approaches identify and unify sets of possible worlds, thus following the paradigm of equivalent worlds introduced in this chapter. The efficient computation resulting from either technique is paramount in answering many probabilistic queries on spatial and spatio-temporal data. Both techniques will be applied, adapted and improved throughout this thesis.
Part III
Probabilistic Spatial Queries on
Uncertain Data
B
B
A
55
In this section, the paradigm of equivalent worlds presented in Chapter 3 is applied to permit efficient spatial similarity search on uncertain spatial data. For the most rele- vant spatial query types, which have been presented in Chapter 1, efficient solutions are presented. This part is structured as follows.
• Chapter 4 will give an efficient solution to answer range queries on uncertain data, for both cases where the query object is a (certain) point, and the case where the query object is uncertain itself. Furthermore, the problem of answering range queries on uncertain data will be extended by considering the range count query, which is to return the number of objects within a given range of a query object. In an uncertain database, this number is a random variable, for which this section will give an efficient solution to compute its distribution. The solutions shown in this chapter are rather straight-forward, given the paradigm of equivalent worlds. Nonetheless, solutions for range queries on uncertain data are necessary for completeness of this thesis, and at the same time, show-case the function of the paradigm of equivalent worlds. Some of the results of this chapter have been published in [30].
• Chapter 5 presents a novel a approach to facilitate spatial pruning of uncertain ob- jects that are conservatively approximated by minimal bounding boxes. This spatial pruning approach is a key technique to boost efficiency of similarity queries such as k-nearest neighbor queries, ranking queries and reverse k-nearest neighbor queries on uncertain data. Parts of this chapter have been published at the ACM SIGMOD Conference in 2010 ([65]) as a full paper.
• In Chapter 6 a solution for kNN-queries on uncertain data is presented. Here, the main challenge is that the predicate of an object being a kNN of a query object is a random variable, which stochastically depends on the position of other objects. An efficient solution to handle these dependencies is given in this section. Parts of this chapter have been published at the IEEE International Conference on Data Engineering (ICDE) 2011 ([22]) as a full paper.
• Chapter 7 extends chapter 6 by giving a solution to the problem of answering spa- tial ranking queries in uncertain databases. This section will show how to efficiently compute the rank distribution, i.e., the probability of each rank of each object in an uncertain database. Parts of this chapter have been published at the IEEE Transac- tions on Knowledge and Data Engineering journal 2010 ([25]) as a regular paper.
• Finally, Chapter 8 will give the first efficient solution for the problem of reverse kNN queries in uncertain databases. Parts of this chapter have been published in the Proceedings of the VLDB Endowment (PVLDB), Volume 4, 2011 ([23]) as a full paper.
Chapter 4
Probabilistic Range Queries on
Uncertain Data
hot spots
hot spots
(a) Astrological hot items in terms of interesting constellations.
hot spots in terms of
hot spots interms of
drug offences
(b) Hot item detection for crime defence appli- cations.
Figure 4.1: Applications for hot item detection.
4.1
Introduction
The detection of dense regions in a feature space is paramount in a variety of several density- based data mining techniques, in particular density-based clustering [68, 161], density- based outlier detection [38] and other density-based mining applications [113, 175]. We call a region R, for which a sufficiently large population of objects in DB exists a dense region or hot location. Analogously, we call an object o, for which a sufficiently large number of other objects in DB are similar to o, a hot item. Intuitively, an item that shares its attributes with a lot of other items could be potentially interesting as its shows a typical occurrence of items in the database. Deciding for a given item, whether it is an hot item is an important subtask in density-based clustering algorithms such as DBSCAN [68], where such items are called core items or core points. Further application areas where the detection of hot items is potentially important exemplarily include scientific applications, e.g. astrophysics (cf. Figure 4.1(a) showing a clipped capture of a star field
in the Sagittarius area), biomedical, sociological and economic applications. In particular, the following applications give a good motivation for the efficient detection of hot items:
Pre-detection of criminal activities: After a soccer game one might be interested in the detection of larger groups of hooligans that should be accompanied by guards in order to avoid criminal excess. If we assume that the locations of all hooligans are monitored, then it would be interesting which of these individuals have a lot of other hooligans in their immediate vicinity. Another example is the detection of outstanding crime delicts, e.g. cases of drug abuse in areas with high population of drug offences as depicted in Figure 4.1(b).1 To find hot items, the density of other items in the vicinity of the probed object
has to be assessed. In traditional databases, an range query can be utilized to perform this task: if the number of other database objects inside a certain rangearound an object exceeds a minimal threshold minItems, this object is declared a hot item, formally
Definition 17 (Hot Item). Given a spatial point database DB, a scalar spatial distance threshold and an integer minimum population threshold minItems. An object o∈ DB is called hot item, if and only if
|{o0 ∈ DB\{o}:dist(o, o0)< }|> minItems.
Both parameters and minItems are user-specified and application dependent. An example is depicted in Figure 4.2(a), where the -range for two objects of a exemplary database-range is depicted. For a parameter value of andminItems= 5, only one of these items is a hot item. On uncertain data, the task of deciding whether a database object is a hot item becomes more challenging as seen in Figure 4.2(b). For the highlighted object o, the predicatedist(o, o0)< thatois sufficiently close to another objecto0becomes a random variable, that may be true with some probability, and false otherwise. Consequently, the number of objects in range ofoalso turns into a random variable having a probability mass function defined on N0.
This chapter will show how to efficiently answer range queries on uncertain data. Re- lated work is presented in the following Section 4.2. Next, the paradigm of equivalent worlds (c.f. Chapter 3) is applied in Section 4.3 in order to find an efficient solution for the problem of answering probabilistic range queries on uncertain data in the special case where the query object is a (certain) point. While this type of query is rather trivial to answer efficiently due to object independence in the X-tuple model, it is a good showcase of the paradigm of equivalent worlds introduced in Chapter 3. This solution is general- ized to the case of an uncertain query point in Section 4.4 by again using the paradigm of equivalent worlds. Since range queries do not directly return a scalar measuring the density of a region, but rather return a set of potential result objects associated with their respective probability to be a result, Section 4.5 presents an efficient solution to the prob- lem of computing the distribution of the total number of objects located in a given range, which allows to assess the probability that an object is hot. This problem is particular challenging, since distances between objects are stochastically dependent even in the case
4.1 Introduction 59
hot item
not ahot item
(a) Hot items in certain data.possible siblebe
hot item
(b) Hot items in uncertain data. Figure 4.2: Examples of hot items.
where the location of objects are stochastically independent. An detailed explanation and an example of this problem is also given in Section 4.5. Again, an efficient solution can again be found using the paradigm of equivalent worlds. To allow an experimental evalua- tion of the efficiency of this approach, a straightforward approach as well as a competitive approach is introduced. The results of an experimental evaluation of all approaches are presented in Section 4.6 before this chapter is concluded in Section 4.7.