7. Memoria y proyecto de vida
7.1 El proyecto
This section gives an efficient solution of the problem of finding the distribution of the total number |{U ∈ DB|dist(Q, U) ≤ )}| of database objects located within an -range around a query object Q. At first glance, this task may seem rather straightforward, given the probabilities P(dist(Q, U) ≤ ) for each U ∈ DB, which can be computed efficiently as shown in Section 4.4. Given these probabilities of each object to be located in -range of Q, it may seem possible to use the techniques presented in Section 3.3 to compute the sum of independent Bernoulli trials. However, the events dist(Q, U1 ∈ DB) < and dist(Q, U2 ∈ DB) < are not stochastically independent, as both events depends on the position of Q. This problem described in detail in the following example.
4.5 Range Count Queries on Uncertain Data 65 Q
A
B
C
0.5 0.5 q1 q2Figure 4.3: An example database showing the stochastic dependencies between probabilis- tic distances.
Example 16. To illustrate this problem, consider Figure 4.3. It shows a query object Q
having two alternative positions having a probability of 0.5 each. If Q has position q1, i.e.,
assuming thatQ=q1, we see that all three database objectsA, B andCare certainly located
in the -range of q1, which is depicted by the large circle centered at q1. If Q has position q2, that is if Q =q2, then all objects will certainly be located outside of the -range of q2.
Thus, it is clear that for any object O ∈ {A, B, C} it holds that P(dist(Q, O)≤ ) = 0.5: With a probability of 0.5, Q is at location q1 and certainly has O in it’s range, and with a
probability of0.5,Qis at locationq2 andO is certainly out of range ofQ. If we (incorrectly)
assume independence between the events dist(Q, O) ≤ , we can compute the distribution of the number count(Q, ,DB) := |{O ∈ DB|}|dist(Q, O) ≤ ) of objects in range of Q
using the technique of generating functions by
F(x) = (0.5 + 0.5x)·(0.5 + 0.5x)·(0.5 + 0.5x) = 0.125x3+ 0.375x2+ 0.375x+ 0.125.
Interpreting the semantics of the monomials of this expansion yields the following (in- correct) probabilities P(count(Q, ,DB) = 0) = P(count(Q, ,DB) = 3) = 0.125 and
P(count(Q, ,DB) = 1) =P(count(Q, ,DB) = 2) = 0.375. We can clearly see that these probabilities must be wrong, since in all worlds whereQ=q1, we know thatcount(Q, ,DB)
must equal 3. Knowing that P(Q =q1) we get that P(count(Q, ,DB) = 3) = 0.5. Anal-
ogously, for the case Q = q2 we get P(count(Q, ,DB) = 0) = 0.5. The wrong results
returned by the above generating functions technique are a result of the incorrect assump- tion of independence between the three stochastic events dist(Q, O ∈ {A, B, C}) ≤ . In this example, these events are indeed highly correlated, since it holds that dist(Q, A) ≤
if and only if dist(Q, A) ≤ if and only if dist(Q, A) ≤ . Thus, the three events are equivalent, thus having a correlation of one.
4.5.1
Probabilistic Hot Items
The problem of computing the density in the range of a query object is defined as follows
Definition 21 (Probabilistic Hot Item). Let DB be an uncertain spatial database, let be a real value and let minItems be a positive integer. A database object U ∈ DB is called a hot item if
U is a hot item:=|{U0 ∈DB \ {U}|dist(U, U0)≤}|> minItems
Using possible worlds semantics, the probability of this random event is given by
P(|{U0 ∈DB|dist(U, U0)≤}|> minItems) = X
w∈W
I({U0 ∈DB\{U}|dist(U, U0)≤},DB)·P(w) (4.3)
Based on the uncertainty models and the corresponding definitions given above, we can compute hot items in uncertain data in a probabilistic way. However, we have to solve the problem of distance-dependencies of the uncertain attributes. Though we assume that the locations of uncertain spatial objects are independent of each other, we have to respect that the random variablesdist(U1, U2) anddist(U3, U4) for objectsU1, ..., U4 ∈ DBare mutually
dependent if {U1, U2} ∩ {U3, U4} 6= ∅. Obviously, the problem here is that the uncertain
object Q is used in both random variables P(dist(Q, A) ≤ ) and P(dist(Q, B) ≤ ), rendering these variables mutually dependent, despite the independence between objects
A,B andQ, as shown in Example 16. To avoid this dependency, we perform a partitioning of possible worlds into subsets of worlds: One partition for each possible position ofq ∈Q. In each such partition, the random variables P(dist(q, A) ≤ ) and P(dist(q, B) ≤ ) are independent, since the only involved uncertain objects areAandB, which are independent by model definition.
Definition 22(Conditional Probabilistic Hot Item). Given a database DB with uncertain objects and a minimum population threshold minItems. Furthermore, let ε ∈ R+
0 be a
scalar. Under the condition that an uncertain object U ∈ DB has a certain locationx∈Rd,
the probability that U is a hot item, given that U =x is denoted as
P(U is a hot item|U =x)
This definition allows to compute the probability that U is a hot item as follows:
Corollary 4.
P(U is a hot item) =X
x∈U
P(U is a hot item|U =x)·P(x) Proof. Corollary 4 follows directly from the law of total probability ([220]).
4.5 Range Count Queries on Uncertain Data 67
Corollary 4 allows to reduce the problem of computing the probability that an uncertain item is hot, to the problem of computing the probability that a single location is hot. To compute the probability
P(U is a hot item|U =x) = P(|{o0 ∈DB\ {U}|dist(U, U0)≤}|> minItems|U =x) =P(|{U0 ∈DB \ {U}|dist(x, U0)≤}|> minItems)
we first start by computing the probabilities
P(dist(x, U0)≤)
using the technique presented in Section 4.3. Note that, depending on the value of , usually only a small portion DB0 ⊂ DB of the database has a non-zero probability to be in -range of x. A quick search of those objects which have to be taken into account can be efficiently supported by means of an index structure, e.g. the R*-tree. In particular, the index supported ε-range join [39] can be used to speed-up the search as proposed in [34]. Here, approximative representations like the minimal bounding rectangle (mbr) of an uncertain object are very appropriate to be used as index key for a filter step following the multi-step query processing paradigm. A solution for the ε-range join on uncertain data is proposed in [109] which can be used as a preprocessing step for our proposed algorithm for the detection of hot items.
To efficiently compute the distribution of the number|{U0 ∈ DB \ {U}|dist(x, O0)≤}|
of objects close to instance x ∈ U, we proceed as follows. Since each random event
dist(x, Ui ∈ DB \ {U})≤follows a Binomial distribution, having two possible values true
and false, we can simply define a Bernoulli distributed random variable Bi that returns
one if dist(x, Ui)≤ and zero otherwise. By definition, the sum
X
Ui∈DB\{U}
Bi
corresponds to the number of random events dist(x, Ui ∈ DB) ≤ that hold true. These
events dist(x, Ui) ≤ and dist(x, Uj) ≤ , Ui, Uj ∈ DB \ {U}, Ui 6= Uj are now in-
dependently distributed, since x is now a fixed location and not a random variable. Consequently, the corresponding Beroulli random variable Bi and Bj are also mutually
independent. Thus, we can use the techniques presented in Section 3.3 to compute the distribution of P
Ui∈DB\{U}Bi efficiently. This distribution can be used to compute
P(U is a hot item|U =x) for each possible alternative x ∈ U of U. Given these proba- bilities, we can apply Corollary 4 to compute the probability P(U is a hot item).