7. Memoria y proyecto de vida
7.2 La institución como marco de proyecto y marca de vida
7.3.1 El proceder como inicio
In Section 3.3 and in Section 4.5.1 the paradigm of equivalent worlds has been used to efficiently compute the probabilistic distribution of the sum of Bernoulli distributed random variables. To show the power of this technique, this section presents two competitive approaches to compute the probability of a database objects to be a hot item. While the first algorithm is a straightforward baseline algorithm, the second algorithm is an adaption of an algorithm proposed in [27].
4.6.1
Brute-Force Algorithm
That condition probability that an objectU ∈ DBis a hot item, given thatU is at position
x can be rewritten as follows:
P(U is a hot item|U =x) = P(|{U0 ∈ DB \ {U}|dist(x, U0)< }| ≥minItems) = X SminItems ⊆ DB \ {U} |SminItems| ≥minItems ( Y U0∈S minItems P(dist(x, U0)≤)· Y U0∈DB\(S minItems∪{U}) (1−P(dist(x, U0)≤))).
This computation is very expensive, since a total ofP|DB|−1
i=minItems |DB| −1 i different sets SminItems have to be taken into account to calculate the sum term. Furthermore, for
each summand we have to compute the product of|DB| −1 multipliers. Even if we ignore the cost of the product, the computational complexity of the sum term remains O(2|DB|). However, only those objects for which the probability that the predicate dist(x, U0)≤ is satisfied is greater than zero have to be taken into account. In total, the computational complexity is O(2|DB0|), where DB0 ⊆ DB denotes the set of objects U0 ∈ DB0 for which
P(dist(x, U0) ≤ ) > 0 holds. Even if |DB0| << |DB|, the computational cost would explode for reasonably large database size and reasonable settings for theminItemsvalue. Yet, this approach is faster than the naive approach that enumerates all possible worlds, whose run-time equals the case where|DB0|=|DB|.
4.6.2
Bisection-Based Algorithm
The computational cost can be significantly reduced if we utilize the bisection-based al- gorithm as proposed in [27]. The bisection-based algorithm uses a divide-and-conquer approach to compute the probability P(|{U0 ∈ DB \ {U}|dist(x, U0) < }| = k) that ex- actlyk objects are inside the query range by iteratively dividing the databaseDB \{I}into two equally sized subsets DB1 and DB2, exploiting the law of total probability as follows:
P(|{U0 ∈ DB \ {U}|dist(x, U0)< }|=k) =
4.6 Experimental Evaluation 69 10000000 1E+09 BF 100000 1000000 10000000 ] BSB 100 1000 10000 time [s ] DPB BSB DPB 0,1 1 10 run t BF PHID PHID 0,001 0,01 , 10 161 314 465 618 771 921 10 161 314 465 618 771 921
#objectsofthedatabase
(a) Evaluation of competing techniques.
10000 100 1000 s] H=8 1 10 n time [m s epsilon=8 H=1 0,1 1 ru n epsilon=1 0,01
1,E+01 1,E+02 1,E+03 1,E+04 1,E+05 1,E+06
#objectsinthedatabase
(b) Scalability experiments. Figure 4.4: Performance w.r.t database size.
This approach allows to reduce the effective size of the database relevant for each proba- bility computation to 21a, but incurring a total ofk
a probability computations rather than
a single one, where a is the number of divide-and-conquer iterations. For each probability computation, this approach of [27] applies a straight-forward computation having expo- nential worst-case complexity in the size of the size of the database considered for the probability computation.
This technique can easily be adapted for our problem. The main idea is to recursively perform a binary split of the set of relevant objects, i.e. objects which have to be taken into account for the probability computation. Details of this algorithm can be found in [27].
4.6.3
Run-Time Experiments
In this section, we present the results of an experimental evaluation of the proposed meth- ods w.r.t. efficiency.
Datasets and Experimental Setup
The hot item detection methods were applied to one artificial dataset (ART) and two scientific real-world datasets (SCI1,SCI2), based on the discrete uncertainty model.
In theART dataset, each object is represented by a set of positions sampled uniformly from the [0,1]5 space.
Each of the 1500 objects of the datasets SCI1 and SCI2 consists of 10 samples, where each sample corresponds to a set of environmental sensor measurements of one single day that consist of several dimensions (attributes). The attribute set of SCI1 describes temperature, humidity and CO concentration, whereasSCI2 has a larger set of attributes (temperature, humidity, speed and direction of wind as well as concentrations ofCO,SO2,
Here, we compare two variants of our approach denoted by DPB and PHID. The algorithmDPB applies the techniques presented in Section 4.5.1 on the complete database. In contrast, PHID applies a spatial pruning filter using the R∗-Tree to find objects which must have a probability of zero to be in range of the query object. The performance ofPHID and DPB is compared to that of the brute-force solution (BF) by applying the formula given in Definition 4.3. Furthermore, we compare them to the bisection-based method [27] (BSB) Let us note that BSB is considerably more efficient than the brute-force solution and, thus, a more challenging competitor thanBF.
Note that in our algorithm, we concentrate on the evaluation of the CPU-cost only. The reason is that the P HID-algorithm is clearly CPU-bound. The only I/O bottleneck is the initial computation of the likelihood that U0 is in the -range of ui, for each object
U0 ∈ DB and each sample ui, where U, U0 ∈ DB, ui ∈ U and U 6= U0. This requires a
distance-range-self-join of the database which can be performed by a nested-block-loop join that requires O(|DB|2) page-faults in the worst case. In contrast, the CPU time for the P HID-algorithm is cubic: To compute the number of objects in -range of an instance of an object, i.e., to compute the sum of Bernoulli variables Bi, 1≤i ≤ |DB|, we can either
use the Poisson binomial recurrence technique presented in Section 3.4 or the generating function technique presented in Section 3.5. Either way, the incurred complexity is in
O(|DB|2) time and has to be performed once for each sample in the database. In our
experimental evaluation, we use an implementation of the Poisson binomial recurrence to compute the number of objects in the range of an instance of an object.
The first experiments relate to the scalability of the proposed approaches. The re- sults depicted in Figure 4.4 demonstrate how the runtime of the competing techniques is influenced by the database size. Figure 4.4(a) shows that, though the bisection-based ap- proach has exponential runtime, it outperforms the brute-force approach by several orders of magnitude. However, the Poisson binomial recurrence approaches scale significantly bet- ter than their competitors which in contrast to DPB andPHID have polynomial runtime. Furthermore, the pre-processing step ofPHID obviously pays off. The performance can be further improved by an order of magnitude when applying the Poisson binomial recurrence only on objects U0 having a non-zero chance to be in range. The next experiment shows the scalability of PHID for different ε-range values. Here, the average time required to compute the hot item probability for an object was measured. The results shown in Figure 4.4(b) demonstrate that PHID scales well, even for very large databases.
Figure 4.5(a) demonstrates the performance w.r.t. the minItems value for different database sizes. Contrary toDP B and P HID, theBSB is very affected by theminItems
value due to the expensive probability computation. The slight increase of the DP B and
P HID performances can be explained by the reduced number of hot items with increasing
minItems value.
Finally, we evaluate the performance based on real-world data (cf. Figure 4.5(b)). Unlike the exponential algorithms, DPB and PHID are able to perform a full hot item scan of the database in reasonable time, even for a relatively large database size.
4.7 Conclusions 71