• No se han encontrado resultados

3. JUSTIFICACIÓN

6.1. Institución Educativa Distrital La Candelaria

This section introduce a novel scalable pruning approach to identify candidates for a class of probabilistic similarity queries. This novel pruning method is applied to the most promi- nent query of the above mentioned class, the probabilistic k-nearest neighbor (PkNN) query.

B

B

A

R

Figure 6.1: A dominates B w.r.t. R with high probability.

6.1.1

Uncertainty Model

In this chapter, we assume a continuous model, where each objectUi ∈ DB ={U1, ..., UN}

is represented by a continuous probability density functionfi. Following the convention of

uncertain databases [31, 45, 48, 51, 55, 131, 170], we assume thatfiis (minimally) bounded

by an uncertainty region Ui such that ∀x /∈Ui :fi(x) = 0 and

Z

U

i

fi(x)dx≤1.

Specifically, the case RU

i

fi(x)dx < 1 implements existential uncertainty, i.e. object oi

may not exist in the database at all with a probability greater than zero. In this chapter we focus on the caseRU

i fi(x)dx= 1, but the proposed concepts can be easily adapted to existentially uncertain objects. Iffiis an unbounded PDF, e.g., Gaussian PDF, we truncate

PDF tails with negligible probabilities and normalize the resulting PDF. This procedure is also used in related work [45, 48, 31]. In specific, [31] shows that for a reasonable low truncation threshold, the impact on the accuracy of probabilistic ranking queries is very low.

In this way, each uncertain object can be considered as a d-dimensional rectangle with an associated multi-dimensional object PDF (c.f. Figure 6.1). Here, we assume that uncertain attributes may be mutually dependent. Therefore the object PDF can have any arbitrary form, and in general, cannot simply be derived from the marginal distribution of the uncertain attributes. Note that in many applications, a discrete uncertainty model is appropriate, meaning that the probability distribution of an uncertain object is given by a finite number of alternatives assigned with probabilities. This is a special case of the model used here.

6.1 Introduction 105

6.1.2

Problem Formulation

To answer kN N queries on uncertain objects efficiently, we exploit the observation that an object o is a k-nearest neighbor of an object q if and only if there is less thank objects in the database, that are closer to q than o. Thus, we address the problem of detecting for a given uncertain query object Q and an uncertain object B the number of uncertain objects of an uncertain databaseDB that are closer to (i.e. dominate)Q thanB. We call this number thedomination count of B w.r.t. Q as defined in the following.

Definition 30(Probabilistic Domination).Consider an uncertain databaseDB ={U1, ..., UN}

and an uncertain query object Q. Let A, B ∈ DB. Let (A≺Q B) denote the random indi-

cator variable that returns one, if and only if A dominates B w.r.t. Q, formally: (A≺QB) :=I(dist(A, Q)< dist(B, Q),DB)

where I(dist(A, Q) < dist(B, Q),DB) is a random indicator variable that returns one if the random location of A is closer to the random location of Q than the random location of B, and zero otherwise.

Note that in Section 5, the predicate (A ≺Q B) has been defined for rectangles. In

the above Definition 30, the notation (A ≺Q B) has deliberately been overloaded as a

random variable having uncertain objects as formal parameters, rather than a predicate having rectangles as formal parameters. The random variable (A≺Q B) follows a Bernoulli

distribution, i.e., a distribution having a success probability P(A ≺Q B) of taking value

one, and a 1−P(A≺Q B) probability of taking value zero.

Definition 31(Domination Probability). The probability P(A≺Q B)that object A domi-

nates object B with respect toQ is denoted as domination probability. If P(A≺Q B) = 0,

then we say that A does not dominate B with respect to Q. If P(A ≺Q B) = 1, then we

say that A certainly dominatesB with respect to Q. If 0< P(A≺Q B)<1, then we stay

that A dominates B probabilistically with respect to Q. This case is called probabilistic domination.

Definition 32 (Probabilistic Domination Count). Consider an uncertain database DB =

{U1, ..., UN} and an uncertain query object Q. For each uncertain object B ∈ DB, the

probabilistic domination count DomCount(B, Q) is defined as the random variable of the number of uncertain objects A∈ DB (A6=B) that are closer to q than B:

DomCount(B, Q) := X

A∈DB,A6=B

(A≺Q B)

DomCount(B, Q) is the sum of N−1 non-necessarily identically distributed and non- necessarily independent Bernoulli variables. The domination count can be used directly to efficiently answer probabilistic threshold k-NN queries.

Corollary 5. Let Q be an uncertain query object and let k be a scalar. The problem is to find all uncertain objects kN Nτ(Q) that are the k-nearest neighbors of Qwith a probability

of at least τ. Given the probability mass function of DomCount(B, Q), we can compute the probability PkN N(B, Q) that an object B is a kNN of Q as follows:

PkN N(B, Q) =

k−1

X

i=0

P(DomCount(B, Q) = i)

Proof. The above corollary is evident, since the proposition “Bis akNN ofQ” is equivalent to the proposition “B is dominated by less than k objects”.

To decide whether B is a kNN of Q, i.e. whether B ∈P τ kN N(Q,DB), we just need to check if PkN N(B, Q)τ.

The problem solved in this chapter is to efficiently compute the probability mass func- tion ofDomCount(B, Q). The solutions to compute the sum of independent Bernoulli trials presented in Section 3.3 cannot applied directly, since the two random events (Ai ≺Q B)

and (Aj ≺Q B) are mutually dependent, as they both depend on the location of uncertain

objectsB and Q. In this chapter, the technique of generating functions, introduced in Sec- tion 3.3 will be adapted, to give upper and lower bound functions of the probability mass function of DomCount(B, Q). Experiments will show, that the resulting bound functions are very tight, allowing to guarantee correctkN N results in most cases, avoiding expensive integration to obtain exact result.

6.1.3

Basic Idea

First (Section 6.3) presents a methodology to efficiently find objects in DB that certainly dominateB w.r.t. Qas well as objects inDBthat do not dominateB. At the same time, we find the set of objects that dominateB probabilistically. Using a decomposition technique, for each object A in this set, we can derive a lower and an upper bound for P(A ≺Q B),

i.e., for the probability that A dominates B w.r.t. Q. In Section 6.4, we show that due to dependencies between object distances to Q, these probabilities cannot be combined in a straightforward manner to approximate the distribution ofDomCount(B, Q). We propose a solution that copes with these dependencies and introduce techniques that help to to compute the probabilistic domination count in an efficient way. In particular, we prove that the bounds of P(A ≺Q B) are mutually independent if they are computed without

a decomposition of B and Q. Then, we provide a class of uncertain generating functions that use these bounds to bound the distribution of DomCount(B, Q). We then propose an algorithm which progressively refinesDomCount(B, Q) by iteratively decomposing the objects that influence its computation in Section 6.5. In Section 6.6, we experimentally demonstrate the effectiveness and efficiency of our probabilistic pruning methods for various parameter settings on artificial and real-world datasets.

6.2 Related Work 107

6.2

Related Work

Uncertain similarity query processing has focused on various aspects. A lot of existing work dealing with uncertain data addresses probabilistic one nearest neighbor (1NN) queries for certain query objects [51, 110] and for uncertain queries [95]. To reduce computational effort, [48] add threshold constraints in order to retrieve only objects whose probability of being the nearest neighbor exceeds a user-specified threshold to control the desired confi- dence required in a query answer. Similar semantics of queries in probabilistic databases are provided by Top-k nearest neighbor queries [31], where the k most probable results of being the nearest neighbor to a certain query point are returned. The solution of [133] for probabilistic k-nearest neighbor (kNN) queries is restricted to expected distances of un- certain objects to the query object. The use of expected distances drops any uncertainty information, yielding expected results, whose reliability cannot be assessed ([171, 125]).

The work of [49] used result based query semantics to answer probabilistickN Nqueries. For a probabilistic kN N query, the set of possible answers using result based semantics equals DBk, i.e., one possible result for each k-combination of objects in DB. Since in the

worst case, the result is exponentially in this case, the problem itself must be exponentially hard, since returning the computed results requires exponential time. For this reason, [49] presents approximation techniques, to find result sets having a probability of at least

τ. However, due to the exponential large set of possible answers, a proper τ value that still returns any results, becomes exponentially small. Furthermore, the approach of [49] assumes a certain query point.

Several approaches return the full result to queries as a ranking of probabilistic objects according to their distance to a certain query point [27, 55, 125, 171]. However, all these prior works have in common that the query is given as a single (certain) point.