• No se han encontrado resultados

Capítulo 4: RESULTADOS DE LA INVESTIGACIÓN

4.2. Estimación del modelo de medida

4.2.2. Validación de los índices reflectivos

band Processing

K-skyband query is a generalization of the well known skylineconcept. As defined in [57] aK-skyband query reports all points that aredominatedby no more thanK points. The case K = 0 corresponds to a conventional skyline. The key idea underlying this skyline concept is to define thedomination relationshipbetween any two data points. As a simple example consider a datasetDcomposed ofnone dimensional data points, namely

n distinct values {p1,p2,p3, ...,pn}. Assume the domination relationshipbetween any

pair of data pointspiandpj (1 ≤i,j ≤n) is defined aspi dominatespj ifpi >pj. Then

the K-skyband (K=2) query on dataset D returns the top-3 largest points in D: {pmax,

pmax−1,pmax−2}. pmax−2 is dominated by the two data pointspmax andpmax−1, while all

other data points inDare dominated by at least these three top-3 points ofD.

To map our problem of determining the outlier status of a given point p to the K- skyband problem, we have to similarly define the domination relationship between any pair of data points in the datasetDWc, i.e., the population of the current windowWc. The

key observation here is that given any two pointspi andpj, two key factors, namely their

relativearrival timeand thedistanceto the pointpunder evaluation, determine whether

pi is more important thanpj in terms of evaluating the outlier status ofp.

Let us introduce a query groupQused in the remainder of this section. Assume we

have a query group Q: {q1(r1),q2(r2),...,qm(rm), qm+1(rm+1), ..., qn(rn)} 1, where rm

represents the r parameter of query qm. The r parameter of q1, q2,...,qn monotonically

increases, that is,r1 < r2 <... < rm< rm+1 <... < rn.

Distance Dimension. In distance-based outlier definition (Def. 2.1), points in a dataset

Dare classified either as outliers or inliers. Thus, the process of identifying outliers inD

1For the ease of readability, we only list those parameters in the query notationq

i(r,k,win,slide)that

8.1 K-SKY: VARYING PARAMETER - R

is equivalent to the process of finding and eliminating inliers from it. By Def. 2.1, p is guaranteed to be an inlier oncekneighbors are acquired inD. Given two pointspi andpj,

assumedist(pi,p)< rm <dist(pj,p)¡rm+1. Thenpi is the neighbor ofpwith respect

to query subsetQi={qm, ...,qn}, whilepj is the neighbor ofponly with respect to query

subsetQj ={qm+1, ...,qn}. Qi⊃Qj. In other wordspisatisfies the neighbor requirement

of more queries thanpj. For the evaluation ofp,pi is more important thanpj, becausepi

makes the outlier status ofpcloser to be determined with respect to all queries inQthan

pj. In this perspectivepidominatespj.

On the other hand, assume rm < dist(pi,p) < dist(pj,p) ¡ rm+1. Then pi and pj

are both neighbors of p for the same set of queries {qm+1, ..., qn}. In this scenario pi

and pj equally affect the outlier status of p although dist(pi,p) 6= dist(pj,p). Based

on this observation we now are ready to re-define the distance function dist(p,pi)so to

normalize the distance between data points. Theoriginaldistance function is denoted as

disto(p,pi)instead.

Definition 8.1 Given a query groupQ:{q1(r1),q2(r2), ...,qm(rm),qm+1(rm+1), ...,qn(rn)}

withr1 <r2 < ... <rm <rm+1 < ... <rn,dist(p,pi)=m +1 ifrm <disto(p,pi)≤rm+1

for0 ≤m ≤nwithr0defined as -∞andrn+1 defined as∞.

By Def. 8.1,dist(pi,p)=dist(pj,p)ifrm <disto(pi,p)<disto(pj,p)¡rm+1. This

new normalized distancecalculated using Def. 8.1 now accurately represents the impor- tance of each data point top.

Time Dimension. In the streaming context the presence of the time dimension further complicates matters. In particular we cannot simply claim that one data point pi closer

to p impacts the status of p more than the other points. Instead the arrival time of the data points also has to be taken into consideration. A point pi that arrived later in the

8.1 K-SKY: VARYING PARAMETER - R

to an earlier arrivingpj even ifpiis not closer to p thanpj. This is so because theyounger

a data pointpiis, the longer its neighbor relationships (if any) withpwill persist into the

future.

Domination Relationship. We now define thedomination relationship between the pair of points in dataset DWc that takes both the distance and time dimensions into con-

sideration.

Definition 8.2 Domination Relationship.Given a query groupQ:{q1(r1),q2(r2), ...,qm(rm),

qm+1(rm+1), ...,qn(rn)}withr1 <r2 < ... <rm <rm+1 < ... <rn, pointpi dominates

pj with respect to point p if: (1)pi.time >pj.time; (2)dist(p,pi)≤dist(p,pj)(pi,pj ∈

DWc −p) and p ∈ DWc; (3) dist(p,pi)≤n, with dist() the normalized distance of Q

defined in Def. 8.1.

In other words, given a data pointpi,pi dominates another pointpj only ifpi expires

later than pj from window Wc (Condition 1) and it is not further away from p than pj

(Condition 2). The third condition in the domination rule filters out any data pointpi that

is not a neighbor ofpfor any query inQ. As otherwise thispiwould never be influencing

the outlier status ofp.

Based on the domination relationship defined in Def. 23.3, the outlier status ofpwith respect to all queries in Q can now be correctly answered based on the skyband points delivered by one single (k −1)-skyband query denoted as Qs, namely the K-skyband

query withK specified ask-11.

Lemma 8.1 Given a query group Q, for any data point p, the output of the skyband queryQs corresponding to

Q, denoted asSp, issufficientandnecessaryto continuously

determine the outlier status of p with respect to all queries inQ.

8.1 K-SKY: VARYING PARAMETER - R

Here we sketch the key ideas of the proof for this lemma.

Sufficiency. The sufficiency of this mapping is based on two observations, namely theKNNobservation and theK-distanceobservation as explained below.

KNN Observation. First,Qs always returns thek nearest neighborsofpas part of the skyband points. The k nearest neighbors ofpdenoted askNN(p)arekpoints inDWc that

do not have larger distance topthan any other point inDWc. The proof of this observation

is intuitive. Given any point pi ∈kNN(p), at most k −1 points inDWc are closer top

thanpi. By the domination relationship defined in Def. 23.3, at mostk −1 points inDWc

dominatepi. Thereforepiis a skyband point of our skyband queryQs.

K-distance Observation. Second, once kNN(p) is discovered, the outlier status of

p with respect to each query in Q can be determined by examining the distance be- tweenpand its kth-nearest neighbor calledk-distance(p). Ifrm<k-distance(p)≤rm+1,

then p is guaranteed to be anoutlier for queries {q1,q2,...,qm} and an inlierfor queries {qm+1,...,qn}.

Justifying this observation is straightforward. If k-distance(p)≤rm+1, then all points

inkNN(p)are neighbors ofpfor queries{qm+1,...,qn}. Thereforepis an inlier for such

queries. On the other hands, since k-distance(p) > rm, p does not have k neighbors for

queries{q1,q2,...,qm}. Otherwise the points inkNN(p)would not be the k nearest points

topinDWc. Thuspis an outlier to queries{q1,q2,...,qm}.

Stream p1: <t1,2> p2: <t2,3> p3: <t3,2> p4: <t4,1> p5: <t5,1> p6: <t6,4> p7: <t7,3> p8: <t8,2> p6: <t6,4> p7: <t7,3> p8: <t8,2> p5: <t5,1> p10: <t10,5> p11: <t11,6> p12: <t12,4> p9: <t9,4> Wc Wc+1

Figure 8.1: SOP: Sliding window stream

8.1 K-SKY: VARYING PARAMETER - R

Example 8.1 Given a query groupQ: {q1(1),q2(2),q3(3)}with the k parameter set as 3 and the dataset DWc composed of pointspi represented in the arrival time and distance

space (< ti, di >): {p1 :< t1,2 >,p2 :< t2,3 >, p3 :< t3,2 >,p4 :< t4,1 >, p5 :<

t5,1 >, p6 :< t6,4 >, p7 :< t7,3 >, p8 :< t8,2 > } as shown in Fig. 8.1. Here ti

indicates the arrival time of pi (t1 < t2 <... <t8) anddi indicates the distance of pi to

p. The (k −1)-skyband queryQs with k = 3 will return{p4 :< t4,1 >, p5 :< t5,1 >,

< p7 : t7,3 >, < p8 : t8,2 >} as the skyband points in window Wc. A subset of this

result, namely {p4 :< t4,1 >, p5 :< t5,1 >, p8 :< t8,2 >} is the kNN of p. The k- distance of p thus is 2. By the k-distance observation we can correctly derive the outlier status of p. Namely p is an outlier forq1, while being an inlier forq2 andq3.

Necessity. Note in the above example since the skyband point p7 :<t7,3 > is not

in thekNN(p) set ofWc, p7 is not utilized to evaluate pinWc. Howeverp7 arrived later

thanp4andp5inkNN(p). Potentially it might still benefit the evaluation ofpin the future

windows.

As shown in Fig. 8.1 when the window slides from Wc to Wc+1, <t4,1 > will

expire. Since all new arrivals pi ({p9, p10, p11, p12}) in Wc+1 are far from p, namely

dist(p,pi)>3, nowp7 will be inkNN(p) ={< t5,1>,< t7,3 >, < t8,2 >}ofWc+1.

As the third nearest neighbor ofp, the distance betweenp7 andpdist(p7,p) =3 will be

utilized to determine the outlier status ofp. Nowpis an outlier forq1andq2, while being

an inlier only forq3.