Capítulo 4: RESULTADOS DE LA INVESTIGACIÓN
4.2. Estimación del modelo de medida
4.2.2. Validación de los índices reflectivos
band Processing
K-skyband query is a generalization of the well known skylineconcept. As defined in [57] aK-skyband query reports all points that aredominatedby no more thanK points. The case K = 0 corresponds to a conventional skyline. The key idea underlying this skyline concept is to define thedomination relationshipbetween any two data points. As a simple example consider a datasetDcomposed ofnone dimensional data points, namely
n distinct values {p1,p2,p3, ...,pn}. Assume the domination relationshipbetween any
pair of data pointspiandpj (1 ≤i,j ≤n) is defined aspi dominatespj ifpi >pj. Then
the K-skyband (K=2) query on dataset D returns the top-3 largest points in D: {pmax,
pmax−1,pmax−2}. pmax−2 is dominated by the two data pointspmax andpmax−1, while all
other data points inDare dominated by at least these three top-3 points ofD.
To map our problem of determining the outlier status of a given point p to the K- skyband problem, we have to similarly define the domination relationship between any pair of data points in the datasetDWc, i.e., the population of the current windowWc. The
key observation here is that given any two pointspi andpj, two key factors, namely their
relativearrival timeand thedistanceto the pointpunder evaluation, determine whether
pi is more important thanpj in terms of evaluating the outlier status ofp.
Let us introduce a query groupQused in the remainder of this section. Assume we
have a query group Q: {q1(r1),q2(r2),...,qm(rm), qm+1(rm+1), ..., qn(rn)} 1, where rm
represents the r parameter of query qm. The r parameter of q1, q2,...,qn monotonically
increases, that is,r1 < r2 <... < rm< rm+1 <... < rn.
Distance Dimension. In distance-based outlier definition (Def. 2.1), points in a dataset
Dare classified either as outliers or inliers. Thus, the process of identifying outliers inD
1For the ease of readability, we only list those parameters in the query notationq
i(r,k,win,slide)that
8.1 K-SKY: VARYING PARAMETER - R
is equivalent to the process of finding and eliminating inliers from it. By Def. 2.1, p is guaranteed to be an inlier oncekneighbors are acquired inD. Given two pointspi andpj,
assumedist(pi,p)< rm <dist(pj,p)¡rm+1. Thenpi is the neighbor ofpwith respect
to query subsetQi={qm, ...,qn}, whilepj is the neighbor ofponly with respect to query
subsetQj ={qm+1, ...,qn}. Qi⊃Qj. In other wordspisatisfies the neighbor requirement
of more queries thanpj. For the evaluation ofp,pi is more important thanpj, becausepi
makes the outlier status ofpcloser to be determined with respect to all queries inQthan
pj. In this perspectivepidominatespj.
On the other hand, assume rm < dist(pi,p) < dist(pj,p) ¡ rm+1. Then pi and pj
are both neighbors of p for the same set of queries {qm+1, ..., qn}. In this scenario pi
and pj equally affect the outlier status of p although dist(pi,p) 6= dist(pj,p). Based
on this observation we now are ready to re-define the distance function dist(p,pi)so to
normalize the distance between data points. Theoriginaldistance function is denoted as
disto(p,pi)instead.
Definition 8.1 Given a query groupQ:{q1(r1),q2(r2), ...,qm(rm),qm+1(rm+1), ...,qn(rn)}
withr1 <r2 < ... <rm <rm+1 < ... <rn,dist(p,pi)=m +1 ifrm <disto(p,pi)≤rm+1
for0 ≤m ≤nwithr0defined as -∞andrn+1 defined as∞.
By Def. 8.1,dist(pi,p)=dist(pj,p)ifrm <disto(pi,p)<disto(pj,p)¡rm+1. This
new normalized distancecalculated using Def. 8.1 now accurately represents the impor- tance of each data point top.
Time Dimension. In the streaming context the presence of the time dimension further complicates matters. In particular we cannot simply claim that one data point pi closer
to p impacts the status of p more than the other points. Instead the arrival time of the data points also has to be taken into consideration. A point pi that arrived later in the
8.1 K-SKY: VARYING PARAMETER - R
to an earlier arrivingpj even ifpiis not closer to p thanpj. This is so because theyounger
a data pointpiis, the longer its neighbor relationships (if any) withpwill persist into the
future.
Domination Relationship. We now define thedomination relationship between the pair of points in dataset DWc that takes both the distance and time dimensions into con-
sideration.
Definition 8.2 Domination Relationship.Given a query groupQ:{q1(r1),q2(r2), ...,qm(rm),
qm+1(rm+1), ...,qn(rn)}withr1 <r2 < ... <rm <rm+1 < ... <rn, pointpi dominates
pj with respect to point p if: (1)pi.time >pj.time; (2)dist(p,pi)≤dist(p,pj)(pi,pj ∈
DWc −p) and p ∈ DWc; (3) dist(p,pi)≤n, with dist() the normalized distance of Q
defined in Def. 8.1.
In other words, given a data pointpi,pi dominates another pointpj only ifpi expires
later than pj from window Wc (Condition 1) and it is not further away from p than pj
(Condition 2). The third condition in the domination rule filters out any data pointpi that
is not a neighbor ofpfor any query inQ. As otherwise thispiwould never be influencing
the outlier status ofp.
Based on the domination relationship defined in Def. 23.3, the outlier status ofpwith respect to all queries in Q can now be correctly answered based on the skyband points delivered by one single (k −1)-skyband query denoted as Qs, namely the K-skyband
query withK specified ask-11.
Lemma 8.1 Given a query group Q, for any data point p, the output of the skyband queryQs corresponding to
Q, denoted asSp, issufficientandnecessaryto continuously
determine the outlier status of p with respect to all queries inQ.
8.1 K-SKY: VARYING PARAMETER - R
Here we sketch the key ideas of the proof for this lemma.
Sufficiency. The sufficiency of this mapping is based on two observations, namely theKNNobservation and theK-distanceobservation as explained below.
KNN Observation. First,Qs always returns thek nearest neighborsofpas part of the skyband points. The k nearest neighbors ofpdenoted askNN(p)arekpoints inDWc that
do not have larger distance topthan any other point inDWc. The proof of this observation
is intuitive. Given any point pi ∈kNN(p), at most k −1 points inDWc are closer top
thanpi. By the domination relationship defined in Def. 23.3, at mostk −1 points inDWc
dominatepi. Thereforepiis a skyband point of our skyband queryQs.
K-distance Observation. Second, once kNN(p) is discovered, the outlier status of
p with respect to each query in Q can be determined by examining the distance be- tweenpand its kth-nearest neighbor calledk-distance(p). Ifrm<k-distance(p)≤rm+1,
then p is guaranteed to be anoutlier for queries {q1,q2,...,qm} and an inlierfor queries {qm+1,...,qn}.
Justifying this observation is straightforward. If k-distance(p)≤rm+1, then all points
inkNN(p)are neighbors ofpfor queries{qm+1,...,qn}. Thereforepis an inlier for such
queries. On the other hands, since k-distance(p) > rm, p does not have k neighbors for
queries{q1,q2,...,qm}. Otherwise the points inkNN(p)would not be the k nearest points
topinDWc. Thuspis an outlier to queries{q1,q2,...,qm}.
Stream p1: <t1,2> p2: <t2,3> p3: <t3,2> p4: <t4,1> p5: <t5,1> p6: <t6,4> p7: <t7,3> p8: <t8,2> p6: <t6,4> p7: <t7,3> p8: <t8,2> p5: <t5,1> p10: <t10,5> p11: <t11,6> p12: <t12,4> p9: <t9,4> Wc Wc+1
Figure 8.1: SOP: Sliding window stream
8.1 K-SKY: VARYING PARAMETER - R
Example 8.1 Given a query groupQ: {q1(1),q2(2),q3(3)}with the k parameter set as 3 and the dataset DWc composed of pointspi represented in the arrival time and distance
space (< ti, di >): {p1 :< t1,2 >,p2 :< t2,3 >, p3 :< t3,2 >,p4 :< t4,1 >, p5 :<
t5,1 >, p6 :< t6,4 >, p7 :< t7,3 >, p8 :< t8,2 > } as shown in Fig. 8.1. Here ti
indicates the arrival time of pi (t1 < t2 <... <t8) anddi indicates the distance of pi to
p. The (k −1)-skyband queryQs with k = 3 will return{p4 :< t4,1 >, p5 :< t5,1 >,
< p7 : t7,3 >, < p8 : t8,2 >} as the skyband points in window Wc. A subset of this
result, namely {p4 :< t4,1 >, p5 :< t5,1 >, p8 :< t8,2 >} is the kNN of p. The k- distance of p thus is 2. By the k-distance observation we can correctly derive the outlier status of p. Namely p is an outlier forq1, while being an inlier forq2 andq3.
Necessity. Note in the above example since the skyband point p7 :<t7,3 > is not
in thekNN(p) set ofWc, p7 is not utilized to evaluate pinWc. Howeverp7 arrived later
thanp4andp5inkNN(p). Potentially it might still benefit the evaluation ofpin the future
windows.
As shown in Fig. 8.1 when the window slides from Wc to Wc+1, <t4,1 > will
expire. Since all new arrivals pi ({p9, p10, p11, p12}) in Wc+1 are far from p, namely
dist(p,pi)>3, nowp7 will be inkNN(p) ={< t5,1>,< t7,3 >, < t8,2 >}ofWc+1.
As the third nearest neighbor ofp, the distance betweenp7 andpdist(p7,p) =3 will be
utilized to determine the outlier status ofp. Nowpis an outlier forq1andq2, while being
an inlier only forq3.