• No se han encontrado resultados

Clasificación: temáticas

Capitulo 2: Concepto de Think Tank

2.6. Clasificación: temáticas

parallelization for outlier detection among different grid cells, we introduce the notion of

supporting area. As formally defined in Def. 18.1 the data points within the supporting areaof cellCi may affect the outlier decision of at least one core point ofCi.

Definition 13.2 Supporting Area. The supporting area of a grid cell Ci, denoted as

Ci.suppArea, is an extension of the boundaries ofCi in each dimension ofD. All data

pointspj (also called support point )∈Ci.suppAreasatisfy the following two conditions:

(1)pj 6∈Ci.core, and (2) there exists at least one pointpk ∈Ci.coresuch that dist(pj,pk) ≤r, where r is the distance threshold parameter in Def. 2.1.

Figure 13.1(a) highlights in grey the supporting areas of grid cellsC1 andC7 respec-

tively. Each grid cellCi will now be augmented with its support points in addition to its

core points. For example,C1will be extended to containsupport points{p1, p3}inC1’s

supporting area, along with its circle-shapedcore points.

Step 3: Parallelized Outlier Detection. The final step is to directly apply any cen- tralized outlier detection algorithm, e.g., the Nested-Loop algorithm [8], to each of the grid cellsCi to identify the outliers contained within that cell. This step can now be ap-

plied to each grid cell in total isolation from the others. Hence each can be distributed to different machines.

13.2

Optimality in Duplication Rate

The cost of a MapReduce algorithm is usually determined by two factors, namely com- munication and computation costs. The communication costs correspond to the costs of transmitting data from mappers to reducers. Often, if not always, the communication costs are the dominant costs of a MapReduce job [60]. Similar to the communication costs, the computation costs (especially those of the reducers) are also directly related to

13.2 OPTIMALITY IN DUPLICATION RATE

the number of the data points transmitted from mappers to reducers, i.e., the more data points that are received by the reducers, the more computational work is performed by them.

In theDODframework presented in Figure 13.1, each data point has to be transmitted at least once from mappers to reducers since the latter are performing detection of the outliers. Therefore, the efficiency of the framework can be modeled using the notion of

“Duplication Rate”, which refers to the average number of duplicates that mappers need to create for each input data point. The larger the duplication rate, the more data points must be transmitted from mappers to reducers, and thus the higher the communication and computation costs. The duplication rate is defined next.

Definition 13.3 Duplication Rate (dr). For a dataset D and a MapReduce algorithm

A for detecting distance-based outliers in D, the “duplication rate” dr(D, A) ∈ [1,∞]

represents the average number of duplicates that the mapper phase of A generates per data pointpi ∈D.

Intuitively, to minimize the overall costs, an algorithm should produce all outliers for input dataset Din the fewest possible rounds of MapReduce jobs. In addition, map- pers should transmit the smallest number of data points to the reducers to minimize the duplication rate. In the following, we show that the DOD framework using the uniS- pace partitioning strategy is optimal w.r.t the duplication rate in the case of a uniformly distributed dataset. Without loss of generality, we will use our working example of a two-dimensional space in Figure 13.1(a).

Lemma 13.1 Correctness and Minimal Duplication Rate. Assume a two-dimensional uniformly-distributed dataset D, where the values in each dimension d1, d2 are normal- ized to [0,1](0 ≤d1 ≤1 and 0 ≤d2 ≤1). Consider n the number of the equi-width

13.2 OPTIMALITY IN DUPLICATION RATE

parameter in the outlier detection problem. Then the DOD framework correctly detects

the distance-based outliersin D ina single MapReduce jobwith theminimal duplication

ratedr(D,DOD)=1 +πr2n+4r√n.

Proof. Since the two-dimensional domain space of D is partitioned into equi-width squared grid cells of equal area sizes, the area that one grid cell Ci covers is: A(Ci)

= |d1|×|d2|

n , where |d1 | × |d2 | represents the area of the entire domain. Since

0 ≤di ≤1, i∈ {0,1}, then|d1 | × |d2 |=1. ThereforeA(Ci)=1n, and the side length

ofCi, denoted asl, is computed as: l = q

1 n.

SinceDis uniformly distributed, each grid cellCi will hold the same number of core

points|core(Ci)|=

|D|

n . Moreover, the duplication rate over the entire datasetDdenoted

asdr(D,DOD)will be equivalent to the duplication rate of a single grid cellCi denoted

asdr(Ci,DOD), where

dr(D,DOD)=dr(Ci,DOD)= |core(Ci|)core|+|C(Ci.isuppArea)| | (1)

Since the data values are uniformly distributed over the space, the cardinalities can be directly mapped to the underlying areas as follows:

dr(D,DOD)=dr(Ci,DOD)= A(Ci.suppAreaA(Ci))+A(Ci) (2)

whereA(Ci.suppArea)represents the size of the supporting area ofCi.

As illustrated in Figure 13.1(a),Ci.suppArea is composed of four (r xl) rectangles

plus four quarter circles each of radius r. Therefore, A(Ci.suppArea) = πr2+4rl =

πr2 +4r

q

1

n. By replacement in Eq. (2), we get:

dr(D,DOD)=dr(Ci,DOD)= πr 2+4rl+l2 l2 = πr2+4rq1 n+ 1 n 1 n =1 +πr2n+4r√n (3)

13.2 OPTIMALITY IN DUPLICATION RATE

for a given grid cellCi, the support points inCi.suppAreaare the necessary and sufficient

set of points to determine the outlier status of the core points inCi.

“Necessity” Proof. By Def. 18.1, any point pj ∈ Ci.suppArea is the neighbor of at

least one pointpi ∈Ci. Ifpj is excluded fromCi.suppArea, then possiblypiinCi would

have been falsely reported as an outlier ifpi happens to only acquirek −1 neighbors.

“Sufficiency” Proof.Any data pointpj 6∈Ci.suppAreais not the neighbor of any point

pi ∈Ci. Thereforepj has no influence on the decision of whether or notpi is an outlier

by the distance-based outlier definition in Def. 2.1.

Next we show that the square-shaped grid cells lead to the lowest duplication rate. In other words, any other rectangle-shaped grid cells would lead to larger duplication rate. Suppose Cj is ay by z rectangle cell. Cj covers the same size of domain space as the

square cell Ci with side lengthx, and hence x2 =y×z. By applying Eq. (2) over Ci

andCj, we get: dr(Ci,DOD)=1 +πr 2+4rx x2 (4) dr(Cj,DOD)=1 +πr 2+2ry+2rz yz (5)

Given that x2 =y×z, to prove that dr(C

i,DOD)≤dr(Cj,DOD) we only need

to show that 2x ≤ y+z. This is equivalent to proving that (2x)2 (y+z)2, or

equivalently (y+z)2 −4x2 ≥0. Since x2 =y×z, then by replacement of x, we need to prove that((y+z)2 −4yz)≥0. The L.H.S is equivalent to(y−z)2, which is guaranteed to be always larger than or equal to zero. That is:

dr(Ci,DOD)≤dr(Cj,DOD) (6)

Documento similar