• No se han encontrado resultados

CAPÍTULO 3. METODOLOGÍA Y DISEÑO DE LA INVESTIGACIÓN

3.1. Fundamentación metodológica

3.1.1. La investigación educativa basada en el diseño

3.1.1.1. Características de las investigaciones basadas en el

the supporting area of cell Ci that covers all possible kNNs for all points in Ci. More

specifically, it adopts the maximumhighxi of eachCˆi as the finalhighxi and the minimum

lowxi of eachCˆi as the finallowxi as shown in Fig. 20.4 (b).

During the reduce phase the “local”kNNs is attached to each point and written out to HDFS. For each cellCithe boundaries of its supporting area (Cˆi) as well as the boundaries

ofCi itself together forms the partition plan of the next MapReduce job corresponding to

outer partitionkNN search.

20.3

Outer Partition

kNN Search

C1 C8 C4 C3 C2 C7 C5 Supporting area of C1 Supporting area of C7 Object p4 Object p3 Object p1 Object p2 Map 1 Map 2 Map N (K= C1, V =“0-p1”) (K= C1, V =“1-p2”) (K= C1, V =“1-p3”) (K= Cn, V =“0-pn”) Output of p1 Output of pn S h u ff li n g & S o rt in g Reduce 1 Reduce M

Each Reducer receives as a group one cell_id (key), and the cell’s core objects (tag = 0) as well as supporting objects(tag =1 ) HDFS Data Blocks Partitioning plan is given as input to Mappers (K= C2, V =“0-p2”) Output of p2

Figure 20.5:DDLOF: Outer Partition KNN Search.

Unlike the inner partitionkNN search, now in the partition plan passed to the outer partition kNN search MapReduce job each cell is augmented with the supporting area. The points in the original cell Ci are the points that we have to calculate their kNNs so

cored core points, while the points in the supporting area, namely the points withinCˆi but

out ofCi are only utilized to support thekNN search of the core points so called support

points. Therefore at the map phase each mapper retrieves one data block as well as the space partitioning strategy (Figure 20.5). Then for each data point pi, the map function

20.3 OUTER PARTITIONKNN SEARCH

produces two types of output records, i.e., core- and supporting-related records.

The core-related record is one key-value pair record in the form of (K =Ci, V =“0-

pi”), where the key is the ID of the grid cell for which pi is a core point, i.e., pi ∈ Ci.

The prefixed flag “0” in the value component indicates thatpi is a core point forCi. For

example, referring to Figure 20.5, the mapperMap 1generates output record (K =C1, V

=“0-p1”) for data pointp1.

Mappers also create zero or more supporting-related records for an input data pointpi

in the form of (K =Cj, V =“1-pi”), where the keypi ∈Cj is the ID of the grid cell for

whichpi is a support point. The prefixed flag “1” in the value component indicates that

pi is a support point forCj. For example, in Fig. 20.5, the mapperMap 1generates one

output record (K =C1, V =“1-p2”) for pointp2 since it is a support point forC1.

After the internal shuffling and sorting phase based on the cell ID, each group received by a reducer will correspond to a specific grid cell, say Ci, and will consist of the union

of the core and support points belonging to Ci (See Figure 20.5. The reducer function

categorizes the data points according to their attached flag encoded in the value. Lastly, it executes akNN search algorithm on each core point.

In this step the localkNNs attached to each core point will be fully utilized. Given a core pointpj, first the localkNNs ofpj will be parsed and stored in a listtempKNN(p).

The points in this list are sorted in the ascendent order by their distances topj. Then its

kNN will only be searched within the support points. If one support pointpsis closer topj

than at least one point intempKNN(p),ps is inserted intotempKNN(p). The point at the

tail will then be removed. This process proceeds until all support points are examined. Then the remaining points in tempKNN(p) will be the actual kNN of pj. Therefore

the duplicatekNN search between inner partitionkNN search and outlier partitionkNN search is completely avoided.

21

Performance Evaluation

21.1

Experimental Setup & Methodologies

Experimental Infrastructure. All experiments are conducted on a shared-nothing clus- ter with one master node and 40 slave nodes. Each node consists of 16 core AMD 3.0GHz processors, 32GB RAM, 250GB disk, and nodes are interconnected with 1Gbps Ether- net. Each server runs CentOS Linux (kernel version 2.6.32), Java 1.6, Hadoop 1.0.1. Each node is configured to run up to 8 map and 8 reduce tasks concurrently. The sort buffer size is set to 512MB. Speculative execution is disabled to boost performance. The replication factor is set to 3.

Dataset. We evaluated the performance of our proposed distributed local outlier de- tection algorithms (PDLOF and DDLOF) on the open data set: OpenStreetMap(OSM:

http : //wiki.openstreetmap.org/wiki/M ainPage). OSM contains points with geo-

graphic position, stored as coordinates (pairs of a latitude and a longitude). The raw data was stored in a 500G XML file. Each row in the dataset represents a building. 3 attributes are utilized in the experiments, namely ID, longitude and latitude. In order to adjust pa- rameters and evaluate the basic properties of our proposed algorithms, we extracted from