• No se han encontrado resultados

Capítulo 2: MARCO CONCEPTUAL DE LA INVESTIGACIÓN

2.3. Los impactos del turismo

2.3.2. Visión General de los Impactos del Turismo

ously detects and outputs the outliers in the current windowWcwhen the window slides.

2.5

MapReduce Basics

MapReduce is a framework for parallel processing of massive data sets popular for its scalability to thousands machines, flexibility in the data model, efficient fault tolerance execution, and cost effectiveness. A job to be performed using the MapReduce frame- work has to be specified as two phases: the map phase as specified by a Map function (also called mapper) takes key/value pairs as input, possibly performs some computation on this input, and produces intermediate results in the form of key/value pairs; and the reduce phase which processes these results as specified by a Reduce function (also called reducer). The data from the map phase are shuffled, i.e., exchanged and merge-sorted, to the machines performing the reduce phase. It should be noted that the shuffle phase can itself be more time-consuming than the two others depending on network bandwidth availability and other resources.

In more detail, the data are processed through the following 6 steps as illustrated in Figure 2.1:

Worcester Polytechnic Institute

Input

Reader Map Combiner Partition Reduce

Output Writer

Input

Reader Map Combiner Partition Reduce

Output Writer

Input

Reader Map Combiner Partition Reduce

Output Writer

Figure 2.1:MapReduce Dataflow

1. Input reader: The input reader in the basic form takes input from files (large blocks) and converts them to key/value pairs. It is possible to add support for other input

2.5 MAPREDUCE BASICS

types, so that input data can be retrieved from a database or even from main mem- ory. The data are divided into splits, which are the unit of data processed by a map task. A typical split size is the size of a block, which for example in HDFS is 64 MB by default, but this is configurable.

2. Map function: A map task takes as input a key/value pair from the input reader, per- forms the logic of the Map function on it, and outputs the result as a new key/value pair. The results from a map task are initially output to a main memory buffer, and when almost full spill to disk. The spill files are in the end merged into one sorted file.

3. Combiner function: This optional function is provided for the common case when there is (a) significant repetition in the intermediate keys produced by each map task, and (b) the user-specified Reduce function is commutative and associative. In this case, a Combiner function will perform partial reduction so that pairs with same key will be processed as one group by a reduce task.

4. Partition function: As default, a hashing function is used to partition the intermedi- ate keys output from the map tasks to reduce tasks. While this in general provides good balancing, in some cases it is still useful to employ other partitioning func- tions, and this can be done by providing a user-defined Partition function.

5. Reduce function: The Reduce function is invoked once for each distinct key and is applied on the set of associated values for that key, i.e., the pairs with same key will be processed as one group. The input to each reduce task is guaranteed to be processed in increasing key order. It is possible to provide a user-specified comparison function to be used during the sort process.

2.5 MAPREDUCE BASICS

storage. In the basic case, this is to a file, however, the function can be modified so that data can be stored in, e.g., a database.

As can be noted, for a particular job, only a Map function is strictly needed, although for most jobs a Reduce function is also used. The need for providing an Input reader and Output writer depends on data source and destination, while the need for Combiner and Partition functions depends on data distribution.

Hadoop [50] is an open-source implementation of MapReduce, and without doubt, the most popular MapReduce variant currently in use in an increasing number of prominent companies with large user bases, including companies such as Yahoo! and Facebook.

Hadoop consists of two main parts: the Hadoop distributed file system (HDFS) and MapReduce for distributed processing. As illustrated in Figure 2.2, Hadoop consists of a number of different daemons/servers: NameNode, DataNode, and Secondary NameNode for managing HDFS, and JobTracker and TaskTracker for performing MapReduce.

Worcester Polytechnic Institute

JobTracker Client NameNode TaskTracker DataNode Secondary NameNode TaskTracker DataNode TaskTracker DataNode MapReduce Layer HDFS Layer

Part I

3

A Generic Outlier Detection

Framework

We now introduce our scalable framework calledLEAP, capable of continuously process- ing distance-based outliers with low CPU and memory resource utilization. LEAP is built on two fundamental optimization principles namelyminimal probingandlifespan-aware prioritizationas described below.

3.1

Theoretical Foundation

In all distance-based outlier definitions, points in a datasetDare classified either as out- liers or inliers. Thus, the process of identifying outliers inDis equivalent to the process of eliminating inliers from it. In fact, initially, each pointpi in the dataset is apotential

outlier candidate, until one has acquired enough evidence to show thatpiis an inlier. For

example, in the process of identifying Othres(k,R) outliers, until finding thatpi has at least k

neighbors and thus qualifies as inlier,pi cannot be safely removed from theoutlier candi-

3.1 THEORETICAL FOUNDATION

This fact leads us to an important observation. That is, to identify whether a pointpiis

a distance-based outlier in a datasetD, one may not need the distance betweenpitoevery

other point in D. Instead, a potentially small subset of points will be sufficient to prove thatpi is an inlier. Also due to the rarity of outliers, the majority of points in the dataset

could be labeled as inliers in this way by collecting only a small amount of information. To describe the least amount of information needed to provepi’s inlier status we define

the concept ofMinimal Evidence Set for Inlier(MESI).

Definition 3.1 Given an outlier query and a dataset D, theMESI set for a data point

pi ∈D is a datasetM such thatM ⊆D, if the distance setDistSet(M,pi)={d(p1,pi),

d(p2,pi), ...,d(pn,pi)|pj(1≤j≤n)∈M}is sufficient to labelpias an inlier, and there does

not exist anyM0 ⊆D such that|M0|<|M|andDistSet(M0,pi)={d(p1,pi),d(p2,pi), ...,d(pm,pi)|pj(1≤j≤m) ∈M0}is sufficient to labelpias an inlier.

The size ofMESIfor a pointpiis usually much smaller than the size ofpi’s complete

neighborhood. For example, forOthres(k,R)outlier, theMESIfor any pointpi is composed of

anyk points that are withinRdistance frompi. Thus its size isk. In general, this input

parameterkis much smaller than the average number of neighbors each point may have in

Rdistance range. Otherwise the outliers detected with fewer thankneighbors would not considered to beabnormal phenomenain the dataset. The cardinality ofMESIfor a point

pi in the kNN outlier definitions is also bounded by a constant value k as we will show

in Chapter 5. This observation guides us to propose theMinimal Probing optimization principle (Sec. 3.2).

Although MESI is sufficient to prove a point’s inlier status in the current window, unlike in static environments, locating more neighbors beyond MESI for a given point may be beneficial in streaming environments. These additional neighbors may help us to determine the status of this point in future windows. Thus, we now extend the concept

3.1 THEORETICAL FOUNDATION

of MESI in a static dataset toMESI in a sequence of stream windows. In particular, we define the concept ofMinimal Evidence Set for Inlier in a Window Sequenceas below.

Definition 3.2 Given a streaming outlier detection query Q and all points in the cur- rent windowWc, denoted byDWc,MESI(Wc,c+x)forpi in a window sequence fromWcto

Wc+x, is a datasetMwithM ⊆DWc, if the distance setDistSet(M,pi)={d(p1,pi),d(p2,pi), ...,d(pn,pi)|pj(1≤j≤n)∈M}is sufficient to labelpi as an inlier in windowsWctoWc+x,

and there does not exist anyM0 ⊆DWcwith|M

0|<|M|andDistSet(M0,p

i)={d(p1,pi),

d(p2,pi), ...,d(pm,pi)|pj(1≤j≤m)∈M0} is sufficient to labelpi as an inlier in windows

WctoWc+x.

In other words, the MESI(Wc,c+x) for a point pi is a minimal subset of the current

window population DWc that provides sufficient evidence to prove thatpi is an inlier in

windowsWctoWc+x, regardless of the characteristics of the future incoming stream. This

is possible because by analyzing the time stamp of a point pi and the query window (the

slide and window sizes), we can determine the number of windows that pi will survive

in. For example, for a pointpithat just arrived with the latest slide in the current window

Wc, if we foundk points withinRdistance frompithat arrived whenpidid, then thesek

points formMESI(Wc,c+x)forpi, whereWc+xis the last window in whichpi will be alive.

This is because these points will be accompanyingpias its neighbors untilpi expires. We

are now ready to define the concept ofLife Time Minimal Evidence Set for Inlier.

Definition 3.3 MESI(Wc,c+x)forpiis alife time MESIofpi, denoted asMESIlt, ifWc+x

is the last window in whichpi participates before its expiration.

A MESIlt for pi is an ideal evidence set because it proves the inlier identity of pi

during its entire remaining life, hence named safe inlier. It eliminates the need for any future maintenance effort onpi for the potential detection of its outlier status. Acquiring