Capítulo 2: MARCO CONCEPTUAL DE LA INVESTIGACIÓN
2.3. Los impactos del turismo
2.3.2. Visión General de los Impactos del Turismo
ously detects and outputs the outliers in the current windowWcwhen the window slides.
2.5
MapReduce Basics
MapReduce is a framework for parallel processing of massive data sets popular for its scalability to thousands machines, flexibility in the data model, efficient fault tolerance execution, and cost effectiveness. A job to be performed using the MapReduce frame- work has to be specified as two phases: the map phase as specified by a Map function (also called mapper) takes key/value pairs as input, possibly performs some computation on this input, and produces intermediate results in the form of key/value pairs; and the reduce phase which processes these results as specified by a Reduce function (also called reducer). The data from the map phase are shuffled, i.e., exchanged and merge-sorted, to the machines performing the reduce phase. It should be noted that the shuffle phase can itself be more time-consuming than the two others depending on network bandwidth availability and other resources.
In more detail, the data are processed through the following 6 steps as illustrated in Figure 2.1:
Worcester Polytechnic Institute
Input
Reader Map Combiner Partition Reduce
Output Writer
Input
Reader Map Combiner Partition Reduce
Output Writer
Input
Reader Map Combiner Partition Reduce
Output Writer
Figure 2.1:MapReduce Dataflow
1. Input reader: The input reader in the basic form takes input from files (large blocks) and converts them to key/value pairs. It is possible to add support for other input
2.5 MAPREDUCE BASICS
types, so that input data can be retrieved from a database or even from main mem- ory. The data are divided into splits, which are the unit of data processed by a map task. A typical split size is the size of a block, which for example in HDFS is 64 MB by default, but this is configurable.
2. Map function: A map task takes as input a key/value pair from the input reader, per- forms the logic of the Map function on it, and outputs the result as a new key/value pair. The results from a map task are initially output to a main memory buffer, and when almost full spill to disk. The spill files are in the end merged into one sorted file.
3. Combiner function: This optional function is provided for the common case when there is (a) significant repetition in the intermediate keys produced by each map task, and (b) the user-specified Reduce function is commutative and associative. In this case, a Combiner function will perform partial reduction so that pairs with same key will be processed as one group by a reduce task.
4. Partition function: As default, a hashing function is used to partition the intermedi- ate keys output from the map tasks to reduce tasks. While this in general provides good balancing, in some cases it is still useful to employ other partitioning func- tions, and this can be done by providing a user-defined Partition function.
5. Reduce function: The Reduce function is invoked once for each distinct key and is applied on the set of associated values for that key, i.e., the pairs with same key will be processed as one group. The input to each reduce task is guaranteed to be processed in increasing key order. It is possible to provide a user-specified comparison function to be used during the sort process.
2.5 MAPREDUCE BASICS
storage. In the basic case, this is to a file, however, the function can be modified so that data can be stored in, e.g., a database.
As can be noted, for a particular job, only a Map function is strictly needed, although for most jobs a Reduce function is also used. The need for providing an Input reader and Output writer depends on data source and destination, while the need for Combiner and Partition functions depends on data distribution.
Hadoop [50] is an open-source implementation of MapReduce, and without doubt, the most popular MapReduce variant currently in use in an increasing number of prominent companies with large user bases, including companies such as Yahoo! and Facebook.
Hadoop consists of two main parts: the Hadoop distributed file system (HDFS) and MapReduce for distributed processing. As illustrated in Figure 2.2, Hadoop consists of a number of different daemons/servers: NameNode, DataNode, and Secondary NameNode for managing HDFS, and JobTracker and TaskTracker for performing MapReduce.
Worcester Polytechnic Institute
JobTracker Client NameNode TaskTracker DataNode Secondary NameNode TaskTracker DataNode TaskTracker DataNode MapReduce Layer HDFS Layer
Part I
3
A Generic Outlier Detection
Framework
We now introduce our scalable framework calledLEAP, capable of continuously process- ing distance-based outliers with low CPU and memory resource utilization. LEAP is built on two fundamental optimization principles namelyminimal probingandlifespan-aware prioritizationas described below.
3.1
Theoretical Foundation
In all distance-based outlier definitions, points in a datasetDare classified either as out- liers or inliers. Thus, the process of identifying outliers inDis equivalent to the process of eliminating inliers from it. In fact, initially, each pointpi in the dataset is apotential
outlier candidate, until one has acquired enough evidence to show thatpiis an inlier. For
example, in the process of identifying Othres(k,R) outliers, until finding thatpi has at least k
neighbors and thus qualifies as inlier,pi cannot be safely removed from theoutlier candi-
3.1 THEORETICAL FOUNDATION
This fact leads us to an important observation. That is, to identify whether a pointpiis
a distance-based outlier in a datasetD, one may not need the distance betweenpitoevery
other point in D. Instead, a potentially small subset of points will be sufficient to prove thatpi is an inlier. Also due to the rarity of outliers, the majority of points in the dataset
could be labeled as inliers in this way by collecting only a small amount of information. To describe the least amount of information needed to provepi’s inlier status we define
the concept ofMinimal Evidence Set for Inlier(MESI).
Definition 3.1 Given an outlier query and a dataset D, theMESI set for a data point
pi ∈D is a datasetM such thatM ⊆D, if the distance setDistSet(M,pi)={d(p1,pi),
d(p2,pi), ...,d(pn,pi)|pj(1≤j≤n)∈M}is sufficient to labelpias an inlier, and there does
not exist anyM0 ⊆D such that|M0|<|M|andDistSet(M0,pi)={d(p1,pi),d(p2,pi), ...,d(pm,pi)|pj(1≤j≤m) ∈M0}is sufficient to labelpias an inlier.
The size ofMESIfor a pointpiis usually much smaller than the size ofpi’s complete
neighborhood. For example, forOthres(k,R)outlier, theMESIfor any pointpi is composed of
anyk points that are withinRdistance frompi. Thus its size isk. In general, this input
parameterkis much smaller than the average number of neighbors each point may have in
Rdistance range. Otherwise the outliers detected with fewer thankneighbors would not considered to beabnormal phenomenain the dataset. The cardinality ofMESIfor a point
pi in the kNN outlier definitions is also bounded by a constant value k as we will show
in Chapter 5. This observation guides us to propose theMinimal Probing optimization principle (Sec. 3.2).
Although MESI is sufficient to prove a point’s inlier status in the current window, unlike in static environments, locating more neighbors beyond MESI for a given point may be beneficial in streaming environments. These additional neighbors may help us to determine the status of this point in future windows. Thus, we now extend the concept
3.1 THEORETICAL FOUNDATION
of MESI in a static dataset toMESI in a sequence of stream windows. In particular, we define the concept ofMinimal Evidence Set for Inlier in a Window Sequenceas below.
Definition 3.2 Given a streaming outlier detection query Q and all points in the cur- rent windowWc, denoted byDWc,MESI(Wc,c+x)forpi in a window sequence fromWcto
Wc+x, is a datasetMwithM ⊆DWc, if the distance setDistSet(M,pi)={d(p1,pi),d(p2,pi), ...,d(pn,pi)|pj(1≤j≤n)∈M}is sufficient to labelpi as an inlier in windowsWctoWc+x,
and there does not exist anyM0 ⊆DWcwith|M
0|<|M|andDistSet(M0,p
i)={d(p1,pi),
d(p2,pi), ...,d(pm,pi)|pj(1≤j≤m)∈M0} is sufficient to labelpi as an inlier in windows
WctoWc+x.
In other words, the MESI(Wc,c+x) for a point pi is a minimal subset of the current
window population DWc that provides sufficient evidence to prove thatpi is an inlier in
windowsWctoWc+x, regardless of the characteristics of the future incoming stream. This
is possible because by analyzing the time stamp of a point pi and the query window (the
slide and window sizes), we can determine the number of windows that pi will survive
in. For example, for a pointpithat just arrived with the latest slide in the current window
Wc, if we foundk points withinRdistance frompithat arrived whenpidid, then thesek
points formMESI(Wc,c+x)forpi, whereWc+xis the last window in whichpi will be alive.
This is because these points will be accompanyingpias its neighbors untilpi expires. We
are now ready to define the concept ofLife Time Minimal Evidence Set for Inlier.
Definition 3.3 MESI(Wc,c+x)forpiis alife time MESIofpi, denoted asMESIlt, ifWc+x
is the last window in whichpi participates before its expiration.
A MESIlt for pi is an ideal evidence set because it proves the inlier identity of pi
during its entire remaining life, hence named safe inlier. It eliminates the need for any future maintenance effort onpi for the potential detection of its outlier status. Acquiring