Capítulo 2: MARCO CONCEPTUAL DE LA INVESTIGACIÓN
2.2. Actitud de los residentes hacia los impactos del turismo
2.2.3. Teorías aplicadas al estudio de las actitudes de los residentes hacia los
large datasets other approaches have to be explored instead of extensively maintaining all possible final results.
In density-based clustering, theOPTICSalgorithm [42] produces the clusters in a time complexity linear in the size of the dataset with respect to any parameter settings picked in a given parameter setting range. OPTICSachieves this by creating an augmented ordering of the dataset to represent the clustering structure corresponding to a set of parameter settings. However, producing outliers as by-products of clustering has already been shown to be not effective in capturing abnormal phenomena [18]. Furthermore, the ordering information is only effective in representing the clusters with respect to a small range of parameter settings. Our work instead aims to support any outlier detection request with any possible parameter setting in a near real-time fashion.
1.3
Research Challenges Addressed in This Dissertation
Continuous Outlier Detection Over Data Streams. First, designing scalable stream outlier detection strategies that satisfy the stringent response time requirements of online monitoring applications is extremely difficult, because the processing of an outlier de- tection request is resource-consuming due to the algorithmic complexity of the mining process. As shown in [1], the algorithmic complexity of most outlier detection techniques is known to be quadratic with respect to the number of points. Continuously mining out- liers from high volume, high velocity stream data is like mining needles in a haystack. There is so much hay to mine and so little time to utilize.
Second to handle a large workload composed of hundreds or even thousands of out- lier requests over data streams in real time, effective sharing of system resources utilized for the processing of each of these queries must be achieved. However outlier mining requests with different parameter settings may cause totally different outliers to be iden-
1.3 RESEARCH CHALLENGES ADDRESSED IN THIS DISSERTATION
tified. Furthermore, given a data point p, the evidence needed to prove its outlier status, i.e., whether it is an outlier or an inlier, with respect to distinct outlier interpretations (parameter settings) can differ. Therefore a sharing-aware execution strategy that com- pletely avoids the redundant computation across the process of different outlier detection requests on data is hard to develop.
Distributed Outlier Detection.The design of an efficient distributed outlier detection algorithm is challenging.
First, designing an effective partitioning strategy for the MapReduce-based outlier detection approach is challenging. Intuitively the default partitioning solution in MapRe- duce would randomly spread the information that is necessary to prove the status of one point into possibly numerous nodes. Therefore a point pwould not be able to prove its outlier status on the local reducer node on which presides. This inevitably would lead to a multi-pass solution, thus introducing heavy communication costs due to requiring a repeated re-distribution of the whole dataset. On the other hand partitioning the data points with similar characteristics to the same node might be able to preserve the norm on the local node for each data point to evaluate its abnormity. However real world datasets tend to be skewed [43] instead of being uniformly distributed over their domain spaces. For this reason, data characteristics-based partitioning suffers from the problem that the number of points allocated to each node may vary extremely−leading to an unbalanced workload.
Second, a common limitation in distributed analytics work [35, 36, 37, 38] is that they apply one single detection algorithm to all compute nodes. This “monolithic” detection approach is based on the implicit assumption that there is one outlier algorithm that is superior to all others for all types of datasets. However, we observe that although numeri- ous centralized algorithms have been proposed to speed up the outlier detection process, e.g., [8, 44], none of them has shown consistent superiority in all circumstances. Instead,
1.3 RESEARCH CHALLENGES ADDRESSED IN THIS DISSERTATION
the performance strongly varies depending on the characteristics of the dataset being pro- cessed. Since the data partitions in a distributed environment may each have different characteristics, this “monolithic” detection approach misses important optimization op- portunities to minimize the overall costs of the distributed outlier detection process. To solve this problem we must assign an appropriate detection algorithm to each partition based on its characteristics. This requires a thorough understanding of the correlations between the characteristics of the data and the performance of the algorithms. However, to date no such work appears in the literature.
Third, the partition generation problem (Challenge 1) and the algorithm-selection problem (Challenge 2) are strongly interdependent, i.e., a change in one may cause a modification in the other. For example, to minimize the overall detection costs, the effec- tiveness of a partitioning plan should be evaluated based on the costs estimated from the detection algorithms assigned to each partition. On the other hand, the algorithm assign- ments must be determined based on the characteristics of the data subsets produced by the partitioning plan. This raises the proverbial chicken and egg question.
Interactive Outlier Exploration. Designing an interactive outlier detection system that effectively derives the outliers of interest to the analysts with real time responsiveness, thereby meeting the requirements of online analytics applications, is challenging. It is challenging to design an interactive system that can recommend appropriate parameter settings and allow users to online analyze outliers over big data.
First, due to the algorithmic complexity of mining techniques [45], processing each outlier request from scratch over big datasets each time when it is submitted clearly cannot satisfy the response time requirement of interactive systems. On the other hand pre- computing and storing the results for all potential detection requests beforehand on first sight appears infeasible because of the infinite number of possible parameter settings.