III. Desarrollo de la Investigación
III.11. Sostenibilidad y Niveles de Impacto ambiental
The basic idea of the probabilistic approach is to build a data model of conditional probability
P(Y|X) from data D. The model is then used to identify outliers by calculating Pe(Y =
y(n)|X = x(n)), where (x(n), y(n)) denotes a data instance being examined1. In general, the
instance is an outlier when it leads to a low conditional probability. Since it may be hard to define a fixed “low probability” threshold, the conditional probability can be also used to score and rank the different data instances in terms of their outlier strength. We note that the probabilistic approach has been successfully applied to solve multiple UCOD problems in the literature [Hauskrecht et al., 2007,Song et al., 2007,Hauskrecht et al., 2010,Hauskrecht et al., 2016].
Figure 4.1 illustrates the basics of the probabilistic approach and its two phases: data modeling and outlier scoring. In the following, we review methods one can use for building the data model and discuss the outlier scoring step.
4.2.1.1 Data Modeling The first phase of the probabilitic approach regards model
building that produces a probabilistic model M that captures stochastic dependence rela-
1The vertical bar (|) denotes conditioning; variables to the left of the symbol are conditioned on those to
Figure 4.1: Probabilistic conditional outlier detection.
tions among data attributes. The goal of this phase is to obtain a precise data representation that can efficiently estimate the conditional probability Pe(y|x;M) for any observed input and output pair (x, y). Various statistical machine learning models and methods can be used for this purpose. These include generative and discriminative models.
Generative models compute the conditional probability using Bayes rule (i.e.,P(Y|X) =
P(X, Y)/P(X)). That is, approaches based on generative models first learn the joint distri- bution of both input and output, P(X, Y), using a set of model parameters. Subsequently, the joint distribution is used to estimate the conditional distribution through an algebraic transformation defined by Bayes rule. This approach was applied to COD by [Hauskrecht et al., 2007] to detect unusual emergency room admissions from emergency room observations and findings. The authors built a probabilistic model of the admission action conditioned on the current patient status (such as symptoms, observations) using Bayesian belief net- works (BBN) [Pearl, 1988,Lauritzen and Spiegelhalter, 1988,Cooper and Herskovits, 1992]. A similar approach for COD was also used in [Song et al., 2007]. This work tackled a slightly different type of COD problem where both input and output attributes are contin- uous. The generative model used in the work was based on the Gaussian mixture model (GMM) [Nowlan, 1991,Titterington et al., 1985] which was used to represent the conditional probability by modeling the correlations among the input and output spaces respectively.
In contrast to generative models, discriminative models directly learn the conditional distributionP(Y|X) by optimizing a likelihood or loss function expressed by a set of param- eters. The discriminative models were used to support COD for identification of unusual
patient management actions (medication and lab orders) in clinical workflow [Hauskrecht et al., 2010, Hauskrecht et al., 2013, Hauskrecht et al., 2016]. More specifically, the ap- proach applied in this work used calibrated support vector machines (SVM) models that first learn a discriminative projection of the input attributes that reflect the associated out- put values. Then, a transformation from the projection to a probability estimate is obtained using a post-hoc recalibration approach [Platt, 1999, DeGroot and Fienberg, 1983].
Apparently, generative and discriminative models have very different properties as well as complementary strengths and weaknesses (e.g., generative models allow one to generate new data similar to existing data; whereas discriminative models generally outperform generative models in classification tasks). Detailed discussion on the comparison of generative and discriminative models could be found in [Ng and Jordan, 2002, Ulusoy and Bishop, 2006,
Bishop and Lasserre, 2007].
To represent the probabilistic approach in our empirical studies, we implement the base- line probabilistic model using the discriminative approach, where the L2-regularized logistic regression model is used to directly learn the conditional probability from D.
4.2.1.2 Outlier Scoring The second phase of the probabilistic approach aims to com-
pute outlier scores using the obtained data model. The goal is to assign each instance an outlier score such that the higher the score is, the more likely the instance is an outlier.
Since outliers are associated with low probabilities, we can convert probabilities to outlier scores (where stronger outliers are associated with a higher score) using one of the following transformations: ScorePROB(y(n)|x(n)) = 1−Pe(y(n)|x(n);M) (4.1) or ScorePROB(y(n)|x(n)) = 1 e P(y(n)|x(n);M) (4.2) In the following discussion, we use a minor modification of the second score, where we take the logarithm of the inverse probability to rank the conditional outliers:
Note that the logarithm is a monotonous function. Therefore, the order of scores before and after the transformation is preserved.
4.2.1.3 Limitations of Probabilistic Models Probabilistic outlier detection approaches,
however, has several fundamental drawbacks that may affect the COD performance. This mainly regards the accuracy of the underlying data models that produce the probability estimates, with which we compute the outlier score.
More specifically, standard parameter optimization criteria for generative models, such as the Bayesian belief networks, na¨ıve Bayes model, and linear discriminant analysis, assume that data instances are drawn independently from an unknown population (i.e., indepen- dently and identically distributed or i.i.d.). Accordingly, they treat all instances equally important and minimize the expected loss under the i.i.d. assumption. However, this as- sumption is often violated in many practical problems [Dundar et al., 2007].
Although discriminative models, such as logistic regression, are less strict with the i.i.d. assumption, the models still often fail to produce well calibrated probabilities for sparse regions of the input (X) space (i.e., the regions where X has a low support) [MacKay, 2003]. In addition to that, the fixed representation of parametric data models may constrain accuracy in the estimates. A parametric approach relies on a set of model parameters that reflect the underlying assumptions about the population. When the assumptions are correct, the approach will produce accurate and precise probability estimates. However, if the assumptions are not correct, the approach has a large chance of failing; e.g., when one trains a linear model for nonlinear domains, the assumption that the probability is monotonously increasing along the discriminative projection does not hold and leads to imprecise probability estimation.
Apparently, the above described issue may have a crucial impact on the outlier detection performance. Unfortunately, there are no rules of thumb for avoiding the issue. For example, it is possible to consider local (instance-based) models instead of building a global model, such as the work by [Valko and Hauskrecht, 2008]. However, this approach typically reduces the sample size, and thus the resulting probability estimates may still be inaccurate. Al- ternatively, calibration via binning [Tukey, 1961, Bella et al., 2009, Pakdaman, 2017] might
address the general issues with imprecise probability estimates. However, this again would not be a good solution for outlier detection, in which we want to correctly estimate very small probabilities. Since outlier detection is often done on a finite dataset, this becomes particularly hard as binning reduces sample size.