POLÍTICAS DE LA COMUNIDAD
APROXIMACIÓN DE LAS LEGISLACIONES
This section provides an introduction to anomaly detection. It starts by providing a definition of anomaly. In order to keep the problem as general as possible the whole discussion is based on a generic dataset D and distance measure d. In general for all our applications we may assume that D = Rp and d is the Euclidean distance.
6.2.1
Definition of Anomalies
The aim of an anomaly detection system is to detect anomalies or outliers. An intuitive definition of outlier was provided by Hawkins [206].
Definition 6.2.1. Hawkins-Outlier: An outlier is an observation that de- viates so much from other observations as to arouse suspicion that it was generated by a different mechanism.
A more formal definition based on distance (Distance Based) is instead pro- vided in [207]:
Definition 6.2.2. DB(α, dmin)-Outlier: An object p in a dataset D is a
DB(α, dmin) outlier if at least percentage α of the objects in D lies greater than distance dmin from p, i.e., the cardinality of the set
{q ∈ D : d(p, q) ≤ dmin} (6.2)
is less than or equal to (100 − α)% of the size of D.
The above definition only captures certain kinds of outliers. Because the definition takes a global view of the dataset, these outliers can be viewed as "global" outliers. However, for many interesting real-world datasets which exhibit a more complex structure, there are other kinds of outliers. These can be objects that are outlying relative to their local neighbourhoods, par- ticularly with respect to the densities of the neighbourhoods. These outliers are regarded as "local" outliers. An example dataset with global and local outliers is depicted in Figure 6.1.
6.2.1.1 Local Outliers
Breunig et al. [208] provide a formal definition of local outliers. The key dif- ference between this notion and the previous notions of outliers is that being outlying is not a binary property. Instead, it assigns to each object an outlier factor, which is the degree to which the object is considered outlying. The proposed local outlier factor (LOF) is mathematically described in equation 6.6 but some preliminary definitions are required.
1.5 1.0 0.5 0.0 0.5 1.0 1.5
x
1 4 3 2 1 0 1 2x
2 Cluster 1Local Anomalies Cluster 1 Cluster 2
Local Anomalies Cluster 2 Cluster 3
Global Anomalies
Figure 6.1: A 2 dimensional dataset defined by a main cluster (green) and two smaller clusters (red and blue). Global anomalies are represented with yellows stars while local anomalies associated with the smaller clusters are respectively represented with blue and red stars.
Definition 6.2.3. (k-distance of an object p) For any positive integer k, the k-distance of object p, denoted as k-distance(p), is defined as the distance d(p, o) between p and an object o ∈ D such that:
i) for at least k objects o0 ∈ D/p it holds that d(p, o0) ≤ d(p, o). ii) for at most k − 1 objects o0 ∈ D/p it holds that d(p, o0) < d(p, o). Definition 6.2.4 ( k-distance neighbourhood of an object p). Given the k- distance of p, the k-distance neighbourhood of p contains every object whose distance from p is not greater than the k-distance, i.e.
Nk(p) = {q ∈ D/p : d(p, q) ≤ k-distance(p)} (6.3)
The objects q are called the k-nearest neighbours of p.
Definition 6.2.5. (reachability distance of an object p w.r.t. object o) Let k be a natural number. The reachability distance of object p with respect to object o is defined as
Intuitively, if object p is far away from o, then the reachability distance between the two is simply their actual distance. However, if they are "suf- ficiently" close, the actual distance is replaced by the k-distance of o. An illustration of this concept (reproduced from the original LOF paper [208]) is provided in Figure 6.2
3
relative to their local neighborhoods, particularly with respect to
the densities of the neighborhoods. These outliers are regarded as “local” outliers.
To illustrate, consider the example given in Figure 1. This is a sim- ple 2-dimensional dataset containing 502 objects. There are 400 ob- jects in the first cluster C1, 100 objects in the cluster C2, and two additional objects o1 and o2. In this example, C2 forms a denser cluster than C1. According to Hawkins' definition, both o1 and o2 can be called outliers, whereas objects in C1 and C2 should not be. With our notion of a "local" outlier, we wish to label both o1 and o2 as outliers. In contrast, within the framework of distance-based out- liers, only o1 is a reasonable DB(pct,dmin)-outlier in the following
sense. If for every object q in C1, the distance between q and its
nearest neighbor is greater than the distance between o2 and C2
(i.e., d(o2,C2)), we can in fact show that there is no appropriate val-
ue of pct and dmin such that o2 is a DB(pct,dmin)-outlier but the the
objects in C1 are not.
The reason is as follows. If the dmin value is less than the distance
d(o2,C2), then all 501 objects (pct = 100*501/502) are further away from o2 than dmin. But the same condition holds also for every ob- ject q in C1. Thus, in this case, o2 and all objects in C1 are DB(pct, dmin)-outliers.
Otherwise, if the dmin value is greater than the distance d(o2, C2),
then it is easy to see that: o2 is a DB(pct,dmin)-outlier implies that
there are many objects q in C1 such that q is also a DB(pct,dmin)- outlier. This is because the cardinality of the set {p ∈ D | d(p,o2) ≤ dmin} is always bigger than the cardinality of
the set {p ∈ D | d(p,q) ≤ dmin}. Thus, in this case, if o2 is a
DB(pct,dmin)-outlier, so are many objects q in C1. Worse still,
there are values of pct and dmin such that while o2 is not an outlier,
some q in C1 are.
4. FORMAL DEFINITION OF LOCAL OUTLIERS
The above example shows that the global view taken by DB(pct,
dmin)-outliers is meaningful and adequate under certain conditions,
but not satisfactory for the general case when clusters of different densities exist. In this section, we develop a formal definition of lo- cal outliers, which avoids the shortcomings presented in the previ- ous section. The key difference between our notion and existing no- tions of outliers is that being outlying is not a binary property. Instead, we assign to each object an outlier factor, which is the de- gree the object is being outlying.
We begin with the notions of the k-distance of object p, and, corre- spondingly, the k-distance neighborhood of p.
Definition 3: (k-distance of an object p)
For any positive integer k, the k-distance of object p, denoted as
k-distance(p), is defined as the distance d(p,o) between p and an
object o ∈ D such that:
(i) for at least k objects o’∈D \ {p} it holds that
d(p,o’) ≤ d(p,o), and
(ii) for at most k-1 objects o’∈D \ {p} it holds that
d(p,o’) < d(p,o).
Definition 4: (k-distance neighborhood of an object p)
Given the k-distance of p, the k-distance neighborhood of p contains every object whose distance from p is not greater than the k-distance, i.e. Nk-distance(p)(p) = { q ∈ D\{p} | d(p, q) ≤ k- distance(p) }.
These objects q are called the k-nearest neighbors of p. Whenever no confusion arises, we simplify our notation to use
Nk(p) as a shorthand for Nk-distance(p)(p). Note that in definition 3,
the k-distance(p) is well defined for any positive integer k, although the object o may not be unique. In this case, the cardinality of Nk(p) is greater than k. For example, suppose that there are: (i) 1 object with distance 1 unit from p; (ii) 2 objects with distance 2 units from
p; and (iii) 3 objects with distance 3 units from p. Then 2-dis-
tance(p) is identical to 3-distance(p). And there are 3 objects of 4- distance(p) from p. Thus, the cardinality of N4(p) can be greater than 4, in this case 6.
Definition 5: (reachability distance of an object p w.r.t. ob-
ject o)
Let k be a natural number. The reachability distance of object
p with respect to object o is defined as
reach-distk(p, o) = max { k-distance(o), d(p, o) }.
Figure 2 illustrates the idea of reachability distance with k = 4. In- tuitively, if object p is far away from o (e.g. p2 in the figure), then the reachability distance between the two is simply their actual dis- tance. However, if they are “sufficiently” close (e.g., p1 in the fig-
ure), the actual distance is replaced by the k-distance of o. The rea- son is that in so doing, the statistical fluctuations of d(p,o) for all the
p's close to o can be significantly reduced. The strength of this
smoothing effect can be controlled by the parameter k. The higher the value of k, the more similar the reachability distances for ob- jects within the same neighborhood.
So far, we have defined k-distance(p) and reach-distk(p) for any
positive integer k. But for the purpose of defining outliers, we focus on a specific instantiation of k which links us back to density-based clustering. In a typical density-based clustering algorithm, such as [7], [3], [22], or [11], there are two parameters that define the notion of density: (i) a parameter MinPts specifying a minimum number of objects; (ii) a parameter specifying a volume. These two parameters determine a density threshold for the clustering algorithms to oper- ate. That is, objects or regions are connected if their neighborhood densities exceed the given density threshold. To detect density-
o p1
p2
Figure 2: reach-dist(p1,o) and reach-dist(p2,o), for k=4
reach-distk(p1, o) = k-distance(o)
reach-distk(p2, o)
Figure 6.2: reach-dist(p1, o) and reach-dist(p2, o), for k = 4. Figure from [208].
Definition 6.2.6. The local reachability density of p is defined as the inverse of the reachability distance that is:
lrdv(p) = P o∈Nv(p)reach-distv(p, o) |Nv(p)| !−1 (6.5) Definition 6.2.7. It is finally possible to define the Local Outlier Factor (LOC) for a point p as:
LOF = P o∈Nv(p) lrdv(o) lrdv(p) Nv(p) (6.6) lrdv(p) is a measure of the density of points around p. LOF is the value of the average density of data and the v neighbours of p and p itself. It is clear to see that if the density of the data aound p is much less than that of its v nearest neighbours then LOF for p will be large.
In [208] the authors use the LOF factor to detect anomalies. While the method is important because it provides a formal definition of local outliers and the idea to use an outliers score instead of a binary classifier this will
not be further discussed in this thesis because it has been outperformed by more recent methods both in computing time and accuracy [203].
To complete the classification of the different kind of anomaly the definition of contextual anomaly is now reported:
Definition 6.2.8 (Contextual Anomalies). The term Contextual Anomaly is reported in [209] to describe a particular case of outlier. If a data instance is anomalous in a specific context (but not otherwise), then it is termed as a contextual anomaly (also referred to as conditional anomaly). The notion of a context is induced by the structure in the data set and has to be specified as a part of the problem formulation. For example the time in a time series dataset is an attribute that defines a context.
6.2.2
Introduction to Anomaly Detection Algorithms
Anomaly detection is a classical field of study. Over the years numer-
ous anomaly detection algorithms have been developed. According to [210] anomaly detection algorithms can be classified in three main groups:
• Supervised Anomaly Detection. In this case the anomaly detection problem is reduced to a classification task. It assumes the availability of a training data set which has labeled instances for normal as well as anomaly behaviour in which case standard classifiers such as Linear Discriminant Analysis (LDA), Support Vector Machines (SVM) and k- nearest neighbours can be trained to distinguish between normal and anomalous samples [210].
• Semi-supervised Anomaly Detection. Assume that the training
data has labeled instances for only the normal class. Systems can
then be trained to assign an anomaly score to new samples accord- ing to how distant they are from the normal behaving ones. Several algorithms have been developed with this aim, including Multivariate Control Charts [211], one-class SVMs [205] and Unsupervised Random Forests [201], [23]. While some methods are explicitly designed for unsupervised anomaly detection, it is possible to transform an unsu- pervised problem to a supervised one using the technique described in [201]. Since Semi-Supervised anomaly detection methods do not re- quire labels for the anomaly class, they are more widely applicable than supervised techniques.
• Unsupervised Anomaly Detection. Techniques that operate in unsupervised mode do not require labelled data and thus are most
widely applicable. The techniques in this category make the implicit assumption that normal instances are far more frequent than anomalies in the test data. This structure is then revealed and potential anomalies identified though the application of unsupervised clustering techniques such as DBSCAN [212] and Max Separation clustering [90] or through isolation based methods such as Isolation Forest [203].
6.2.2.1 Unsupervised Anomaly Detection
In semiconductor manufacturing measurements of normal behaving wafers are in general available while often no information regarding the anomaly data is available. For this reason in this thesis the focus is on semi-supervised and unsupervised techniques as they do not require data labelled as anomaly. In addition, due to the high dimension of the OES datasets, of particular im- portance are algorithms that can scale to large and high dimensional datasets. Unless otherwise specified, in the reminder of the thesis unsupervised algo- rithm will be used to refer to both unsupervised and semi-supervised algo- rithms. Unsupervised algorithms for anomaly detection can be divided in 4 main groups. Good descriptions of these groups are reported in [208] and [213].
• Distribution-Based: In distribution-based methods a standard dis- tribution (e.g. Normal, Poisson, etc.) is used to fit the data. Out- liers are then defined based on the probability distribution. Over one hundred tests in this category, called discordancy tests, have been de- veloped for different scenarios [214]. A key drawback of this category of tests is that most of the distributions used are univariate. Some examples of multivariate tests are described in [215]. In addition, for the majority of applications, the underlying distribution is unknown. Fitting the data with standard distributions is costly, and may not produce satisfactory results.
• Depth-Based. Each data object is represented as a point in a p- dimensional space, and is assigned a depth. With respect to outlier de- tection, outliers are more likely to be data objects with smaller depths. There are many definitions of depth that have been proposed such as the Mahalanobis depth [216] and the convex hull peeling depth [217]. In theory, depth-based approaches could work for large values of p. However, in practice, while there exist efficient algorithms for p = 2 or 3 [218], [219], depth-based approaches become inefficient for large datasets for p ≥ 4. This is because depth-based approaches rely on the
computation of p-dimensional convex hulls which have a lower bound complexity of O(np/2) for n objects.
• Cluster-Based. Most clustering algorithms (e.g. DBSCAN [212], BIRCH [220]), are to some extent capable of handling exceptions. How- ever, since the main objective of a clustering algorithm is to find clus- ters, they are developed to optimize clustering, and not to optimize outlier detection. The exceptions (called "noise" in the context of clus- tering) are typically just tolerated or ignored when producing the clus- tering result. Even if the outliers are not ignored, the notions of outliers are essentially binary, and there is no quantification of how outlying an object is.
• Density-Based. As previously mentioned this methodology was orig- inally proposed by Markus et al. [208]. It relies on the local outlier factor (LOF) of each object, which depends on the local density of its neighborhood. Algorithms that are part of this group include: LOF [208], LOCI [213], Isolation Forest [203].
In the next sections a review is presented of selected anomaly detection algo- rithms. The algorithms have been chosen according to their popularity and for their performance on large and high dimensional datasets.