Actividades diarias
3.1 ESTUDIO DE NECESIDADES Y ACTIVIDADES DEL ESTUDIANTE UNIVERSITARIO
3.1.6 ANALISIS DE ÁREAS 1 COCINA
3.1.6.2 SALA COMEDOR
This section presents an experimental comparison of theEx-
plainerto the state of the art algorithms by [25], [26] and [28] on 36 classification datasets from the UCI machine learning repository [36], with the number of features ranging from 4 to 5000 and number of samples from 22 up to nearly 100k.
Since there is a lack of the prior art on explaining anoma- lies, the methodology to measure quality of an explanation has not been established yet. Therefore, this work measures how the area under the receiver operating characteristic (AUC) of an anomaly detector improves if the anomaly detection uses only features used in the explanation in comparison to the baseline when all features are used. The rationale behind is that the explanation should contain only features in which the anomaly is most different from the remaining data. Therefore, the AUC in the reduced space should be comparable or even higher than the AUC computed on the full space. Also recall that all recent prior
art to which theExplaineris compared stop the explanation
after identifying the set of features in which anomaly is visible, which further justifies the choice of using AUC for the comparison.
The algorithms are also compared on the basis of the number of returned features and scalability, which is impor- tant for contemporary large data.
4.1 Experimental settings
The Explaineris compared to the algorithms of [25], [28] and [26]. The implementations of the first two algorithms
were kindly provided by their authors, while the last algo- rithm was implemented by ourselves. The compared algo- rithms tried to explain anomalies identified by Local Outlier
Factor (LOF) by [23], ak−nearest neighbor (knn) classifier
by [38] and the Isolation Forest (iForest) by [31]. To avoid the influence of incorrect detections, the perfect detector scenario was simulated by explanation of the ground truth (GT) anomalies.
As mentioned above, the performance of anomaly detec- tors was measured by the area under the receiver operating characteristic (AUC) of an anomaly detection algorithm us- ing only the features selected by the explanation algorithm. The reported AUC values were measured by the anomaly detector under investigation, e.g. LOF was used to measure AUC in all LOF related experiments. For the evaluation of experiments with ground truth anomalies the Isolation forest detector was used because it is the fastest detector with very good overall performance.
The anomaly detection algorithms were set as follows. The Isolation forest was set to create 100 trees using 256 samples for each, the thresholds between normal and anomalous samples were estimated from data, separately for each dataset. The size of local neighbourhood for the LOF detecter was set to 10 and detection threshold was set to 2, in accordance with the recommendation in [23]. The
k-nn used onlyk= 4, while the threshold on a distance to
the average of theknearest neighbours was estimated from
data, separately for each dataset.
TheExplainerused training sets of size|T |= 100. The
number of trained trees was set to nT = 5for minimal
explanation, whereas it is determined automatically during the run of the algorithm for maximal explanation. The other algorithms were set up as recommended by the authors in
the respective papers. In [25], the neighbouring set sizek
varied from 10 to 40, in [26], the regularization parameter
was set toα= 0.1and the neighbourhood sizes varied from
5 to 25. Finally, [28] has two parameters and according to the authors, the algorithm should be robust to their setting,
which our experiments withαvaried from 0.1 to 0.7 and
kfrom 20 to 50 have confirmed. The reported results were
obtained with the settingα= 0.5andk= 50.
The experimental comparison used 36 different datasets from UCI machine learning repository, [36], covering many application domains. Since they are designed for supervised classification, they were converted to anomaly detection problems as recommended by [11]. The LOF alone showed to be the worst of all tested detection algorithms as it did not found any outliers in 6 out of 36 datasets. Therefore, the comparison was done using only 30 datasets.
4.2 Experimental results
Quality of explanation measured by the AUC At first, the quality of explanation was compared by measuring the aforementioned improvement in AUC. Due to the large number of problems on which the algorithms are compared, the results are summarized by critical difference diagrams as proposed by [12]. Figure 3 shows the average rank of each algorithm over all 36 problems, groups of algorithms that are not significantly different according to the Nemenyi
post-hoc analysis are connected. Table??contains all results
8 1 2 3 4 5 Explainermin Explainermax Dang 2014 Dang 2013 Micenkova
(a) ground truth
1 2 3 4 5 Explainermin Dang 2014 Explainermax Dang 2013 Micenkova (b) Isolation forest 1 2 3 4 5 Explainermin Dang 2013
Dang 2014 Explainermax Micenkova (c)k-nn 1 2 3 4 5 Explainermin Explainermax Dang 2014 Dang 2013 Micenkova (d) LOF
Fig. 3. Critical comparison of explainers when ground truth anomalies (top-left) and anomalies discovered by Isolation forest (top-right),k-nn (bottom-left) and LOF (bottom-right) were explained.
TheExplainerminperforms consistently the best regard-
less the anomaly detector used to identify the anomalies, though the margin of the improvement over the prior art depends on that detector. For anomalies identified by
Isolation Forest andk-nn, the statistical test did not reject
the hypothesis that the Explainermin and Dang 2013 and
Dang 2014 are statistically different. Still, in both cases theExplainermin achieved the lowest rank. For anomalies
identified by the LOF, theExplainerwas significantly better
than prior methods.
When the effect of an anomaly detector is removed
by explaining the ground truth, both theExplainerminand
Explainermax are significantly better than all of the prior
art according to the [39] test and the post-hoc tests with the corrections for simultaneous hypotheses testing by [40]. Precise results are shown in Table 3.
The algorithm of [28] didn’t work well with thek-nn
detector as it didn’t return any features as an explanation in 17 out of 36 cases, despite the algorithm has been executed with a wide range of parameters.
Further investigation, of the cases where the Ex-
plainerminperformed worse revelled that in multiple fea-
tures, waveform-1 and waveform-2 datasets in combination
withk-nn anomaly detector, theExplainerminhas removed
features responsible for clustering normal samples together,
which misleads thek-nn detector to create new anomalies
that were not present in the full space version of the dataset. Yet authors believe that the explanation was correct and the worse performance is due to the detector rather than due to theExplainer. TheExplainerminexcels on high dimensional
datasets like gissete, madelon, libras and on noisy datasets. theExplainermaxseems to be more susceptible to mistakes
in detection than theExplainermin. While the explanations
were reasonable, anomaly detection on more features was typically worse (see Table 6) since more features increase the signal to noise ration to which anomaly detectors are sensitive, and which explains the worse AUC.
To conclude this comparison, the performance of [28] algorithm strongly depends on the particular dataset. The [26] improved the [25] and is the closest prior art to the
Explainerin the terms of identification of features in which the anomaly is detectable.
The compactness of explanation by number of features
TABLE 3
Results of the statistical test of [39] and its post-hoc tests with the correction for simultaneous hypotheses testing by [40]. The rejection thresholds are computed for the family-wise significance levelα= 0.05
algorithms p rejection threshold Explainerminvs. Micenkova 2013 1.52e-15 0.005
Explainerminvs. Dang 2013 2.45e-12 0.008
Explainermaxvs. Micenkova 2013 6.15e-12 0.008
Explainermaxvs. Dang 2013 1.81e-12 0.008
Explainerminvs. Dang 2014 1.27e-6 0.008
Dang 2014 vs. Micenkova 0.0017 0.013 Explainerminvs. Dang 2014 0.0022 0.013
Dang 2013 vs. Dang 2014 0.0306 0.017 Explainerminvs.Explainermax 0.0736 0.025
Dang 2013 vs. Micenkova 0.3325 0.050
Figure 4a shows the median of selected features used in
explanation of the ground truth anomalies. Since theEx-
plainerminwas designed to select as few features as possi-
ble, it consistently selects the minimal number of features among all tested methods. Taking into account that it also outperforms all prior art in terms of the AUC, it can be con- cluded that its explanations are more compact and therefore more precisely specifying root causes of anomalies, which
was its goal. Surprisingly, even though the Explainermax
should return all relevant features (and therefore returns
more features than theExplainermin), their number is still
considerably lower than in case of all prior art methods, especially for large datasets.
The comparison of running times The last compar- ison was done with respect to the computational com- plexity, measured as a total time to explain all ground- truth anomalies. As in the previous experiment, the last two datasets are missing for both Dang’s algorithms and one for Micenkov´a, because of their excessive memory and computational requirements. Figure 4b show running times averaged over 20 runs. On smaller datasets, the algorithm of [26] clearly outperformed all other algorithms, but for the
larger datasets, it is evident that theExplainer methods are
much more efficient. The break point is approximately 1000 samples for low dimensional datasets or 200 samples for high dimensional datasets.
Comparison with Isolation forest:
The Explainer and Isolation forest are both based on a random forests trained only on a small sub sample of data, so they may seem similar. This section shows that not only their main purpose but also their outcomes differ substantially.
Isolation forest is an ensemble of random trees. More specifically, both the feature and threshold are selected randomly for each splitting node. The main idea is that anomalies differ strongly from the normal samples, there- fore, they should be separated with fewer splits. There is no focus on identifying key features or any other way of explanation or description of findings.
The main goal of theExplainer is to identify features
responsible for the anomalous nature of the sample under analysis. Therefore, each split uses the feature and threshold that produces the best possible separation of an anomaly and normal samples. Features used in spilt nodes are then assembled into classification rules.
9 5 10 15 20 25 30 35 100 101 102 103 104 datasets n um b er of features All features Explainermin Explainermax Dang 2013 Dang 2014 Micenkova
(a) number of features
76 200 240 570 1528 3000 45586 10−2 10−1 100 101 102 103 104 105 number of samples time [s] Explainermin Explainermax Dang 2013 Dang 2014 Micenkova (b) running times
Fig. 4. The comparison of the number of selected features per dataset. The curve marked as all features shows the number of features of a particular dataset (left). The comparison of the running times (right). Please notice that number of features as well as running times are plotted in logarithmic scale.
One of the basic properties indicating the comprehensibility of a rule is its length. A comparison of the average rule length for each dataset is depicted at Figure 5 and sum- marised in Table 4.
TABLE 4
The average length of rules extracted fromExplainermin,Explainermax
and Isolation forest over all datasets. Explainermin Explainermax Isolation forest
2.47 4.06 5.79
The rules produced by Isolation forests are not only
longer (even against the Explainermax) but they have also
other problems, caused by the random nature of this method. First, extracted rules are inconsistent and used fea- tures vary every time a forest is trained. Secondly, rules ex- tracted from a random tree can contain a rule on one feature many times, even in succession, or contain rules without any impact on resulting set. Therefore, rules extracted from Isolation forests are not suitable for the explanation.
5 10 15 20 25 30 35 0 5 10 15 datasets av erage rule length Explainermin Explainermax Isolation forest
Fig. 5. Comparison of the average length of the rules extracted from the
Explainerand the Isolation forest anomaly detector.
2 4 6 8 10 12 14 0.5 0.6 0.7 0.8 0.9 1 datasets A UC |T |=10 |T |=20 |T |=50 |T |=100 |T |=200 |T |=500 |T |=1000
(a) ground truth
2 4 6 8 10 12 14 0.5 0.6 0.7 0.8 0.9 datasets A UC |T |=10 |T=20 |T |=50 |T |=100 |T |=200 |T |=500 |T |=1000 (b) Isolation forest 5 10 15 20 25 30 35 0.5 0.6 0.7 0.8 0.9 1 datasets A UC nT=1 nT=2 nT=5 nT=10 nT=20 nT=50 (c) ground truth 5 10 15 20 25 30 35 0.5 0.6 0.7 0.8 0.9 datasets A UC nT=1 nT=2 nT=5 nT=10 nT=20 nT=50 (d) Isolation forest Fig. 6. AUCs of anomaly detection on datasets with more than 1000 samples (14 of all used dataset) using only features suggested by the
Explainerwith varying training set sizes when ground truth anomalies were explained (top-left) and when anomalies identified by Isolation forest were explained (top-right). The reported values are the average
of 10 runs training 10 trees per anomaly. TheExplainer was used
to explain the ground truth anomalies (bottom-left) and the anomalies identified by the Isolation forest detector (bottom-right). The AUCs are average of 10 runs using only features suggested by theExplainerwith the varying number of trees trained per anomaly and fixed training set size to 100 samples.The results for each setting were sorted according to their AUC, showing general performance rather than performance on specific datasets. The interpretation is simple, the line of the setting leading to better average AUCs will be higher than the line representing the setting leading to lower average AUCs.
4.3 Sensitivity to parameters
TheExplaineris controlled by two parameters: by the train-
ing set size |T | and by the number of trees trained per
anomaly nT. In this section we show the robustness of
theExplainerto the setting of those parameters, effectively
10 TABLE 5
Results of the Friedman’s statistical test with the correction for simultaneous hypotheses testing by [40], calculated while testing the training set size of theExplainerminfor explanation of the ground truth
anomalies. The rejection thresholds are computed for the family-wise significance levelα= 0.05
settings p rejection threshold
|T |=200 vs.|T |=1000 0.325 0.0023 |T |=10 vs.|T |=1000 0.407 0.0025 |T |=20 vs.|T |=1000 0.437 0.0026 |T |=100 vs.|T |=1000 0.468 0.0028 |T |=50 vs.|T |=200 0.501 0.0029 |T |=200 vs.|T |=500 0.534 0.0031 |T |=10 vs.|T |=50 0.604 0.0033 |T |=20 vs.|T |=50 0.641 0.0036 |T |=10 vs.|T |=500 0.641 0.0038 |T |=50 vs.|T |=100 0.678 0.0042 |T |=20 vs.|T |=500 0.678 0.0045 |T |=100 vs.|T |=500 0.717 0.0050 |T |=500 vs.|T |=1000 0.717 0.0056 |T |=50 vs.|T |=1000 0.756 0.0063 |T |=100 vs.|T |=200 0.795 0.0071 |T |=20 vs.|T |=200 0.835 0.0083 |T |=10 vs.|T |=200 0.876 0.0100 |T |=10 vs.|T |=100 0.917 0.0125 |T |=50 vs.|T |=500 0.958 0.0167 |T |=10 vs.|T |=20 0.958 0.0250 |T |=20 vs.|T |=100 0.958 0.0500
At first, theExplainerminwas executed with the number
of trees per anomaly nT = 5while varying the training
set size T ∈ {10,20,50,100,500,1000}, for that reason the experiment was performed only on datasets with more than 1000 samples. AUCs of different configurations, mea- sured by the Isolation forest detector, are shown for ground truth anomalies in Figure 6a and anomalies discovered by the Isolation forest 6b. According to these results, the
Explainerdoes not seem to be sensitive to the size of the training set, as all AUCs are very similar. Results of the Friedman’s test and its post-hoc tests are all in 5. When explaining ground truth, training sets as small as 50 - 100 samples were sufficient. When explaining anomalies identi- fied by a real anomaly detector, larger training sets like 100– 200 samples seems to be more appropriate. The finding that even a small training set is sufficient for good explanations is especially important for data-streams, where all samples needed for explanation have to be stored in memory.
Fixing the size of the training set to T = 100,theEx-
plainerminwas executed with different number of trees per
anomalynT ∈ {1,2,5,10,20,50}. AUCs of all configura-
tions, measured by the Isolation forest detector, are shown in Figure 6c for ground truth anomalies and in Figure 6d for anomalies identified by Isolation Forests. Similarly to the
size of the training dataset, theExplainerdoes not seem to
be sensitive to this setting. Results of the Friedman’s test and its post-hoc tests are all in the Appendix C. According to the results using 5-10 trees per anomaly is recommended because its stable while still being computationally efficient. 5 CONCLUSION
This paper has presented a novel approach to explaining the deviation of a sample, identified by an arbitrary anomaly detection algorithm, from the rest of data. The proposed
approach relies on specifically trained random forests to identify rules explaining the anomaly. Those rules are then assembled into the explanation in the form of a set of association rules or in the disjunctive normal form and presented to the user. The set of association rules is easy to read and understand by humans, whereas the disjunctive normal form was chosen because it is compact and practical for further machine processing.
The proposed approach has been extensively compared to the relevant prior art on 36 UCI machine learning repository datasets while explaining anomalies identified by multiple anomaly detection algorithms. The results show a superiority of the proposed solution, as it provides tighter and more accurate explanations. In addition, our testing has shown that it is also superior with respect to the time complexity, especially on the high dimensional datasets.
As to the theoretical fundamentals of our approach, we proved that our simplified splitting criterion used during the specific training of random forest is for this kind of ran- dom forest equivalent to the common best practice splitting criteria based on information gain or Gini impurity. APPENDIX
LetHdenotes the space of all possible splitting rules defined
as: H={hj,θ|j∈ {1, . . . , d}, θ∈R}, where hj,θ(x) = ( +1 ifxj> θ −1 otherwise
with xj being the jth feature of xand θ representing a
threshold. The most popular criteria to select the splitting
functionh∈ Hof a new internal node areinformation gain
defined as IG(S, h) =H(S)− X b∈{L,R} |Sb(h)| |S| H(S b) (7)
andGini impuritydefined as
GI(S) = 1− X
b∈{L,R}
f2(
Sb(h)), (8)
whereSis the subset of the training setTreaching the leaf
being split,SL(h) ={x∈ S|h(x) = +1}andSR(h) ={x∈
S|h(x) =−1},H(S)is the Shanon entropy ofSandfis
the fraction of samples of the specified class in the set. In the following lemma, we show that when training set
T contains exactly one anomaly, both criteria (7) and (8) are
equivalent and equal to
arg min h∈H|S
a
(h)|, (9)
where|Sa|is the size of the set containing the anomaly after
splitting. This splitting criterion is easy to interpret and can be calculated more effectively than information gain or Gini impurity.
Lemma A.1. Denote Tq(p1, . . . , pk)the Havrda-Charv´at en- tropy with indexq≥1, up to a multiplicative constant coinciding with the more recently introduced Tsallis entropy, [41]) for a finite-dimensional probability distribution(p1, . . . , pk), i.e.,