SECCIÓN II. OTRAS DISPOSICIONES SOBRE RÉGIMEN JURÍDICO
ISLA DE EL HIERRO
The output generated by the baseline system was evaluated against the gold-standard (see Section 3.2.1). As shown in Figure 5, the gold-standard only contains the evalua- tion of 16,826 mappings. However, with 32,246 snort messages mapped to a maximum
Tag Number of Mappings
Correct 9,222
Acceptable 5,496
Incorrect 2,108
Total 16,826
Table 5: Statistics of the Gold-standard Built by Two Cyber Security Analysts
of 6 CAPEC fields, the total number of possible mappings is 193,476 (6 × 32,246), therefore the evaluation of the outputs of the baseline system was made only on the overlapping answers; mappings provided by the system that were not included in the gold-standard were therefore not evaluated. For each overlap, three measures were recorded: Correct Mapping, Acceptable Mapping and Incorrect Mapping, depending on how the mapping quality was judged in the gold-standard dataset. Following the advice of our security analysts, recall was deemed more important than precision. Indeed, in this domain, it is preferable to alert clients too often with false alarms than to miss potential cyber threats. To account for this, two types of precision were computed: strict precision (PS) and lenient precision (PL) which are defined as:
Strict Precision: PS = Correct M appings
(Correct + Acceptable + Incorrect) Mappings
Lenient Precision: PL = (Correct + Acceptable + Incorrect) Mappings(Correct + Acceptable) Mappings
as well as two types of recall: strict recall (RS) and lenient recall (RL):
Strict Recall: RS
= CorrectCorrect M appings+ Acceptable + Incorrect
Lenient Recall: RL= (Correct + Acceptable) Mappings Correct + Acceptable
Finally, we also calculated a series of F-Measures, which are a weighted combination of precision and recall. F-Measure is defined as Fβ = (β
2 + 1) × P × R
β2 × P + R . If β = 1,
then precision and recall have the same importance; if β < 1, it means that recall
is favored; if β > 1, then precision is more important. In these experiments, we set
the weight beta to 0.5 (F0.5), 1 (F1) and 2 (F2) respectively and also computed two versions: lenient F-Measures and strict F-Measures. These F-Measures are defined as:
Strict FS 0.5 = (0.5 2 + 1) × PS × RS 0.52 × PS + RS = 1.25P SRS 0.25PS + RS Lenient FL 0.5 = (0.5 2 + 1) × PL × RL 0.52 × PL + RL = 1.25P LRL 0.25PL + RL Strict FS 1 = (1 2 + 1) × PS × RS 12 × PS + RS = 2P SRS PS + RS Lenient F1L = (121+ 1) × P2 × PL + RL × RL L = 2P LRL PL + RL Strict FS 2 = (2 2 + 1) × PS × RS 22 × PS + RS = 5P SRS 4PS + RS Lenient FL 2 = (2 2 + 1) × PL × RL 22 × PL + RL = 5P LRL 4PL + RL
Table 6 shows the default parameters indicated in Section 3.1.2 that we used to eval- uate the baseline system. SimM IN represents the minimum similarity threshold to
match messages. With SimM IN = 0, this means that as long as the snort message
and attack field are not completely orthogonal, they are considered similar. Expan-
sion indicates the use of snort rule name description to extend snort messages (see
Section 3.1.1). As Table 7 shows, the number of acceptable mapping is quite high as it accounts for 94% (5,178 / 5,496) of the total acceptable mappings, whereas only 1% of the correct mappings were found. The PL was 97.96% because of the contribution
of acceptable mappings while the RL was only 35.22%. Table 8 shows that the FL
0.5
and F1L were 72.23% and 51.81% respectively.
System SimMIN DF TV Expansion Nb of Features
Baseline 0 40 0.98 Yes 140
Table 6: Description of Input Parameters in Baseline System
System Number of Mappings Lenient Strict Correct Acceptable Incorrect PL RL PS RS
Baseline 108 5,178 6 97.96% 35.22% 0.11% 0.07% Table 7: Precision and Recall of the Baseline System
As we can see, although the mapping rate is 98.94%, the mapping quality is low because only 1% of the correct mappings were found. In next three chapters, we will describe several approaches to address this problem.
System Lenient Strict FL0.5 FL1 FL2 FS0.5 FS1 FS2
Baseline 72.23% 51.81% 40.40% 0.09% 0.08% 0.07% Table 8: F-Measure of the Baseline System
In this chapter, we have described the workflow of the baseline system and the attempt of [Scarabeo et al., 2015] to improve it through snort rule expansion. In addition, we explained how the mapping rate was initially evaluated (see Section 3.1.2) and how the measurement did not measure the quality of the mapping. We then described our work to evaluate the quality of the baseline’s output by creating a gold- standard and using the standard metrics of precision, recall and F-measure. In order to enhance the performance of the baseline system, the next chapters investigate three approaches:
1. Feature Selection and Snort Messages Supplement.
2. Pre-clustering Snort Messages.
3. Semantic Mapping by Latent Semantic Analysis.
In the next chapter, we will provide a detailed description of the snort messages supplement methodology as well as an analysis of the evaluation of the outputs.
Chapter 4
Feature Selection and Snort
Messages Supplement
Table 7 in Chapter 3 showed that the recall of the baseline system was only 35%. In order to improve the system performance, we experimented with three approaches:
1. Feature Selection and Snort Messages Supplement.
2. Pre-clustering Snort Messages.
3. Semantic Mapping by Latent Semantic Analysis.
In this chapter, we describe the first approach: n-grams feature selection to analyze the feature distribution and snort messages supplement. Section 4.1 describes our experiments with the use of a variety of feature sets and their effect on the evalua- tion of the system. After analyzing the feature distribution, we noticed that many snort messages suffered from a sparse representation. Indeed, although the snort rule descriptions extend the length of original snort messages (see Section 3.1.1), most of these messages are still quite short (below 15 words). To address this issue, we investigated the use of entities in the Common Vulnerabilities and Exposures (CVE) (see Section 2.1.1) to further supplement snort messages (see Section 4.2). The effect of this strategy is analyzed in Section 4.2.3.
4.1
Feature Selection
In the baseline system (see Chapter 3), snort messages and CAPEC fields are repre- sented by a mixture of unigrams, bigrams and trigrams. However, the contribution of each type of n-gram was not clear. To measure the usefulness of each type of n-gram, three experiments were performed: the use of unigrams only, bigrams only and trigrams only. Sections 4.1.1, 4.1.2 and 4.1.3 describe these experiments; while Section 4.1.4 provides an overall evaluation.