1. PLANTEAMIENTO DEL PROBLEMA
6.3 FASE 3 FORMULACIÓN DEL PLAN ESTRATÉGICO PARA LA
This section presents the core experimental design employed in all of the experiments undertaken in this thesis. It also sets out the decisions relating to data selection, the preparation requirements, and the parameter settings of the ML algorithms.
3.3.1
Research design
The study performed in this thesis employed a factorial research design and experimental evaluation to assess the performance (i.e. the Geometric Mean of Accuracy) of the models of the three ML algorithms under various setups. The choice of design was driven by the desire to investigate every possible combination, at every level, of all of the factors, on the overall performance of the developed models. In addition, the design was chosen so as to study the relationship between various factors to identify which ML algorithm was the most adaptable to a change in pattern (concept) in the network traffic over time. Table 3.1 shows the different factors and their levels for every experiment undertaken in this thesis.
3.3.2
Selection of datasets
A number of datasets were used over the course of this research, including synthetic data and simulated domain specific (network flow) data.
74
Experiments Factors Levels
Exp.1 ML Algorithm 3 [C5.0, RF, SVM]
Thresholds 2 [fixed, adaptive]
Exp.2 ML Algorithm 3 [C5.0, RF, SVM]
Data Balance 2 [balanced, imbalanced]
Feature sets 5 [Full, MDA, MDG, MDABal., MDGBal.]
Thresholds 2 [fixed, adaptive]
Exp.3 ML Algorithm 3 [C5.0, RF, SVM]
Data Balance 2 [balanced, imbalanced]
Feature sets 5 [Full, MDA, MDG, MDABal., MDGBal.]
Sample size
11 [10%, 5%, 1%, 0.5%, 0.1%, 0.05%, 0.01%, 0.005%, 0.001%, 0.0005%, 0.0001%] Bins (sampling strategy) 5 [B1, B10, B20, B50, B100]
Errors 4 [0%, 1%, 5%, 10%,]
Table 3.1: Factors and levels of every experiment.
3.3.2.1Existing Datasets
As the first experiment aimed to investigate the potential of prediction threshold adaptation in addressing the issue of concept drift between training and evaluation datasets, synthetic and domain specific (network flow) datasets were used. The synthetic datasets, SEA [317] and AGR [318], were chosen as they provided the control required over concept drift (between data files) to analyse the effect of threshold adaptation on model predictions when the evaluation data were either of the same or of a different concept to the training data used to build the prediction model. This analysis was extended to the domain specific dataset (gureKDD [160- 162]) to investigate the effectiveness of the adaptation approach in a near to real life scenario. The gureKDD was selected over the KDD Cup 1999 and NSL-KDD datasets for two main reasons: firstly, the gureKDD dataset maintains the chronological order of every connection, which makes it possible to split connections based on their time; secondly, the KDD Cup 1999 dataset has a large number of limitations (listed in Al Tobi and Duncan [1]) including miscalculated values, which raised doubts over its suitability.
As the decision to use these synthetic datasets was based on the need to generate different levels of concept differences, the data files generated from each of these datasets had different combinations of similar and differing concepts. Similarly, gureKDD was divided into multiple
75 files using a time window of one week which resulted in different network patterns (concepts) in the traffic for each week. A detailed discussion on data generation, preparation and descriptions is provided in Chapter 4.
3.3.2.2Newly Generated Dataset
As set out in Chapter 2, there are many limitations in the publicly available datasets. In addition, as the aim in this thesis is to extend the analysis to data that were more up to date and more representative of real network traffic, a new dataset was generated by transforming the flows of the ISCX2012 dataset [14] into a suitable format for ML algorithms.
The resultant dataset, called STA2018, profiled every connection using 550 features and can be used as balanced or imbalanced (original) data. The decision to transform the ISCX2012 dataset was governed by its availability at the time. It was also the best option out of all of the available datasets. For instance, the ISCX2012 dataset contains modern traffic with full captures. In addition, unlike many other datasets, it provided a reasonable number of simulation days (seven). Although the DARPA datasets have more simulation days than ISCX2012, the traffic included in those datasets is very old, does not represent real network traffic and is very limited. For example, one day of the ISCX2012 captures equals the whole of DARPA in size.
Data records from the resultant dataset (STA2018) were grouped by day so that every data file aggregated all of the connections within that simulation day. Overall, the transformation process had five main stages: basic-features extraction; validation and labelling; extending the basic-features; balancing; and cleaning-up. Full details of the transformation are discussed in
Chapter 5.
The large number of features space combined with having a balanced version of the data made it possible to analyse the effect of various decisions such as feature selection and data balancing, which are usually taken at the model development stage. Many of these decisions were based on the overall performance of the prediction models that were developed and tested using the
76
available training data. This was especially the case when new data with a different concept (and hence, with different important features) were evaluated with the models generated. Therefore, using this dataset made the analysis of the effect of predictions threshold adaptation on these models more systematic.
3.3.3
Parameter setting for the ML algorithms
As stated by Bhuyan et al. [12], classification-based methods usually produce better results than unsupervised learning, such as clustering methods, due to the use of data labels at the training stage. However, as shown by Laskov et al. [319], the accuracy of the supervised learning algorithms could deteriorate when their models are faced with novel attacks. Therefore, three well-known classification algorithms in the domain of ID were investigated, namely C5.0, Random Forest (RF) and Support Vector Machine (SVM) [see Section 3.1]. These algorithms were used to process the training data in order to learn a classification (prediction) model, which was then tested on the evaluation (test) data with different statistical properties (concept). The performance of these models were analysed before and after threshold adaptation to examine their ability to adapt to changing traffic patterns and to detect novel attacks. The following subsections provide further details about the packages used and the parameter settings for each of the algorithms used.
3.3.3.1C5.0 algorithm
The “c50” package {version 0.1.0-24} [320] in the R environment [321] was used in this thesis. All experiments used the default settings of this algorithm, with the 10-trials option (trials = 10) set to return the results of the classification as a probability score (type = "prob") when the model was used to predict the evaluation (test) data.
3.3.3.2Random Forest
The “ranger” package {version 0.8.0} [297, 322] in the R environment [321] was used over the course of this research. This package was selected because of its fast implementation of RF in
77 C++. All experiments used the default settings of 500 trees (nTree) to grow, and with the number of features to evaluate at every node being the square root of the total number of features in the dataset (mtry = ⌊√𝑝⌋), where p is the number of features. The algorithm was instructed to return results in the form of classification probabilities (probability = TRUE).
3.3.3.3Support Vector Machine (SVM)
The open source SVM package (LiblineaR) {version 2.10-8} [323, 324] in the R software [321] was used in these experiments. This package executes an optimized linear version of SVM. All experiments used the default settings of L2-regularized logistic regression linear model type (type = 0) with the cost set to one (cost = 1).
The choice to use the linear version of SVM was driven by the very large differences in the runtime of experiments between its linear and nonlinear kernel versions. Some preliminary experimentations have been conducted to compare the two versions. Table 3.2 presents the runtimes (in seconds) of the kernel SVM (with type = 2 [radial basis function], cost = 2 and
gama = 2) and the linear SVM (with cost = 2). These experiments were performed on 10% subsets of Day 2 (12/Jun) and Day 3 (13/Jun) of the STA2018 dataset. This table shows the large difference between the two versions where the linear version of SVM is much faster. It also shows that the runtime of the kernel SVM grows exponentially as the number of instances increase, especially when the data is balanced.
As SVM can only handle numerical data, it was necessary to pre-process the data before the training or testing phase took place. Therefore, all categorical (nominal) features were converted into dummy attributes, so that every value for each of the levels of these categorical features was converted into an independent feature that contained a ‘zero’ or a ‘one’ [325]. Another pre-processing stage, as recommended by the SVM library package [323, 326], was to standardise the data. Therefore, every feature in the training data was standardised (as per
78 Day 2 (12/Jun) 10% Day 3 (13/Jun) 10% Kernel Cost=2 , Gama=2 Linear Cost=2 Kernel Cost=2 , Gama=2 Linear Cost=2 Time Sec. (imbalanced) Time Sec. (balanced) Time Sec. (imbalanced) Time Sec. (balanced) Time Sec. (imbalanced) Time Sec. (balanced) Time Sec. (imbalanced) Time Sec. (balanced) Fold 1 3,330 16,602 4 4 4,186 23,693 4 6 Fold 2 3,505 12,101 2 4 3,463 21,595 4 4 Fold 3 3,385 12,309 2 6 4,054 20,559 2 6 Total (seconds) 10,220 41,012 8 14 11,703 65,847 10 16
Table 3.2: Runtime in seconds between kernel and linear SVM setups on a 10% subset of Day 2 (12/Jun) and Day 3 (13/Jun) of the STA2018 dataset
Eq.(3.7)). The standardisation parameters (the mean and standard deviation) of the training data, which is utilized to build the classification model, were used to standardise the features of the test data before being classified by the model:
𝑥̇ =𝑥 − 𝜇
𝜎
Eq.(3.7)
where, 𝑥̇ is the new standardised value, x is the actual value to be standardised, 𝜇 is the mean and 𝜎 is the standard deviation of a feature column.
In addition, all class labels were mapped for every dataset as shown in Table 3.3.
Dataset Class label SVM label
SEA and AGR groupA -1
groupB +1
gureKDD and STA2018 Attack -1
Normal +1
Table 3.3: Class labels mapping for SVM algorithm.
3.3.4
Evaluation measures
As discussed in Chapter 2, many of the common evaluation measures suffer in model performance assessment when imbalanced data are used. It has also been shown that network traffic exhibits an imbalance of traffic classes i.e. more normal traffic exists than anomalous traffic [98]. In an attempt to avoid this problem, the Geometric Mean (G-Mean) of Accuracy
79 as the main measure to evaluate the performance of all of the models developed in all experiments in this thesis.
3.3.5
Performance assessment techniques
Cross-Validation is a technique used to estimate a model’s performance and to assess how its results could be generalised for unseen datasets [327-329]. This technique addresses the issue of data shortage when the holdout technique (for training and testing sets) loses important modelling or testing capability [330]. It can also be used to assess the expected performance of a prediction model in a real environment. In addition, it is a very useful technique to avoid
overfitting problems, the situation that arises when the model perfectly predicts the data used to generate the classification model, but fails to generalise to new data. It is also a useful technique to identify and fine tune the main parameters of the model [331]. As discussed in
Section 2.3.1, there are many types of Cross-Validation techniques, ranging from ‘leave-one-out’ to ‘holdout’ methods, but the most popular one is the K-folds Cross-Validation. In K-folds Cross-Validation (Figure 3.6), the dataset is randomly divided into K parts. A model is then trained using K-1 parts and tested on the remaining part. This process is repeated K times, so that each one of the K parts is only used once as test data. The model’s overall performance is estimated by aggregating the performance of the K models (through averaging or a majority vote). Although this evaluation method is frequently cited in the literature, it requires a long time to process as larger values of K are used. Also, as discussed in Chapter 2, it could provide overly optimistic results due to the random division of datasets.
Despite these drawbacks, the K-folds Cross-Validation technique was used in all experiments at every model building (training) stage to estimate the prediction thresholds for every developed model as per the recommendation of Ambroise and McLachlan [205].
80
Figure 3.6: K-folds Cross-Validation process.
Prospective sampling [225], as discussed in Section 2.3.1, is a technique underused in batch- learning tasks in the domain of network ID. However, this was the sampling approach used in this thesis to evaluate the performance of the generated models, by assessing them using evaluation (test) data collected at a different time from the training data. This evaluation method aimed to mirror real life, given that models are usually trained on data that have been collected in the past to predict future data.
3.3.6
Statistical evaluation
Normality tests were used to assess whether or not the data was following a normal distribution. Such tests are useful in selecting the right statistical analysis tests, such as parametric or non-parametric tests. As a result, in order to assess the normality of the results prior to analysis, the Shapiro-Wilk normality test [332] was used. However, this test is limited in its ability to handle more than 5,000 records. Therefore, in such cases (as in Experiment 3 and discussed in
81
Friedman’s test [335, 336] is a non-parametric test used to assess the difference between treatments or effects during multiple test trials, without assuming a specific distribution in the data. This test involves ranking the observations in each row, then computing the mean of the ranked observations across treatments/effects (for each column).
Parametric tests, such as ANOVA, require asserting some assumptions before they can be applied. One of these assumptions is that the data should be normally distributed. As this core assumption could not be made for these experiments, a non-parametric approach was applied to analyse the results. Using an example, the following figure (Figure 3.7) illustrates how the Friedman test works.
Figure 3.7: Friedman’s Test computation and interpretation
If the Friedman test indicates that there is a significant difference between treatments/effects, post-hoc tests [337] can then determine which of these treatments/effects are significantly different by comparing the differences between their mean ranks. In these experiments, the
82
treatments and to determine which pairs’ differences were statistically significant. To visualise these differences Demšar [17] proposed the Critical Difference Plot; in this plot, any two effects joined by a line indicates that they are not statistically different (see Figure 3.7).