tection. They state that although machine learning has been successfully applied in domains like spam detection and product recommendation systems, the vast majority of IDSs found in opera- tional deployment are misuse detectors. Furthermore, the anomaly detectors found in operational deployment employ mostly statistical techniques. The authors identify the following challenges on using machine learning for anomaly detection. We also present a set of guidelines for strength- ening future research. We emphasise that the authors do not consider machine learning to be an inappropriate tool for anomaly detection. On the contrary, the authors believe the use of machine learning is reasonable and possible but great care is required.
2.2.5.1 Challenges High Cost of Errors
Contrary to other domains, in anomaly detection the cost of errors is really high. On the one hand a false positive is very expensive since it requires analyst time thoroughly examining the reported incident. On the other hand a false negative can cause serious damage to an organisation’s IT infrastructure. A related challenge is that there “too few attacks”. Anderson (2008) gives the following example: “If there are ten real attacks per million sessions - which is certainly an overestimation - then even if the system has a false alarm rate as low as 0.1%, the ratio of false to real alarms will be 100”. When one considers product recommendation systems, although a relevant recommendation can potentially increase sales, an irrelevant recommendation can only lead to a customer just continue shopping. According to Sommer & Paxson (2010), the high cost of errors is the primary reason for the lack of machine learning-based anomaly detectors in operational deployment. The high rate of false alarms occurs for the following two reasons.
Firstly, Sommer & Paxson (2010) claim that machine learning works better at identifying similarities rather than anomalies. A fundamental rule of a machine learning algorithm is that its training requires a large and representative set of instances of both positive (normal specimens) and negative (anomalous specimens) classes. Consider for example spam detection, a machine learning algorithm is provided with large amounts of spam and ham, and after the training period it is able to reliable identify unsolicited email.
However, there exist some problems with the anomaly detection domain. Anomaly detection is used to reveal novel attacks whose symptoms deviate from normal behaviour. Therefore one cannot train a machine learning algorithm with anomalous specimens. Furthermore, there is not a perfect model of normality. It is very difficult to define a “normal” region which contains every possible normal activity. Moreover, normal activity evolves and may not be representative in the near future (Chandola et al. 2009). Sommer & Paxson (2010) state that if the machine learning algorithm is trained using specimens of known attacks and specimens of normal activity, then machine learning is better suited for detecting mutated (variations of existing) intrusions rather than novel ones.
Section 2.2 Network Attacks and Defence 39
characteristics such as bandwidth, duration of connections and application mix can exhibit great variability resulting to an unpredictable behaviour. The diversity and variability occur regularly and often falsely considered as anomalous by an IDS. For this reason it is difficult to define what is actually “normal”. Furthermore, the Internet is an environment full of noisy data such as pack- ets created from software bugs and out-of-date or corrupt DNS data (Anderson 2008). Similarly, such data can be falsely considered anomalous by an IDS.
Difficulties with Evaluation
Two difficulties are mainly encountered regarding the evaluation of anomaly detectors (Sommer & Paxson 2010). The first problem is the difficulty of obtaining training data. The reason behind this is privacy concerns due to the sensitivity of such data (e.g. confidential communications and business secrets). Also, there are not any standardised datasets available. According to Sommer & Paxson (2010) the two publicly available datasets, namely DARPA/Lincoln Labs and KDD Cup datasets, should not be used for any current study since they are now more than a decade old. Again, when compared to other domains e.g. spam detection, large datasets of spam do exist that are free of privacy concerns.
The second problem is aboutadversarial drift, that is, when attackers modify their behaviour to evade detection. This constitutes the old classic arms-race between attackers and defenders. The authors do admit however that exploiting a machine learning technique requires considerable effort, time and expertise from the attacker’s behalf. In addition, since most attackers target weak and vulnerable systems instead of handpicking victims, the possibility of a sophisticated attack exploiting a machine learning technique is low (Sommer & Paxson 2010). Having said that, soph- isticated attacks do occur as people are sometimes willing to spend very large amounts of money to achieve their goals (as in the Stuxnet worm (Zetter 2014) case). Therefore, if adversarial drift is not taken into consideration or is de-prioritised, the risk associated with a sophisticated attack exploiting a machine learning technique is very high.
Semantic Gap
Anomaly detectors typically generate an output label, that is, an instance is labelled as either normal or anomalous. Some detectors improve on that by generating an anomaly score, indic- ating the degree to which an instance is considered anomalous (Chandola et al. 2009). Sommer & Paxson (2010) state that this is not enough. Consider for example the output “HTTP traffic of host did not match the normal profile”. Even if we assume that the system correctly detected a web server exploit, a network operator must still spend considerable effort in order to figure out what had happened. In other words, anomaly detectors ideally must transfer their results into actionable reports. Sommer & Paxson (2010) term this thesemantic gap.
Also, someone is at serious risk of discriminating if he uses a machine learning-based IDS. For example, how would someone defend himself in a court if he cannot explain the underlying
rules of a neural network? Anderson (2008) calls thisredlining.
2.2.5.2 Guidelines Cost Reduction
A way to reduce the false alarm rate and therefore the cost is to keep the scope narrow, that is, to limit the types of intrusions an anomaly detector is trained to identify (Sommer & Paxson 2010). Kantchelian et al. (2013) propose the use of an ensemble of classifiers one for each family of malicious behaviour. Each individual classifier generates a decision and the network operator combines each of them to generate a final decision, which is likely to be more accurate since each classifier specialises into detecting a specific family. Each classifier can use the same machine learning technique or different classifiers for each family can be used. Google uses an ensemble of classifiers for detecting malicious advertisements (Sculley et al. 2011). The drawback of this approach is that although there has been some work on automated classification of instances to families, deployed systems do so manually.
Another way to reduce the cost is to use aggregated features such as “number of connections per source” or to average features over a period of time such as “volume per hour” in order to deal with the diversity, variability and noise of network traffic. One type of machine learning-based anomaly detection system that is indeed found in operational deployment is the one that takes into consideration highly aggregated information (Sommer & Paxson 2010). Finally, a well-designed machine learning algorithm could potentially reduce the false alarm rate. For example, a careful inspection of the feature space will most likely reveal that some features are irrelevant, and some are more relevant than others.
Evaluation
As already discussed the two publicly available datasets should not be used by current studies since they are now more than a decade old. Researchers have turned into alternative approaches (Sommer & Paxson 2010). One alternative is to use simulation. The advantage is that it is free of privacy concerns. However, it is very difficult to realistically simulate Internet traffic. Another al- ternative is to anonymise or remove sensitive information from captured data. However, anomaly detectors often rely on information that had previously been removed during the anonymisation process. Besides, there is always the fear that sensitive information can still be leaked.
Another option is to capture data from a small-scale lab. However, the traffic obtained is different from the aggregated traffic seen upstream (where intrusion detectors are typically de- ployed). Sommer & Paxson (2010) state that the “gold standard” is to obtain data from large- scale environments. Even this choice suffers from some limitations. Specifically, data obtained from large-scale environments lack information that had been filtered out or unintentionally lost. Furthermore, such a dataset will contain a lot of noisy information. The important point here is to realise that no evaluation method is perfect, and researchers always need to acknowledge the
Section 2.2 Network Attacks and Defence 41
shortcomings of the approach they use.
Regarding the adversarial drift, the use of an ensemble of classifiers as discussed earlier can help to tackle this issue (Kantchelian et al. 2013). An adversarial drift could theoretically be arbitrarily radical, however in practice it is limited by the adversary’s resources. Typically dur- ing a campaign, an adversary recycles techniques from previous ones and therefore a campaign evolves slowly over time. For example, two different spam emails may attempt to sell the same product under a different (misspelled) name. Due to their evolutionary nature campaigns can be grouped into families where adversarial drift can be captured and organised into distinct trends. It is often the case that attacks within the same family have a similar detection strategy.
Mind the Gap
The general advice here is to gain insight of the anomaly detector capabilities. Not only do researchers need to determine why false positive alarms are generated (or not generated in case of false negatives), but they also need to understand why it produces correct results. Consider the following case from the 1980s (Sommer & Paxson 2010). A Pentagon’s project involved the use of a neural network to detect pictures of tanks. The classifier was indeed identifying correctly the pictures of tanks. However, it turned out the classifier was recognising the colour of the sky, as all the pictures of tanks were taken on a cloudy day.
Furthermore, the use of an ensemble of classifiers as discussed earlier provides isolation
which closes the semantic gap (Kantchelian et al. 2013). Based on the assumption that malware campaigns are most cost-effective using a single family, the use of multiple classifiers can isolate malicious campaigns from one another and provide better understanding and accuracy to the human expert. Multi-family attack is of course possible but the adversary is required to spend more resources.
Finally, as discussed earlier redlining should be taken seriously in commercial systems as such cases can contravene the European data protection law (Anderson 2008).