It can be concluded that the use of packet payload information as signatures to classify malicious streams using network data does work. We concluded that we can perform similarity measurements on packet payload data with confidence. Experimentally we found the system to be robust and can allow for significant changes in the signatures without misclassifying the stream. We found that for this type of data, NCD and Spamsum metric were the most appropriate similarity metrics and concluded that since both are highly correlated and have some advantages over the other, we can use them in a hybrid approach for similarity measurements.
Clustering and Classification for
Intrusion Detection
“There are known knowns; there are things we know that we know. There are known unknowns; that is to say there are things that, we now know we don’t know. But there are also unknown unknowns there are things we do not know we don’t know”
–Donald Rumsfeld, US Secretary of Defence
The previous chapter demonstrated how the use of edit distance and information the- oretic similarity metrics can help determine similarity between network profiles for intrusion detection. Several similarity metrics were reviewed and a few were chosen from the literature, and used to determine the similarity between various packet or stream profiles. Some preliminary tests have been conducted to identify the most suit- able metrics for our domain. Different combinations for measurements were discussed and metrics of choice were proposed based on theoretical analysis and some initial experiments.
This chapter builds on this work and investigates the use of machine learning techniques for the classification of malicious network streams. This chapter discusses clustering and classification techniques using data obtained from similarity measurements of ma- licious network streams. Some machine learning approaches for intrusion detection are discussed. Experiments are designed to find the best clustering algorithm and cut-off threshold for our domain given the similarity metrics. The results are analysed with a discussion on ROC-AUC based measures and Accuracy. Results obtained by compar- ing all the clustering algorithms used for the evaluation will be discussed and the best ones highlighted.
5.1
Problem Description
In order to establish the best performing metric suitable for our domain, the metrics are compared and evaluated both theoretically and empirically, using statistical machine learning techniques. These techniques would help in correctly grouping the measure- ments obtained from similarity metrics into their respective groups for classification. To achieve this classification goal, it is necessary to find a combination of the best sim- ilarity metrics and clustering algorithms to use. The following questions are addressed in this chapter:
1. How to group similar items together?
2. How to determine the optimal threshold to partition data into groups? 3. How to find the best clustering algorithm(s) for our domain?
4. How to evaluate the classifier results for correctness?
5. How to identify the most suitable similarity metric for our domain using the above evaluations?
5.2
Background
5.2.1 Machine Learning
The discipline of machine learning provides a set of algorithms for solving problems that require learning, adapting or evolving a computer program based on empirical data. Examples or observations from the training data can be used to extract unique features or characteristics to create models. These models can be used to automatically learn or recognize complex patterns that emerge in the data or make intelligent decisions based on them. The performance of the learner is directly proportional to the coverage of features in the training data. Since not all features can be made readily available to the system during training, the learner should generalize known features in order to learn and identify new and novel features from the test set. In other words, the learner should be able to generalize from its experience.
5.2.2 Machine Learning for Intrusion Detection
Cyber criminals target individual and organizations’ data for their gain. Security tools are deployed in the cyber infrastructures to defend against these cyber criminals and
their ill intent. Information security tools like intrusion detection and prevention sys- tems, anti-virus systems and firewalls produce large amounts of data in the form of logs. With all these security appliances generating large amounts of data, the industry is finding it hard to cope with and make appropriate intelligent decisions. Data mining, statistics and machine learning capabilities are required to address the challenges of cy- ber security i.e. to process vast amount of data. We intend to use machine learning techniques to extract knowledge from data for better intrusion detection and preven- tion. Machine learning techniques will give us the capability to extract strong patterns or derived rules from the data to group entities and to use that knowledge to predict new classes of data. It serves as an iterative process of extracting knowledge from data. Applying this concept to intrusion detection can help better detect intrusive behaviour in networks and systems, than done by researchers in the past.
Clustering and Classification
Machine learning algorithms are used for our domain, for the clustering and classifica- tion of malicious network streams. Instances of malicious and benign network traffic profiles are used as a template or an example to classify other similar instances, thus “learning from examples” (Aha et al., 1991), this is an example of supervised learn- ing. An instance space consists of a set of stored instances. Given an instance from instance space, an instance-based learning algorithm is used to map instances to cat- egories. Instance-based learning algorithms use similarity and classification functions for their operation. We hypothesize that instance-based machine learning algorithms with random instance selection, utilizing input from similarity metrics and clustering algorithms can provide good classification.
Classificationis the process of identifying which set of categories an observation belongs to 1. A similarity score is established between incoming packets or streams and the instances in the instance space. Classification is carried out using a model comprising of instances and their corresponding thresholds to categorize the input streams. A confusion matrix is generated as a result of this classification, populating table entries for the predicted class and the actual class, to evaluate the performance of the classifier. From this confusion matrix we calculate: Accuracy, Area Under the Curve (AUC) and finally generate a Receiver Operating Characteristics curve (ROC curve) to visualize the performance of the binary classifier.