In paragraph 4.2 we noted that because the monitored environment is often subject to change, the model should also adapt to this change. In this chapter we have seen several approaches in which the model is updated or retrained in order to account for concept-drift. We have made a distinction between continuous learning and threshold based systems. This separation is also used in [30], where a distinction is made between complete retraining, gradual retraining and spot retraining. Gradual retraining represents our notion of continuous learning, where new data is continuously incorporated into the model as a way to adapt gradually to legitimate changes in the environment. Spot retraining makes the process of retraining more efficient; the model is updated only when necessary by identifying when a legitimate change occurs. This is similar to the threshold-based learning we described. In addition, the complete retraining approach represents a method in which the entire model is rebuilt. For this approach, the system may be re-learning data that was already considered “normal” and this approach can therefore be considered to be less efficient than the other two. Complete retraining has similarities with our notion of semi-continuous learning which was described in paragraph 6.2.1.
In addition to this higher level distinction for retraining methods, which applies to the entirety of the model, we can also look at the structure of the model itself, which can be composed out of several sub-models. For these sub-models the decision whether either the existing model is re-used and is retrained based on newly observed legitimate behavior, or an entire new model is created, is independent from the retraining approach that is used on the higher level. For example, when continuous learning is applied to the entire model, some of its sub-models may require complete re-training. Whether sub-models require complete retraining or otherwise can retain some of the historic data largely depends on the level of granularity that is considered for the observed environment with respect to the sub-models. As an example, consider the HTTP requests for a website as the subject of monitoring. We can either decide to use a low level of granularity and create one model for all of the possible requests, use a moderate level of granularity and create a model for each requested web resource, or use a high level of granularity and create a model for each parameter in the request for each web resource. Although there are many aspects involved in the adaptation of models when it comes to legitimate changes in the monitored environment, for now simply assume that in the described situation there is a legitimate change in a certain parameter value and the underlying model should be updated based on this observation. In this case, with a low level of granularity it would not be efficient to create an entire new model for all the possible web requests, when the change only applies to a parameter of a certain web resource. In such cases, when the model creation algorithm allows for it, retraining the existing model would take much less time than creating a new model. However, with the higher level of granularity, for which a model exists for each individual parameter, creating a completely new model will usually take less time and may improve the accuracy of the model. The improved accuracy that is achieved by replacing the existing model could then be the determining factor to choose for this option. Whether the model is updated
or replaced will also determine the amount of historic information that will be lost. Replacing a model and completely eliminating the previous pattern can result in having to constantly retrain the system in a case where cyclical traffic pattern changes occur. On the other hand, incorporating every observed traffic pattern can lead to an oversize database that suffers from a higher false negative rate since its pattern database is too general. It is therefore important to keep in mind that the level of granularity of the models has an effect on the optimal retraining behavior.
6.4.1 ALGORITHMS SUITABLE FOR REAL-TIME UPDATING
In this chapter we have already discussed some approaches that use continuous learning based on live traffic. Especially for these situations it is important that the algorithm is capable of updating on-the-fly. We have seen that there are several modeling techniques that are capable of real-time updating, such as the cluster- based methods (DBSCAN or the method described in paragraph 6.3), self-organizing maps with self-adaptive extensions (such as the GHSOM) and SVDD. Although there is a lack of research on comparisons between these methods, they can be considered state-of-the-art in modeling algorithms, outperforming the more static algorithms like the SOM, SVM and k-means with respect to their autonomicity for self-adaptation.
6.4.2 OBSERVING TRAFFIC VERSUS OBSERVING THE UNDERLYING SYSTEM
We have also observed a difference between systems that detect changes in the underlying system and systems that detect changes in observed traffic. Re-generating or retraining an existing model based on observed traffic in the monitored environment is time consuming, costly and attack-prone. This is a benefit of detecting changes in the underlying system; the model can directly be updated without any training phase, because changes are directly known. However, detecting changes in the underlying system is usually
considered to be a complex task. On the other hand, the delay that is caused before concept drift is detected in live traffic and incorporated in the model can be a real issue, especially when the concept drift is abrupt and the packet scores occur above the anomaly-detection threshold, such that false positives are generated.
6.4.3 DETERMINING THE STABILIZATION POINT AFTER CONCEPT-DRIFT
For some retraining approaches that do not employ continuous learning, in addition to determining the point in time when concept drift is occurring in observed traffic, it may also be desirable to identify the point in time when the concept drift has stopped and the input is stable again. As we have seen in this chapter, for systems that identify concept drift by observing live traffic, this process takes some time, as in all of the methods it requires several packets with drifted values before concept drift will be assumed. When concept drift has then been determined, usually the packets in the sliding window before that point in time will be used as the data for retraining. However, it may also be useful to incorporate new incoming packets in that specific training set or in a new training set and then to perform a second retraining. This is because the concept drift may still be going on. Consider, for example, the situation when the granularity of the models is on an HTTP request parameter-specific level and one parameter value that is modeled as being either the integer zero or one. During the process of live detection, suddenly different numbers and alphabetical characters are occurring as values for this parameter. After several packets concept drift is assumed and the
model is retrained with the data that has been observed up until that point in time. After this retraining live detection will resume while the complete representation of the concept drift, which could also have introduced a certain other characters in the value, has not yet been observed. This may be observed soon after the retraining, although these changes may not be sufficient to let the system assume another instance of concept drift. In this case the model has not achieved the optimal accuracy. It may therefore be desirable to detect when the process of concept drift has stabilized. An example of such an approach is given in [28]. There a method is proposed in which that estimates the likelihood of the system seeing new n-grams, and therefore new content, in the immediate future based on the characteristics of the traffic seen so far. The calculation for this property makes use of the rate of unique n-grams that is observed. Now the system will know the point in time that represents the upper time-bound for which to incorporate packets in the training data, with which it is ensured that the micro-models are well-trained is based on the rate at which new content appears in the training data.