In chapter 6 we have made a distinction between threshold-based systems and systems that continuously learn based on live data. Both approaches have different advantages compared to one another. Continuously learning systems immediately incorporate the incoming data into the model, making the model adapt to gradual changes in real-time. For threshold-based systems gradual changes make the model drift away from the actual representation of the environment and after a certain delay a potentially costly retraining process is started. On the other hand, continuously incorporating live data into the model also results in a penalty on efficiency. With one of our research questions we aimed to discover whether some modeling algorithms would be more preferable than others when it comes to the self-update method that we propose. We argue based on the analysis in this report that the algorithm should at least be capable of autonomous training, updating the old model with the features that are extracted from new data. One of the requirements for this is that the algorithm does not have any parameters that should be configured manually and that relate to properties of the training data that are unknown beforehand. One such a modeling technique is the basic SOM, which requires parameters that are set based on a trial and error process that iterates through the training data. In order to support continuous updating this behavior is undesirable because it is less efficient. However, nearly all of the state-of-the-art modeling techniques that were named in this report do not have this limitation or could otherwise be adapted easily to allow for continuous retraining. There is a small advantage when using cluster based models, because the clusters can be visualized, which creates a
representation of the relevant data which is understandable from the point of view of a human operator and this can aid in activities that optimize the system, such as determining the concept-drift and anomaly- thresholds.
A quite intuitive approach that combines the advantages of both the continuous updating and threshold- based updating is the clustering technique that was described in paragraph 6.3 and more extensively in [34] and [35]. At the base of this technique lies the capability of being able to autonomously account for concept- drift and it is this method that we will use to model the “irregular” parameters, which were defined in the previous paragraph.
As an extension we will use the feature of legitimate clients that we described [33] and in paragraph 6.1.1.1. Here the reputation for clusters was used to prevent legitimate outliers from being filtered out of a set of training data that was modeled using a cluster-based approach. We will use a slightly modified version of this method for detecting legitimate changes in live data streams. We propose a system in which the observed outliers that are out of the range of the anomaly threshold will not raise alarms when they originate from legitimate clients. Instead, these outliers will be added to the reservoir such that the model will be retrained based on these changes. In [34] it was already mentioned as a limitation of the clustering based method that the current detection accuracy still needs to be improved and one solution would be to use more features of the data. As was said before, our unique proposal is to use the feature of the reputation index of outliers, which directly contributes to the adaptation to concept-drift in live data. In short, where items (in live detection) and clusters (in retraining) would be considered to be anomalous in [34], we could consider them to be valid based on the reputation index. The reputation index will be calculated per cluster after retraining. In addition to only storing the suspicious items in the “reservoir”, as in [34], we also store the anomalous items during live detection and flag them as “suspicious-anomalous”. We also refrain from directly triggering an anomaly-alert when observing a suspicious-anomalous item, but instead issue a low priority suspicious- anomaly alert, while the retraining process will be responsible for triggering a high priority anomaly alert for clusters that are considered to be anomalous after retraining. During retraining we treat the items that have exceeded the anomaly threshold differently from items that only exceeded the suspicious threshold. For items that have only exceeded the suspicious threshold, as a basis we have adopted the same approach as in [34], by checking the size and sparseness of each cluster. Our extension is that when the cluster fails to meet one of these criteria, a sufficient reputation index of the cluster will still result in the cluster to be considered valid. Note that a sufficient reputation index is not a requirement. When the size and sparseness criteria are met, the cluster is considered to be valid regardless of whether the reputation index is sufficient. When it comes to suspicious items, our extension is therefore similar to the original method, but less strict towards clusters that would otherwise considered to be anomalous. On the other hand, for items that have exceeded the anomaly threshold, in addition to the size and sparseness criteria, a sufficient reputation of the cluster is a requirement. This makes the model stricter towards including suspicious-anomalous items compared to including regular suspicious items.
Note that we only regard it as an anomaly when a suspicious(-anomalous) item becomes an exemplar of a
suspicious cluster. Because clusters may become sparse when non-suspicious existing exemplars can be
complemented with infinitely many suspicious items, sparseness has to be regulated in the AP algorithm. In short, we want to make sure that clusters do not become very sparse during rebuilding, which implies that
we have to ensure that the AP algorithm is sufficiently strict towards including outliers, i.e. the suspicious items, in existing “valid” clusters.
For the “regular” parameters we use a more simplistic approach. It can be deduced from the extracted features we mentioned for this type of parameters that these can well be represented in the form of regular expressions, an approach that has directly been adapted from [15]. Details regarding the retraining of this model will be described in the next paragraph.