CAPÍTULO VI EVALUACIÓN DE BENEFICIOS CON EL MEJORAMIENTO
21 Entrega de premios entre
As aforementioned, the five types of activity logs are prepared with the raw at- tributes that describe the activities carried out by users who belong to a specific community. The raw activity logs are not susceptible for direct analysis by a ma- chine learning method, however, it is necessary to transform these raw attributes into quantitative features. Fig. 4.2 shows that the activity logs are input to the preprocess (2) of ‘feature extraction’ and output a ‘community behaviour profile’ – ‘community matrix’.
We define the feature space of the insider threat problem according to the liter- ature [21], [33], [76]. The feature space comprises a set of features that assess the behaviour of users, and allow to compare to previous behaviour of these users or their community of users. We define each feature in the feature set based on the evidence it would give about any undergoing anomalous behaviour. For example, we define the feature logon_after_hours={0,1}. If the value of the feature is1, then it gives an evidence of an unusual logon activity of a user in the community after the working hours. Therefore, this feature contributes to the overall decision of the system whether an alarm of a malicious insider threat should be flagged or not.
We categorise the features defined in this work into five groups. We give below a brief description of each category:
• Frequency-based ‘integer’: assess the frequency of an activity carried out by the users in a specified community during a defined period of time;
• Time-based ‘integer’: assess an activity carried out within the non-working hours;
• Boolean ‘f lag={0,1}’: assess the presence/absence of an activity-related infor- mation;
• Attribute-based ‘integer’: are more specialised features which assess an activ- ity with respect to a particular value of an attribute; and
4.3. Data preprocessing for CMU-CERT Insider Threat Data Sets 43
TABLE4.4: The defined features and their categories.
Feature Frequency-based Time-based Boolean Attribute-based Others
freq_logon x logon_after_hours x logon_new_pc x freq_connect x connect_after_hours x freq_browse_urls x freq_browse_job_urls x x freq_browse_wikileaks_url x x freq_copy_files x file_access_ext_exe x freq_send_emails x nbr_to_recip x nbr_cc_recip x nbr_bcc_recip x nbr_all_recip x non_emp_recip x avg_size_emails x nbr_attach x
The cross symbol (x) denotes that this feature belongs to the corre- sponding category.
Table 4.4 lists all the features defined in this work and the categories these fea- tures belong to. It is worth noting that some features belong to more than one cat- egory. For example, ‘freq_browse_job_url’ is (1) an attribute-based feature, which assesses the activity of browsing particular urls (i.e. urls for job websites), and (2) a frequency-based feature, where it assesses the frequency of executing this activity.
Fig. 4.4 shows the preprocess (2) of feature extraction for a specific commu- nity. Let{f1, f2, . . . , fm} represent the set of features to construct the community behaviour profile ‘community matrix’, such thatfi; 1 ≤i ≤ mrepresents a feature from the defined features above. The raw attributes in the prepared activity logs are utilised to extract the values of the features. We define a session_slot as a period of time from start_time to end_time. A community behaviour profile represents a set of feature vectors (i.e. a matrix) over sorted session slots. Each feature vector (i.e. instance) is a set of the values of the features{f1, f2, . . . , fm}over a certain session slot. Consider session_slott, and assume thatf1 is the feature ‘freq_connect’, then the value of featuref1 is the frequency of the connect activity for the users who be- long to the specified community during the start_time to end_time of session_slot t.
The constructed community behaviour profile is then input to a machine learn- ing approach to generate a community behaviour model. This model defines the
44 Chapter 4. Feature Space for Insider Threat Detection
FIGURE 4.4: Feature extraction for a specific community.
{f1, f2, ..., fm}represent the set of features to construct the commu-
nity behaviour profile ‘community matrix’, such thatfi; 1 ≤ i ≤ m
represents a feature from the defined features. A session_slot is a period of time from start_time to end_time. A community behaviour profile represents a set of feature vectors over sorted session slots.
baseline behaviour for the users in the specified community. Any deviation from the baseline is analysed by the detection system to identify whether it is a normal drift of behaviour or it is associated with a malicious insider threat.
In this thesis, based on deep analysis of the data distribution, we define the session_slot per four hours to find local anomalous behaviour within a day which would not be detected per day. The rationale behind choosing the session slotper four hoursis that this period of time is long enough to extract an instance (i.e. vec- tor of feature values) that provides an adequate evidence of anomalous behaviour. Thus, it allows the system to capture the anomalous behaviours in the feature space. If the session slot is chosen per minutes, for example, the extracted instances would lack adequate evidence of the occurrence of anomalous behaviour. On the other hand, if the session slot is chosen per days/weeks, for example, the period of time
4.3. Data preprocessing for CMU-CERT Insider Threat Data Sets 45
FIGURE4.5: Sample of a community behaviour profile ‘Community Matrix’.
will be too long to capture the anomalous behaviour blurred among the normal be- haviour in the extracted vector of feature values. It is worth noting that early ex- periments in this work were applied on session slot per minutes, however, the low detection performance guided our work towards session slot per hours. The rea- son behind this is that the session slot per minutes is too small to carry adequate evidence of the occurrence of anomalous behaviours (as previously mentioned).
After constructing the community behaviour profile, we normalise each vector of feature values (over a session slot) to the range [0,1], and associate it with a class_label{Normal, Anomalous}and a threat_label{Normal, scenRef_insiderID}. scenRef_insiderID (e.g. s1_ALT1465) has two parts: scenRef (e.g. s1, s2, s3, or s4), which is the reference number for the scenario followed in the malicious insider threat, and insiderID (e.g. ALT1465, AYG1697), which is the user ID of the insider attributed to the threat.
Fig. 4.5 provides a sample of a community behaviour profile ‘community ma- trix’. It shows the attributes (session_id, session_slot), the features (freq_logon to nbr_attach), and the labels (class_label, threat_label) for the session slots session_id =2248−2257.