B. INFORMACIÓN CUALITATIVA Y CUANTITATIVA DE RIESGOS
2. Información sobre el riesgo asociado a la cartera industrial
In the general case of feature selection, select the best predictive features for a classifier serves two purposes: (1) improving the accuracy of the classifier, and (2) reducing the
computational cost caused by the high-dimensionality of the original feature space. Max- ion pointed out that most prior profiling approaches apply the same features to all user models [Maxion, 2005]. I posit that diversifying features by selecting the best features for
a specific user model independently from all other user models improves the overall aggre- gated accuracy of all classifiers. Furthermore, diversifying the model features also hardens
the classifier against mimicry attacks. Now, an adversary has to identify the specific features modeled for the individual victim before being able to launch a mimicry attack successfully.
CHAPTER 6. PERSONALIZED DIVERSIFIED USER MODELS
Feature Number Feature Description
1 Browsing 2 Communications 3 Database Access 4 Desktop Configuration 5 Development 6 Editing 7 File Compression
8 File System Management
9 Games 10 Installation 11 IO Peripherals 12 Learning 13 Media 14 Modeling 15 Networking 16 Organization 17 Other 18 Search 19 Security 20 Software Management 21 System Management 22 Web Hosting
Table 6.1: Features Extracted for Building User Models
I therefore increase the costs of launching such attacks against a wide set of users.
I apply this concept of diversified modeling to support vector machines (SVMs), which
have been shown to achieve the best accuracy results when used for user behavior profil- ing [Seo and Cha, 2007; Ben-Salem et al., 2008]. The choice of SVMs was suitable for online
CHAPTER 6. PERSONALIZED DIVERSIFIED USER MODELS
learning due to their adequacy for block-by-block incremental learning: With the advent of new data and the potential need for updating the model in order to deal with concept
drift, SVMs do not have to be retrained with the whole set of new and old data. Instead, it is sufficient to use the most recent data for re-training in addition to the support vectors
identified in the old SVM model [Vapnik, 1999; Syed et al., 1999].
To implement the concept of diversified modeling, I use the maximum entropy discrim- ination framework. In the following subsections, I briefly introduce this framework, and
explain how I use it to apply the diversified modeling approach.
6.1.2.1 Maximum Entropy Discrimination Framework
Jebara and Jaakola developed a Maximum Entropy Discrimination (MED) framework for
support vector machines and large-margin linear classifiers, which they later extended to sparse SVMs and to multi-task SVMs [Jebara and Jaakkola, 2000].
Solving a regular quadratic convex problem returns a SVM model Θ = {θ, b}. The MED
framework can be used to return a distribution of parameter models P (θ) in lieu of a single parameter value θ, such that the expected value of the discriminant under this distribution
matches the labeling of the feature vectors [Jebara, 2004]. The framework can therefore be considered as a generalization of support vector machines.
As the developers of the framework note, one can augment the discriminant with a feature selection switch vector s, which becomes a part of the more complex model Θ = {θ, b, s} [Jebara, 2004]. The switch vector represents weights corresponding to the features
used to train the model. If the weight is equal to zero, then the corresponding feature is
not included in the model and can be discarded from the input data. I use these augmented models to select the best features for the user models. This is done for each user model
independently from the remainder of user models.
The model returned by the MED framework can be further augmented to include joint
densities over parameters of several classifiers and feature and kernel selection configura- tions, rather than parameters of a single classifier only [Jebara, 2011]. This constitutes the
multi-task learning variant of the framework. Instead of learning each classifier or task independently, one can pool all tasks and corresponding data together to form one global
CHAPTER 6. PERSONALIZED DIVERSIFIED USER MODELS
prediction problem and learn a single classifier for all of them.
Although the models are conditionally independent given the data, the model represen-
tation used makes them dependent otherwise. Observing the data with the latent shared parameter s introduces such dependencies among the multiple tasks or classifiers. For exam-
ple, in the simple case of multi-task learning with two models (Θ1 → D1 → s ← D2← Θ2),
observing the data D1 and D2 links the two user models Θ1 and Θ1 unless the shared fea-
ture binary switch s is also observed. This shared classifiers or shared models setup may be particularly beneficial in the case of a limited number of training examples for each task.
6.1.2.2 Feature Selection
I use the MED framework in order to select the best features for the user models. I apply the MED’s feature selection capability in two different ways. The first is to select
the best features for individual user models independently. I call this the independent learning technique, which results in the diversified modeling approach, as the selected
features for each classifier vary from one user model to another. In the second approach, I present the MED framework with training data from all the users, i.e. for all classifiers
or tasks, and use it to select the best features for all user models. The MED framework returns one global solution for all user models where the same features get selected for
all classifiers. I call this the multi-task learning, or more accurately in this case, the meta-learning approach, since no additional samples are made available to the learner.
Meta-learning improves accuracy through the inter-dependencies between the binary tasks, and not through additional data samples.
To apply the MED framework, I build a Gram matrix G, whose entries are given by the
kernel function evaluated over all pairs of data points G (x, ¯x) =PD
d=1ˆs (d) kd(x, ¯x), where
kd(x, ¯x) = x (d) ∗ ¯x (d) is the scalar product of the d’th dimension of the input data needed
for feature selection. The MED framework will return the optimal weighted combination of features ˆs.
CHAPTER 6. PERSONALIZED DIVERSIFIED USER MODELS