2.3 Desarrollo de las destrezas intelectuales
2.3.1 Destrezas intelectuales
2.3.1.2 La aptitud numérica
Given a dataset ofprandom variables the task of feature selection for predictive models is to obtain a subset of these variables (features) that are given for each instance, which are then input to the predictive model. In the supervised setting, the predictive model is used to solve a classification or regression problem, that is, the chosen features represent the independent variables to predict a given dependent variable. The number of variables is also referred to as dimensionality of the dataset. Feature selection can be seen as search through the powerset consisting of all 2p −1 subsets, which is again an optimization problem on its own.
The importance of feature selection has grown over the last decade due to the increas- ing availability of high-dimensional datasets. Since generalization of machine learning models over such data, i.e. the capability of a model to fit the data correctly without being to biased towards the given samples (training data), becomes exponentially harder as the number of variables increases [Dom12]. This is closely connected to the problem of overfitting in machine learning literature, as every learning algorithms come with a certain bias and variance towards its parameters. Highly biased algorithm need lots of evidence in the data to change their initial parameter choice, while high-variance algo- rithms very quickly fit to special peculiarities of datasets. Overfitting happens when the model parameters are over-specific to the characteristics of the training data and fail to generalize to unseen samples.
Another related challenge of high-dimensional datasets is thecurse of dimensionality, which holds for a broad family of data distributions and distance measures [HKK+10]. Formally this effect has been shown that with increasing data dimensionality the propor- tional difference between the farthest-point distance dmax and the closest-point distance
dmin converges to zero: on a dataset consisting ofpfeaturesD={x1,x2, . . . ,xn} ∈Rn×p,
when p → ∞, dmin(D) = min
i,j dist(xi,xj), then we have
dmax−dmin
dmin = 0. Since a stable distance or measure is vital for many predictive models, reducing high-dimensionality is an important challenge.
Wrapper-based framework
Search strategy All features
Predictive model
Feature subset performanceEstimated
Best subset
Figure 3.1: The wrapper-base feature selection framework is an iterative approach to select the best subset of features according to a search strategy
In order to avoid this problem, it is necessary to develop robust and yet cost-efficient predictive models by decreasing the data dimensionality beforehand. This reduction of input feature dimensionality is part of the well-known problem of feature selection. There are various feature selection methods that try to judge the usefulness of features towards the objective function of the predictive model. They can be grouped in to the following categories, cf [TAL14]: filter,wrapper, embedded.
The wrapper-based framework is shown in Figure 3.1. In this iterative approach, feature selection is performed by a search strategy (e.g. greedy) that selects a subset of all features. Then the predictive model is evaluated on this subset with respect to some performance metric, e.g. accuracy. Finally, the best subset according to the metric is chosen. This generic framework is agnostic of the predictive model and can therefore be applied universally. In contrast to embedded methods, the wrapper setting uses an external evaluation criterion that is not built into the predictive model itself. This is why wrapper approaches are computationally more expensive and the crucial part is to come up with an efficient search strategy that does not need exhaustive enumeration.
One form ofembedded feature selection approaches can be seen as adding constraints on the model parameters, a technique called regularization. TheL1–norm penalty on the model parameters induces a sparse feature representation. In linear regression models this is called the Lasso, or Laplacian prior. For very high variance models such as neural networks with multiple stacked layers of hidden parameters best practices to avoid over-
fitting are applied to the objective function by introducing a penalty on the magnitude of parameters. For example, given a parameter matrix W the L2–norm kWk2 is added to the objective function, therefore favoring solutions with small magnitude of the parame- ters. To increase accuracy of prediction on unseen data, these regularization techniques have been applied to linear models as well, e.g. in Ridge or Lasso regression models. Such linear models are also used in cases when p is greater than the number of data points n
(p > n). Here, further regularization schemes have been developed, especially focused on collinearity between predictor variables [ZH05]. Removing predictor variables from the model also has the benefit to increase interpretability of results.
Filter feature selection methods are based on characteristics of the features itself, such as their variance. This is closely related to dimensionality reduction approaches that transform the original feature space into a lower-dimensional one that still keeps most of the original variance in the dataset. This can be accomplished by combining highly cor- related features, as in principal component analysis (PCA) for example, which computes the eigenvectors of the feature’s covariance matrix over the whole dataset. Hence, most of the original variance can be retained by a subset of eigenvectors used as low-dimensional projection. The shortcoming of techniques like PCA are that they retain only linear cor- relations within the dataset. Furthermore, the low-dimensional projection can be hard to interpret in later explanation of the predictive model to domain experts.
In general, both dimensionality reduction as well as feature selection approaches have been targeted towards traditional tabular, i.e. propositionalized, data structures. How- ever, the proliferation of complex-structured data has pushed the demand for feature se- lection approaches that directly deal with these representations, e.g. connected datasets [GH11], common single-relational graphs (networks) [AW06], and multi-relational graphs [GLT+16]. Due to the difficulty of applying predictive models to complex-structured data, the border between pure feature selection and feature extraction vanishes and the result of applying these methods to graph-structured data again results in a tabular data rep- resentation, thereby enabling efficient statistical inference, which will be introduced in Chapter 4.