• No se han encontrado resultados

Entrevista al alcalde de Campillo de Ranas D Francisco Maroto

Support vector methods have continuously developed since their first introduction by Boser et al. (1992) and are very competitive machine learning techniques. When deciding on a representative state-of-the-art machine learning technique for the evaluation of visit potential under missing data, we therefore selected support vector regression (SVR) as introduced by Drucker et al. (1997). As the standard SVR implementations such as the R packages e1071 (relying on LIBSVM) and klaR (relying on SVMlight) do not offer the possibility to handle missing data (Dimitriadou et al., 2011; Roever et al., 2011), we designed a two-step single imputation schema to apply the method. The schema has been designed as a general framework for machine learning methods and allows for an easy exchange of the base learner. It is not especially adapted to SVR, and therefore may not lead to optimal results that could be obtained by a specialized SVR. Our imputation schema is able to handle arbitrary patterns of missing data and, by the inclusion of further independent variables, SVR has the possibility to compensate MAR dependencies. Similar to GLM, SVR predicts real-valued, possibly negative numbers. Therefore, an additional data transformation step must be applied. Support vector methods are also known to require careful parameter tuning, which is also a disadvantage for the practical use of the method. However, SVR is a state-of-the-art machine learning method, which has been successfully applied in a number of different application domains.

We begin this section with an introduction of support vector regression which is based on the tutorial of Smola and Sch¨olkopf (1998). Subsequently, we present our imputation schema. Similar to GLM, we will postpone the treatment of real-valued results until Section 5.4.

Learning algorithm. Let {(x1, y1), (x2, y2), . . . , (xn, yn)} ⊂ X × R denote a training data set

where X denotes the space of the independent variables, for example, X = Rd with d ∈ Z and d ≥ 1. The aim of ε-support vector regression is to find a function f (x) : X → R which a) shows a deviation of at most ε from the actual values yi for i = 1..n and b) is as flat as

possible. Flat hereby refers to a restriction of the complexity of the function in order to avoid overfitting. In the case of a linear function, f has the following form

f (x) = hw, xi + b with w ∈ X , b ∈ R. (5.45) Hereby, h·, ·i denotes the inner product. We can formulate a convex optimization problem for SVR as follows minimize 12||w||2 subject to    yi− hw, xii − b ≤ ε hw, xii + b − yi ≤ ε . (5.46)

The term ||w||2 denotes the squared norm of w. It is a regularization term that reduces the complexity of the function by minimization. However, Equation 5.46 has the disadvantage that it assumes that a function f exists which predicts the data with an error of at most ε, i.e. |yi− f (xi)| ≤ ε for all i = 1..n. Often, however, this is not the case. We can then relax the

optimization problem by introducing slack variables ξ, ξ∗ similar to a soft-margin approach. This leads to the following optimization problem

minimize 12||w||2+ CPn i=1(ξi+ ξ ∗ i) subject to          yi− hw, xii − b ≤ ε + ξi hw, xii + b − yi ≤ ε + ξi∗ ξi, ξi∗ ≥ 0 . (5.47)

The constant C determines the cost of prediction errors that are larger than ε. Equations 5.47 are typically solved using the dual formulation. This has the advantage that non-linear functions can be easily integrated into the framework later on. The transformation of the primal (Equations 5.47) to the dual can be accomplished using Lagrange multipliers, which yields L := 12||w||2+ CPn i=1(ξi+ ξi∗) − Pn i=1(ηiξi+ η∗iξi∗) −Pn i=1αi(ε + ξi− yi+ hw, xii + b) −Pn i=1α∗i(ε + ξ∗i + yi− hw, xii − b). (5.48)

L denotes the Lagrangian and ηi, ηi∗, αi, α∗i are the dual variables or Lagrange multipliers,

derivatives of L with respect to the primal variables w, b, ξi, ξi∗ must be derived and set to zero.

The derivatives are

∂ ∂bL = Pn i=1(α∗i − αi) = 0 ∂ ∂wL = w − Pn i=1(αi− α ∗ i)xi = 0 ∂ ∂ξL = C − αi− ηi = 0 ∂ ∂ξ∗L = C − α∗i − η∗i = 0. (5.49)

In order to obtain the dual optimization problem, the derivatives must be substituted into Equation 5.48. Further details on the solution of the dual can be found in Smola and Sch¨olkopf (1998). Note, however, that the derivative in the second line of Equation 5.49 can be rewritten as w = n X i=1 (αi+ α∗i)xi. (5.50)

It implies that w is a linear combination of the input data. In combination with Equation 5.45 we obtain f (x) = n X i=1 (αi− α∗i) hxi, xi + b. (5.51)

From Equation 5.51 we can predict the value of the dependent variable of some data instance x without explicitly computing w. The calculation relies only on the inner product of x and the training instances. More specifically, the calculation relies only on those training instances for which |f (xi) − yi| ≥ ε, the so-called support vectors.

Equation 5.51 allows in addition to extend SVR to non-linear functions by applying some mapping function Φ : X → F which transforms instances of input space X to some higher dimensional input space F . The inner product now takes the form hΦ(xi), Φ(x)i, which can

be considered as similarity function of the instances in the transformed space F . However, the kernel trick allows to avoid the explicit computation of the transformation and the inner product in F . If a kernel function k : X × X → R satisfies Mercer’s theorem, i.e. as long as it is positive definite, a transformation exists such that the value of the kernel function computed for two instances is equal to the inner product of the instances in the transformed feature space (Mercer, 1909). Typical examples of such kernel functions are

polynomial: k(xi, xj) = ( γ hxi, xji )d, γ > 0,

radial basis: k(xi, xj) = exp( −γ ||xi− xj||2 ), γ > 0,

sigmoid: k(xi, xj) = tanh( γ hxi, xji ).

(5.52)

In summary, when applying SVR we have to select an appropriate kernel function along with its parameterization and a cost C for prediction errors. Previous to our experiments we therefore tested different parameterizations of SVR which are described in more detail in Section 5.4.2.

Treatment of missing data. In our experiments we apply SVR to predict missing values of the variable Y = (Y1, . . . , Yq). We hereby iteratively predict missing values of one variable Yj

using the remaining (q − 1) variables (Yk | k = 1..q, k 6= j) and possibly additional sociodemo-

graphic variables X as independent variables. The training data set then comprises all entities for which Yj is observed. However, the variables Yk may also contain missing values, which

cannot be treated with standard SVR. We therefore perform a secondary imputation step prior to application of SVR during which we temporarily impute missing values for variables Ykwith

k = 1..q, k 6= j. The secondary imputation is simply a mean substitution where we replace missing values with average values. We performed different variants of mean substitution, which we call vertical and horizontal mean substitution. The terms vertical and horizontal indicate the direction in the data matrix over which the average is formed, i.e. over a column or a row. During vertical mean substitution (VMS) we replace missing values of Yk with an

average of the same variable, possibly subject to conditioning on sociodemographic character- istics of the entity of interest. During horizontal mean substitution (HMS) we form averages for Yk over all observed values of Y1, . . . , Yq for a given entity. We tested both types of mean

substitution during parameter tuning of SI-SVR. The results are given in Section 5.4.2. We perform both imputation steps of SI-SVR independently for all variable Y1, . . . , Yq,

leading to the approach depicted in Algorithm 2.

Algorithm 2: Single imputation via SVR (SI-SVR) Input:

X = (X1, . . . , Xp) // data set of completely observed variables

Y = (Y1, . . . , Yq) // data set of partially observed variables

θ // SVR parameterization

ϕ // mean substitution parameterization Output:

visit potential quantities for data set (X, Y )

1 for j = 1 to q do

// determine set of independent partially missing variables

2 Y1..q \ j = (Yk | k = 1..q, k 6= j)

// temporarily impute missing values

3 Y1..q \ j (mis)= applyM eanSubstitution(X, Y, ϕ)

// determine training and prediction data set

4 (X, Y1..q \ j, Yj)train = { (xi1, . . . , xip, yi1, . . . , yiq) | yij observed, i = 1..n } 5 (X, Y1..q \ j)predict= { (xi1, . . . , xip, yik) | yij missing, i = 1..n }

// train SVR and impute missing values

6 h = trainSVR( θ, (X, Y1..q \ j, Yj, )train)

7 Yj∗= ( Yj(obs), applySVR( h, (X, Y1..q \ j)predict) ) 8 end

9 calculate visit potential quantities on observed and imputed values Y∗= (Y1∗, . . . , Yq∗)

Documento similar