Selección de manuales Nuestro corpus de trabajo

Capítulo 2. Directivas actuales para la enseñanza de lenguas y enseñanza de ELE

2. Selección de manuales Nuestro corpus de trabajo

Based on inverse linear programming, described in Sect. 2.6 we develop our novel approach for learning parameters in graphical models.

Our learning approach we propose in this section consists of two independent phases: (i) model parameter perturbation where we propose the novel inverse linear programming approach for learning model parameters, and (ii) model parameter prediction based on the results from (i), where we can use any model parameter prediction method.

4.2.1. Model Parameter Perturbation

In this section we apply the inverse linear programming approach of [ZL96,AO01] to the problem of learning model parameters of graphical models. For the required background on minimal representation and the exponential family model we refer to Sect.2.3.

We consider the local polytope relaxation to the MAP inference problem, as given by

min

µ∈LM

hˆθ, µi, (4.6)

where the local polytope LM_G is defined by the so called minimal representation,

LM_G :=              µ ≥ 0 −µ_ij + µ_i≥ 0, i ∈ V, ij ∈ E −µij + µj ≥ 0, i ∈ V, ij ∈ E −µi− µj+ µij ≥ 1, i ∈ V, ij ∈ E. (4.7)

The second and third constraint come from the probabilistic interpretation for the binary case of two labels 0 and 1 with variables x_i, that is, µ_i _{= P(x}_i = 1),

µij = P(xi= 1 ∧ xj = 1). In this section we will assume the minimal representation

when referring to the local polytope representation. Furthermore, for compactness we will write the constraints (4.7) as a linear inequality system, i.e. Aµ ≥ b.

In the following we will follow the inverse linear programming approach from [ZL96, AO01]. We want a ground truth segmentation, µ∗ to be in the optimum of a labeling problem by perturbing an initial vector ˆθ:

µ∗ ∈ arg min

µ∈LM

hˆθ + θ, µi. (4.8)

Concerning the initial model parameter vectors ˆθ, they can be obtained with any

learning method.

Our original problem (4.6) is an LP and using the results from Sect. 2.6we can formulate the inverse LP by first constructing the dual and deriving the complemen- tary slackness conditions. This will allow us to derive the optimal perturbation θ. In the following we denote n := |V| and m := |E |.

4.2. invLPA: Inverse Linear Programming Approach

Let us define the set of all θ parameter vectors that correspond to the ground truth segmentation µ∗

Θ(µ∗) := {˜_{θ ∈ R}m+n| min

µ∈LM

h˜θ, µi = h˜θ, µ∗i}. (4.9)

We want to perturb the given initial vector ˆθ /∈ Θ(µ∗), in the sense of the `1 distance, i.e.

θ ∈ min{||˜θ − ˆθ||1 |˜θ ∈ Θ(µ∗)}. (4.10) Next we summarize the primal-dual formulation of (4.6) and use the inequality form Aµ ≥ b to represent the local polytope:

Primal: Dual:

min

µ hθ, µi, maxν hb, νi (4.11a)

s. t. Aµ ≥ b, µ ≥ 0 s. t. A>ν ≤ θ, ν ≥ 0. (4.11b)

Let us denote with µ∗ the optimal solution to the primal problem. Furthermore, let ν and νµ be the feasible dual variables to the optimal µ∗, corresponding to the

primal constraints Aµ ≥ b and µ ≥ 0, respectively. Since we are considering a convex problem, the necessary and sufficient optimality conditions are given by the Karush-Kuhn-Tucker conditions [BV04]:

stationarity A>ν + νµ= θ (4.12a)

primal feasibility Aµ∗ ≥ b, µ∗≥ 0 (4.12b)

dual feasibility ν ≥ 0, νµ≥ 0 (4.12c)

complementary slackness hν, Aµ∗− bi = 0, hνµ, µ∗i = 0. (4.12d)

Let us define the following sets

I : = {i ∈ [dim(b)] : (Aµ∗− b)i > 0} and (4.13a)

J : = {j ∈ [n + m] : µ∗_j > 0}. (4.13b)

The complementary slackness conditions imply that for those indices i ∈ I we have ν_i = 0 and likewise we have (ν_µ)_j = 0 for j ∈ J .

Based on the primal problem in (4.11) and using the concept of inverse linear programming, see Sect.2.6, we derive the main result of this section.

Proposition 4.2.1. Let n = |V|, m = |E| and ˆθ ∈ Rn+m be a given model parameter. Suppose the local polytope based on the minimal problem representation is given by

LM_G _{= {µ ∈ R}n+m₊ : Aµ ≥ b}. (4.14)

`1-norm perturbation θ ∈ Rn+m of ˆθ is such that µ∗∈ arg min{hˆθ + θ, µi : µ ∈ LMG }

is a solution to the linear program

min θ,νµ,νkθk1 s.t. A >_{ν + ν} µ= ˆθ + θ (4.15a) θ ∈ Rn+m, νµ∈ Rn+m+ , ν ∈ R dim(b) + , νI = 0, (νµ)J = 0, (4.15b) I := {i ∈ [dim(b)] : (Aµ∗− b)i > 0}, J := {j ∈ [n + m] : µ∗j > 0}. (4.15c)

Proof. Having initial feasible ˆθ we want to adjust it (correct it) so that the final ˆθ + θ

corresponds to the optimal solution µ∗ to the primal problem in (4.11). From the theory of inverse linear programming [ZL96,AO01] we know that this amounts to solving an LP (4.10). On the other hand the necessary KKT optimality conditions are sufficient, too, since we have a linear (convex) function. From this follows that the minimal norm perturbation θ can be found by replacing θ by ˆθ + θ in (4.12), which then leads to the constraints in (4.15).

The problem in (4.15) is not a linear program in the above written form but can be easily converted into one:

min

ν,νµ≥0,θ||θ||1 s.t. (4.16a)

A>ν + νµ− θ = ˆθ (4.16b)

D1(µ∗)ν = 0 (4.16c)

D2(µ∗)νµ= 0 (4.16d)

where D₁ and D₂ are the diagonal matrices that correspond to the constraints from (4.15), i.e. D1(µ∗)ii: =    1, if i ∈ I

0, otherwise i ∈ [dim(b)] (4.17a)

D2(µ∗)jj : =    1, if j ∈ J 0, otherwise j ∈ [n + m]. (4.17b)

The equation (4.16) above is equivalent to the linear program min ν,νµ,θ+_,θ−_≥0h1, θ +_{+ θ}−_i subject to (4.18a)    A> In+m −In+m In+m D1(µ∗) 0 0 0 0 D2(µ∗) 0 0         ν νµ θ+ θ−      =    ˆ θ 0 0    (4.18b)

where θ+ = max{θ, 0} and θ− = − min{θ, 0}. The linear program above can be solved with a linear programming solver, for example MOSEK, [ApS15].

4.2. invLPA: Inverse Linear Programming Approach 4.2.2. Model Parameter Prediction

In the previous section we saw how to compute the perturbation potentials θ, given the ground truth labelings, µ∗. In this section we describe the second phase of invLPA, which is completely independent of the first one. Here we use the perturbation potentials computed in the first phase to predict new potentials based on novel data features. Our training data comprises the learned perturbed potentials ˜θk= ˆθ + θk, and corresponding features fk (unary or pairwise), k ∈ [N ]. Note that the outcome from the invLPA (4.18) is a vector θ = (..., θk, ...) where θk _{is a scalar value which is}

the perturbation value for the corresponding node k (for the case for learning unary potentials).

We are free to choose any model prediction method that returns model parameter vector θ based on observed novel features. However, in this work we limit ourselves to simple linear prediction methods and a nonlinear (NL) Gaussian regression method able to capture a richer model structure.

Linear Prediction

For linear prediction we consider two common methods, linear least-squares (LS), and a sparse `₁-norm approach. Furthermore, we assume a linear dependency of our potentials on a vector w, i.e.

θk= hfk, wi, k ∈ [N ] (4.19)

where fk_{, w ∈ R}dimfk. Here θk are the potentials of a discrete graphical model, see (4.1), and can be either unary or pairwise ones. Furthermore, fk represent the corresponding features. In the next step, the perturbed model parameters ˜θk= ˆθ + θk,

k ∈ [N ], obtained by solving (4.15) are fit to the observed features fk. To this end we collect all our corrected model parameters learned in the first phase and fit linearly parametrized model parameters to them. We do this with two linear fitting methods as described next.

Least-squares fitting: We set up an overconstrained system and solve

min

k∈[N ]

|hfk_{, wi − ˜}_θk_|2_. _(4.20)

Please keep in mind that in the equation above the features fk are the corresponding ones to the perturbed potentials ˜θk.

`1-norm fitting: In addition to the smooth least-squares approach, we also apply the sparse regularization approach

min

w,skkwk1+ λ

k∈[N ]

|sk|, s.t. hfk, wi − sk= ˜θk, k ∈ [N ]. (4.21)

where λ > 0 is some parameter. Here again fk are the features corresponding to ˜θk _{in the training data.}

After having found one fitting vector w for all train data we can use it, together with the novel features on the test data, to construct new predicted linearized model parameter vectors as in (4.19).

Nonlinear Prediction

To demonstrate the flexibility of our method we apply different model prediction methods. For this reason we use a nonlinear Gaussian regression, [RW06], which can capture more from the model structure as opposed to simple linear methods. The Gaussian prediction model for obtaining prediction potentials θ(f ) is given in the form θ(f ) := kN(f )> K(F ) + σn2I −1˜ θ (σ2_n is a parameter) (4.22a) = X k∈[N ] wk(F, ˜θ)k(fk, f ), w(F, ˜θ) := K(F ) + σn2I −1˜ θ, (4.22b)

where K(F ) is the covariance matrix induced by the training data, that is

K(F ) = k(fk, fl) k,l∈[N ] (4.23) where k(fk, fl) := σ2_mexp− 1 2σ2 f kfk− flk2, (σ_f2,σ2_m are parameters), (4.24) and kN(f ) := k(f1, f ), . . . , k(fN, f ) > (4.25) evaluates for any novel feature vector f the kernel function using all given feature vectors fk, k ∈ [N ], for the training data. Thus, given a novel image with feature vector f , the corresponding model parameter is θ = θ(f ).

One drawback of a Gaussian regression is its cubic complexity. However, in the literature there are many solutions suggested to this problem, all finding an approximation to the whole model, e.g. by using a sparse Gaussian regression and selecting a subset of the training data by random. Another option is approximation of the matrix K(F ) + σ2

nI with a low rank matrix. In contrast, the Bayesian Committee

machine[Tre00, ST02] splits the whole data into random subsets (clusters) and predicts subset-wise, while considering the testing data when making a prediction.

An approach similar to the Bayesian Committee machine was proposed in [ND14], which considers the inductive property instead of the transductive one. The training data is split by random into smaller subsets and exact inference is performed on every of these subsets. In this way the computation time can be reduced if we can parallelize the inference. In addition, the complexity can be significantly reduced. In fact the complexity is now linear in the number of data N . If we consider splitting into M subsets, then we have to invert an N/M matrix and we can write M = N/α, for some scalar α.

In document Los métodos de enseñanza en ELE : el método comunicativo revisado (página 99-102)