Talleres avanzados (invierno y primavera 2011)

Learning is the dual problem to inference. In inference we saw that given the canonical parameters θ we infer the mean parameters µ. In learning we have given the mean parameters µ which represent a subset of our data, the training data, from which we learn the canonical parameters θ, also called model parameters. Just as inference, learning is a non-trivial problem, however, there are efficient approximations for learning as well.

There are different approaches to learning. The most natural way to solve the learning problem is with MLE as we saw before, see (2.70). Due to the close connection of MLE to Kullback-Leibler (KL) divergence [KL51] the learning problem has been also formulated in other ways.

Depending on how the objective function of the learning problem is formulated, (approximate) learning can be divided into two classes: probabilistic parameter learning which is MLE with different type of approximations, and loss minimizing parameter learning, when the learning problem is defined as minimizing a loss function defined as the cost of predicting approximate mean or canonical parameters on data on which the correct mean parameters are given. Basically the loss function measures some distance between predicted and correct mean or canonical parameters.

In the following sections we describe these two types of learning.

Many other approaches not discussed fall into this categorization. For more on this subject we refer to [NL11].

2.5.1. Probabilistic Parameter Learning Given empirical mean parameters ˆµ = _n1Pn

i=1φ(xi) ∈ M the problem of learning

canonical parameters, or probabilistic learning, is concerned with the problem sup

θ∈Θ

hθ, ˆµi − A(θ). (2.94)

The equation above is A∗(µ), the conjugate dual to the log-partition function which is finite for mean parameters µ ∈ M due to the result of Theorem 2.3.6. Solving the learning problem or optimizing (2.94) is equivalent to computing ∇A∗. This is computationally intractable in the general case. The reason is the high complexity of the marginal polytope space parameters as well as of the lack of a closed form

2.5. Learning

expression of A∗ in the general case. Exceptions are tree structured graphs where junction tree theory applies, see [Lau96]. In the case of tree structured graphs the log-partition function can be written in closed form. When the graph is a tree its probability function can be decomposed as

p_θ(µ)(x) := Y i∈V µi(xi) Y ij∈E µij(xi, xj) µi(xi)µj(xj) , (2.95)

a result from the junction tree theory. When the probability can be decomposed as above A∗ can be written in closed form. Using this probability distribution decomposition rule and the result from2.3.6, that A∗ is equivalent to the negative entropy for µ ∈ rint M, we have

A∗(µ) = −H(p_θ(µ)_{) = −E}_pθ(µ)[log p_θ(µ)(X)] (2.96a)

= −X i∈V Hi(µi) + X ij∈E Iij(µij) (2.96b)

where the singleton entropy Hi(µi) := −Pxiµi(xi) log µi(xi), for i ∈ V and the

pairwise information Iij :=Pxixjµij(xi, xj) log_µiµij_(x(xi,xj) i)µj(xj).

For general graphs however due to intractability, approximations have been proposed. When relaxing the inference problem we introduced the approximation of the marginal polytope M(G) with a local polytope L(G), which is tight for trees. For the learning problem in addition to relaxing the marginal polytope, the intractable

A(θ) should be approximated.

In [WJW05a] a strictly convex twice differentiable approximation to A∗ was de- signed so that the domain of the approximation of A∗ is contained in the local polytope relaxation of the marginal polytope. Let us denote with B∗ the approxi- mation to −A∗. The motivation for constructing the approximate function comes from the closed form of A∗ on trees. When the general graph can be considered as a combination of tree graphs, then due to (2.96) the functions A∗ on each of the trees would have a closed form expression. In fact any connected graph can be decomposed into spanning trees. Considering a spanning tree decomposition of the connected graph G into trees T ∈ T , A∗ can be considered as a decomposition of A∗ functions in closed form on spanning trees composing the whole graph while weighting every spanning tree T by a probability P (T ) ∈ [0, 1]. Using (2.96) the approximation proposed by Wainwright et al. [WJW05a] also known as convexified

Bethe entropy approximation is given by

B∗ := X T ∈T P (T ) X i∈V Hi(µi) − X ij∈E(T ) Iij(µij) =X i∈V Hi(µi) − X ij∈E(T ) PijIij(µij), (2.97) where P_ij =P

T P (T )IT, also called edge appearance probability, and IT being the

indicator set function, see (2.16). When P_ij > 0 for all ij ∈ E (T ), the approximation B∗ is strictly convex. Using this result the log-partition function A when expressed as in (2.66) can be approximated using the approximation to −A∗, B∗ and the relaxed

marginal polytope, LG. We denote the approximation to A as B defined by

B := max

µ∈L(G)

(hθ, µi + B∗(µ)). (2.98)

It can be shown that when the approximation B∗ is strictly convex and twice continuously differentiable, the approximation B is convex and differentiable and has a unique solution, see [Wai06]. This approximation of the log-partition function is used in approximate methods for computing marginals, in particular in tree reweighted message passing methods (TRW) [WJW05b]. We note that loopy belief propagation methods use the normal Bethe approximation which differes from the convexified version in that the weighting probability P from (2.97) is excluded [YFW05].

Moreover, a learning method which uses a strongly convex approximation to −A∗, (the entropy) was proven to be globally stable [Wai06].

In practice learning is used to predict mean parameters on novel data. First canonical parameters from given empirical mean parameters are learned, for instance with an approximation of the intractable MLE (2.70). Using the introduced approximation to the log-partition function as in [WJW05a] and given empirical mean parameters of given data samples ˆµ = 1_nPn

i=1φ(Xi), approximate MLE, also known as surrogate

likelihood can be computed by maximizing

lB(θ) := hθ, ˆµi − B(θ). (2.99)

Optionally a regularizer term R(θ) can be added and weighted by some λ > 0 to trade off between the data term and the regularizer. Maximizing the surrogate likelihood can be achieved simply with gradient descent. In the following we call this step perturbation as the canonical parameters can be thought as perturbed exact parameters. Based on the learned approximate parameters and given noisy observations the canonical parameters can be fitted to the observed data, in order to compute new predicted parameters. We call this step fitting or prediction. Perturbation and fitting can be done jointly in one step or separately. Finally, using the predicted canonical parameters mean parameters can be computed in an inference step.

Wainwright et al. proved in [Wai06] that it is of great importance to use the same approximate method for computing the approximate parameters and when inferring the mean parameters in the last step. In particular when learning with approximate MLE the same approximation of the log-partition function should be used for the approximate MLE and the computation of the mean parameters (marginals). Furthermore, Wainwright et al. showed that learning benefits from approximations (when the same type of approximation is used for learning and inference) since in practice the errors introduced by the inexact canonical parameters are partly compensated by the error of the inexact inference.

Learning with approximate MLE can lead to good predictions when the training data is big, as we will see in the following. To this end we interpret MLE as the Kullback-Leibler (KL) divergence [KL51] which is a measure of distance between

2.5. Learning

probability distributions and defined by KL(p_θ∗||p_θ) = X i pθ∗(x_i) logpθ ∗(x_i) pθ(xi) . (2.100)

Let θ∗ be the parameters corresponding to the given empirical mean parameters now denoted by µ∗ and let θ be the parameter we obtain from approximate MLE learning. Then minimizing the KL divergence (2.100) between the two distributions we get

min θ∗ KL(pθ∗||pθ) = X i pθ∗(x_i) log p_θ∗(x_i) −X i pθ∗(x_i) log p_θ(x_i) (2.101a) = const −X i pθ∗(x_i) log p_θ(x_i) = (2.101b) = −Epθ∗[log pθ]. (2.101c)

If we consider the MLE as defined in (2.70), then minimizing the negative MLE

− 1 n n X i=1 log p_θ(x_i) (2.102)

converges to the expectation −E[log pθ] due to the strong law of large numbers, see

[Ete81].

That is for very large training data (n → ∞) minimizing the negative MLE is equivalent to minimizing the Kullback-Leibler divergence between the true (or exact) distribution and the searched canonical parameters. This insight motivates the definition of learning as minimization of a distance between true and estimated parameters. However, in practice we are not always given a lot of data in which case learning with MLE performs poorly.

2.5.2. Loss Minimizing Parameter Learning

In the machine learning community learning is usually defined as the minimization of a loss function. We give a brief description of learning from this point of view. The defined objective loss function measures a distance between the predicted approximate mean parameters and the given correct mean parameters, the empirical mean parameters. As described in the previous subsection, one way is to learn the canonical parameters using approximate MLE from empirical mean parameters. Based on that result, in the inference step the predicted mean parameters can be computed.

In other words, learning in graphical models is concerned with building a graphical model that describes the problem we want to solve. In practice learning methods minimize a loss function which measures the similarity between the model we want to obtain and the model that corresponds to the observed data. Interpreted differently, the loss function evaluates the distance ∆ between ground truth µ∗ and the minimum

energy configuration of the MRF model. This can be written as min

θ ∆(µ(θ), µ

∗₎ _{subject to} _{µ(θ) = arg min}

θ Eθ(µ), (2.103)

with the energy function defined by

Eθ(µ) = X i∈V θi(xi)µi(xi) + X ij∈V θij(xij)µij(xij). (2.104)

In this work we concentrate on supervised learning, i.e. ground truth is provided for the training data, based on what we have to predict output for new observed test data. We refer to the learning problem to be unsupervised or semi-supervised if ground truth is not or only partially available for the training data.

This thesis is concerned with the problem of learning for graphical models. We introduce 2 novel methods for loss minimizing parameter learning which we will describe in Chapter4.

One of our novel learning methods is based on inverse linear programming, which we will briefly introduce in the next section.

In document La enseñanza del español a adultos mayores en Quebec : experiencia de talleres prácticos (página 165-174)