Principales enfoques sobre competencias - El Desarrollo de competencias en PCDI en espacios de

CAPÍTULO 4. LA INCLUSIÓN EN AMBIENTES UNIVERSITARIOS DE PERSONAS CON

4.2 El Desarrollo de competencias en PCDI en espacios de educación superior

4.2.2 Principales enfoques sobre competencias

Expectation propagation (EP) [Min01] is like belief propagation except it requires that the posteriors (beliefs) on each variable have a restricted form. Specifically, the posterior must be in the exponential family, i.e., of the formq(θ)∝exp(γ0_f(θ))_{, where}_θ_{is a variable. This ensures that beliefs can be represented using a fixed}

number of sufficient statistics. We choose the parameters of the beliefs s.t.

γ∗= arg min γ D(p(θ)||qγ(θ)) where p(θ) =q prior_(θ) ×t(θ) Z

is the exact posterior,Z =R_θqprior_(θ)_×_t(θ)_{is the exact normalizing constant,}_t(θ)_{is the likelihood term}

(message coming in from a factor) andqγ(θ)is the approximate posterior. This can be solved by moment matching. When we combine sequential Bayesian updating with this approximation (projection) after every update step, the result is called Assumed Density Filtering (ADF), or the probabilistic editor [SM80]. It is clear that the BK algorithm (see Section 4.2.1) is an example of ADF, as is the standard GBP algorithm for switching Kalman filters (see Section 4.3).

One drawback with ADF is its dependence on the order in which the updates are performed. EP, which is a batch algorithm, reduces the sensitivity to ordering by iterating. Intuitively, this allows EP to go back and reoptimize each belief in the revised context of all the other updated beliefs. This requires that we store the messages so their effect can be later “undone” by division. In EP, rather than approximating the messages directly (which is hard, since they represent conditional likelihoods), we approximate the posterior using moment matching, and then infer the corresponding equivalent message. We explain this below.

To explain the difference between EP and BP procedurally, consider the simple factor graph shown in Figure B.22. We send a message fromf tox, and then updatex’s belief, as follows:

φprior

x = φx/µoldf→x=µoldg→x

φprior_f = φf/µoldx→f=f(x, y)µoldy→f(y) µf_→x = φpriorf ↓x=

f(x, y)µoldy→f(y) φx = φpriorx ×µf→x=µoldg→x×µf→x

The terms after the second equality on each line are for this particular example.

In EP, we compute the approximate posteriorφxfirst, and then derive the messageµf→x, which, had it

been combined with the priorφprior

x , would result in the same approximate posterior: φpriorx = φoldx /µoldf→x

φprior_f = φf/µoldx→f (φx, Z) = ADF(φpriorx ×φ

prior f ↓x) µf_→x = (Zφx)/(φpriorx )

function[φ, µ] =sweep(φold_{, µ}old_,

order)

for each factorfin order

φf =φoldf

for each variablexin pred(f, order)

µx→f = (φoldx )/(µ old f→x)

φf =φf×(µx→f)/(µoldx→f)

for each variablexin succ(f,order)

φprior x =φoldx /µoldf→x φpriorf =φf/µoldx→f (φx, Z) =ADF(φpriorx ×φ prior f ↓x) µf→x= (Zφx)/(φpriorx )

Figure B.21: Pseudo-code for expectation propagation on a factor graph using a serial updating protocol. This sweep function is called iteratively, as in Figure B.20.

y x

h(y) f(x, y) g(x)

Figure B.22: A simple factor graph. Square nodes are factors, round nodes are variables.

where(q, Z) =ADF(p)produces the best approximationqtopwithin a specified family of distributions. The code for EP is shown in Figure B.21.

For example, consider the switching Markov chain in Figure B.17. The factor graph for the first few slices is shown in Figure B.22, where the variables are the separators,y = (S1, X1),x = (S2, X2), etc., and the factors are the clique potentials,h(y) =P(S1)P(X1|S1),f(x, y) =P(S2|S1)P(X2|X1, S1), etc. Whenf sends a message tox, it needs to compute

φx(X2, S2) =ADF Z X1 X S1 φprior x (X2, S2)×φfprior(X2, S2, X1, S1) !

which can be done by weak marginalization (moment matching for a Gaussian mixture).

In the special case in which we only have a single variable node (as Minka typically assumes), things simplify, since the factor nodes do not contain any other variables, (e.g., φprior_f = f(x), as opposed to

φprior_f =f(x, y)_×µy_→f(y)), and hence the projection ontoxis unnecessary. In this case, the ADF step simplifies to

(φx, Z) =ADF(φpriorx ×f(x))

To map this to Minka’s notation, use:x=θ,φx=q(θ), andf(x) =ti(θ).

We now discuss how to implement(q(θ), Z) =ADF(qprior_(θ)_×_t(θ))_{for different kinds of distribution} q(θ)and likelihood termst(θ).

EP with a Gaussian posterior

We can compute the parameters of an approximate Gaussian posteriorq(θ), as follows:

mpost = Ep[θ] =m+V_∇m

Vpost = Ep[θθ0]−[Epθ][Epθ]0=V −V(∇m∇0m−2∇V)V

where m = mprior, V = Vprior, _∇m = ∇mlogZ(m, V), ∇V = ∇V logZ(m, V) andZ(m, V) =

θt(θ)q

hold for anyti(θ). Minka works the details out for the example of a likelihood term that is a mixture of Gaussians (modelling a point buried in clutter):

ti(θ) = (1−w)N(xi;θ, I) +wN(xi; 0,10I)

for a fixed, knownw.

Once we have approximated the beliefs, we can compute the new message by division. In the case of Gaussians, the easiest way to do this is to convert the beliefs to canonical form, and then divide. Note that this division might result in a negative or infinite variance term (since division in canonical form is implemented by subtracting precision matrices). A negative variance represents a function that curves upwards (like a U), rather than the usual downward (bell-shaped) curve of a Gaussian; an infinite variance corresponds to a flat (uniform) distribution; zero variance corresponds to a constant. It is okay if messages have negative variance, but a belief with a negative variance is a problem. In this case, one crude approximation is to to replace a negative variance with an infinite variance, which means the absorbing variable will ignore this message (since it is completely uncertain). This will always result in convergence, but [Min01, p22] says that “when EP does converge with negativevi’s, the result is always better than having forcedvi>0” (where the variance isviIin the case of a spherical Gaussian).

EP with a Dirichlet posterior

It is possible to use EP to compute a Dirichlet approximation toP(θ_|y)even when the likelihood functions have the form of a mixture,P(yi_|θ) =P_zP(z_|θ)P(yi_|z). This has been applied to a model in which they’s represent words, thez’s represent latent topics, andθis a distribution over topics for a particular document [ML02].

EP with mixed types of posterior

It is not clear how to use EP for arbitrary factor graphs, where the belief on each factor (and hence the corresponding outgoing messages) might be a different member of the exponential family. This is a subject for future research.

EP with a fully factorized posterior is LBP

Minka points out that LBP is a special case of EP when we make the approximation that the posterior is fully factorized: q(x) = QN_i₌₁qi(xi). To see this, consider combining this factored prior with a termti(x) = P(Xi|Pa(Xi))to yield to posteriorp(x). We now seek the distributionqs.t.,D(p(x)||q(x))is minimized, subject to the constraint thatqbe fully factorized. This means

qi(xi) =p(xi) =X x\xi

p(x)

i.e., the marginals must match. These are expectation constraints:

Eq[δ(xi₋v)] =Ep[δ(xi, v)]

for all valuesvand all nodesi. This corresponds to running LBP on a factor graph with no need for the ADF step to approximate messages.

B.7.3 Variational methods

The simplest example of a variational method is the mean-field approximation, which, roughly speaking, exploits the law of large numbers to approximate large sums of random variables by their means. In particular, we essentially decouple all the nodes, and introduce a new parameter, called a variational parameter, for each node, and iteratively update these parameters so as to minimize the cross-entropy (KL distance) between the approximate and true probability distributions, i.e., we seek to minimizeD(q_||p), whereqis the approximate

distribution. (EP, by contrast, (locally) minimizesD(p||q).) Updating the variational parameters becomes a proxy for inference. The mean-field approximation produces a lower bound on the likelihood.

It is possible to combine variational and exact inference; this is called a structured variational approximation. See [JGJS98, Jaa01] for good general tutorials.

In document Acreditación de competencias en personas con discapacidad intelectual para la mejora de su empleabilidad Programa Somos Uno Más Universidad Iberoamericana de Ciudad de México (página 135-139)