CAPÍTULO 4. LA INCLUSIÓN EN AMBIENTES UNIVERSITARIOS DE PERSONAS CON
4.2 El Desarrollo de competencias en PCDI en espacios de educación superior
4.2.2 Principales enfoques sobre competencias
Expectation propagation (EP) [Min01] is like belief propagation except it requires that the posteriors (beliefs) on each variable have a restricted form. Specifically, the posterior must be in the exponential family, i.e., of the formq(θ)∝exp(γ0f(θ)), whereθis a variable. This ensures that beliefs can be represented using a fixed
number of sufficient statistics. We choose the parameters of the beliefs s.t.
γ∗= arg min γ D(p(θ)||qγ(θ)) where p(θ) =q prior(θ) ×t(θ) Z
is the exact posterior,Z =Rθqprior(θ)×t(θ)is the exact normalizing constant,t(θ)is the likelihood term
(message coming in from a factor) andqγ(θ)is the approximate posterior. This can be solved by moment matching. When we combine sequential Bayesian updating with this approximation (projection) after every update step, the result is called Assumed Density Filtering (ADF), or the probabilistic editor [SM80]. It is clear that the BK algorithm (see Section 4.2.1) is an example of ADF, as is the standard GBP algorithm for switching Kalman filters (see Section 4.3).
One drawback with ADF is its dependence on the order in which the updates are performed. EP, which is a batch algorithm, reduces the sensitivity to ordering by iterating. Intuitively, this allows EP to go back and reoptimize each belief in the revised context of all the other updated beliefs. This requires that we store the messages so their effect can be later “undone” by division. In EP, rather than approximating the messages directly (which is hard, since they represent conditional likelihoods), we approximate the posterior using moment matching, and then infer the corresponding equivalent message. We explain this below.
To explain the difference between EP and BP procedurally, consider the simple factor graph shown in Figure B.22. We send a message fromf tox, and then updatex’s belief, as follows:
φprior
x = φx/µoldf→x=µoldg→x
φpriorf = φf/µoldx→f=f(x, y)µoldy→f(y) µf→x = φpriorf ↓x=
Z
y
f(x, y)µoldy→f(y) φx = φpriorx ×µf→x=µoldg→x×µf→x
The terms after the second equality on each line are for this particular example.
In EP, we compute the approximate posteriorφxfirst, and then derive the messageµf→x, which, had it
been combined with the priorφprior
x , would result in the same approximate posterior: φpriorx = φoldx /µoldf→x
φpriorf = φf/µoldx→f (φx, Z) = ADF(φpriorx ×φ
prior f ↓x) µf→x = (Zφx)/(φpriorx )
function[φ, µ] =sweep(φold, µold,
order)
for each factorfin order
φf =φoldf
for each variablexin pred(f, order)
µx→f = (φoldx )/(µ old f→x)
φf =φf×(µx→f)/(µoldx→f)
for each variablexin succ(f,order)
φprior x =φoldx /µoldf→x φpriorf =φf/µoldx→f (φx, Z) =ADF(φpriorx ×φ prior f ↓x) µf→x= (Zφx)/(φpriorx )
Figure B.21: Pseudo-code for expectation propagation on a factor graph using a serial updating protocol. This sweep function is called iteratively, as in Figure B.20.
y x
h(y) f(x, y) g(x)
Figure B.22: A simple factor graph. Square nodes are factors, round nodes are variables.
where(q, Z) =ADF(p)produces the best approximationqtopwithin a specified family of distributions. The code for EP is shown in Figure B.21.
For example, consider the switching Markov chain in Figure B.17. The factor graph for the first few slices is shown in Figure B.22, where the variables are the separators,y = (S1, X1),x = (S2, X2), etc., and the factors are the clique potentials,h(y) =P(S1)P(X1|S1),f(x, y) =P(S2|S1)P(X2|X1, S1), etc. Whenf sends a message tox, it needs to compute
φx(X2, S2) =ADF Z X1 X S1 φprior x (X2, S2)×φfprior(X2, S2, X1, S1) !
which can be done by weak marginalization (moment matching for a Gaussian mixture).
In the special case in which we only have a single variable node (as Minka typically assumes), things simplify, since the factor nodes do not contain any other variables, (e.g., φpriorf = f(x), as opposed to
φpriorf =f(x, y)×µy→f(y)), and hence the projection ontoxis unnecessary. In this case, the ADF step simplifies to
(φx, Z) =ADF(φpriorx ×f(x))
To map this to Minka’s notation, use:x=θ,φx=q(θ), andf(x) =ti(θ).
We now discuss how to implement(q(θ), Z) =ADF(qprior(θ)×t(θ))for different kinds of distribution q(θ)and likelihood termst(θ).
EP with a Gaussian posterior
We can compute the parameters of an approximate Gaussian posteriorq(θ), as follows:
mpost = Ep[θ] =m+V∇m
Vpost = Ep[θθ0]−[Epθ][Epθ]0=V −V(∇m∇0m−2∇V)V
where m = mprior, V = Vprior, ∇m = ∇mlogZ(m, V), ∇V = ∇V logZ(m, V) andZ(m, V) =
R
θt(θ)q
hold for anyti(θ). Minka works the details out for the example of a likelihood term that is a mixture of Gaussians (modelling a point buried in clutter):
ti(θ) = (1−w)N(xi;θ, I) +wN(xi; 0,10I)
for a fixed, knownw.
Once we have approximated the beliefs, we can compute the new message by division. In the case of Gaussians, the easiest way to do this is to convert the beliefs to canonical form, and then divide. Note that this division might result in a negative or infinite variance term (since division in canonical form is implemented by subtracting precision matrices). A negative variance represents a function that curves upwards (like a U), rather than the usual downward (bell-shaped) curve of a Gaussian; an infinite variance corresponds to a flat (uniform) distribution; zero variance corresponds to a constant. It is okay if messages have negative variance, but a belief with a negative variance is a problem. In this case, one crude approximation is to to replace a negative variance with an infinite variance, which means the absorbing variable will ignore this message (since it is completely uncertain). This will always result in convergence, but [Min01, p22] says that “when EP does converge with negativevi’s, the result is always better than having forcedvi>0” (where the variance isviIin the case of a spherical Gaussian).
EP with a Dirichlet posterior
It is possible to use EP to compute a Dirichlet approximation toP(θ|y)even when the likelihood functions have the form of a mixture,P(yi|θ) =PzP(z|θ)P(yi|z). This has been applied to a model in which they’s represent words, thez’s represent latent topics, andθis a distribution over topics for a particular document [ML02].
EP with mixed types of posterior
It is not clear how to use EP for arbitrary factor graphs, where the belief on each factor (and hence the corresponding outgoing messages) might be a different member of the exponential family. This is a subject for future research.
EP with a fully factorized posterior is LBP
Minka points out that LBP is a special case of EP when we make the approximation that the posterior is fully factorized: q(x) = QNi=1qi(xi). To see this, consider combining this factored prior with a termti(x) = P(Xi|Pa(Xi))to yield to posteriorp(x). We now seek the distributionqs.t.,D(p(x)||q(x))is minimized, subject to the constraint thatqbe fully factorized. This means
qi(xi) =p(xi) =X x\xi
p(x)
i.e., the marginals must match. These are expectation constraints:
Eq[δ(xi−v)] =Ep[δ(xi, v)]
for all valuesvand all nodesi. This corresponds to running LBP on a factor graph with no need for the ADF step to approximate messages.
B.7.3
Variational methods
The simplest example of a variational method is the mean-field approximation, which, roughly speaking, exploits the law of large numbers to approximate large sums of random variables by their means. In particular, we essentially decouple all the nodes, and introduce a new parameter, called a variational parameter, for each node, and iteratively update these parameters so as to minimize the cross-entropy (KL distance) between the approximate and true probability distributions, i.e., we seek to minimizeD(q||p), whereqis the approximate
distribution. (EP, by contrast, (locally) minimizesD(p||q).) Updating the variational parameters becomes a proxy for inference. The mean-field approximation produces a lower bound on the likelihood.
It is possible to combine variational and exact inference; this is called a structured variational approxi- mation. See [JGJS98, Jaa01] for good general tutorials.