BLOQUE I. MARCO TEÓRICO: LA VALIDACIÓN DE LOS APRENDIZAJES y Los sistemas de gestión de la
Capítulo 4. LOS SISTEMAS Y MODELOS DE GESTIÓN
4.1 Los Sistemas de Gestión basada en Procesos
4.1.1 Los procesos como base de la gestión de las organizaciones
When training an HHMM for ASR, we are essentially learning a probabilistic mapping fromw~ = W1:Tin
toY1:Tout, where typicallyTinTout. This is like an asychronous input-output HMM [BB96], which can
handle input and output sequences of different lengths. (Note that the complexity of the inference algorithm for asychronous IO-HMMs in [BB96] isO(TinTout), just like the DBN method. By contrast, inference is a synchronous IO-HMM, whereTin=Tout=T, only takes timeO(T).)
Bengio [Ben99] suggested using IO-HMMs for ASR, treating the acoustic sequence as input and the words as output, the opposite of what we discussed above. There are two problems with this. First, we may be forced to assign high probability to unlikely observation sequences; this is because we are conditioning on the acoustics instead of generating them: if there is only one outgoing arc from a state, it must have probability one no matter what the input is, because we normalize transition probabilities on a per-state level. This has been called the “label bias problem” [MFP00]. A second and more important problem is that the “meaning” of the states in an IO-HMM is different than in an HMM. In particular, the states summarize past acoustics, rather than past words, so the kind of sharing of phone models we discussed above, that is vital to successful performance, cannot be used.9
2.3.12
Variable-duration (semi-Markov) HMMs
The self-arc on a state in an HMM defines a geometric distribution over waiting times. Specifically, the probability we remain in stateifor exactlydsteps isp(d) = (1−p)pd−1, wherep=A(i, i)is the self-loop probability. To allow for more general durations, one can use a semi-Markov model. (It is called semi- Markov because to predict the next state, it is not sufficient to condition on the past state: we also need to know how long we’ve been in that state.) We can model this as a special kind of 2-level HHMM, as shown in Figure 2.22. The reason there is noQtoF arc is that the termination process is deterministic. In particular,
Q1 Q2 Q3
F1 F2 F3
QD
1 QD2 QD3
Y1 Y2 Y3
Figure 2.22: A variable-duration HMM modelled as a 2-level HHMM. Qt represents the state, and QD t
represents how long we have been in that state (duration).
the bottom level counts how long we have been in a certain state; when the counter expires (reaches 0), the
F node turns on, the parent nodeQcan change state, and the duration counter,QD, resets. Hence the CPD for the bottom level is as follows:
P(QD t =d0|QDt−1=d, Qt=k, Ft−1= 1) = pk(d0) P(QDt =d0|QDt−1=d, Qt=k, Ft−1= 0) = 1 ifd >1andd0 =d−1 0 otherwise
Note thatpk(d)could be represented as a table (a non-parametric approach) or as some kind of parametric function. Ifpk(d)is a geometric distribution, this emulates a standard HMM.
The naive approach to inference in this HHMM takes O(T D2K2)time, where D is the maximum number of steps we can spend in any state, andKis the number of HMM states; this is the same complexity as the original algorithm [Rab89]. However, we can exploit the fact that the CPD forQDis deterministic to reduce this toO(T DK2). This is faster than the original algorithm because we do dynamic programming on the joint state-space ofQandQD, instead of justQ. Also, we can apply standard inference and learning
procedures to this model, without having to derive the (somewhat hairy) formulas by hand.
A more efficient, but less flexible, way to model non-geometric waiting times is to replace each state with
nnew states, each with the same emission probabilities as the original state [DEKM98, p69]. For example, consider this model.
1 2 3 4
1−p 1−p 11−−pp
p p p p
Obviously the smallest sequence this can generate is of lengthn= 4. Any path of lengthlthrough the model has probabilitypl−n(1−p)n; the number of possible paths is l−1
n−1
, so the total probability of a path of lengthlis p(l) = l−1 n−1 pl−n(1−p)n
This is the negative binomial distribution. By adjustingnand the self-loop probabilities of each state, we can model a wide range of waiting times.
2.3.13
Mixtures of HMMs
[Smy96] defines a mixture of HMMs, which can be used for clustering (pre-segmented) sequences. The basic idea is to add a hidden cluster variable as a parent to all the nodes in the HMM; hence all the distributions are conditional on the (hidden) class variable. A generalization of this, which I call a dynamical mixture of HMMs, adds Markovian dynamics to the mixture/class variable, as in Figure 2.23. This model has several
X1 X2
X11 X12 X13 X21 X22 X23
Y11 Y12 Y13 Y21 Y22 Y23
Figure 2.23: A dynamical mixture of HMMs, aka an embedded HMM. Here we have 2 sequences, both of length 3. Obviously we could have more sequences of different lengths, and multiple levels of the hierarchy. A static mixture of HMMs would combineX2andX1.
X 1,1 X 1,2 X 1,3 X 1 X 2 X 2,1 X 2,2 X 2,3 Y1,1 Y 1,2 Y1,3 Y 2,1 Y2,2 Y2,3 forehead eyes nose mouth chin end end end end end (a) (b)
Figure 2.24: An embedded HMM for face recognition represented (a) as a DBN and (b) as the state transition diagram of an HHMM (dotted arcs represent calling a sub-HMM). We assume there are 2 rows and 3 columns in the image, which is clear from the DBN representation, but not from the HHMM representation. The number of states in each sub-model can be different, which is clear from the HHMM but not the DBN. The DBN in (a) is just a rotated version of the one in Figure 2.23.
names in the literature: “hierarchical mixture of Markov chains” [Hoe01], “embedded HMM” [NI00], and “pseudo-2D” HMM [KA94]. The reason for the term pseudo-2D is because this model provides a way of extending HMMs to 2D data such as images. The basic idea is that each row of an image is modelled by a row HMM, all of whose states are conditioned on the state of the column HMM. The overall effect is to allow dynamic warping in both the horizontal and vertical directions (although the warping in each row is independent, as noted by Mark Paskin).
Nefian [NI00] used embedded (2D) HMMs for face recognition. In this model, the state of each row could be forehead, eyes, nose, mouth or chin; each pixel within each row could be in one of 3 states (if the row was in the forehead or chin state) or in one of 6 states otherwise. The “meaning” of the horizontal states was not defined; they just acted as positions within a left-to-right HMM, to allow horizontal warping of each feature. See Figure 2.24.
The key difference between an embedded HMM and a hierarchical HMM is that, in an embedded HMM, the ending “time” of each sub-HMM is known in advance, since each horizontal HMM models exactlyC
observations, whereCis the number of columns in the image. Hence the problem of deciding when to return control to the (unique) parent state is avoided. This makes inference in embedded HMMs very easy.
from an embedded HMM because each substate emits a single observation, not a fixed-length vector; also, it is different from a 2-level HHMM, because there is nothing forcing the top level to change more slowly than the bottom level.
Q1
1 Q12 Q13
Q2
1 Q22 Q23
Y1 Y2 Y3
Figure 2.25: A switching HMM, aka a 2-level hidden Markov decision tree. This is different from a 2-level HHMM, because there is nothing forcing the top level to change more slowly than the bottom level.
2.3.14
Segment models
τ
Q1 Q2 Qτ l 2 lτ l 1...
...
...
...
...
...
X1 X2 Xl1 Xl1+1 Xl 1+l2 XΣli XTFigure 2.26: A schematic depiction of a segment model, from [Bil01]. TheXtnodes are observed, the rest are hidden.
The basic idea of the segment model [ODK96] is that each HMM state can generate a sequence of obser- vations instead of just a single observation. The difference from an HHMM is that the length of the sequence generated from stateqiis explicitly determined by a random variableli, as opposed to being implicitly de- termined by when its sub-HHMM enters its end state. Furthermore, the total number of segments, τ, is a random variable. Hence the likelihood is given by
P(y1:T) = X τ X q1:τ X l1:τ li Y i=1
P(τ)P(qi|qi−1, τ)P(li|qi)P(yt0(i):t1(i)|qi, li)
where andt0(i) = Pij−=11lj+ 1 andt1(i) = t0(i) +li−1 are the start and ending times of segmenti. A first attempt to represent this as a graphical model is shown in Figure 2.26. This is not a fully specified graphical model, since theLi’s are random variables, and hence the topology is variable. Also, there are some dependencies that are not shown (e.g., the fact that Pτi=1li = T), and some dependencies might not exist (e.g., theYi’s in a segment may not be fully interconnected). We will give a more accurate DBN representation of the model below.
Let us consider a particular segment and renumber so thatt0= 1andt1=l. In a variable-length HMM,
P(y1:l|q, l) = l
Y
k=1
τ S1 S2 S3 Q1 Q2 Q3 F1 F2 F3 L1 L2 L3 R1 R2 R3 Y1 Y2 Y3
Figure 2.27: A stochastic segment model.
(Ifp(l|q)is a geometric distribution, this becomes a regular HMM.) A simple generalization is to condition on some function of the position within the segment:
P(y1:l|q, l) = l
Y
k=1
P(yk|q, rk)
where rk is the region corresponding to position kwithin the segment. This can be modelled as shown in Figure 2.27. We have added a deterministic counter, St, which counts the number of segments. We constrainST = τ, rather like the end-of-sequence trick for HHMMs (see Section 2.3.10). TheLtnodes is a deterministic down-counter, and turns on F when it reaches zero, as in a semi-Markov model (see Section 2.3.12).
Alternatively, we can define a conditional Gaussian segment model [DRO93]
P(y1:l|q, l) = l
Y
k=1
P(yk|yk−1, q, rk)
We just modify Figure 2.27 by adding an arc fromYt−1toYt; however, we want this arc to “disappear” when
Ft−1 = 1, since that indicates the end of a segment. Hence we need to add an arc fromFt−1toYtas well. We then define
P(Yt|Yt−1, Rt=r, Qt=q, Ft−1=f) =
N(yt;Ar,qyt−1,Σr,q) iff = 0
N(yt;µr,q,Σr,q) iff = 1
In general, P(y1:l|q, l)can be an arbitrary joint distribution, e.g., consider an undirected graphical model
connectingyktoyk−1andyk+1. However, representing this graphically can get messy.
2.3.15
Abstract HMMs
[BVW00a] describes the “Abstract HMM” (AHMM), which is very closely related to HHMMs. (They have 3 slightly different models in [BVW00a, BVW00b, BVW01]; we consider the version in [BVW00b].) These
F1 1 F21 F31 abstractπ Q1 1 Q12 Q13 F2 1 F22 F32 concreteπ Q2 1 Q22 Q23 action A1 A2 A3 state S1 S2 S3 obs Y1 Y2 Y3
Figure 2.28: The DBN for a 3-level abstract HMM, from [BVW00b]. Note that we call the top layer level 1, they call it levelD. Note also that we drawFd
t aboveQdt, they draw it below, but the topology is the same.
The lines from the global world stateStare shown dotted merely to reduce clutter. The top level encodes the probability of choosing an abstract policy:P(Q1
t =π0|St−1=s) =σ(s, π0). The second level encodes the probability of choosing a concrete policy: P(Q2
t =π|St−1=s, Q1t =π0) =σπ0(s, π). The third level
encodes the probability of choosing a concrete action: P(At=a|St−1=s, Q2t =π) =σπ(s, a). The last
level encodes the world model:P(St=s0|St
−1=s, At=a).
authors are interested in inferring what policy an agent is following by observing its effects in the world. (We do not observe the world state, but we assume that the agent does.)
A (stochastic) policy is a (probabilistic) mapping from (fully observed) states to actions: σπ(s, a) = P(do actiona|in states).An abstract (stochastic) policy is a (probabilistic) mapping from states to lower level abstract policies, or “macro actions”. An abstract policy can call a sub-policy, which runs until termina- tion, returning control to the parent policy; the sub-policy can in turn call lower-level abstract policies, until we reach the bottom level of the hierarchy, where a policy can only produce concrete actions. [BVW00a] consider abstract policies of the “options” kind [SPS99], which is equivalent to assuming that there are no horizontal transitions within a level. (HAMs [PR97a] generalize this by allowing horizontal transitions (i.e., internal state) within a controller.) This simplifies the CPDs, as we see below.
This can be modelled by the DBN shown in Figure 2.28. It is obviously very similar to an HHMM, but is different in several respects, which we now discuss.
• There is a global world state variableSt, that is a parent of everything.
• The bottom level represents a concrete action, which always finishes in one step; hence there is no
FA
t =Ft3node, and noAt−1toAtarc.
• There is no arc fromF2
t toQ1t+1, since levels cannot make horizontal transitions (soFt1turns on as
soon asF2
t does). • There is no arc fromQ1
ttoAtorYt. In general, they assumeQdtonly depends on its immediate parent, Qdt−1, not its whole context,Q1:td−1. This substantially reduces the number of parameters (and allows
for more efficient approximate inference10). However, it may also lose some information. For example, 10We can apply Rao-Blackwellised particle filtering to sample theFandSnodes; if eachQnode has a single parent, theQnodes
if the bottom level controller determines how to leave a room via a door, we may be able to predict the future location of the agent (e.g., will it leave via the left or right door) more accurately if we take into account the complete calling context (e.g., the top-level goals).
• There is no arc fromQ1
t toFt2. They assume the termination probability depends on the current policy
only, not on the calling context.
Given these assumptions, the CPDs simplify as follows.
P(Fd
t = 1|Qdt =π, St=s, Ftd+1=b) =
0 ifb= 0 βπ(s) ifb= 1
whereβπ(s)is the probability of terminating in statesgiven that the current policy isπ.
P(Qdt =j|Qdt−1=i, Qtd−1=π, St−1=s, Ftd−1=f) =
δ(i, j) iff = 0 σπ(s, j) iff = 1
whereσπ(s, j)is the probability that abstract policyπchooses to call sub-policyjin states.
1-level AHMMs G1 G2 G3 FG 1 F2G F3G S1 S2 S3 Y1 Y2 Y3
Figure 2.29: A 1-level AHMM.FG
t turns on if stateStstatisfies goalGt; this causes a new goal to be chosen.
However, the new state may depend on the old state; hence there is no arc fromFG
t−1 toStto “reset” the submodel.
In [BVW01], they consider the special-case of a 1-level AHMM, which is shown in Figure 2.29. This has a particularly simple interpretation: Q1
t ≡Gtrepresents the goal that the agent currently has, andQ2t ≡St
is the world state. It is similar to a variable duration HMM (see Section 2.3.12), except the duration depends on how long it takes to satisfy the goal.
The transition matrix forGis the same as forQ1in an HMMM, except we never reset.
P(Gt=j|Gt−1=i, FtG−1=f) =
δ(i, j) iff = 0 AG(i, j) iff = 1
([BVW01] consider the special case whereAG(i, j)is a deterministic successor-state function for goals.) The transition matrix forFGis the same as forFDin an HMMM:
P(FG
t = 1|Gt=g, St=i) =ASg(i,end)
The transition matrix forSis the same as forQD in an HHMM, except we never “reset”, sinceFGis
not a parent (since the next state always depends on the previous state, even if the goal changes):
2.4
Continuous-state DBNs
So far, all the DBNs we have looked at have only had discrete-valued hidden nodes. We now consider DBNs with continuous-valued hidden nodes, or mixed discrete-continuous systems (sometimes called hybrid systems). In this section, we adopt the (non-standard) convention of representing discrete variables as squares and continuous variables as circles.11
2.4.1
Representing KFMs as DBNs
The graph structure for a KFM looks identical to that for an HMM or an IO-HMM (see Figures 2.1 and Figure 2.9), since it makes the same conditional independence assumptions. However, all the nodes are continuous, and the CPDs are all linear-Gaussian, i.e.,
P(Xt=xt|Xt−1=xt−1, Ut=u) = N(xt;Axt−1+Bu+µX, Q)
P(Yt=y|Xt=x, Ut=u) = N(y;Cx+Du+µY, R)
In the linear-Gaussian case, but not the HMM case, there is a one-to-one correspondence between zeros in the parameter matrices and absent arcs in the graphical model, as we discuss in Section 2.4.2.