This section presents an overview of the gradient ascent procedure used for all the experiments reported in this thesis. The methods are not novel so detailed descriptions are deferred to Appendix B.
All the experiments reported in this thesis used the Polak-Ribi´ere conjugate gra- dient ascent algorithm [Fine, 1999, §5.5.2] described in Appendix B.1. This algorithm returns a search direction θ∗ (which we assume encompasses all parameters) that is orthogonal to any previous search direction.
Unless stated, all experiments used the GSEARCH algorithm [Baxter et al., 2001a] to conduct a line search along the direction θ∗ for the best step size γ. Traditional line search methods estimate η for each trial step size, attempting to find two values of γ that bracket the maximum ofη(θ+γθ∗). Instead, GSEARCH computes the sign
of the dot product between the search direction and local gradient estimates for each trial step size. When the sign of the dot product changes, it indicates that we stepped past the local maximum we are searching for. Empirically, GSEARCH is more robust in the presence of noise than value-bracketing methods. Reasons for this are discussed in Appendix B.1, along with the details of the GSEARCHalgorithm.
Quadratic weight penalties [Kushner and Clark, 1978,§5.2] were used in our experi- ments to prevent the parameters entering sub-optimal maxima of the soft-max function. Maxima occur whenever the parameters grow large, saturating the soft-max function. See Appendix B.2 for details. The benefits of using penalty terms is demonstrated and discussed as part of the first experiment in Section 4.5.1.
3.6
Summary
Key Points
I The FSC and the world-state induce a global-state Markov chain with tran- sition matrixP(φ, θ) = [p(j, h|φ, θ, i, g)].
unknown dynamics known dynamics IState−GPOMDP IOHMM−GPOMDP Exp−GPOMDP known dynamics GAMP World state Internal state Alternative Exp−GPOMDP
Figure 3.2: The relationship between the algorithms covered in the next three chapters. If a model of the world is available then we can use the algorithms on the right of the world-state axis, otherwise we are restricted to the left hand side. On the left there is a choice of algorithms that make varying degrees of use of the internal-state model.
µ(u|θ, h, y), and r(i) are well behaved, then the derivative of the long-term average reward∇η exists.
III Distributions generated using the soft-max function allow deterministic poli- cies. They are used for most of the experiments reported in this thesis.
IV We use a standard conjugate gradient ascent method. The line search is unusual because it avoids using estimates ofη, improving the robustness of the line search.
Sneak Preview
The next three chapters present our novel policy-gradient algorithms that were originally presented in Aberdeen and Baxter [2002]. All can incorporate internal state to allow them to cope with partially observable environments. Figure 3.2 classifies the algorithms by their use of knowledge of the global dynamics. On right of the world- state axisq(j|i, u),ν(y|i), andr(i), are assumed to be known. On the left the dynamics are completely hidden. Similarly, the algorithms on top of the internal-state axis make use of knowledge of the internal-state transition probabilities ω(h|φ, g, y).
The GAMP algorithm is covered in Chapter 4. TheIState-GPOMDP algorithm is covered in Chapter 5. The remaining algorithms in the top-left of Figure 3.2 are covered in Chapter 6.
Model-Based Policy Gradient
Research is what I’m doing when I don’t know what I’m doing.
—Wernher von Braun
Without a model of the world there is little choice but to learn through interaction with the world. However, if we are able to at least approximately model the world then gradients can be computed without simulation. For example, manufacturing plants may be reasonably well modeled by hand or models can be estimated using methods from state identification theory [Ogata, 1990]. Given a model we can compute zero-variance gradient approximations quickly and with less bias than Monte-Carlo methods. This chapter introduces one such approximation: the GAMP algorithm. GAMP is feasible for many thousands of states. This is an order of magnitude improvement over model- based value-function algorithms that can handle tens to hundreds of states [Geffner and Bonet, 1998].
We begin with a generic description of how ∇η is computed analytically, then in Section 4.2 we describe GAMP, followed by some experimental results in Sections 4.4 and 4.5.
4.1
Computing
∇η
with Internal State
For the purposes of this discussion the model of the POMDP is represented by the global-state transition matrixP(φ, θ). The parameter vector φparameterises the FSC model ω(h|φ, g, y) and θ parameterises the policy model µ(u|θ, h, y). The entries of P(φ, θ) are given by Equation (3.1). This matrix has square dimension |S||G| and incorporates our knowledge of the world-state transitions given by q(j|i, u), the ob- servation hiding process ν(y|i), the reward r(i), and the current parameters θ and φ.
Recall that the stationary distribution of the global state isπ(φ, θ), a column vector of length |S||G|. The reward in each global state is assumed known and given by the column vector r. Let e be a length |S||G| column vector of all ones. Dropping the explicit dependence on φand θ, π0eis a scalar with value 1. The symbol 0 is used to
denote the transpose of a matrix. Also,eπ0 is the outer product ofewithπ, that is, a
rank-1|S||G| × |S||G| matrix with the stationary distribution in each row.
We now derive an exact expression for the gradient ofη with respect to the agent’s parameters. The rest of this section follows the original derivation by Baxter and Bartlett [2001]. We start by rewriting the scalar long-term average reward (3.7) and its gradient as
η=π0r
∇η= (∇π0)r. (4.1)
We can derive an expression for (∇π0) by differentiating both sides of the balance
equation (3.6)
∇π0=∇(π0P)
= (∇π0)P +π0(∇P) ∇π0−(∇π0)P =π0(∇P)
(∇π0) [I−P] =π0(∇P), (4.2)
which should be understood as a set of linear equations for each of thenφ+nθ param-
eters. For example, for parameterθcwe have
h ∂π(1,1) ∂θc · · · ∂π(|S|,|G|) ∂θc i 1−p(1,1|φ, θ,1,1) · · · −p(|S|,|G||φ, θ,1,1) .. . . .. ... −p(1,1|φ, θ,|S|,|G|) · · · 1−p(|S|,|G||φ, θ,|G|,|S|) = h π(1,1) · · · π(|S|,|G|)i ∂p(1,1|φ,θ,1,1) ∂θc · · · ∂p(|S|,|G||φ,θ,1,1) ∂θc .. . . .. ... ∂p(1,1|φ,θ,1,1) ∂θc · · · ∂p(|S|,|G||φ,θ,|S|,|G|) ∂θc .
This system is under-constrained because [I−P] is not invertible; which can be shown by re-arranging the balance equation to reveal a leading left eigenvector with zero eigenvalue (all 0 vectors and matrices are represented by [0])
π0 =π0P
π0[I−P] = [0]. (4.3)
|S||G| × |S||G|matrix with the stationary distribution π0 in each row. Since (∇π0)e=X i,g ∇π(i, g) =∇X i,g π(i, g) =∇1 = 0,
we obtain (∇π0)eπ0 = [0]. Thus, adding eπ0 to I −P adds 0 to (∇π0) [I−P] and we
can rewrite (4.2) as
(∇π0)I−P +eπ0=π0(∇P).
To show that [I −P +eπ0] is invertible we call upon a classic matrix theorem:
Theorem 1 (Theorem 1, §4.5, Kincaid and Cheney [1991]). Let A be an n×n matrix with elements aij. Let kAkp be the subordinate matrix norm induced by the
vector p-norm. For example
kAk∞ := max i n X j=1 |aij|,
then for any p, if limn→∞kAnkp = 0, we have
[I−A]−1 =
∞ X
n=0
An. (4.4)
We now demonstrate that A= (P −eπ0)n converges to [0] as n→ ∞, hence that
[I−(P −eπ0)] is invertible. The first step is showing (P −eπ0)n =Pn−eπ0. This is
trivially true forn= 1, now we assume it is true for some nand demonstrate it is true forn+ 1
(P −eπ0)n+1= (Pn−eπ0)(P −eπ0) , from assumption =Pn+1−eπ0P −Pneπ0+eπ0eπ0
=Pn+1−eπ0−Pneπ0+e1π0 , fromπ0P =π0 andπ0e= 1 =Pn+1−Pneπ0
and by induction it is true for all n. As n→ ∞ we havePn→eπ0, so
lim
n→∞A
n= lim n→∞P
n−eπ0 =eπ0−eπ0= [0].
Thus [I−(P −eπ0)] is invertible and we can write
(∇π0) =π0(∇P)I−P +eπ0−1. So, applying Equation (4.1),
∇η=π0(∇P)I−P +eπ0−1r. (4.6)