Most interesting MDP problems are too large to be solved precisely and must be approxi- mated. The methods for approximately solving Markov decision processes can be divided into two main types: 1) policy search methods, and 2) approximate dynamic programming methods. This thesis focuses on approximate dynamic programming.
Policy search methods rely on local search in a restricted policy space. The policy may be represented, for example, as a finite-state controller (Stanley & Miikkulainen, 2004) or as a greedy policy with respect to an approximate value function (Szita & Lorincz, 2006). Policy search methods have achieved impressive results in such domains as Tetris (Szita & Lorincz, 2006) and helicopter control (Abbeel et al., 2006). However, they are notoriously hard to analyze. We are not aware of any theoretical guarantees regarding the quality of the solution.
Approximate dynamic programming (ADP) methods, also known as value function ap- proximation, first calculate the value function approximately (Bertsekas & Ioffe, 1997; Powell, 2007a; Sutton & Barto, 1998). The policy is then calculated from this value function. The advantage of value function approximation is the it is easy to determine the quality of a value function using samples, while determining a quality of a policy usually requires extensive simulation. We discuss these properties in greater detail inSection 2.3.
A basic setup of value function approximation is depicted inFigure 2.2. The ovals repre- sent inputs, the rectangles represent computational components, and the arrows represent information flow. The input “Features” represents functions that assign a set of real val- ues to each state, as described in Section 2.4.1. The input “Samples” represents a set of simple sequences of states and actions generated using the transition model of an MDP, as described inSection 2.4.2.
Calculate value
function
Compute policy
Value function
Execute policy
Policy
Offline
Online: samples
Features
Samples
Figure 2.2.Approximate Dynamic Programming Framework.
A significant part of the thesis is devoted to studying methods that calculate the value function from the samples and state features. These methods are described and analyzed in Chapters3,4, and5. A crucial consideration in the development of the methods is the way in which a policy is constructed from a value function.
Value function approximation methods can be classified based on the source of samples into online and offline methods asFigure 2.2shows. Online methods interleave execution of a calculated policy with sample gathering and value function approximation. As a re- sult, a new policy may be often calculated during execution. Offline methods use a fixed number of samples gathered earlier, prior to plan execution. They are simpler to analyze and implement than online methods, but may perform worse due to fewer available sam- ples. To simplify the analysis, our focus is on the offline methods, and we indicate the potential difficulties with extending the methods to the online ones when appropriate.
2.4.1 Features
The set of the state features is a necessary component of value function approximation. These features must be supplied in advance and must roughly capture the properties of the problem. For each state s, we define a vector φ(s)of features and denote φi :S →R to be a function that maps states to the value feature i:
φi(s) = (φ(s))i.
The desirable properties of features to be provided depend strongly on the algorithm, sam- ples, and attributes of the problem, and their best choice is not yet fully understood. The feature function φican also be treated as a vector, similarly to the value function v. We use |φ|to denote the number of features.
Value function approximation methods combines the features into a value function. The main purpose is to limit the possible value functions that can be represented, as the fol- lowing shows.
Definition 2.20. Assume a given convex polyhedral setM ⊆ R|S|. A value function v is representable (inM) if v∈ M.
This definition captures the basic properties and we in general assume its specific instanti- ations. In particular, we generally assume thatMcan be represented using a set of linear constraints, although most approaches we propose and study can be easily generalized to quadratic functions.
Many complex methods that combine features into a value function have been developed, such as neural networks and genetic algorithms (Bertsekas & Ioffe, 1997). Most of these complex methods are extremely hard to analyze, computationally complex, and hard to use. A simpler, and more common, method is linear value function approximation. In linear value function approximation, the value function is represented as a linear combination of nonlinear features φ(s). Linear value function approximation is easy to apply and analyze, and therefore commonly used.
It is helpful to represent linear value function approximation in terms of matrices. To do that, let the matrixΦ :|S| ×m represent the features, where m is the number of features. The feature matrixΦ, also known as basis, has the features of the states φ(s)as rows:
Φ= − φ(s1)T − − φ(s2)T − .. . Φ= | | φ1 φ2 . . . | |
The value function v is then represented as v=Φx andM =colspan(Φ)
Generally, it is desirable that the number of all features is relatively small because of two main reasons. First, a limited set of features enables generalization from an incomplete set of samples. Second, it reduces computational complexity since it restricts the space of rep- resentable value functions. When the number of features is large, it is possible to achieve these goals using regularization. Regularization restricts the coefficients x in v =Φx using a norm as:kxk ≤ψ. Therefore, we consider the following two types of representation:
Linear space: M = {v∈R|S| v=Φx}
Regularized: M(ψ) = {v ∈ R|S| v = Φx, Ω(x) ≤ ψ}, whereΩ : Rm → R is a convex regularization function and m is the number of features.
When not necessary, we omit ψ in the notation ofM(ψ). Methods that we propose require the following standard assumption (Schweitzer & Seidmann, 1985).
Assumption 2.21. All multiples of the constant vector 1 are representable inM. That is, for all k∈R we have that k1 ∈ M.
We implicitly assume that the first column of Φ — that is φ1 — is the constant vector 1.
Assumption 2.21is satisfied when the first column ofΦ is 1 — or a constant feature — and the regularization (if any) does not place any penalty on this feature. The influence of the constant features is typically negligible because adding a constant to the value function does not influence the policy asLemma C.5shows.
s1 s2 a1 (s2, a1) (s1, a2) (s1, a1) a1 a2
Figure 2.3.Example of equivalence of pairs of state-action values.
The state-action value function q can be approximated similarly. The main difference is that the function q is approximated for all actions a∈ A. That is for all actions a∈ A:
q(a) =Φaxa.
Notice thatΦaand xamay be the same for multiple states or actions.
Policies, like value functions, can be represented as vectors. That is, a policy π can be represented as a vectors over state-action pairs.
2.4.2 Samples
In most practical problems, the number of states is too large to be explicitly enumerated. Therefore, even though the value function is restricted as described in Section 2.4.1, the problem cannot be solved optimally. The approach taken in reinforcement learning is to sample a limited number of states, actions, and their transitions to approximately calculate the value function. It is possible to rely on state samples because the value function is restricted to the representable setM. Issues raised by sampling are addressed in greater detail inChapter 9.
Samples are usually used to approximate the Bellman residual. First, we show a formal definition of the samples and then show how to use them.
Definition 2.22. One-step simple samplesare defined as:
˜
˜
Σ Σ¯
Figure 2.4.Sample types
where s1. . . snare selected i.i.d. from the distribution P(s, a)for every s, a independently. Definition 2.23. One-step samples with expectationare defined as follows:
¯
Σ⊆ {(s, a, P(s, a), r(s, a)) s∈ S, a∈ A}.
Notice that ¯Σ are more informative than ˜Σ, but are often unavailable. Membership of states in the samples is denoted simply as s ∈Σ or(s, a)∈Σ with the remaining variables, such as r(s, a)considered to be available implicitly. Examples of these samples are sketched in
Figure 2.4.
We use|Σ¯|sto denote the number of samples in terms of distinct states, and|Σ¯|ato denote the number of samples in terms of state–action pairs. The same notation is used for ˜Σ. As defined here, the samples do not repeat for states and actions, which differs from the traditional sampling assumptions in machine learning. Usually, the samples are assumed to be drawn with repetition from a given distribution. In comparison, we do not assume a distribution over states and actions.
The sampling models vary significantly in various domains. In some domains, it may be very easy and cheap to gather samples. In the blood inventory management problem, the model of the problem has been constructed based on historical statistics. The focus of this work is on problems with a model available. This fact simplifies many of the assumptions on the source and structure of the samples, since they can be essentially generated for an arbitrary number of states. In general reinforcement learning, often the only option is to gather samples during the execution. Much work has focused on defining setting and
sample types appropriate in these settings (Kakade, 2003), and we discuss some of them inChapter 9.
In online reinforcement learning algorithms (seeFigure 2.2), the samples are generated dur- ing the execution. It is important then to determine the tradeoff between exploration and exploitation. We, however, consider offline algorithms, in which the samples are generated in advance. While it is still desirable to minimize the number of samples needed there is no tradeoff with the final performance. Offline methods are much easier to analyze and are more appropriate in many planning settings.
The samples, as mentioned above, are used to approximate the Bellman operator and the set of transitive-feasible value functions.
Definition 2.24. The sampled Bellman operator and the corresponding set of sampled transitive- feasible functions are defined as:
(¯L(v))(¯s) =
max{a (¯s,a)∈Σ}¯ r(¯s, a) +γ∑s0∈SP(¯s, a, s0)v(s0) when ¯s∈ Σ¯
−∞ otherwise
(2.3)
¯
K = {v ∀s∈ S v(s)≥ (¯Lv)(s)} (2.4)
The less-informative ˜Σ can be used as follows.
Definition 2.25. The estimated Bellman operator and the corresponding set of estimated transitive-feasible functions are defined as:
(˜L(v))(¯s) =
max{a (¯s,a)∈Σ}˜ r(¯s, a) +γn1∑ni=1v(si) when∀¯s∈Σ˜
−∞ otherwise
(2.5)
˜
K =v ∀s∈ S v(s)≥ (˜Lv)(s)
(2.6)
Notice that operators ˜L and ¯L map value functions to a subset of all states — only states that are sampled. The values for other states are assumed to be undefined.
The samples can also be used to create an approximation of the initial distribution, or the distribution of visitation-frequencies of a given policy. The estimated initial distribution ¯α is defined as: ¯α(s) = α(s) s∈Σ¯ 0 otherwise
Although we define above the sampled operators and distributions directly, in applica- tions only their estimated versions are typically calculated. That means calculating ¯αTΦ instead of estimating ¯α first. The generic definitions above help to generalize the analysis to various types of approximate representations.
To define bounds on the sampling behavior, we propose the following assumptions. These assumptions are intentionally generic to apply to a wide range of scenarios. Chapter 9
examines some more-specific sampling conditions and their implications in practice. Note in particular that the assumptions apply only to value functions that are representable. The first assumption limits the error due to missing transitions in the sampled Bellman operator ¯L.
Assumption 2.26(State Selection Behavior). The representable value functions satisfy for some ep:
K ∩ M ⊆K ∩ M ⊆ K(¯ ep)∩ M.
WhenM(ψ)is a function of ψ then we write ep(ψ)to denote the dependence.
The constant epbounds the potential violation of the Bellman residual on states that are not provided as a part of the sample. In addition, all value functions that are transitive-feasible for the full Bellman operator are transitive-feasible in the sampled version; the sampling only removes constraints on the set.
The second assumptions bounds the error due to sampling non-uniformity.
Assumption 2.27(Uniform Sampling Behavior). For all representable value functions v∈ M:
|(α−¯α)Tv| ≤ec.
This assumption is necessary to estimate the initial distribution from samples. This is im- portant only in some of the methods that we study and strongly depends on the actual domain. For example, the initial distribution is very easy to estimate when there is only single initial state. The constant ecessentially represents the maximal difference between the true expected return and the sampled expected return for a representable value func- tion.
The third assumption quantifies the error on the estimated Bellman operator ˜L.
Assumption 2.28(Transition Estimation Behavior). The representable value functions sat- isfy for some es:
¯
K(−es)∩ M ⊆K ∩ M ⊆˜ K(¯ es)∩ M,
where ¯Σ and ˜Σ (and therefore ¯Kand ˜K) are defined for identical sets of states. WhenM(ψ) is a function of ψ then we write es(ψ)to denote the dependence.
The constant e limits the maximal error in estimating the constraints from samples. This error is zero when the samples ¯Σ are available. Note that unlike ep, the error es applies both to the left and right sides of the subset definition.