During this description of reinforcement learning there has been one fundamental flaw that has not been mentioned. Regardless of the method selected how can we both keep a record of, as well as continually update every state and state-action pair in problems with huge state spaces, such as chess containing roughly 1043positions. With large state spaces many states also may never even be visited once after thousands of episodes, let alone being visited enough to learn a meaningful policy.
Clearly, the only way this can be achieved is through generalization from previously experienced states for the states not yet seen. Fortunately,
generalization from examples has been researched extensively for many years across many fields and we need only integrate these methods into our
reinforcement-learning model. The type of generalization required is often called a function approximation as the examples are from a desired function, such as a value function. Function approximation can be achieved using techniques such as supervised learning neural networks as described in chapter 2, pattern recognition and statistical curve fitting (Sutton and Barto, 1998). 3.4.2.1 Value Prediction with Neural Networks
Generally, the approximate value function Vt at time t has been stored for every state, probably in a table. However, when building a generalized function we instead represent this data in a parameterized functional form with a parameter vector θrt. Therefore, the function Vt now depends totally on θrt and varies from time step to time step only as θt
r
varies. This is particularly useful when implemented with a neural network, as they are primarily a means of calculating vector equations (Sutton and Barto, 1998). For instance, if Vt is the function computed by the network with θt
r
being the vector of connection weights, then by adjusting these weights any number of a wide range of different functions Vt can be implemented by the network. Generally, θrt does not have to have a value for every state. Instead each
Chapter 3: Reinforcement Learning Richard Dazeley
is adjusted a number of states have their value functions changed (Sutton and Barto, 1998).
The problem here is that neural networks using supervised learning are expecting training examples that they can attempt to converge towards. Therefore, the return Rt from the environment must be used as the error value for the network to use in adjusting weights through backpropagation. Thus, we must view each backup as a conventional training example (Sutton and Barto, 1998).
When using supervised learning we are generally seeking to minimize the mean-squared error (MSE) over some distribution, P, of the inputs. In the value prediction problem the inputs are states and the target function is the true value function Vπ, therefore, the MSE for an approximation V
t, using θt
r
, is given in Equation 3.26. The distribution P is important, as it is not usually possible to reduce the error to zero for all states (Sutton and Barto, 1998). Therefore, P acts as a guide on how these function approximations can be balanced.
[
]
∑
∈ − = S s t t P s V s V s MSE(θr) ( ) π( ) ( )2Equation 3.26: Calculation of MSE for an approximation Vt using θt
r
. Basically, integration of TD(λ) and neural networks is achieved by taking the TD-error, δ, as defined earlier, and adjusting our network’s vector of weights by applying Equation 3.27, where ertis a vector of eligibility traces.
t t t t e r r r αδ θ θ +1 = +
Equation 3.27: Vector weight update rule.
There is one element in ert for each element in θrt and they are updated using
Equation 3.28. The complete algorithm for the gradient-descent TD(λ) is given in Figure 3.16. ) ( 1 t t t t e V s e t θ γλr r r ∇ + = −
Chapter 3: Reinforcement Learning Richard Dazeley
62
Initialise θr arbitrarily and er =
0r Repeat for each episode:
s initial state of episode
Repeat for every step in the episode:
a ←action given by π for s
Take action a, observe r, and next state s’
δ← r + γV(s’)-V(s) ) (s V e er←γλr+∇θr er r r αδ θ θ ← + s ← s’; until s is terminal
Figure 3.16: On-line gradient-descent TD(λ) algorithm for estimating Vπ.
3.5
Conclusion
Reinforcement learning is a computational approach to understanding and automating goal-directed learning and decision-making. It, like many artificial intelligence techniques, stems from the psychology of animal learning but is distinguished by its emphasis on learning by the individual agent from direct interaction with the environment. It does this without the need for supervision or a complete model of the environment. It achieves this through the use of a formal framework defining the interaction between a learning agent and its environment in terms of states, actions and rewards. This framework represents the essential elements involved in learning, such as a sense of cause and effect, a sense of uncertainty and non-determinism, and the existence of explicit goals. This chapter has described the reinforcement-learning model in detail,
describing all of the primary elements and how they interact. It then described a number of methods used to evaluate their actions and explore a problem domain through careful action selection. It has also given a detailed description,
including procedural algorithms and equations for the three main categories of methods used in reinforcement learning: Dynamic Programming, Monte Carlo and Temporal Difference. Finally, it discussed a technique, called eligibility traces, for amalgamating the Temporal Difference technique smoothly with Monte Carlo methods. This allows for on-line learning and is the foundation for the TD(λ) algorithms used in this thesis.