• No se han encontrado resultados

Reinforcement Learning borrowed many of its initial intuitions to the learning behavior of living species, trying to mimic the learning process induced by sequences of trial and errors. In particular, we, as humans, usually keep track of our behavior (policy) instead of its ex- pected reward (value function), even though we have a rough idea of what our behavior is worth. Therefore, we perform some sort of direct policy reinforcement, changing our behav- ior when it seems that a new action can improve our “expected gain”. This idea underlies the approach of Policy Iteration which we review throughout this chapter.

We first recall the basics of the Policy Iteration algorithm and highlight its main draw- back: the evaluation phase. We postpone the discussion considering the set of approximate evaluation methods for the policy, defining variants of Approximate Policy Iteration to sec- tion 12.2. Instead, we first focus on the specification of an Asynchronous Policy Iteration algorithm.

Let us start with a reminder on Policy Iteration.

Reminder on Policy Iteration

Policy Iteration (Cf. [Bertsekas and Tsitsiklis, 1996; Puterman, 1994]) is a dynamic pro- gramming method which operates directly in policy space. It can be summarized by saying that if one has a current policy πn and is able to exactly evaluate the expected gain Vπn

of this policy in every state of the process, then performing a Bellman backup in state s with respect to Vπn corresponds to finding a better or equivalent action a in s than the one

specified by πn(s). Therefore, replacing πn(s) by a yields a new policy πn+1 which has a

better or equivalent expected reward. Consequently, Policy Iteration jumps from policy to policy in the policy space and from value functions to better value functions in the value function space.

We recall below the Policy Iteration algorithm as presented in algorithm 2.2. It alternates two phases: the policy evaluation phase and the improvement phase. The evaluation phase consists in evaluating exactly the policy’s value function Vπn (without any optimization).

The improvement phase sweeps through the state space, updating the action in every single state by performing a Bellman backup based on Vπn, and builds the new policy πn+1.

Evaluating the policy can be done in a number of ways. For example, one can use the Value Iteration algorithm without the maximization step (namely, using the Lπ operator

instead of L) in order to build a sequence of functions converging to Vπ. Organizing the

Bellman backups and focusing on relevant states was actually the first idea behind the pri- oritized sweeping method of [Moore and Atkeson, 1993] since it was first introduced for Markov prediction problems and later extended to Markov decision tasks. Another option consists in performing explicit matrix inversion in order to solve the linear system of equa- tions Vπ = LπVπ. This kind of resolution can exploit the fact that transition matrices are

Chapter 12. Real-Time Policy Iteration Algorithm 12.1: Policy Iteration

π0∈ D

n ← 0 repeat

Solve the system of |S| equations:

∀s ∈ S Vn(s) = r(s, πn(s)) + γPs0∈Sp(s0|s, πn(s))Vn(s0) for s ∈ S do πn+1(s) ← argmax a∈A  r(s, a) + γ P s0∈Sp(s 0|s, a)Vn(s0) n ← n + 1 until πn= πn−1 return Vnn

However, the evaluation phase usually remains the bottleneck for most Policy Iter- ation methods.

Therefore, one often uses Approximate Policy Iteration which allows for faster approximate evaluation and despite the lack of theoretical guarantees. We discuss this question in section 12.2.

Asynchronous Policy Iteration

Dynamic Programming is not limited to Value Iteration. Building an Asynchronous Policy Iteration algorithm seems a little more complicated in the first place because of the two distinct phases of the standard Policy Iteration Algorithm. We use this section to review the idea of asynchronism in Policy Iteration.

In the following paragraphs, we will make the — rather drastic — assumption that there exists a black box which quickly evaluates the policy’s value function. This assumption is made for clarity of presentation and we will discuss it along with the Approximate Policy Iteration methods in the next sections.

The algorithm of Modified Policy Iteration provides a smooth transition from Policy It- eration to Asynchronous Policy Iteration. It builds on the idea that one can use other value functions than the evaluation of πn to find πn+1 as long as the value function used respects

some properties. Namely, the evaluation phase of Modified Policy Iteration at iteration n consists in performing mn times the Vk+1= Lπn+1Vk operation in order to approach Vπn+1.

[Puterman, 1994] shows that Modified Policy Iteration converges for any non-zero value of mn.

Because of the alternance of evaluation / improvement phases, Policy Iteration seems to be a Synchronous Dynamic Programming method by nature. Introducing asynchronism in Policy Iteration implies allowing these two phases to mix, therefore performing partial evaluation and partial improvements of the policy. The case of Modified Policy Iteration is a good illustration of the fact that Asynchronous Policy Iteration necessarily relies on some sort of approximation in the policy’s value function. In the case of Modified Policy Iteration, this approximation has no impact on optimality while with less conservative approximation methods it might necessitate the error bounds of Approximate Policy Iteration.

12.2. Approximation for Policy Iteration

Documento similar