• No se han encontrado resultados

Justificación y objetivos del repertorio interaccional

4. Propuesta de un repertorio interaccional

4.1 Justificación y objetivos del repertorio interaccional

There is one last thing that needs to be mentioned about time. Even though, as we have just seen, it often is a bounded state variable, the fact that this variable is non-replenishable introduces structure in the evolution of the process. Namely, all states with t being strictly smaller than the current date are non-reachable states. Moreover, in real-life problems, instantaneous loops always come to an end and the time variable eventually grows and reaches the pseudo-horizon. This means there is a null probability of observing an infinite sequence of instantaneous transitions. In other words: executing a plan always reaches the pseudo-horizon.

3We do not consider periodic problems on purpose here. Namely, we suppose these problems can be dealt

Chapter 2. Temporal Markov Decision Problems — Modeling

Finally, as we have explored — without entering too much in the modeling details — the impact of making time continuous and observable in MDPs, it appears that:

• This time is (indeed) a state variable,

• but it shouldn’t be confused with the process’ discrete time (succession of discrete decision epochs).

• It can usually be bounded, at least as a moving horizon. • However, it induces a specific quasi-loopless structure.

• Modeling and exploiting this structure in the framework of MDPs seems neces- sary to build efficient algorithms in order to generate efficient time-dependent plans or policies.

3

Thesis outline

In order to organize the successive ideas leading to our contributions and to facilitate the reader’s progression across the chapters, this thesis is divided in four main parts.

Part I provided an introduction, both to the general problem of decision and to the ques- tion of introducing time in MDPs. This general introduction, in chapter 1, led to a review of models in chapter 2. These models focus on the integration of the time variable in the MDP framework. They are discussed and compared in order to highlight their specificities and to introduce the first ideas as to the mechanisms involved in their resolution. These formalisms provide the modeling basis which is reused and developed throughout the thesis.

When dealing with explicit time-dependent models, one needs to question a strong hy- pothesis of standard MDPs: is the model stationary anymore? More specifically, how do we model the exogenous evolution due to the environment, the system’s intrinsic temporal behaviour, the opponent’s or ally’s actions, etc.? [Boutilier et al., 1999] makes a distinction between implicit-event models, where the environment’s evolution and effects are factored into the representation of stochastic actions, and explicit-event models, where change caused by the environment is modeled separately from change caused by the agent’s actions. Part II deals with implicit-event temporal models, trying to highlight the structure of the tempo- ral problem and to build an adapted algorithmic solution to the resolution of the associated problem. Then, part III illustrates why such implicit-event models are hard to build and how one can use explicit-event models to learn a policy. Thus, one can summarize the question addressed by each part as:

Part I General introduction and models

Part II Implicit-event models and continuous observable time

Part III Learning policies in explicit-event temporal models with hybrid state spaces Part IV General conclusion

In part II, our attention goes to the straightforward idea of introducing an observable, continuous time variable in an MDP model. In the literature, this approach is known as the TMDP model. We link TMDPs with SMDPs by introducing observable time in SMDPs (chapter 4). Then we improve the TMDP framework’s expressiveness by extending the fam- ily of continuous functions its resolution can handle in chapter 5. We also improve the resolution scheme itself by introducing the specific TMDPpoly algorithm in chapter 6 and

evaluate this resolution in chapter 7. This work inside the TMDP framework extends to the more generic framework of time-dependent, implicit-event, hybrid state and action problems

Chapter 3. Thesis outline

for which we introduce the XMDP formalism in chapter 8. Chapter 9 introduces unfinished work presenting an alternative to the previous approaches. We keep this chapter in the thesis’ corpus for three main reasons: first it provides an interesting algorithmic alternative in itself, secondly it highlights one of the weaknesses of the previous TMDPpoly approach,

and finally it introduces the first ideas underlying part III. Finally, chapter 10 summarizes our results on the question of introducing a continuous, observable time variable in implicit- event, time-dependent MDPs.

Part III begins with a — somehow — admission of failure: for complex domains, implicit- event models are generally not available. Chapter 11 explores the question of modeling temporal complexity in stochastic problems. It does so from the generic discrete events sys- tems point of view and makes a link with the Generalized Semi-Markov Decision Processes framework, illustrating why constructing an implicit-event model is much harder than as- sembling the corresponding explicit-event model. Then, chapter 12 takes a brief step out of the framework of temporal problems to review the approximate and asynchronous Policy Iteration approaches in order to introduce the general idea of Real-Time Policy Iteration and to relate it as much as possible to existing approaches. Finally, chapters 13 and 14 apply the RTPI ideas to the case of temporal domains, using the simulation properties of explicit-event models introduced in chapter 11 and introducing specific notions related to exploration and generalization.

Finally, part IV contains a single conclusion chapter which tries to summarize the thesis’ contributions.

Each part begins with a short overview, introducing the problematic at hand, summa- rizing the questions addressed in each chapter and presenting the organization of developed ideas. Then we introduce each chapter with a brief abstract of the problem addressed and, along the document, framed boxes try to highlight the essential results punctuating the reasoning’s progression.

Part II

Planning with Continuous

Observable Time in Markov

Overview

This part presents our contribution to model-based MDP solving when time is made con- tinuous and observable in the decision-maker’s model. This characteristic allows to consider non-stationary problems where the transition and reward functions depend explicitly on the continuous time variable.

Introducing explicit continuous time in MDP modeling raises a certain number of issues. Among these, we will look specifically at the following questions:

• How do we model the actions affecting the time variable? How should we represent the temporal consequences of actions within an MDP framework?

• Can we represent idleness in a discrete event model? Is there a difference between idleness and waiting?

• Which is the most suitable way to represent continuous evolution of the model? In practice, what kind of methods can we use and what are the appropriate representations (function classes) for these methods?

• How do we represent a policy? What kind of algorithmic precautions should we take to infer policies in practice?

• How do we make the link between policies and value functions with respect to this continuous time?

• How should we exploit this observable time to structure our policy search?

The course of our reasoning goes as follows. We start with the classical model of Semi- MDPs which includes temporal extensions of transitions and investigate what is needed to use this model in order to plan with respect to an observable time. This leads us to consider the questions of:

• Is the SMDP hypothesis of transition probability and transition duration independence still valid when one wishes to plan with respect to this observable time? How should the SMDP model be adapted to such representation constraints?

• Can we model idleness in a discrete event model? Is there a difference between idleness and waiting?

Then our attention turns to the class of problems introduced by [Boyan and Littman, 2001], known as Time-dependent MDPs (TMDPs). We try to relate the model of SMDPs with continuous observable time — which we call SMDP+ — with the TMDP model. This helps us answer the following questions:

• What criterion is really optimized with the dynamic programming equations of [Boyan and Littman, 2001]?

• Are there implicit assumptions concerning the TMDP model which need to be pointed out to improve the resolution of TMDP problems? Namely:

• Can TMDPs represent all time dependent problems? Including the ones where the outcome state depends on the transition duration (and not the opposite)?

• What are the assumptions behind the “dawdling” authorized by TMDPs and how do they affect the optimality equations?

This exploration of the TMDP model will highlight both its advantages and limitations. Then we focus on the TMDP resolution itself. Boyan and Littman introduced an exact resolution scheme for TMDPs. We try to find out to what extent it is possible to expand this exact resolution to a wider class of continuous temporal descriptions. This leads us to investigate the questions of:

• Given the TMDP optimality equations, can we find a class of functions which would be stable though value iterations, ie. for which Vn+1 would belong to the same function

space as Vn?

• What would be a reasonable set of hypotheses on the model to insure that the value function belongs to this function space?

• How would these hypotheses relate to the exact resolution framework of [Boyan and Littman, 2001]?

Finally, based on the previous analysis, we slightly extend the exact resolution framework and design an approximate algorithm which provides L∞ bounds on the value function and

exhibits good convergence properties thanks to the adaptation of the Prioritized Sweeping algorithm to TMDPs. The efficiency of this algorithm also relies a lot on the introduction of a specific piecewise polynomial framework and dedicated approximation algorithms. This allows us to answer the practical question:

• Which are the advantages and drawbacks of our TMDPpoly algorithm which is meant

to extend the standard TMDP resolution?

• More specifically: what can we expect from the “formal Bellman backups” on piecewise polynomial representations?

• And finally: how does this approach scale to temporal planning domains such as the Mars rover benchmark or the UAV coordination problem?

This exploration of the TMDP framework then leads us to a second thought about the nature of the wait action and the place of time in our problem. We consider the idea that wait is a specific continuous parametric action; this leads us to generalize the framework of TMDPs to a more general model which we call XMDP and which improves on the MDP model in two ways:

• First it considers a generalization of actions. Instead of considering raw discrete or continuous actions, it introduces structure by differentiating actions of distinct nature (wait, walk, . . . ) and by associating them with their respective continuous or discrete parameters. Hence, XMDPs consider parametric actions.

• Secondly, it provides an extension of the standard Bellman equation to the case of dis- counted MDPs with observable time, hence proving the soundness of a formal extension of TMDPs to hybrid state spaces, hybrid parametric action spaces and discounted cri- teria.

This XMDP framework thus provides a general model for implicit-event Temporal Markov Decision Problems.

This course of reasoning and the associated mathematical, modeling and algorithmic issues are linearly addressed throughout the following chapters.

• Chapter 4 establishes the link between the well-explored framework of Semi-Markov Decision Processes and TMDPs. Its goal is to point out two different features: on the one hand, we consider TMDPs under the light of temporal extensions of MDPs, showing which hypotheses are implicitly made to transform SMDPs with observable time into TMDPs. On the other hand, we try to highlight why and how is a TMDP different from a hybrid variable MDP.

• Chapter 5 focuses on the dynamic programming equations introduced in [Boyan and Littman, 2001]. It presents our attempt at finding a class of functions which is sta- ble by the Bellman operator for TMDPs. More specifically, our contribution extends slightly the results for exact resolution presented by Boyan and Littman, highlights the difficulties and interests of using piecewise polynomial functions for TMDP solving and opens the door to the approximate resolution scheme presented in the following chapter.

• Then, in chapter 6 we present our TMDPpoly algorithm designed to efficiently solve

generalized TMDPs. It relies on the properties of exact and approximate operations on piecewise polynomial functions, makes use of convergence bounds for Approximate Value Iteration and implements an adapted version of Prioritized Sweeping for gener- alized TMDPs.

• Chapter 7 presents the experimental results of the TMDPpolyplanner implemented from

the TMDPpoly algorithm. Its performance and outputs are experimentally evaluated

on different temporal Markov problems.

• In chapter 8 we bring mathematical foundations to an extension of TMDPs. We generalize the concept of idleness defined in TMDPs to the case of hybrid (continuous and discrete) actions. We define the XMDP framework on the basis of MDPs with observable time and hybrid states and actions. Then we introduce an extended Bellman equation for XMDPs and provide a sound set of hypotheses in order to extend the classical Bellman operator’s properties. XMDPs include standard MDPs and TMDPs and provide a more general mathematical foundation to the problem of modeling and solving MDPs with observable time.

• Chapter 9 presents a possible perspective of the previous work. It introduces the idea of incrementally finding the policy’s temporal bounds via the resolution of a sequence of discrete problems. Somehow in-between Value Iteration and Policy Iteration, the proposed method gives the first hints as to the model-free algorithms which will be presented in the next part of the thesis.

• Finally, chapter 10 summarizes the results and contributions seen throughout the pre- vious chapters, highlights their strengths and weaknesses, and presents how they can contribute to more general MDP optimization methods.

4

Bridging the gap between SMDP and TMDP: the SMDP+ model

The previous part provided an introduction to models and frameworks designed to take the temporal consequences of actions into account in the MDP framework. The TMDP formalism of [Boyan and Littman, 2001] seems to be a natural way of modelling time dependency in MDPs. However, the connection with continuous- time discrete-event decision processes such as SMDPs is unclear. In this chapter, we will focus on the continuous observable time variable of the TMDP model and try to establish the link between SMDPs and TMDPs. Namely, we answer the question “are TMDPs equivalent to SMDPs with observable time?”. Another important question we will try to answer regards the definition of inactivity: “How should we describe idleness? Can it be described within a discrete event framework? Is it equivalent to waiting?”. We introduce the SMDP+ model for this purpose, highlight which criterion is really optimized in TMDPs in order to define policies, and clarify these questions concerning idleness.

4.1

Making time observable in SMDPs

The first step in introducing time in MDPs was to define Semi-MDPs (section 2.2.2) and to introduce continuous action duration. It appears natural to build on the SMDP model in order to go one step further. This step corresponds to defining a model where time intervenes not only as a random duration between decision epochs, but also as an observable continuous variable in the state space, therefore permitting the definition of non-stationary, continuous time, discrete event problems.

In the SMDP model, writing the transition model under the form of Q(τ, s0|s, a) =

P (s0|s, a) · F (τ|s, a), implicitly implies that:

• The model is stationary (no dependency on t in Q),

• The transition duration τ and the post-action state s0 are independent.

We introduce the SMDP+ model which extends the SMDP model with the following features:

Chapter 4. Bridging the gap between SMDP and TMDP: the SMDP+ model • Possible dependency between post-action state and sojourn time.

The problems we wish to consider do not usually satisfy the above conditions of station- arity and independence between variables. For example, the outcome of a “take a photo” action for the Mars rover depends on the time of day (non-stationarity) and its duration depends on the success or failure of the action.

Time-dependency is expressed through continuous evolution of the model with re- spect to the continuous time variable. Post-action states and action durations are often linked.

In order to overcome this modeling issue, we define an SMDP+ as a 4-tuple hΣ, A, Q, Ri: • Σ is the augmented state space containing all σ = (s, t) elements. This state space can

be decomposed into:

– a discrete state space s ∈ S, – a continuous time axis t ∈ R. • A is the discrete action space.

• Q(σ0|σ, a) is the cumulative transition model. It can be written Q(σ0|σ, a) = P (s0|s, t, a)·

F (t0|s, t, a, s0). As in SMDPs, F is the duration model’s cumulative distribution func-

tion. As previously and for convenience, we will write the probability density functions indifferently as f(t0|s, t, a, s0) or f(τ|s, t, a, s0), with:

f(t0|s, t, a, s0) = 0 if t0 < t

f(τ = t0− t|s, t, a, s0) if t0 ≥ t

• R(σ0, a, σ) is the reward model.

One can note that we can write either F (t0|s, t, a, s0) or F (τ|s, t, a, s0) as long as there is

no place left for ambiguity. In our notations, t0 always stands for the post-action date, while

τ = t0− t always describes the transition duration (or the state’s sojourn time).

Using Bayes rule, we could similarly write the transition model on S as P (s0|s, t, a, t0) and

the duration model on t0 as F (t0|s, t, a) and obtain Q(σ0|σ, a) = P (s0|s, t, a, t0) · F (t0|s, t, a).

This is why the SMDP+ model is defined in terms of a Q(σ0|σ, a) function which — in prac-

tice — can be provided either as P (s0|s, t, a) · F (t0|s, t, a, s0) or as P (s0|s, t, a, t0) · F (t0|s, t, a).

In our experiments, the transition duration often depends on the post-action state (for move- ment actions, for example) so we choose to use the P (s0|s, t, a) · F (t0|s, t, a, s0) notation, but

some examples where post-action states are more likely to depend on transition durations can be expressed using the other formulation (as for a “run to catch the bus”) action.

An SMDP+ policy is defined as a function of S ×R into A. Evaluating an SMDP+ policy with respect to the discounted criterion of equation 4.1 yields equation 4.2.

(σ) = E X∞ δ=0 γtδrπ δ|σ0= σ ! (4.1) 44

4.2. Idleness in the SMDP+ model