4. Propuesta de un repertorio interaccional
4.2 Obstáculos a la hora de seleccionar el contenido del repertorio
s0∈S ∞ Z 0 R(s0, t + τ, π(σ), σ) + γτVπ(σ0)·f(τ|σ, π(σ), s0)P (s0|σ, π(σ))dτ = Lt π(Vπ)(σ) (4.2) This equation is a natural extension of the standard MDP Lπ operator to the SMDP+
case. Similarly, the optimality equation becomes equation 4.3.
V∗(σ) = max a∈A P s0∈S ∞ Z 0 R(s0, t + τ, a, σ) + γτV∗(σ0)· f(τ|σ, a, s0)P (s0|σ, a)dτ (4.3) V∗(σ) = LV∗(σ)
This chapter focuses on modeling and solving TMDP problems. So, for clarity, we will admit for now the intuition stating that these Lπ and L operators really provide the value
functions of π and π∗. Chapter 8 will focus on proving the mathematical foundations and
correctness of equations 4.2 and 4.3 in a more general framework.
Equations 4.2 and 4.3 illustrate the tight coupling between transition dynamics and criterion whenever time is made observable: the τ duration used for the discount factor γτ
is also conditioning the post-action augmented state σ0 = (s0, t + τ).
4.2
Idleness in the SMDP+ model
As we anticipated in section 2.2, as soon as we introduce continuous time, the idea of using an available “wait” action comes to mind. Hence we need to answer the question: “is there an idle action in the SMDP+ model?”. If so, how do we write its transition and reward functions?
We need to consider two options: either we put an “idle” action in the action space A or we don’t. The latter implies disabling the option of acting at specific times. If we do not allow idleness, then actions are executed without interruptions and we loose one of the interests of considering a continuous observable time. In the first case, we need to define the transition and reward functions associated with the “wait” action, which highlights the fact that “wait” is an abstract action which does not have a physical impact on the system as long as we don’t associate it with a duration or an end date. More specifically, in TMDPs, a natural modelling of a “wait” action is chosen so as to imply a deterministic effect on the time variable. This effect itself is conditioned on the duration or end date parameter of the action. Hence, “wait” needs to be associated with a proper parameter to gain the meaning of an action operator. Then, for a “wait(tnext)” or “wait(τ)” action, one can write
the transition and reward models.
The “wait” action’s model can only be formalized with respect to some idleness du- ration or ending date parameters. The transition and reward model are conditioned on these parameters.
One simple remark concerning the fact that “wait” is chosen to be deterministic with respect to the time variable in TMDPs: an engineer with a good sense of humour could
Chapter 4. Bridging the gap between SMDP and TMDP: the SMDP+ model
decide to model the sleepy behaviour of its robot. He could then state that the decision to wait for 8 minutes might result in a different waiting duration — described for example by a Gaussian process of average 8 and standard deviation 1 — because the robot can fall asleep during idleness phases and not wake up exactly in time. This little example finds echoes in real-world problems, for example waiting before sending a request to a web service can sometimes end up in waiting for a lot longer than expected. This simple remark only high- lights the fact that using a deterministic “wait” action is a deliberate choice, adapted to the problem at hand, but which can be questioned for some applications. Since our purpose here is to bridge the gap between SMDPs and TMDPs and since TMDPs consider a deterministic “wait” action, we will use deterministic idleness in SMDP+. However one should keep in mind that “wait(τ)” is not necessarily deterministic in real-world problems.
Additionally, it appears that being idle does not really correspond to “making no change” to the process, since the system might evolve by itself during idleness phases (for example the fuel resource can decrease, the exogenous processes might trigger transitions and change their state, and — of course — our observable time changes). It appears that instead of defining passive idleness, the wait(τ) (or wait(tnext)) action is a particular action which we
consider deterministic with respect to the time variable.
Intuitively, the notion of idleness in mission planning implicitly means “wait until it is time to undertake a new action”. Thus, we can give an interpretation of the “idle” action as a “let the system change on its own until the next decision epoch”. This next decision epoch occurs whenever we enter any state whose date corresponds to the end of the idle- ness. This notion can be illustrated in other words: since we only take decisions at decision epochs’ dates, then the end of a “wait” action must match the date of the next decision epoch. It appears that defining the “wait” transition function necessitates knowledge of the decision epoch’s date. This “wait” action, applied in (s, t), takes the process to a new state s0 described by the natural evolution of the process — everything happening if the agent does
not interact with the world, as described by equation 4.4 — and to the date corresponding to the time of the next decision epoch as described by equation 4.5. Thus, the “wait” action’s model depends on the dates of decision epochs. More specifically, the “wait” action is an instantaneous jump to the date of the next decision epoch and to a state drawn according to the undisturbed dynamics of the system W (s0|s, t, t0). This W (s0|s, t, t0) function captures
all the influences of what would be the exogenous processes if we were in an explicit-event model.
Q(s0, t0|s, t, a) = P (s0|s, t, a) · F (t0|s, t, a, s0)
P (s0|s, t, wait, t0) = W (s0|s, t, t0) (4.4)
f(t0|s, t, wait) = 1
tnext(t0) with tnext = minδ∈N{tδ|tδ > t} (4.5)
Q(s0, t0|s, t, wait) =R∞
−∞P (s0|s, t, wait, t0) · f(t0|s, t, wait)dt0
This last paragraph illustrates the specificity of the time variable among state variables: Planning with respect to a continuous observable time in MDPs and allowing idle- ness actions does not imply knowing in advance the dates of the successive decision epochs, however, it implies considering decision variables which correspond to these dates — in the case of TMDPs these dates are the parameters of the deterministic “wait” action.
4.3. Then what is the difference between waiting and idleness?