POSIBILIDAD MEDIDAS CORRECTORAS
8. VALORACIÓN GLOBAL DEL PROYECTO
Throughout this section, we assume that the realisations of all random variables can be completely observed. We define a stochastic process which satisfies the Markov condition. After defining a controllable Markov process, we ultimately define a Markov decision process.
Definition 2.1 (Stochastic process). A stochastic process is a collection of
The index set T typically models time and can be either discrete or continu- ous, e.g. T “ t0, 1, . . .u or T “ tt P R | 0 ď t ă 8u, respectively, or some closed interval subset of either.
Consider a discrete-time stochastic process tStu with T “ t0, 1, . . .u. Each
St models the system state at time t and assumes values in a state space S. In a general causal stochastic process, St`1may depend on the realisations of any Sk, k ď t. This leads to a state transition model defined as a conditional PDF ppst`1 | st, st´1, . . . , s0q. Note that we assume the state transition
model to be independent of t. This assumption is made to simplify notation, and no conceptual difficulties arise from defining state transition models dependent on t, as is done e.g. by Puterman (1994).
When dependency on past states is reduced to St only, a Markov process is defined.
Definition 2.2 (Markov process). A discrete-time stochastic process tStu
with T “ t0, 1, . . .u is a Markov process if its state transition model satisfies the Markov condition
ppst`1| st, st´1, . . . , s0q “ ppst`1 | stq (2.1)
for all t P T and all s0, s1, . . . , st`1 P S.
In a decision process, an agent has the opportunity of influencing a stochastic process by applying actions. Fundamentally this means that the state transition model of the process is an action-dependent function. Actions can be selected at the decision epochs determined by an index set T . If T “ t0, 1, . . .u, the decision process is of infinite horizon. If T “ t0, 1, . . . , du,
d P N, the decision process is of finite horizon, and the last decision is made
at epoch pd ´ 1q. If at some decision epoch the system is in state s P S, the agent may choose the action to apply from the set of actions allowed in s, denoted As. Let A “
Ť sPS
As denote the action space of the decision process. With the introduction of actions, a controlled stochastic process is defined, where in general the next state depends on the past state and actions both via a state transition model ppst`1 | st, st´1, . . . , s0, at, at´1, . . . , a0q. In a
controlled Markov process, state transitions satisfy the Markov condition with respect to the states and actions.
Definition 2.3 (Controlled Markov process). In a controlled Markov process,
the state transition model satisfies
ppst`1 | st, st´1, . . . , s0, at, at´1, . . . , a0q “ ppst`1 | st, atq (2.2)
for all t P T and all s0, s1, . . . , st`1 P S, a0, a1, . . . , atP A.
The state transition model in a controlled Markov process is concisely represented as a function T : S ˆ A ˆ S Ñ R`, such that Tps1, a, sq is the
value of the PDF over the new system state s1 when the system is currently
in state s and action a is executed. A valid state transition model must satisfy ş
S
Tps1, a, sqds1 “ 1 for all s P S and a P A1.
1
The actions are applied in a sequential manner. At decision epoch t, the system is in a state st. The agent then executes action at, and the system transitions to a new state st`1 according to T. The agent receives a reward
R1
pst, at, st`1q, which is a random quantity as it depends on the system state at decision epoch pt ` 1q. The reward function is assumed to be independent of the decision epoch, although no extra difficulty beyond notational inconvenience arises from epoch-dependent rewards. Positive reward is interpreted as an income, and negative reward as a cost. We adopt the alternative view of the reward function R1 where it is replaced by its
expected value calculated by
Rpst, atq “ ESt`1rRpst, at, st`1qs , (2.3)
defining a new expected reward function R : S ˆ A Ñ R. The expectation in the expression above is taken with respect to T ” ppst`1 | st, atq. In a finite-horizon decision process with T “ t0, 1, . . . , du, the last action is selected at decision epoch pd ´ 1q, and an additional real-valued terminal reward Rdpsdq is sometimes defined. Throughout the rest of the thesis we assume the terminal reward is equal to zero.
Adding the reward process to a controlled Markov process defines a Markov decision process (MDP).
Definition 2.4 (Markov decision process; Puterman, 1994). A Markov
decision process is a tuple xT , S, tAsu, T, Ry, where T is the set of decision
epochs, S is the state space, As is the set of actions allowed in s P S such
that A “ Ť
sPS
As, T : S ˆ A ˆ S Ñ R` is the state transition model, and
R : S ˆ A Ñ R is the real-valued reward function.