• No se han encontrado resultados

VALORACIÓN GLOBAL DEL PROYECTO

POSIBILIDAD MEDIDAS CORRECTORAS

8. VALORACIÓN GLOBAL DEL PROYECTO

Throughout this section, we assume that the realisations of all random variables can be completely observed. We define a stochastic process which satisfies the Markov condition. After defining a controllable Markov process, we ultimately define a Markov decision process.

Definition 2.1 (Stochastic process). A stochastic process is a collection of

The index set T typically models time and can be either discrete or continu- ous, e.g. T “ t0, 1, . . .u or T “ tt P R | 0 ď t ă 8u, respectively, or some closed interval subset of either.

Consider a discrete-time stochastic process tStu with T “ t0, 1, . . .u. Each

St models the system state at time t and assumes values in a state space S. In a general causal stochastic process, St`1may depend on the realisations of any Sk, k ď t. This leads to a state transition model defined as a conditional PDF ppst`1 | st, st´1, . . . , s0q. Note that we assume the state transition

model to be independent of t. This assumption is made to simplify notation, and no conceptual difficulties arise from defining state transition models dependent on t, as is done e.g. by Puterman (1994).

When dependency on past states is reduced to St only, a Markov process is defined.

Definition 2.2 (Markov process). A discrete-time stochastic process tStu

with T “ t0, 1, . . .u is a Markov process if its state transition model satisfies the Markov condition

ppst`1| st, st´1, . . . , s0q “ ppst`1 | stq (2.1)

for all t P T and all s0, s1, . . . , st`1 P S.

In a decision process, an agent has the opportunity of influencing a stochastic process by applying actions. Fundamentally this means that the state transition model of the process is an action-dependent function. Actions can be selected at the decision epochs determined by an index set T . If T “ t0, 1, . . .u, the decision process is of infinite horizon. If T “ t0, 1, . . . , du,

d P N, the decision process is of finite horizon, and the last decision is made

at epoch pd ´ 1q. If at some decision epoch the system is in state s P S, the agent may choose the action to apply from the set of actions allowed in s, denoted As. Let A “

Ť sPS

As denote the action space of the decision process. With the introduction of actions, a controlled stochastic process is defined, where in general the next state depends on the past state and actions both via a state transition model ppst`1 | st, st´1, . . . , s0, at, at´1, . . . , a0q. In a

controlled Markov process, state transitions satisfy the Markov condition with respect to the states and actions.

Definition 2.3 (Controlled Markov process). In a controlled Markov process,

the state transition model satisfies

ppst`1 | st, st´1, . . . , s0, at, at´1, . . . , a0q “ ppst`1 | st, atq (2.2)

for all t P T and all s0, s1, . . . , st`1 P S, a0, a1, . . . , atP A.

The state transition model in a controlled Markov process is concisely represented as a function T : S ˆ A ˆ S Ñ R`, such that Tps1, a, sq is the

value of the PDF over the new system state s1 when the system is currently

in state s and action a is executed. A valid state transition model must satisfy ş

S

Tps1, a, sqds1 “ 1 for all s P S and a P A1.

1

The actions are applied in a sequential manner. At decision epoch t, the system is in a state st. The agent then executes action at, and the system transitions to a new state st`1 according to T. The agent receives a reward

R1

pst, at, st`1q, which is a random quantity as it depends on the system state at decision epoch pt ` 1q. The reward function is assumed to be independent of the decision epoch, although no extra difficulty beyond notational inconvenience arises from epoch-dependent rewards. Positive reward is interpreted as an income, and negative reward as a cost. We adopt the alternative view of the reward function R1 where it is replaced by its

expected value calculated by

Rpst, atq “ ESt`1rRpst, at, st`1qs , (2.3)

defining a new expected reward function R : S ˆ A Ñ R. The expectation in the expression above is taken with respect to T ” ppst`1 | st, atq. In a finite-horizon decision process with T “ t0, 1, . . . , du, the last action is selected at decision epoch pd ´ 1q, and an additional real-valued terminal reward Rdpsdq is sometimes defined. Throughout the rest of the thesis we assume the terminal reward is equal to zero.

Adding the reward process to a controlled Markov process defines a Markov decision process (MDP).

Definition 2.4 (Markov decision process; Puterman, 1994). A Markov

decision process is a tuple xT , S, tAsu, T, Ry, where T is the set of decision

epochs, S is the state space, As is the set of actions allowed in s P S such

that A “ Ť

sPS

As, T : S ˆ A ˆ S Ñ R` is the state transition model, and

R : S ˆ A Ñ R is the real-valued reward function.

Documento similar