Controlled markov chains - some stability problems

(1)

by

Daniel Felipe ´Avila Girardot

Thesis advisor Ph.D Mauricio Junca

A thesis presented for the degree of Master in Mathematics Departamento de Matem´aticas

Facultad de Ciencias Universidad de los Andes

Colombia 2016

(2)

First of all, I would like to express my deep gratitude to my advisor, Mauricio Junca. He always offered his time, in order to help me and provide me guidance along the development of this work. Without him the completion of this thesis would have been impossible.

Secondly, I would like to thank my classmates and friends: Santiago Pinzón, Andrés Galindo, Jerson Caro, Hernan Garc´ıa, Julian Forero, Gustavo Chaparro, Rodolfo Quintero, Andrés Garcia, Sergio Gil, Andrei Fuentes and Carlos Umana. Who gave me their support along the development of this work.

Finally, I would like to thank my family for supporting me along all this years of study, and in particular for encouraging me to complete my masters studies.

(3)

Along this work we study some stability-related problems in the context of Controlled Markov chains. As a first problem, we consider a division of the state space, and the goal is to construct a control policy such that the chain stabilize as much as possible in each portion of the division without leaving it.

As a second problem, we characterize thedomain of attractionandescape setof a controlled Markov chain via a function v, which happens to be the solution of a Bellman’s equation. The interpretation of v as the solution of a Bellman’s equation also provides a way to calculate such function via a linear program.

Finally, under some assumptions, we find a policy that maximize the probability ofreaching certain setA. Our approach uses certain cost functions and dynamical programming, so it can be solved using a linear program.

(4)

Introduction 4

1 Preliminaries 5

1.1 Markov Chains . . . 5

1.2 Controlled Markov Chains . . . 9

1.3 Dynamic programming . . . 11

1.3.1 Discounted Problem . . . 12

1.3.2 Average Problem . . . 15

2 Recurrent States and Entropy 18 2.1 Forbidden states . . . 18

2.2 Partition of the state space . . . 20

3 Zubov’s Method 28 3.1 Uncontrolled case . . . 28

3.2 Controlled case . . . 32

3.3 Forbidden states . . . 38

4 Reaching a set of states A 40 4.1 Relation with Zubov’s Method . . . 40

4.2 Average Cost Approach . . . 43

Conclusions and Future Directions 48

(5)

Controlled Markov Chains provide a mathematical frame for dynamical systems where outcomes contain certain ambiguity. Such models have been used on inventory control problems, communi-cation models and even on biology models. The goal of this work is to provide control policies that guarantee some kind ofstabilizationfor the system.

In [1, 2] the authors treat the problem of finding control policies such that the induced Markov chain has a maximum number of recurrent states, so the chain will stabilize as much as possible. Their approach uses the concept ofentropyand state a convex program to find such policies. The first problem we consider is to find a policy in such a way that given any partition of the state space, the chain has a maximum number of recurrent states that can’t evolve to states that belong to another piece of the partition. Our Approach follows the ideas from those papers and a kind of Lyapunov function.

Following the idea of stabilize the system, what if instead of stabilizing the chain as much as possible, we want to ensure that the chain will eventually reach some desired set? Stated more rigorously, letν be an initial distribution over the state space. Our objective will be to find a policy πthat maximize the quantity,

lim inf

t→∞ P π

ν(Xt∈A)

With such problem in mind, we first studied Zubov’s method. This method, used in deterministic dynamical systems, allows us to describe the domain of attraction of a stable point (see [7, 11]); that is, allows us to compute the initial states, under which the system will approach a stable point. In this work, we have developed a Zubov’s method for controlled Markov chains. Using a functionv, which is the solution of a Bellman’s equation, we characterize thedomain of attractionandescape set. Moreover, the calculation ofvprovides a policy such that any state of the domain of attraction has positive probability of reachingA.

Finally, we studied the relation between Zubov’s method and the proposed problem. Such method provides a lower bound for the desired quantity. However, using a variation of the function vit’s possible to obtain an exact answer. Nevertheless, the Bellman’s equations for such variation are more complicated.

(6)

Preliminaries

The objective of this chapter is to introduce the terminology, notation and some important results that will be used along this work. The first section is dedicated to Markov chains. Of particular interest is going to be the discussion concerning invariant distributions, since it will be fundamental in Chapter 2. In second section, controlled Markov chains are going to be treated, and the definition of acontrol policywill be given. Last section is dedicated to a particular case of controlled Markov chains, where a reward function is used.

1.1 Markov Chains

LetXbe a countable set and let{Xt}t=0be aX-valued stochastic process defined over a probability

space(Ω,F, P). We say that the process is aMarkov chainif for allk∈_N

P(Xk+1=jk+1|Xk =jk, . . . , X0 =j0) =P(Xk+1=jk+1|Xk =jk).

In caseP(Xk+1=j|Xk=i)does not depend of time, that is onk, we say the Markov chain istime

homogeneous, in such case we writeP(Xk+1 =j|Xk=i) =Pij. In order to simplify notation we

will writePi(·) :=P(·|X0 =i).

Definition 1.1. LetA ⊂ _X, the random variableτA = inf{n≥ 0|Xn ∈ A}is called the hitting

time ofA.

With such definition in mind we can decompose the state space via an equivalence relation as follows. First of all we need a way to relate states.

Definition 1.2. Leti, j ∈ X. IfPi(τj <∞)>0we sayjis accessible fromi, denoted asi→j.

Moreover, ifi→jandj →iwe say statesiandjcommunicate, denoted asi↔j.

The communication relation just defined is an equivalence relation, see [9] for details, as a consequence we can decompose the state space into disjoint equivalence classes. Using this, we will be able to decompose the state space in a way that will be helpful to understand the long time behavior of the chain. First of all we need a notion that captures such behavior.

(7)

Definition 1.3. Giveni∈_Xdefineτi(1) = inf{n≥1|Xn=i}. IfPi(τi(1)<∞) = 1the statei

is called recurrent, otherwise is called transient.

The following lemma contains an important fact, that illustrates why we expect the chain will stabilize in recurrent states.

Lemma 1.4. Letjbe a transient state, then for allx∈_X, lim

t→∞Px(Xt=j) = 0

We need another definition that will be fundamental in our work. Definition 1.5. LetA⊂_X. We sayAis closed if for allx∈A,

Px(τAc <∞) = 0 We give now an important characterization of closed sets.

Lemma 1.6. A setAis closed if and only ifPx,i = 0for allx∈A, i∈Ac.

Proof. SupposeAis closed and letx∈A, i∈Ac_{. The event}_{X

1 =i}is contained in{τAc <∞}, soPx,i ≤ Px(τAc < ∞) = 0. On the other hand, assumeP_x,i = 0for allx ∈ A, i ∈ Ac. Let n∈_N, givenx∈A, i∈Acwe have that,

Px(Xn=i) =

X

i1∈X,...,in−1∈X

Px,i1. . . Pin−1,i

= X

i1∈A,...,in−1∈A

Px,i1. . . Pin−1,i

= 0

where the last equality follows from the fact thatin−1 ∈A, i∈ AcimplyingPin−1,i = 0. Finally,

since,

{τ_Ac <∞} ⊂

[

n

{X_n∈Ac},

then,

Px(τAc <∞)≤

X

n

Px(Xn∈Ac) = 0

where the last equality follows from above calculation.

So intuitively, a closed set is such that if the chain enters in it, it will never escape from it. The state space decomposition is the following, for a proof and further details see [9].

Proposition 1.7. The state spaceXcan be decomposed as follows, X=T ∪(∪iCi)

(8)

whereT is a set of transient states (not necessarily a class), and theC_i0sare closed, disjoint classes of recurrent states. Moreover, reordering of the state space, the transition matrixP can be written as follows,

P =



   

P1 0 0 . . .

0 P2 0 . . .

..

. ... . .. Q1 Q2 · · · ·



   

wherePi are stochastic matrices forCi. Furthermore, in case the state space is finite such

decom-position can be accomplished as follows,

1. Decompose_Xinto equivalence classes. 2. The closed classes are recurrent.

3. Classes which are not closed consist of transient states.

Therefore, recurrent states belong to closed classes, and in case the state space is finite, all closed classes will consist of recurrent states. So in some sense, the chain will stabilize in such closed classes. As a consequence, an interesting question is how to characterize those sets. Invariant distributions, help to answer such question.

Definition 1.8. Letν be a probability distribution over the state space. It’s called an invariant or stationary distribution if,

νT =νTP

The relation between invariant distributions and closed classes is the following, see [9].

Proposition 1.9. Suppose the chain is irreducible (there is only one class), recurrent and thatXis

finite. Then, there exist an unique invariant distributionν. Moreover,ν(x)>0for allx∈_X. Remark. Note that above proposition implies that whenever all recurrent classes are finite, then there exist an unique invariant distribution fully supported in each recurrent class

As a corollary, we obtain an important theorem, which states that invariant distributions are a convex combinations of invariant distributions corresponding to recurrent classes.

Theorem 1.10. Let_Xbe a countable state space, and decompose the state space as_X=T∪(∪iCi),

where T is a set of transient states, and the C_i0sare closed, disjoint classes of recurrent states. Suppose all C_i0s are finite and let νi be the invariant distribution associated to Ci. Then, any

invariant distributionνsatisfies, ν=X

i

αiνi, where

X

i

αi = 1.

Proof. Letνbe an invariant distribution. Therefore, for allt∈_N, ν(j) =X

x∈_X

(9)

Letjbe a transient state. So taking limit in above expression, and recalling that, lim

t→∞Px(Xt=j) = 0

we obtain,ν(j) = 0. As a consequence,ν(x)must be supported on theC_i0s. Recalling that states can be reorganized to obtain,

P =



   

P1 0 0 . . . 0

0 P2 0 . . . 0

..

. ... . .. Q1 Q2 · · · ·



   

we note νT = νTP impliesνT Ci= ν

T

Ci Pi. And because of uniqueness of νi we conclude νT Ci=αiνifor someαi. Finally, evaluating last expression inCiwe obtainαi=ν(Ci).

To end this section, we present some useful lemmas. Recall that given a sequence of events {E_n|n∈_N}we can define the events,

lim inf

n→∞ En:=

[

m

\

n≥m

En

lim sup

n→∞ En:=

\

m

[

n≥m

En

Lemma 1.11. (see [10]) (Fatou’s Lemma for sets)

Plim inf

n→∞ En

≤lim inf

n→∞ P(En)

P

lim sup

n→∞ En

≥lim sup

n→∞ P(En)

Lemma 1.12. LetA⊂_Xand consider the hitting time ofA. Then, lim inf

n→∞ {τA≤n}={τA<∞}= lim sup_n→∞ {τA≤n}.

Proof. Let’s prove the first equality. Letω∈lim inf{τ_A≤n}, so there existsm ∈_Nsuch that for all n ≥ m, ω ∈ {τA ≤ n}; that is, for alln ≥ m, τA(ω) ≤ n, which impliesω ∈ {τA < ∞}.

Now, letω ∈ {τ_A<∞}, so definingm:= τA(ω)we obtainω ∈

T

n≥m{τA≤n}, which implies

ω ∈lim inf{τ_A≤n}.

For the second equality, letω ∈ {τA<∞}and letm∈N. Chooseibig enough so thati≥m

andi≥τA(ω). Therefore,ω∈

S

n≥m{τA≤n}and sincemwas arbitrary,ω∈lim sup{τA≤n}.

Now, letω∈lim sup{τ_A≤n}so for allm∈_Nthere existsn≥msuch thatω ∈ {τ_A≤n}; that is,τA(ω)≤n <∞soω∈ {τA<∞}.

Lemma 1.13. LetAbe a closed set and consider the event

E ={∃n, m∈_Nsuch thatn≤m, Xn∈A, Xm∈/ A}

(10)

Proof. We can write such event asE=S

n

S

m≥n{Xn∈A, Xm∈/ A}. Therefore,

Px(E)≤

X

n

X

m≥n

Px(Xn∈A, Xm∈/ A)

Let’s see that for allm≥n,Px(Xn∈A, Xm ∈/A) = 0. We have that,

Px(Xn∈A, Xm ∈/A) =

X

j∈A

X

i /∈A

Px(Xn=j, Xm =i)

Now, each term Px(Xn = j, Xm = i) = Px(Xn = j)Pj(Xm−n = i). Therefore, sinceA is

closed we havePj(Xm−n =i) = 0, so thatPx(Xn ∈A, Xm−n ∈/ A) = 0and as a consequence

Px(E) = 0.

1.2 Controlled Markov Chains

Controlled Markov chains or Markov decision processes, provide a mathematical scheme for prob-lems where the outcomes contain certain ambiguity. In this section we will shortly introduce the fundamental aspects of controlled Markov Chains in the discrete context, a reference for such topic can be found in [8]. The ideas for the non-discrete context are quite similar however some mathe-matical issues appear, the interested reader can deepen in [4].

Let{(X_t, Ut)}t=0 be a stochastic process over a countable setX×U. The setXis the state

space of the system, whileUis the set of control actions. Such process is a controlled Markov chain

if for allk∈N,i[0,k]⊂X,u[0,k]⊂U,

P(Xk+1 =j|X[0,k]=i[0,k], U[0,k]=u[0,k]) =T(Xk+1 =j|Xk=ik, Uk=uk),

whereTsatisfies,Pj∈_XT(Xk+1 =j|Xk=ik, Uk =uk) = 1. So the dynamics of the process are

defined by the probability ofXk+1 given the previous stateXkand a control actionUk. Since we

will only use homogeneous Markov chains, the following notation makes sense, P_iju :=P(Xk+1 =j|Xk=i, Uk=u)where i, j∈X, u∈U.

We are going to be interested in finding control policies, or strategies, such that the evolution of the process has certain desired property. We will only deal with fully observed controlled Markov chains, which means that for each instant kthe controller observes exactly the state xk. Before

giving a precise definition of a control policy we need the following notation, denoteU(x) ⊂Uas

the set of feasible actions on the statex∈_X, and letK={(x, u)|x∈X, u∈U(x)}.

Definition 1.14. Let P(U) be the set of probability measures over U. A control policy π is a

sequence of functions(µ0, µ1, µ2, . . .), where eachµk:Kk−1×X→P(U)and,

X

u∈_U(xk)

µk(u|x0, u0, . . . , xk−1, uk−1, xk) = 1

(11)

1

2

3

1

2

3

Figure 1.1: Graph representations of transition matricesP1andP2.

DenoteΠas the set of all control policies, we have different types of control policies,

1. We sayπ = (µ0, µ1, µ2, . . .)isdeterministicif for anyk∈ Nand any sequencex0, u0, . . . ,

xk−1, uk−1, xk ∈ Kk−1 ×X, the measureµk(.|x0, u0, . . . , xk−1, uk−1, xk) is zero for all

but one element ofU(xk). Note that in such case we can regard µkjust as a function from

Kk−1×XtoU.

2. We sayπ= (µ0, µ1, µ2, . . .)is aMarkov policyifµkit doesn’t depend on the history; that is,

only depends on the current statexk. DenoteΠM ⊂Πthe set of such policies.

3. Ifπ is a Markov policy such thatπ = (µ, µ, µ, . . .) we say it’s astationary policy. Denote ΠS ⊂ΠM the set of such policies.

As we will see in the next section Markov deterministic policies will play an important role on several optimization problems, in particular deterministic stationary policies will be fundamental in infinite horizon problems. We have one first easily proven lemma that tells us that Markov policies are in some sense well-behaved.

Lemma 1.15. Let π = (µ0, µ1, µ2, . . .) be a Markov policy, the transition probabilities of the

Markov chain of the induced probability measurePπ_{can be computed as,}

Pπ(Xk+1 =i|Xk=x) =

X

u∈U(x)

µk(u|x)·P(Xk+1 =i|Xk=x, Uk=u)

Moreover, ifπis stationary we obtain a homogeneous Markov chain.

Example 1.16. LetX= {1,2,3}andU ={1,2}. For each control action we have the transition

matrices,

P1 =





0.3 0.7 0 0 1 0 0.8 0 0.2



 P2 =





1 0 0 0.8 0.2 0 0 0 1



(12)

We can represent such matrices as graphs as is shown in figure 1.1. Consider the stationary policyπ= (µ, µ, µ, ...),

µ=





1 0 0 1 0.2 0.8





So the transition matrixPπ is given by,

Pπ =





0.3 0.7 0 0.8 0.2 0 0.16 0 0.84





As a consequence the induced Markov chain has one recurrent class{1,2}and one transient class {3}. Note that{1,2}was not even a class withP1 orP2.

1

2

3

Graph representation ofPπ

In the next section we will briefly discuss how these policies are related, and in particular how they behave with the maximization of costs functions.

1.3 Dynamic programming

In the following section we will discuss the optimization of several cost functions. First of all we need a reward function. For allt ∈ _N, let rt be a function fromX×U toR. The objective of

the functionrt(x, u) is to define a cost per stage, a cost for, in step or timet, being in statex and

selecting a controlu.

Relying on the kind of problem there are different cost functions involved. In finite horizon problems, for selecting a policy, the controller gets rewards in the periods of time1, . . . , N. The objective is to find a policy that minimizes or maximizes such rewards. More precisely, given an initial statexwe are interested in finding a policyπ = (µ0, µ1, . . .), that minimizes the cost function

v_Nπ(x) :=Eπx

hN_X−1

t=0

rt(Xt, Ut) +rN(XN)

(13)

Since we are going to be interested in the long time behaviour of the chain, this model will not be of interest, for further details of such model see [8] or [3].

On the other hand, on infinite horizon problems there is not finite horizon planning. We will be interested in two kinds of infinite horizon problems: the discounted case and average case. In the first case, the controller receives a penalization 0 < γ < 1for each period of time. So given an initial statex, we will be interested in finding a policyπthat maximizes the cost function,

vπ_γ(x) := lim sup

N→∞ E π x

h_XN

t=0

γt·rt(Xt, Ut)

i

In the average case, we will be interested in finding a policy that maximizes the cost function, vπ(x) := lim sup

N→∞

1 N ·E

π x

hNX−1

t=0

rt(Xt, Ut)

i

On what follows we will briefly discuss whether there exists a solution for the previously pre-sented problems, and computational approaches to solve them.

1.3.1 Discounted Problem

To treat this problem we will make some assumptions, 1. Assume there are stationary rewardsr(x, u). 2. Assume there are bounded rewards.

3. Assume the discount factorγ ∈(0,1).

Under those hypothesis and using the dominated convergence theorem it’s clear that,

vπ_γ(x) = lim sup

N→∞ E π x

hXN

t=0

γt·r(Xt, Ut)

i

=Eπx

hX∞

t=0

γt·r(Xt, Ut)

i

We are interested in finding a policyπ∗such thatv_γπ∗ =v_γ∗, where v∗_γ(x) = sup

π∈Π

v_γπ(x) x∈X

There is one first interesting result, which states that for each x ∈ _Xand anyπ ∈ Πthere exists π0 ∈ΠM such thatvπγ(x) =vπ

0

γ (x), see [8] for instance. As a consequence,

v_γ∗(x) = sup

π∈Π

v_γπ(x) = sup

π∈ΠM

vπ_γ(x) x∈_X

The existence of an optimal policy is not a trivial issue. Such existence relays on the Bellman’s equations, that are defined as follows,

(14)

Definition 1.17. The system of equations, vγ(x) = sup

u∈U(x) n

r(x, u) +γX

j∈X

P_xju ·vγ(j)

o

x∈X

are called the Bellman’s equation for the discounted system. DenoteDas the set of deterministic stationary policies, so Bellman’s equations can be expressed in vector notation as,

vγ = sup π∈D

n

rπ+γ·Pπ·vγ

o

To get a sense of why such equations make sense, let’s fix a stationary policyπ. Givenx ∈_X we have that,

v_γπ(x) =Eπx

"_∞ X

t=0

γt·r(Xt, Ut)

#

=Eπx[r(X0, U0)] +Eπx

"_∞ X

t=1

γt·r(Xt, Ut)

#

=Eπx[r(X0, U0)] + X

j∈X

Eπ

"_∞ X

t=1

γt·r(Xt, Ut)

X1 =j, X0=x #

·P_x,jπ

=Eπx[r(X0, U0)] +γ X

j∈_X

Eπ

"_∞ X

t=1

γt−1·r(Xt, Ut)

X1=j

#

·P_x,jπ

= X

u∈U(x)

r(x, u)·π(u|x) +γX

j∈_X

v_γπ(j)·P_x,jπ

= X

u∈_U(x)

π(u|x)·hr(x, u) +γX

j∈_X

vπ_γ(j)·P_x,ju i

≤vγ(x)

So for stationary policies we havesupπ∈ΠSv

π

γ(x)≤vγ(x). In general, we would like to prove

vγ = vγ∗. To achieve this it’s needed to express Bellman’s equations in operator notation. LetV

denote the space of bounded real-valued functions onXdoted with the supremum norm, and with

the partial order relation,

Givenu, v∈V we sayu≥v⇔u(x)≥v(x)for allx∈_X

It’s possible to prove that ifv∈V thensup_π∈Dnrπ+γ·Pπ·vo∈V. Therefore, makes sense to define an operatorL:V →V,

L(v) := sup

π∈D

n

rπ+γ·Pπ·vo

(15)

Lemma 1.18. Suppose there existv∈V, such that, • v≥L(v). Then,v≥vγ∗.

• v≤_L(v). Then,v≤v_γ∗.

• v=L(v). Then,vis the only element with such property andv=vγ∗.

Furthermore, the solutionvγof Bellman’s equations is a fixed point ofL, so last item of previous

lemma is telling us that in casevγexists, thenvγ =v∗γ. To conclude the existence of a solution of

Bellman’s equation, theorems as the Banach Fixed Point theorem can be applied. Those equations also provide a way to conclude the existence of a policyπsuch thatvπ_γ =v_γ∗. To sum up, we have the following, for a proof and further details see [8] or [3].

Theorem 1.19. 1. There exist a solution of Bellman’s equation.

2. The solutionvγof Bellman’s equations satisfiesvγ=v∗γ.

3. Suppose the state spaceXis either countable or finite, andU(x)is finite for allx∈X. Then

there exist a stationary deterministic policyπsuch thatvπ_γ =v_γ∗. 4. Such optimal policy can be found definingπ = (µ, µ, . . .)where

µ(x)∈arg max

u∈U(x)

{r(x, u) +γX

j∈_X

P_xju ·vγ(j)}

Remark. There are more general assumptions that can be used in item3. Since we are not using them on the present work we will not discuss them.

To end this section we present a way to solve such equations via linear programming. Recall that because of the first item of previous lemma, ifv∈V satisfies,

v≥sup

π∈D

n

rπ+γ·Pπ·v

o

thenvprovides an upper bound for the optimal solutionv_γ∗. Therefore, makes sense to approach the optimal solution by theleast largevthat satisfiesv ≥ _L(v). However, the restrictionv ≥ _L(v)is non-linear, fortunately it can be can be written as the linear restrictions,

v(x)−X

j∈_X

γP_x,ju v(j)≥r(x, u) foru∈_U(x), x∈_X

The linear program is then as follows. Consider a vectorν ∈R|≥X0|, in case P

x∈Xν(x) = 1we can

(16)

Primal Linear Program

Minimize X

x∈X

ν(x)vγ(x)

Subject to

vγ(x)−

X

j∈X

γP_x,ju vγ(j)≥r(x, u) foru∈U(x), x∈X

However, in order to reconstruct the control is more appropriate to consider the dual linear program, Dual Linear Program

Maximize X

s∈X

X

u∈_U(s)

r(s, u)x(s, u)

Subject to

X

u∈_U(j)

x(j, u)−X

s∈X

X

u∈_U(s)

γP_s,ju x(s, u) =ν(j)

x(j, u)≥0 foru∈U(j), j∈X

Finally, we have the following theorem that ensures that solving above program is equivalent to solving Bellman’s equation.

Theorem 1.20. 1. There exist an optimal solutionx∗of the dual linear program. 2. Define a stationary policyπ= (µ, µ, . . .)as,

µ(u+|s) = x

∗_{(s, u}+₎ P

u∈U(s)x

∗_{(s, u)}

Then,v_γπis a solution to Bellman’s equations.

1.3.2 Average Problem

We need some assumptions,

1. Assume there are stationary rewardsr(x, u). 2. Assume there are bounded rewards.

As before, we are going to be interested in finding a policyπ∗such thatvπ∗ =v∗, where, v∗(x) = sup

π∈Π

vπ(x) vπ(x) = lim sup

N→∞

1 N ·E

π x

hN_X−1

t=0

rt(Xt, Ut)

i

x∈X

First of all, there is an interesting result relating discounted cost functions with average cost func-tions (see [8]).

(17)

Lemma 1.21. Letπ be a stationary policy, and let v_γπ be the discounted cost function for some γ ∈(0,1). Then,

vπ = lim

γ→1(1−γ)v

π γ

As is shown in [8], for eachx ∈ _Xandπ ∈ Π there exists a Markovian policyπ0 such that vπ =vπ0. So as in the discounted case, it’s enough to just consider policies inΠM. Following the

ideas of the discounted case, one would like to define some kind of Bellman’s equation that captures the desired cost function.

Definition 1.22. The system of equations, sup

u∈U(x)

n X

j∈X

P_x,ju v(j)−v(x)o= 0

And sup

u∈U(x)

n

r(x, u)−v(x) +X

j∈X

P_x,ju h(j)−h(x)o= 0

forx∈X. Are called the optimality equations for the multi-chain model.

Remark. Multi-chain models are those where exist a stationary policy for which the induced Markov chain has at least two recurrent classes. Uni-chain models are easier, in such models there is just one recurrent class, and thevfunction is constant so there is no need to introduce the first equation. We have the following theorem that relates the solution of above equations with the average cost function, for a proof see [8].

Theorem 1.23. • Suppose above equations have a solution(v, h), then,v=v∗.

• Suppose the state space _Xis finite, and_U(x) is finite for allx ∈ _X. Then, the optimality equations have solution.

• Suppose the state space _X is finite, and _U(x) is finite for all x ∈ _X. Then there exist a stationary deterministic policyπsuch thatvπ =v∗.

• Such optimal policy can de found definingπ = (µ, µ, . . .)where µ(x)∈arg max

u∈_U(x)

{r(x, u) +X

j∈X

P_xju ·h(j)}

with the additional condition thatPπv=v.

Those equations can be solved via linear programming. Consider a vectorν∈_R|X|

≥0. The linear

(18)

Minimize X

x∈X

ν(x)v(x) Subject to

v(x)≥X

j∈_X

P_x,ju v(j) foru∈_U(x), x∈_X And

v(x)≥r(x, u) +X

j∈X

P_x,ju h(j)−h(x) foru∈U(x), x∈X

And the dual linear program is given by, Dual Linear Program

Maximize X

s∈X

X

u∈U(s)

r(s, u)x(s, u)

Subject to

X

u∈_U(j)

x(j, u)−X

s∈X

X

u∈_U(s)

P_s,ju x(s, u) = 0 j∈_X

And

X

u∈_U(j)

x(j, u) + X

u∈_U(j)

y(j, u)−X

s∈X

X

u∈_U(s)

P_s,ju y(s, u) =ν(j) j∈_X

Andx(j, u)≥0, y(j, u)≥0 u∈U(j), j ∈X

The theorem relating above program with multi-chain optimality equations is the following, Theorem 1.24. 1. There exist an optimal solution(x∗, y∗)of the dual linear program.

2. Define a stationary policyπ= (µ, µ, . . .)as,

µ(u+|s) =



  

  

x∗(s,u+₎

P

u∈_U(s)x∗(s,u)

if P

u∈_U(s)x

∗_{(s, u)}_>₀

y∗₍_s,u+₎

P

u∈_U(s)y∗(s,u) otherwise

(19)

Recurrent States and Entropy

In this chapter we will study the relation between the entropy function and recurrent states. In [1] the authors treat the problem of finding a control that maximizes the number of recurrent states, in such a way that the recurrent states must avoid certain set of forbidden states F. In a most recent

paper [2] the authors generalize such methodology. They treat the problem of finding a control that maximizes the number of recurrent states, subject to a set of convex constraints. Using those ideas we constructed a control such that given any partition of the state space, the control finds the maximum number of recurrent states on each portion of the partition, avoiding states that belong to another piece of the partition. This chapter is organized as follows: First section will be dedicated to briefly discuss the algorithm proposed in [1], while in second section we will treat the proposed problem.

LetXbe a finite state space andUbe a finite set of control actions. Recall that each controlu

has associated a transition matrixPu, we will also assume that all control actions are feasible for each statex, that isU(x) =U. Since will only deal with stationary policiesπ = (µ, µ, . . .)lemma

1.15 implies that transition probabilities can be computed as, Pπ(Xt+1=i|Xt=x) =

X

u∈_U

µ(u|x)·P_xiu

2.1 Forbidden states

Fix a set of statesF⊂X. Letπbe a policy and denote the set of recurrent states under that policy

as,

XπR:={x∈X | Pxπ(τx(1)<∞) = 1}

TheF-safe recurrent states under policyπare defined as, XπR,F :={x∈X

π

R | Px,xπ + = 0for allx+ ∈F}

(20)

So the maximal set ofF-safe recurrent states is defined as, XF :=

[

π

XπR,F

The goal is to find a policyπ such that XπR,F = XF. Before stating the solution for such problem

we require some notation. Denote the set of probability measures over a setSasP(S). The entropy

function is defined as follows,

Definition 2.1. The functionH:P(S)→R≥0,

H(f) =−X

s∈S

f(s)·ln(f(s))

where the convention0·ln(0) = 0is used, is called the entropy off.

The importance of the entropy relies on the following lemma, for a proof see [1].

Lemma 2.2. LetW be a convex subset ofP(S). Definef_S∗ ∈ arg maxH(fS)wherefS ∈W and

letSf :={s∈S|f(s)>0}. Then,Sf_S ⊂Sf∗

S for allfS∈W.

Therefore,f_S∗ hasleastzeros, and as we will see later this corresponds to a maximum number of recurrent states. The following program solves the desired problem.

maxH(f_X,U) fX,U∈P(X×U)

Subject to

X

u+_∈

U

f_X,U(x

+_{, u}+_{) =} X

x∈_X,u∈_U

P_xxu+f_X,U(x, u) for allx

+_∈

X

u∈_U

f_X,U(x, u) = 0 for allx∈F

Note that between0 and 1 the entropy is a concave function. Moreover, the constraints are linear so the distributions that satisfy those constraints are a convex subset ofP(X×U), so above

program is convex. The first constraint defines an invariant distribution, and the second constraint is capturing theF- safety. Intuitively above program is just saying that we must take the invariant

distribution with least zeros, but those zeros must containF. So because of Theorem 1.10 the places

where the distribution is not zero must be recurrent states. The theorem is as follows.

Theorem 2.3. Assume the previous problem is feasible and thatf∗

X,Uis an optimal solution. Use

the notationf∗

X(x) =

P

u∈Uf

∗

X,U(x, u). Then,

• _X_F =Sf∗

(21)

• _X_F =XπR,F, whereπ is given by,

π(u|x) =







f∗

X,U(x,u)

f∗

X(x)

ifx∈Sf∗

X

1

|_U| otherwise

2.2 Partition of the state space

Let’s fix a partition of the state space, that is a family of mutually disjoint setsPisuch that∪iPi =X.

We are interested on states x that are recurrent under some policy π, such that if x ∈ Pi then

P_xπ(X1 = j) = 0for all j ∈ Pci. Recall that because of Proposition 1.7, the statexbelongs to a

closed class, letCπ

x be such class. Therefore, above assertion can be expressed just asCxπ ⊂ Pi.

Denote the recurrent states satisfying such property as,

XπR,P:={x∈X | x∈X

π

RandCxπ ⊂Pifor somei}

So the maximal set of recurrent states with such property is defined as,

XP :=

[

π

XπR,P

Then, the problem of interest is to find a policyπsuch that,

XπR,P=XP

Remark. One possible approach to solve such problem, would be to fix a piece of the partitionPi,

define as a forbidden the setPci and apply the algorithm proposed in last section. Then, repeat with

all the other pieces of the partition and define a new policy using the policies calculated. However, this would require to solve a convex problem for each piece of the partition.

One interesting observation is that policies arising from problems related to maximizing the number of recurrent states, may not be deterministic, as the following example shows.

Example 2.4. LetX={1,2,3}andU={1,2}, where the transition matrices are given by:

P1 =





0.4 0.6 0 1 0 0 0 1 0



 P2 =





0 1 0 0 0 1 0 0 1





Let’s consider a policyπ, with corresponding transition matrixPπ,

π(u|x) =





0 1 0.5 0.5

1 0



 Pπ =





0 1 0 0.5 0 0.5

0 1 0





Note that the induced Markov chain consist of one class, so there is just one recurrent class. However, all deterministic policies induce at least two classes. To see this, note that starting from

(22)

1

2

3

(a)P1

1

2

3

(b)P2

1

2

3

(c)Pπ

Figure 2.1: Graph representation of transition matrices

state2, if we select just the first transition matrix, no matters what control action we define on states 1,3, there is no way3can be accessible from2. On the other hand, if we select the second transition matrix, no matters what control action we define on states1,3, there is no way1can be accessible from2.

Our approach to solve the proposed problem uses a kind of Lyapunov function. LetL:X→N

be defined as,

L(Pi) =i.

The idea is to capture the notion of a closed set using such function. Next lemma relates a closed set with the just described function and it will be fundamental to solve the proposed problem. Lemma 2.5. Fix a policy π, and let Cπ ⊂ _Xbe a closed set under that policy. Suppose for all x∈Cπ_,

Eπx[L(X1)]≤Eπx[L(X0)]

Then, the value of theLfunction alongCπis constant.

(23)

thatL(a)> L(b), and assumeL(b)is the minimum ofA. We have that,

Eπb[L(X1)] = X

x∈_X

L(x)·P_bπ(X1 =x)

=X

x∈C

L(x)·P_bπ(X1=x)

≤_Eπ_b[L(X0)] =L(b)

Now, sinceL(b)is the minimum ofAwe have, L(b)· X

x∈Cπ

P_bπ(X1 =x)< X

x∈Cπ

L(x)·P_bπ(X1=x)≤L(b)

However, since Cπ is a closed set P

x∈CπP_bπ(X1 = x) = 1, so above inequality tells us that

L(b)< L(b), a contradiction.

TheLfunction also provides a way to identify closed sets.

Lemma 2.6. LetB⊂Sand assume the Lyapunov function is constant and positive alongBand0 outsideB. Suppose that for alli∈B we have

Ei[L(X1)] =Ei[L(X0)]

Then,Bis a closed set. Proof. Letb∈B,

Eb[L(X1)] = X

i∈S

L(i)·P(X1 =i|X0 =b)

=X

i∈B

L(i)·P(X1 =i|X0 =b)

=L(b)·X

i∈B

P(X1 =i|X0=b)

=Eb[L(X0)] =L(b)

That is,L(b)·P

i∈BP(X1 =i|X0=b) =L(b)and sinceL(b)>0we concludePi∈BP(X1 =

i|X0 =b) = 1. Therefore, for allb∈Bwe havePi∈BP(X1 =i|X0 =b) = 1, soBis a closed

set.

Following the algorithm described in previous section, we state that the following program solves the desired problem.

(24)

maxH(f_X,U) fX,U∈P(X×U)

Subject to

X

u+_∈

U

f_X,U(x

+_{, u}+_{) =} X

x∈X,u∈U

P_xxu+f_X,U(x, u) for allx

+ _∈

X

u∈_U

f_X,U(x, u)L(x)≥

X

x+_∈

X,u∈U

L(x+)f_X,U(x, u)P

u

xx+ for allx∈_X

Recall that the entropy is a concave function in[0,1], and since the constraints are linear above program is convex. The first constraint defines an invariant distribution, while second constraint is just the condition presented in lemma 2.5. The theorem is as follows,

Theorem 2.7. Assume the previous problem is feasible and thatf∗

X,Uis an optimal solution. Adopt

the notationf∗

X(x) =

P

u∈Uf

∗

X,U(x, u). Then,

• _X_P =Sf∗

X

• _X_P =XπR,P, whereπ is given by,

π(u|x) =







f_X∗_,_U(x,u)

f∗

X(x)

ifx∈Sf∗

X

1

|U| otherwise

Proof. Let’s begin showing thatXP ⊂ Sf_X∗. Letb ∈ XP so there exist a policyπ such thatx is

recurrent andCπ

b ⊂Pifor somei. Recall that because of Theorem 1.10 there must be an invariant

distributionf_Xsuch thatf_X(b)>0andf_X(C_bπ) = 1. The idea will be to showSfX ⊂Sf_X∗. Now, let

f_X,U ∈ P(X×U)be defined asfX,U(x, u) =π(u|x)fX(x). Let’s see thatfX,U(x, u)is a feasible

solution of the program. Because of the invariance off_Xwe have, f_X(x+) =X

x∈_X

f_X(x)P_x,xπ +

=X

x∈X

f_X(x)h X

u∈U

π(u|x)Pu x,x+

i

=X

x∈_X

X

u∈_U

f_X,U(x, u)P

u x,x+

Which corresponds to the first restriction of our program. For the second restriction, letx ∈_Xso there are two cases: x∈Cπ

(25)

• Assumex ∈C_bπ. SinceC_bπ ⊂_Pi, then the value of the functionLalongC_bπ is constant and

equal toL(b). Therefore,

X

x+_∈

X,u∈U

L(x+)f_X,U(x, u)P

u xx+ =

X

x+_∈

X,u∈U

L(x+)π(u|x)f_X(x)P_xxu+

= X

x+_∈

X

L(x+)f_X(x)P_xxπ+

SinceC_bπ is closed underπwe have,

= X

x+_∈Cπ b

L(x+)f_X(x)P_xxπ +

=L(b)f_X(x) X

x+_∈Cπ b

P_xxπ+

=L(b)f_X(x) =X

u∈_U

f_X,U(x, u)L(x)

• Assumex /∈Cπ

b. Recall that by constructionfX(Cbπ) = 1, so we have thatfX(x) = 0, then

f_X,U(x, u) =π(u|x)fX(x) = 0for allu∈U. So the restriction is satisfied.

Therefore,f_X,Uis a feasible solution of the program. Finally, since feasible solutions are a convex

subset ofP(X×U), lemma 2.2 impliesSf_X,U ⊂Sf

∗

X,U. So thatSfX ⊂Sf

∗

X and sincefX(b)>0we

obtainb∈Sf∗

X, soXP⊂Sf

∗

X.

Let’s show that Sf∗

X ⊂ XP. The idea will be to show that Sf

∗

X are recurrent states under some

policy and then use lemma 2.5. Consider the policy,

π(u|x) =







f∗

X,U(x,u)

f∗

X(x)

ifx∈Sf∗

X

1

|U| otherwise

Let’s see that under such policyf_X∗ is an invariant distribution. First of all, note that ifx /∈Sf∗

X then

f∗

X,U(x, u) = 0for allu∈U. Now, letx

+_∈

X, so because of the first restriction,

f_X∗(x+) = X

x∈X,u∈U

P_xxu+f_X∗_,_U(x, u)

= X

x∈S_f∗

X

,u∈U

P_xxu+f_X∗(x)π(u|x)

= X

x∈S_f∗

X

f_X∗(x)Pπ

xx+

=X

x∈X

f_X∗(x)Pπ

(26)

So thatf∗

Xis an invariant distribution underπ. Therefore, theorem 1.10 tells us thatSfX∗ consist of

recurrent states. Now, letx∈Sf∗

X and consider the closed class ofxunderπ,C

π

x. In order to apply

lemma 2.5 we need to show that for allb∈C_xπ,

Eπb[L(X1)]≤Eπb[L(X0)]

Letb∈C_xπ. Because of the second restriction,

X

u∈U

f_X,U(b, u)L(b)≥

X

x+_∈

X,u∈U

L(x+)f_X,U(b, u)P

u bx+

Since,f_X,U(b, u) =f

∗

X(b)π(u|b)above expression can be written as,

X

u∈U

f_X∗(b)π(u|b)L(b)≥ X

x+_∈

X,u∈U

L(x+)f_X∗(b)π(u|b)Pu bx+

Dividing byf∗

X(b)we obtainE

π

b[L(X1)]≤Eπb[L(X0)]. Finally, using lemma 2.5 we obtain that the

value ofLalongCxis constant so thatCx⊂Pifor somei, implyingx∈XP.

To end this chapter we show an example of the proposed algorithm. Such implementation can be done easily usingMatlaband a package for convex programming ascvx.

Example 2.8. LetX={1,2,3,4,5,6,7}andU={1,2}, where the transition matrices are given

by,

P1=

         

0.4 0.3 0 0 0.3 0 0 1 0 0 0 0 0 0 0 0 0.2 0.8 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0.2 0.8 0 0 0

         

P2 =

         

1 0 0 0 0 0 0 0 0 0.5 0.5 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0.8 0.2 0

         

Let’s consider a partitionP1 ={1},P2={2,3,4},P3={5,6,7}. DefineL= [1,2,2,2,3,3,3].

Running the algorithm, we obtain the joint density distribution,

f_X,U=

          0 0.1015 0 0.0915 0.1411 0.1015 0.0672 0.0915 0.1015 0.1015 0.1015 0.1015 0 0          

(27)

1

2

3

4

5

6

7

(a)P1

1

2

3

4

5

6

7

(b)P2

1

2

3

4

5

6

7

(c) Graph representation ofPπ

1

2

3

4

5

6

7

(d) Graph representation ofPπ+

(28)

Therefore, using above theorem we obtain that states all states, except7, are recurrent under policyπ, with corresponding transition matrixPπ,

π=           0 1 0 1 0.5818 0.4182 0.4232 0.5768 0.5 0.5 0.5 0.5 0.5408 0.4592          

Pπ =

         

1 0 0 0 0 0 0

0 0 0.5 0.5 0 0 0 0 0 0.5346 0.4654 0 0 0 0 0.5768 0.4232 0 0 0 0 0 0 0 0 0.5 0.5 0 0 0 0 0 0.5 0.5 0 0 0 0.1082 0.4327 0.3673 0.0918 0

         

Using above figure we can obtain that underPπ, the classes are{1},{2,3,4},{5,6},{7}. The recurrent classes are {1} ⊂ P1, {2,3,4} ⊂ P2, {5,6} ⊂ P3. So each recurrent class is fully

contained in a piece of the partition. Note also that{2,3,4},{5,6}weren’t recurrent classes under P1 orP2. As another example consider the partitionP1 ={1,2,3},P2 ={4,5,6,7}. Applying

the algorithm we obtain the joint density distributionf_X,Uand policyπ+,

f_X,U=

          0 0.1667 0 0 0 0.1667 0 0 0.1667 0.1667 0.1667 0.1667 0 0          

π+₌           0 1 0.5174 0.4826 0 1 0.3510 0.6490 0.5 0.5 0.5 0.5 0.6325 0.3675          

Therefore, states1,3,4,6are recurrent underπ+and transition matrixPπ+,

Pπ+ =

         

1 0 0 0 0 0 0

0.5174 0 0.2413 0.2413 0 0 0

0 0 1 0 0 0 0

0 0.6490 0.3510 0 0 0 0 0 0 0 0 0.5 0.5 0 0 0 0 0 0.5 0.5 0 0 0 0.1265 0.5060 0.2940 0.0735 0

         

So the recurrent classes are{1},{3} ⊂_P1,{5,6} ⊂P2. Note that the transient class{2,4}is

(29)

Zubov’s Method

In this chapter we are going to study Zubov’s method in the context of Markov chains. Zubov’s method, appears in the context of deterministic dynamical systems when describing the domain of attraction of a stable point, see [7, 11]. In specific, it allows us to characterize thedomain of attrac-tion and theescape setin terms of an appropriate functionv. Such function is also characterised as the solution of a differential equation. In [5] Zubov’s method is generalized for deterministic controlled systems, and in [6] it’s generalized for stochastic differential equations. For the present work we took as guide the constructions made in this last paper.

We will proceed as follows: The first section will deal with the uncontrolled case, our objective will be to describe both, the domain of attraction and the escape set of certain states, in terms of a value function v. In the second section, we will deal with the controlled case, and again, the objective will be to describe those sets in terms of a value function, and compute the Bellman’s equation. Finally, in third section we will consider the opposite of the domain of attraction for controlled systems.

3.1 Uncontrolled case

Consider a countable setX, and let{Xt}t=0be aX-valued Markov chain defined over a probability

space (Ω,F, P). Fix a setA ⊂ _X. In order to ensure some kind of stability assume thatA is a closed set. The main objects for study are the following,

Definition 3.1. The domain of attraction an escape set ofAare defined as, Λ =

n

x∈_X

lim inf_t→∞ Px(Xt∈A)>0 o

Γ =nx∈_X

lim inf_t→∞ Px(Xt∈A) = 0 o

In order to understand those sets we are going to describe them in terms of hitting times. The following lemmas will help us to achieve such objective.

Lemma 3.2. For allx∈_Xthe probabilitiesPx(τA≤t)andPx(Xt∈A)are equal.

(30)

Proof. DefineB :={ω∈Ω|∀m > τA(ω), Xm(ω)∈A}. So that,

{τA≤t}= ({τA≤t} ∩B)∪({τA≤t} ∩Bc)

{Xt∈A}= ({Xt∈A} ∩B)∪({Xt∈A} ∩Bc)

Let’s see that the events{X_t ∈ A} ∩B and{τ_A ≤ t} ∩B are equal. It’s clear that{X_t ∈ A} ∩B ⊂ {τA ≤ t} ∩B. Now, let ω ∈ {τA ≤ t} ∩B soτA(ω) ≤ t. Sinceω ∈ B, for all

m≥τA(ω)we haveXm(w)∈A, in particularXt(w)∈A, so that{τA≤t}∩B⊂ {Xt∈A}∩B.

Therefore,

Px({τA≤t} ∩B) =Px({Xt∈A} ∩B)

Finally, let’s consider the eventE = {∃j, m ∈ _Nsuch thatj ≤ m, Xj ∈ A, Xm ∈/ A}. Note

that, {τ_A ≤ t} ∩Bc ⊂ E and{X_t ∈ A} ∩Bc ⊂ E. Since A is closed, lemma 1.13 implies Px(E) = 0, so that,

0 =Px({τA≤t} ∩Bc) =Px({Xt∈A} ∩Bc)

Combining both results we obtain the desired equality.

Lemma 3.3. Consider the hitting time ofA,τA. Then,

lim

t→∞Px(Xt∈A) =Px(τA<∞)

Proof. By Lemma 1.12 we knowlim inf{τA≤t}={τA<∞}= lim sup{τA≤t}, so by Fatou’s

lemma (Lemma 1.11),

Px(τA<∞) =Px

lim inf

t→∞ {τA≤t}

≤lim inf

t→∞ Px(τA≤t)

= lim inf

t→∞ Px(Xt∈A)

On the other hand, Fatou’s Lemma (Lemma 1.11) also implies, Px(τA<∞) =Px

lim sup

t→∞

{τA≤t}

≥lim sup

t→∞

Px(τA≤t)

= lim sup

t→∞

Px(Xt∈A)

So that,

lim sup

t→∞ Px(Xt

∈A)≤Px(τA<∞)≤lim inf

(31)

Therefore, the domain of attraction and escape set can be understood as follows, Corollary 3.4.

Λ ={x∈X|Px(τA<∞)>0}

Γ ={x∈_X|Px(τA<∞) = 0}

We have another characterization of those sets, but now in terms of the expectation ofτA.

Lemma 3.5.

Λ ={x∈_X|_E_x[e−τA_]_>_0} Γ ={x∈_X|_Ex[e−τA] = 0}

Proof. We have that

Ex[e−τA] =

X

j

e−jPx(τA=j)

Therefore,Ex[e−τA] = 0if and only ifPx(τA = j) = 0 for allj, which happens if and only if

Px(τA<∞) = 0. So using the previous corollary we can conclude the proof.

The idea now, is to characterize both sets in terms of an appropriate function. Definition 3.6. Letvγ :X→Rbe defined as,

vγ(x) :=Ex

"_∞ X

t=0

1Ac(X_t)·γt

#

for someγ ∈(0,1).

Note thatP∞

t=01Ac(X_t)·γt ≤ L, whereL = P∞_t₌₀γt. Therefore, 0 ≤ v_γ(x) ≤ Lso the function is well defined. Before we can state and prove the characterization we need the following lemma, denoteZ =P∞

t=01Ac(X_t)·γt. Lemma 3.7. For allx∈_X,

Px(τA<∞) =Px(Z < L)

Proof. DefineB :={ω∈Ω|∀m > τ_A(ω), Xm(ω)∈A}. So that,

{τA<∞}= ({τA<∞} ∩B)∪({τA<∞} ∩Bc)

{Z < L}= ({Z < L} ∩B)∪({Z < L} ∩Bc)

Let’s begin proving{Z < L}∩B ={τA<∞}∩B. Letω ∈ {Z < L}∩BsoP_t1Ac(X_t(ω))·γt< L. Then, there existt∈_Nsuch thatg(Xt(ω)) = 0. By the definition ofgthis impliesXt(ω)∈A,

soτA(ω)<∞. Now, letω ∈ {τA<∞} ∩B. Then,τA(ω)<∞and sinceωbelongs toB, for all

m≥τA(ω)we haveXm(ω)∈A, sog(Xm(ω)) = 0implyingZ(ω)< L. Therefore,

(32)

Now, let’s consider the eventE ={∃j, m ∈_Nsuch thatj ≤m, Xj ∈A, Xm ∈/ A}. We have

that{τ_A < ∞} ∩Bc ⊂ E and{Z < L} ∩Bc ⊂ E. Since A is closed, lemma 1.13 implies Px(E) = 0, so that,

0 =Px({τA<∞} ∩Bc) =Px({Z < L} ∩Bc)

Both results imply,

Px(τA<∞) =Px(Z < L)

The characterization of the domain an escape set in terms ofvis the following, Theorem 3.8.

Λ ={x∈_X|vγ(x)< L}

Γ ={x∈_X|vγ(x) =L}

Proof. Let’s consider the partial sums of Z, Sn =

Pn

i=0γi. Note that for all n ∈ N the event

{Z 6=Sn}is contained in the event,

E ={∃j, m∈Nsuch thatj ≤m, Xj ∈A, Xm∈/ A}

SinceAis closed Lemma 1.13 impliesPx(Z 6=Sn) = 0for alln∈N. As a consequence,

Px(Z < L) =Px(Z = 0) +

X

n

Px(Z =Sn)

Letx ∈ Λ, then0 < Px(τA < ∞). Using previous lemma we conclude0 < Px(Z < L), which

impliesPx(Z = 0)>0orPx(Z=Sn)>0for somen, so that,

vγ(x) = ∞

X

n=0

SnPx(Z =Sn) +L·Px(Z=L)

≤Px(Z = 0) + ∞

X

n=0

SnPx(Z =Sn) +L·Px(Z =L)

< LhPx(Z= 0) + ∞

X

n=0

Px(Z =Sn)

i

+L·Px(Z =L) SinceL≥1

=L·Px(Z =L) +L·P(Z < L) =L

On the other hand, ifx∈ΓthenPx(τA<∞) = 0soPx(Z < L) = 0. Therefore,Px(Z =L) = 1.

As a consequence,

vγ(x) = ∞

X

n=0

SnPx(Z =Sn) +L·Px(Z=L)

(33)

3.2 Controlled case

LetXbe a countable state space, andUa countable set of control actions. Let{(Xt, Ut)}t=0 be a

X×U-valued controlled Markov chain, and assume all control actions are feasible for each state;

that isU(x) =Ufor allx∈X. Fix a setA⊂X, in order to guarantee some stability assumeAis a

closed set under some policyπ ∈ΠS.

Definition 3.9. The domain of attraction an escape set ofAare defined as, Λ =nx∈_X

lim inf_t→∞ P

π

x(Xt∈A)>0for some policyπ∈ΠS

o

Γ =

n

x∈X

lim inf_t→∞ P

π

x(Xt∈A) = 0for all policiesπ∈ΠS

o

In order to use the results of previous section we would like to answer if it’s enough to just consider polices that makesAa closed set.

Definition 3.10. LetΠA ⊂ΠS be the set of control polices that makeAa closed set. Recall that

by assumption such set is non-empty.

The following is fundamental to answer above question,

Lemma 3.11. Letπbe a a policy, and letπA∈ΠA. Define the policyπop∈ΠAas,

πop(u|x) =

(

πA(u|x) forx∈A

π(u|x) forx /∈A Then, for allx∈A,

lim inf

t→∞ P π

x(Xt∈A)≤ lim t→∞P

πop

x (Xt∈A)

lim sup

t→∞ P π

πop

x (Xt∈A)

Remark. SinceAis closed under policyπop, lemma 3.3 implieslimt→∞Pxπop(Xt∈A)exists and

is equal toPπop

x (τA<∞).

Proof. Letx ∈ A, since A is closed under policyπopwe havePxπop(Xt ∈ A) = 1, as a

conse-quence,

lim inf

t→∞ P π

πop

x (Xt∈A)

lim sup

t→∞

P_xπ(Xt∈A)≤ lim t→∞P

πop

x (Xt∈A)

Supposex /∈A. Let’s show by induction ontthat, P_xπ(τA≤t) =P

πop

(34)

Fort= 1we have,

P_xπ(τA≤1) =Pxπ(X0 ∈A) +Pxπ(X1 ∈A)

=X

j∈A

P_xjπ

= X

j∈A,u∈U

π(u|x)Pu xj

= X

j∈A,u∈U

πop(u|x)Pxju

=Pxπop(τA≤1)

Assume the property is true fort. Let’s prove it fort+ 1,

P_xπ(τA≤t+ 1) =Pxπ(τA≤t) +Pxπ(X1 ∈/ A, . . . , Xt∈/ A, Xt+1 ∈A)

=Pπop

x (τA≤t)+

X

j1∈A,...,j/ t6∈A,jt+1∈A

P_xπ(X1 =j1, . . . , Xt=jt, Xt+1 =jt+1)

=Pxπop(τA≤t) +

X

j1∈A,...,j/ t6∈A,jt+1∈A

P_x,jπ ₁. . . P_jπ_t_,j_t₊₁

Finally, sincex, j1, . . . , jt6∈AthenPx,jπ 1. . . P

π

jt,jt+1 =P

πop

x,j1. . . P

πop

jt,jt+1.

Now, note that{Xt∈A} ⊂ {τA≤t}, which implies,

P_xπ(Xt∈A)≤Pxπ(τA≤t) =P πop

x (τA≤t)

Therefore,

lim inf

t→∞ P π

x(Xt∈A)≤lim inf t→∞ P

πop

x (τA≤t)

lim sup

t→∞ P π

x(Xt∈A)≤lim sup t→∞ P

πop

x (τA≤t)

Finally, sinceAis closed underπop, lemma 3.2 implies,

lim inf

t→∞ P π

x(Xt∈A)≤lim inf t→∞ P

πop

x (τA≤t)

= lim

t→∞P πop

x (Xt∈A)

lim sup

t→∞ P π

x(Xt∈A)≤lim sup t→∞ P

πop

x (τA≤t)

= lim

t→∞P πop

x (Xt∈A)

(35)

Corollary 3.12.

Λ ={x∈_X|P_xπ(τA<∞)>0for some policyπ∈ΠA}

Γ ={x∈_X|P_xπ(τA<∞) = 0for all policiesπ∈ΠA}

Proof. Clearly,

{x∈_X|P_xπ(τA<∞)>0for some policyπ ∈ΠA} ⊂Λ

Letx∈Λ, so there exist a policyπ ∈ΠS such that,

lim inf

t→∞ P π

x(Xt∈A)>0

Now, letπop∈ΠAbe the policy defined in previous lemma. Therefore,

0<lim inf

t→∞ P π

x(Xt∈A)< lim t→∞P

πop

x (Xt∈A) = lim t→∞P

πop

x (Xt∈A)

So that we can focus on policies that makeAa closed set. The idea now is to describe a function that characterizes those sets.

Definition 3.13. Letvγ :X→Rbe defined as,

vγ(x) := inf π∈ΠA

vπ_γ(x) Where,

v_γπ(x) =Eπx

"_∞ X

t=0

1Ac(X_t)·γt

#

for someγ ∈(0,1), and policyπ ∈ΠA.

The characterization is as follows, Theorem 3.14.

Λ ={x∈X|vγ(x)< L}

Γ ={x∈_X|vγ(x) =L}

Proof. Letx ∈Λ, so there exist a policyπ ∈ΠAsuch thatPxπ(τA <∞)>0. Therefore, sinceA

is closed underπ, Theorem 3.8 impliesv_γπ(x)< L, as a consequencevγ(x)< L.

Letx ∈Γ, so for any policyπ ∈ΠAwe havePxπ(τA <∞) = 0. As a consequence, Theorem 3.8

implies that for any policyπ∈ΠAthe value functionvπγ(x) =L, so thatvγ(x) =L.

Finally, we would like to show how to calculate the value functionvγ. There is an important

is-sue about calculating such function. Note that the optimization must be carried out only on elements ofΠA, so we would have to calculate such set first. Next theorem shows that it’s not necessary.

(36)

Theorem 3.15. Define,

v∗_γ= inf

π∈ΠS vπ_γ

AssumeUis finite. Letπ∗ be an optimal policy (Such policy exists, Theorem 1.19 ). The setAis

closed underπ∗, sovγ =v∗γ. Moreover,vπ

∗

γ (A) = 0.

Proof. Letπ∈ΠS, givenx∈X, dominated convergence theorem implies,

v_γπ(x) =Eπx

"_∞ X

t=0

1Ac(X_t)·γt

#

= lim

N→∞E π x

"_N X

t=0

1Ac(X_t)·γt

#

= lim

N→∞ N

X

t=0

Eπx

1Ac(X_t)·γt

=X

t=0

γt·_Eπ_xh1_Ac(X_t)

i

=X

t=0

γt·P_xπ(Xt∈/ A)

Letπ∗ be an optimal policy and supposeAis not closed under that policy. LetπA ∈ΠA. SinceA

is closed underπA, ifx∈A,

PπA

x (Xt∈/A) = 0 for allt≥0

So in view of the previous calculation we obtain,vπA

γ (x) = 0for allx ∈A. However, sinceAis

not closed underπ∗, there existi∈Aandj /∈Asuch thatP_iπ∗(X1 =j)>0. So that,vπ

∗

γ (i) >0,

contradicting the fact thatvπ_γ∗(i) = infπ∈ΠSv

π γ(i).

To end this section we present the bellman’s equation for Zubov’s Method. Because of previous theorem,

vγ(x) = inf π∈ΠS

Eπx

"_∞ X

t=0

1Ac(X_t)·γt

#

= 1

1−γ −π∈supΠS

Eπx

"_∞ X

t=0

1A(Xt)·γt

#

So by defining,

b

vγ(x) = sup π∈ΠS

Eπx

"_∞ X

t=0

1A(Xt)·γt

#

(37)

Corollary 3.16. The Bellman’s equation forv_bγare,

b

vγ(x) = sup u∈U

n

1A(x) +γ

X

j∈X

P_x,ju _bvγ(x)

o

Therefore, the linear programs are as follows. Letν ∈_R|X|

≥0

Minimize X

x∈X

ν(x)·v_bγ(x)

Subject to

b

v(x)−X

j∈X

γ·P_x,ju ·v(j)_b ≥1_A(x) foru∈_U, x∈_X

Dual Linear Program

Maximize X

s∈_X

X

u∈_U

1A·x(s, u)

Subject to

X

u∈_U

x(j, u)−X

s∈_X

X

u∈_U

γ·P_s,ju ·x(s, u) =ν(j) x(j, u)≥0 foru∈_U, j ∈_X

Such programs can be implemented easily usingMATLAB, and CVX. To end this section we present an example.

Example 3.17. LetX={1,2,3,4,5,6,7}andU={1,2}, where the transition matricesP1 and

P2 _{are the same as in example 2.8. Let}_A ₌_{2,_3,_4}_{and recall that in such example we showed}

that there exist a policy for whichAis closed, so that the presented theory applies. Letγ = 0.5so L= 2. Using the algorithms we obtain that the dual program has an optimal solution,

x=



        

1.2500 0 0 1.8253 1.5851 1.4742 1.2148 1.2756 1.1120 1.1504 1.0379 1.0746

1 0



        

(38)

1

2

3

4

5

6

7

Figure 3.1: Graph representation ofPπ

Therefore, as is stated in the preliminaries section, an optimal policy is,

π=           1 0 0 1 0.5181 0.4819 0.4878 0.5122 0.4915 0.5085 0.4913 0.5087 1 0          

As a consequence the optimal value functionv, and the functionb vare,

b v=           0.375 2 2 2 0 0 1           v=           1.625 0 0 0 2 2 1          

Therefore, the domain of attraction are the states{1,2,3,4,7}, while the escape set is{5,6}, result that can be checked by inspection on the graphs ofP1 andP2. Using the policyπ, we obtain the

(39)

transition matrix,

Pπ =



        

0.4 0.3 0 0 0.3 0 0 0 0 0.5 0.5 0 0 0 0 0 0.5855 0.4145 0 0 0 0 0.5122 0.4878 0 0 0 0 0 0 0 0 0.4915 0.5085 0 0 0 0 0 0.4913 0.5087 0 0 0 0.2 0.8 0 0 0



        

Note that with such policy, all elements of the domain of attraction have positive probability of reachingA, as is shown in the above figure.

3.3 Forbidden states

As we saw in previous section, the objective of the domain of attraction is to make the set A as reachableas possible. So an interesting question arises: what if we don’t want to reachA?, that is; how to construct a policy control that makesAasforbiddenas possible?. LetXbe a countable

state space, and letUbe a finite set of control actions. Fix a setA⊂X, and assume that all control

policies makeAa closed set. We are going to be interested in the following set. Definition 3.18.

ΓF =

n

x∈X

lim inf_t→∞ P

π

x(Xt∈A) = 0for some policyπ ∈ΠS

o

As in the previous sections, we have that Lemma 3.19.

ΓF ={x∈X|Pxπ(τA<∞) = 0for some policyπ ∈ΠS}

We will introduce now a value function characterizingΓF.

Definition 3.20. Letv :X→Rbe defined as,

v(x) := sup

π∈ΠS vπ(x)

Where,

vπ(x) =Eπx

"_∞ X

t=0

1Ac(X_n)·γt

#

for someγ ∈(0,1)and policyπ∈ΠS.

The characterization is the following, Theorem 3.21.

(40)

Proof. Letx ∈ ΓF, so there exist a policyπ such that Pxπ(τA < ∞) = 0. Therefore, by

Theo-rem 3.8 we havevπ(x) =L, which impliesv(x) =L. On the other hand, assumex∈Γc

F. So for all policiesπ,Pxπ(τA<∞)>0. In particular for the

optimal policyπ∗we would haveP_xπ∗(τA<∞)>0so by Theorem 3.8v(x) =vπ

∗

(41)

Reaching a set of states A

In this chapter we will study a particular optimization problem. LetXdenote the set of states and

let U denote the set of control actions. Letν be an initial distribution over the state space. Our

objective will be to find a policy inΠS, that is a stationary policy, that maximize the probability that

the chain will eventually reach a setA⊂X; that is,

sup

π∈ΠS

n

lim inf

t→∞ P π

ν(Xt∈A)

o

Note that the problem is well defined since for all control policies the quantity, lim inf

t→∞ P π

ν(Xt∈A)

is a positive real number less than or equal to 1. In order to guarantee some kind of stability let’s assume that there exist a policyπsuch thatAis a closed set. We will proceed as follows: In the first section, we will show that it’s enough to consider controls that makeAa closed set. Using this, we will be able to relate the problem with Zubov’s method, such approach will lead us to an average cost function. In second section, we will use such average cost function to conclude the existence of an optimal policy, and how to calculate such policy. As an interesting corollary we will also obtain a way to calculate abortion probabilities via dynamic programming, and a way to describe the domain of attraction via average cost functions.

4.1 Relation with Zubov’s Method

Recall that in Zubov’s method we used a discounted value function to describe the domain of at-traction. Therefore, one first approach is whether the calculation of such domain helps us to solve the proposed problem.

We have one first result, which tells us that it’s enough to consider control policies that makeA a closed set, and a different description to our problem.

(42)

Corollary 4.1.

sup

π∈ΠS

n

lim inf

n→∞ P π

ν(Xn∈A)

o

= sup

π∈ΠA

n

lim

n→∞P π

ν(Xn∈A)

o

Moreover, the limit on the right is equal toP_νπ(τA<∞).

Proof. Letπbe an arbitrary stationary policy. We have that, lim inf

t→∞ P π

ν(Xt∈A)≤lim sup t→∞

P_νπ(Xt∈A)

= lim sup

t→∞

X

x∈X

P_xπ(Xt∈A)·ν(x)

≤X

x∈X

lim sup

t→∞ P π

x(Xt∈A)·ν(x)

Letπop∈ΠAbe defined as in lemma 3.11. Therefore, using such lemma,

X

x∈_X

lim sup

t→∞

P_xπ(Xt∈A)·ν(x)≤

X

x∈_X

lim

t→∞P πop

x (Xt∈A)·ν(x)

Now, using of dominated convergence we obtain,

X

x∈_X

lim

t→∞P πop

x (Xt∈A)·ν(x) = lim t→∞

X

x∈_X

Pπop

x (Xt∈A)·ν(x)

= lim

t→∞P πop

ν (Xt∈A)

Therefore,

lim inf

t→∞ P π

ν(Xt∈A)≤ lim t→∞P

πop

ν (Xt∈A)

Finally, last limit is equal toP_νπ(τA<∞)since,

X

x∈_X

lim

t→∞P πop

x (Xt∈A)·ν(x) =

X

x∈_X

P_xπ(τA<∞)·ν(x) =Pνπ(τA<∞)

Next lemma provides a relation between the proposed problem and the discounted value func-tion characterizing the domain of attracfunc-tion in Zubov’s method.

Lemma 4.2.

sup

π∈ΠS

n

lim inf

n→∞ P π

ν(Xn∈A)

o

≥1−(1−γ)X

x∈X

v_γπ∗(x)ν(x)