Optimal policies for constrained average cost Markov decision processes

(1)

DOI 10.1007/s11750-009-0110-7 O R I G I N A L PA P E R

Optimal policies for constrained average-cost Markov

decision processes

Juan González-Hernández·César E. Villarreal

Received: 12 December 2008 / Accepted: 2 July 2009 © Sociedad de Estadística e Investigación Operativa 2009

Abstract We give mild conditions for the existence of optimal solutions for a Markov decision problem with average cost, undermconstraints of the same kind, in Borel actions and states spaces. Moreover, there is an optimal policy that is a convex combination of at mostm+1 deterministic policies.

Keywords Markov decision processes·Constraints·Stable measures Mathematics Subject Classification (2000) 90C40

1 Introduction

This paper is concerned with the expected average cost optimization of a Markov decision problem with constraints. We study the case in which the state and action spaces are Borel spaces. The cost function may be unbounded, and it is subjected to

mexpected average cost constraints. We give general conditions for the existence of solutions of the Markov decision problem and for the existence of an optimal stable policy, which is a convex combination ofm+1 stable deterministic polices.

It is already known that for Markov decision constraint problems, there ex-ist optimal randomized policies (Beutler and Ross1985; Borkar 1994; Frid1972;

J. González-Hernández

Departamento de Probabilidad y Estadística, Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Apartado postal 20-726, Admon. No. 20, delegación Álvaro Obregón, 01000 Mexico City, D.F., Mexico

e-mail:[email protected]

C.E. Villarreal (

)

(2)

González-Hernández and Hernández-Lerma2005; Haviv1996; Sennott1991,1993). The case of finite, denumerable, and compact state spaces has been widely dealed (Beutler and Ross1985; Sennott1993; Borkar1994; Kurano et al.2000a; Piunovskiy 1993, 1997; Piunovskiy and Khametov 1991; Tanaka 1991; Hu and Yue 2008). The discounted performance criteria has also already been dealed (Feimberg and Shwartz1996; González-Hernández and Hernández-Lerma2005; Hernández-Lerma and González-Hernández2000; and Sennott1991). We can see other related aspects (Collins and McNamara1998; Kurano et al.2000b; and Yushkevich1997).

To give a characterization of some optimal solutions of the control problem (The-orems1 and2bellow), we follow the procedure used by González-Hernández and Hernández-Lerma (2005, Theorem 2.6), which is a generalization of a theorem given by Winkler (1988, Theorem 2.1) and which in turn is an extension of a theorem given by Karr (1983, Theorem 2.1).

In the next section we give a brief background of the Markov decision processes, including the model with constraints and the construction of the process. In Sect.3 we raise the control problem we are interested in, provide the hypothesis to assure the existence of solutions, and set some lemmas in order to prove the main theorem (The-orem2), which shows the existence of optimal randomized policies in Borel spaces for average criterium. Finally, Sect.4shows three examples with optimal policy: the first is related to an inventory problem with optimal stable policy, the second one is an example with stable policies but without optimal stable policies, and the third is an example without stable policies.

2 Preliminaries

We shall use the following concepts and definitions throughout the article.

We suppose that a metric spaceSis always endowed with its Borel σ-algebra, which is denoted byB(S). A Polish space is a complete and separable metric space, and a Borel space is a subspace of a Polish space. The symbolM(S)stands for the linear space of finite signed measures onS, and P(S)is the subset of probability measures.

Given two Borel spacesSandS, a stochastic kernel onSgivenSis a real-valued function(x, B)→K(B|x)onS×B(S)such thatK(B|·)is a measurable function onSfor each fixedB∈B(S)andK(·|x)is a probability measure onB(S)for each

x∈S.

We shall give a brief review of the main concepts of Markov decision processes. A deeper study of these concepts can be seen in several books (Altman1999; Borkar 1994; Hernández-Lerma and Lasserre1996; Hu and Yue2008; and Piunovskiy1997).

Definition 1 Letm be a nonnegative integer. Anm-constrained Markov decision model (m-CMDM) is a sequence of 5+mcomponents

(3)

(a) Xis a nonempty Borel space, called the state space. (b) Aalso is a nonempty Borel space, called the action space.

(c) A:X−→B(A)is called the function of admissible actions. We denote by K the set of admissible pairs{(x, a)∈X×A:a∈A(x)}, which we assume to be

measurable and containing the graph of a measurable function fromXtoA. (d) Qis a stochastic kernel onX givenX×A. It represents the dynamics of the

system.

(e) c:K−→ [0,+∞)is a measurable function called the cost function.

(f) di :K−→ [0,+∞)for i=1, . . . , m are measurable functions that we use to define the constraints.

A measurable functionf :X−→Asuch thatf (x)∈A(x)is called a measurable

selector. SinceKcontains the graph of a measurable map (see Definition1(c) above), the set of measurable selectors is nonempty.

Definition 2 Randomized, deterministic, andm-randomized control.

(a) A randomized control is a stochastic kernel ϕ on A given X, such that

ϕ(A(x)|x)=1 for eachx∈X.

(b) If ϕ is a randomized control and there is a measurable selector f such that

ϕ({f (x)}|x)=1 for all x∈X, then we say thatϕ is a deterministic control, that is,

ϕ(·|x)=δf (x)(·) forx∈X,

whereδydenotes the Dirac measure concentrated ony. (Note that each determin-istic control is identified with a measurable selector and vice versa.)

(c) A randomized controlϕis anm-randomized control if there aremmeasurable se-lectorsf1, . . . , fmandmnonnegative numbersα1, . . . , αmsuch that

m

i=1αi=1 and

ϕ(·|x)=

m

i=1

αiδfi(x)(·).

We need to defineHn, “the set of possible histories up to timen.” LetH0:=X andHn:=Kn×Xforn=1,2, . . .. An arbitrary elementhn∈Hnis represented by

hn=(x0, a0, x1, a1, . . . , xn), where(xi, ai)∈Kfori=0,1, . . . , n−1 andxn∈X. Definition 3 Policies.

(a) A policy is a sequenceπ=(πn)of stochastic kernels onAgivenHn, satisfying

πn

A(xn)|hn

=1

for each historyhn=(x0, a0, . . . , xn−1, an−1, xn)inHn. We denote byΠthe set of all policies.

(b) We say that a policyπ=(πn)is stationary if there is a randomized controlϕ such that, for every historyhn=(x0, a0, . . . , xn), we have

(4)

We specify this dependence by writing πϕ. Also, we denote by Φ the set of randomized controls, and byΦthe set of stationary policies.

(c) A stationary policyπϕ=(πn)is called deterministic if the corresponding ran-domized control ϕ given in (b) is deterministic, that is, there is a measurable selectorf such that, for every historyhn=(x0, a0, . . . , xn), we have

πn(·|hn)=δf (xn)(·).

Let us denote byΦ1the set of deterministic controls, and byΦ1the set of deter-ministic stationary policies.

(d) Letmbe a positive integer. A stationary policy is said to bem-randomized

pol-icy if it is a convex combination ofmdeterministic policies. The corresponding randomized control is calledm-randomized control. We denote byΦmthe set of

m-randomized controls, and byΦmthe set ofm-randomized policies.

These definitions establish bijections between the setΦ of randomized controls and the setΦof stationary policies; the setΦ1of deterministic controls and the set

Φ1of deterministic policies; the setΦmofm-randomized controls and the setΦmof

m-randomized policies. We have the following inclusions diagram:

Φ1⊂Φm⊂Φm+1⊂Φ⊂Π.

Construction of the process Suppose that S1,S2, and S3 are metric spaces, μ∈ P(S1),ϕ1is a stochastic kernel onS2givenS1, andϕ2is a stochastic kernel onS3 givenS2. The product measureμ⊗ϕ1onS1×S2is defined as the measure generated by the formula

μ⊗ϕ1(B×C):=

B

ϕ1(C|s1)μ (ds1) (1)

forB∈B(S1)andC∈B(S2). Also, the kernel productϕ1⊗ϕ2onS2×S3givenS1 is defined as the kernel generated by the formula

ϕ1⊗ϕ2(C×D|s1):=

C

ϕ2(D|s2)ϕ1(ds2|s1) (2) forC∈B(S2)andD∈B(S3).

Let us construct a discrete-time stochastic process onX×A. Given an initial distri-butionν∈P(X)(the distribution ofx0) and a policyπ=(πn), by the Ionescu–Tulcea Theorem (Ash1972, Theorem 2.7.2; Hinderer1970, Sect. 11; Loève1977, pp. 137– 139), we have a stochastic process onX×Asuch that the initial distributionμ0(of the process) isν⊗π0, and the joint distributionμn+1of(x0, a0, x1, a1, . . . , xn+1, an+1) is (μn⊗πn)⊗Q, where μn is the distribution of (x0, a0, x1, a1, . . . , xn, an) for

n∈ {0,1,2, . . .}. We consider the measurable space of trajectories of the process Ω :=(X×A)∞ and Pπ

ν ∈ P(Ω) such that Pπν(Bn×(X×A)∞)=μn(Bn) for

Bn∈B((X×A)n+1). Note that Pπν is supported onK∞.

(5)

expected value ofY by Eπ_ν(Y ), that is, E_νπ(Y )=YdPπ_ν. Furthermore, in the case

ν=δx, we denote Pπν and Eπν by Pπx and Eπx, respectively, instead.

3 Solution of a control problem

The control problem (CP) consists in finding a policy π and an initial distribu-tion ν that minimize the objective function (or performance index function) J : Π×P(X)−→R∪ {+∞}given by

minimize: J (π, ν):=lim sup N→∞

1

N+1E π ν

N

t=0

c(xt, at)

, (3)

subject to

Ji(π, ν):=lim sup n→∞

1

N+1E π ν

N

t=0

di(xt, at)

≤ki (4)

fori∈ {1, . . . , m}andk1, . . . , km≥0. We could interpret the functionsJi(π, ν)as long-run average costs that needed to keep bounded.

LetΔbe the set of feasible pairs for the CP, that is,

Δ:=(π, ν)∈Π×P(X):J (π, ν) <∞andJi(π, ν)≤ki, i∈ {1, . . . , m}

. (5) Whenπ∗is a policy andν∗is an initial distribution such that(π∗, ν∗)∈Δand

J (π∗, ν∗)=infJ (π, ν):(π, ν)∈Δ, (6) we say that the pair(π∗, ν∗)is an optimal solution of the CP.

Hypothesis 1

(a) CP is consistent. That is, the set of feasible pairsΔis non empty.

(b) c≥0 is an inf-compact function, that is, for eachr∈R, the set{(x, a)∈K: c(x, a)≤r}is compact.

(c) Eachdi is a nonnegative lower semicontinuous function.

(d) The stochastic kernelQis weakly continuous, that is,_Xu(y)Q (dy|·)∈Cb(K) for every functionu∈Cb(X)(where for a topological spaceS,Cb(S)denotes the space of bounded continuous real functions).

Under this hypothesis, we have the following theorem (Hernández-Lerma et al. 2003, Theorem 3.2).

Theorem 1 Under Hypothesis1, there is an optimal solution of the CP.

Ifμ∈M(X×A), there are a randomized controlϕ and a signed measureμˆ ∈

M(X)such that

μ(B×C)= ˆμ⊗ϕ(B×C)=

B

(6)

forB∈B(X)andC∈B(A). The measureμˆ in (7) is called the marginal measure of

μonX, and it is defined asμˆ:=μ(·×A). Moreover, for eachC∈B(A), the function

ϕ(C|·)is the Radon–Nikodým derivative ofμ(·×C)with respect toμˆ. Conversely, if

ϕis a randomized control andμˆ∈M(X), there is a signed measureμ∈M(X×A)

such that (1) is satisfied.

Ifg:K−→Ris measurable,ϕ is a randomized control, and f is a measurable selector, for simplicity, we denote

g(x, f ):=gx, f (x)

and

g(x, ϕ):=

g(x, a)ϕ (da|x).

Definition 4 A measureμ= ˆμ⊗ϕ∈M(X×A)is said to be stable if

(a) μ, c :=

c(x, ϕ)μ (ˆ dx) <+∞

and

(b) μ(B)ˆ =

Q(B|x, ϕ)μ (ˆ dx) forB∈B(X).

Also, a randomized controlϕis called stable control if there existsμˆ ∈M(X)such that the signed measureμˆ ⊗ϕ is stable. We denote the set of probability measures stables concentrated onKbyPs(K).

Remark 1 For a policyπ=(πn)∈Π and initial distributionν, let us consider the occupation measure in then-stepμn(B):=Eνπ(δ(xn,an)(B)), and let us disintegrate

this measure asμn= ˆμn⊗ϕn; then the Markovian policyπ=(ϕn)is equivalent to the former policy in the sense that J (π, ν)=J (π, ν) andJi(π, ν)=Ji(π, ν) fori∈ {1,2, . . . , m}. Hence, Markovian policies are sufficient for CP. Even more,

Lemma1below shows that the search of optimal policies can be reduced to stationary and stable policies.

Letμˆ be an initial distribution, and ϕ a stable control. The Individual Ergodic Theorem (Yoshida1978, p. 338; Hernández-Lerma and Lasserre1996, Theorem E11) implies that ifμˆ⊗ϕis stable, then the long-run average valueJ (ϕ,μ)ˆ ofc(xt, at)in (3) is given just by the limit rather than the upper limit, i.e.,

Jπϕ,μˆ= lim N→∞

1

N+1E πϕ

ˆ

μ N

t=0

c(xt, at)

= μ, c,

and analogouslyJi(πϕ,μ)ˆ = μ, difori∈ {1,2, . . . , m}. In brief, we have

μ∈Ps(K) =⇒

(a)J (πϕ,μ)ˆ = μ, c and

(7)

The key of the CP is that we can reduce the search of optimum policies to the search of optimal stable controls. If this is the case, the CP can be reformulated as

minimize: μ, c,

subject to μ, di ≤ki, fori∈ {1,2. . . , m}.

(9)

We shall denote the problem (9) by CP.

As we can see, CP is a linear programming problem withmconstraints whose dimension is not necessarily finite. The following lemma (Hernández-Lerma et al. 2003, Lemma 3.5) gives us a guide to do such a reformulation.

Lemma 1 (Reduction of CP to the set of stable controls) Under Hypothesis1, for

each feasible pair(π, ν)∈Δof the CP, there is a stable measureμ= ˆμ⊗ϕ, such

that

(a) (πϕ,μ)ˆ ∈Δ, and

(b) J (π, ν)≥J (πϕ,μ)ˆ = μ, c.

Remark 2 If we consider the sequence(μn)of empirical measures onX×Agiven by

μn(B):= 1

n+1E π

ˆ

μ n

t=0

δ(xt,at)(B)

forn∈N,

then we need Hypothesis1(b) to apply Prokhorov’s Theorem. In last section we pro-vide two examples showing that without this hypothesis we cannot assure the reduc-tion of CP to stable policies. Even more, if we can apply Prokhorov’s Theorem to the sequence(μn), then there is a convergent subsequence(μkn)of(μn). So, ifμis the

limit of(μkn), then we need Hypothesis1(c) and (d) to show that the limit measure

μsatisfies Lemma1(a) and (b).

Lemma1and Theorem1 yield the existence of an optimal solution(πϕ∗, ν∗)∈ Φ×P(X)for the CP, withμ∗=ν∗⊗ϕ∗stable.

With the next hypothesis we can characterize some optimal policies. Let us define some concepts that will be used in Hypothesis2.

Letμ be a finite (nonnegative) measure onB(Y ). The measure μis said to be

regular if, for everyB∈B(Y ),μ(B)=sup{μ(F ): F ⊂B andF is closed}. The measureμ is said to be τ-smooth if, for each decreasing net(Fα)of closed sub-sets ofY, we haveμ(_αFα)=infαμ(Fα). A probabilityP or a probability space

(Ω,F, P ) is nonatomic ifP (A) >0 implies that there is B∈F such thatB⊂A

and 0< P (B) < P (A)(Billingsley1995, p. 35). In our particular case in whichFis the Borelσ-algebra of some Borel space, the fact thatP is nonatomic is reduced to

P ({x})=0 for allx∈Ω. Hypothesis 2

(8)

(b) The stochastic kernelQis nonatomic, that is, for every couple(x, a)∈K, the probability measureQ(·|x, a)is nonatomic.

Theorem 2 Under Hypotheses1and2, there is an optimal solution(πϕ∗, ν∗)∈Δ for CP such thatϕ∗is a stable(m+1)-randomized control.

We can see that, for a Borel space, each probability measure is regular andτ -smooth (Munkres1975, Theorems 1.2 and 1.3, and Exercise 7 of Sect. 4.1). To prove Theorem2, we shall use several lemmas. In the sequel of this section we fixν∗∈

P(X)such that, for someϕ∗∈Φ, we have an optimal solution(πϕ∗, ν∗)for CP, with

ν∗⊗ϕ∗stable. By Lemma1and Theorem1, we get that the minimum valueρ∗of CP can be written as

ρ∗=infν∗⊗ϕ, c :ϕ∈Φs

,

whereΦs:= {ϕ∈Φ:ν∗⊗ϕ∈Ps(K)}. LetΦ1,s:= {ϕ∈Φ1:ν∗⊗ϕ∈Ps(K)}. Lemma 2 The set of extreme points ofΦsisΦ1,s.

Proof Let ϕ ∈Φs. Suppose that there are x ∈ X and B ∈ B(A) such that 0<

ϕ(B|x) <1. Thenϕ is not an extreme point ofΦs, because we can express ϕ as

ϕ=αϕ1+(1−α)ϕ2with 0< α <1 andϕ1=ϕ2. Indeed, takingα:=ϕ(B|x), let

ϕ1(·|y):=

_ϕ(_·∩_B_|_x)

α ify=x,

ϕ(·|y) ify=x,

and

ϕ2(·|y):=

_ϕ(_·∩_Bc_|_x)

1−α ify=x,

ϕ(·|y) ify=x,

whereBc:=A\B. Also, Hypothesis2(b) and the definition of stable measure imply

thatν∗⊗ϕ1andν∗⊗ϕ2are stable.

LetΦ:= {ϕ∈Φ:ν∗⊗ϕ∈Ps(K)and(πϕ, ν∗)∈Δ}andΦ_m₊₁:= {ϕ∈Φm+1:

ν∗⊗ϕ∈Ps(K)and(πϕ, ν∗)∈Δ}. We have the following lemma, which is an

anal-ogous result to that given for Markov decision processes with discounted cost in González-Hernández and Hernández-Lerma (2005, Theorem 2.6).

Lemma 3 The setΦis convex, andΦ_m₊₁is the set of its extreme points.

Proof The proof of a theorem given by González-Hernández and Hernández-Lerma

(2005, Theorem 2.6) works in the present case if we use Lemma2, taking stable policies, initial distributionν∗, and posingΦs,Φ1,s,ΦandΦ_m₊₁in place ofΦ,F,

(9)

Letν∗⊗Φ:= {ν∗⊗ϕ:ϕ∈Φ}andν∗⊗Φ_m₊₁:={ν∗⊗ϕ:ϕ∈Φ_m₊₁}. Note

that by Lemma3, the set of extreme points ofν∗⊗Φisν∗⊗Φ_m₊₁. Moreover, we can deduce the next lemma.

Lemma 4 The set of extreme points ofν∗⊗Φisν∗⊗Φ_m₊₁. Moreover,ν∗⊗Φis convex, (weakly) closed, and sequentially compact.

Proof By Lemma3, the setν∗⊗Φis convex, andν∗⊗Φ_m₊₁is the set of its extreme points.

Letμk =ν∗⊗ϕk withϕk∈Φ such that (μk)converges weakly to a measure

μ∈M(X×A). Observe thatν∗= ˆμk, so μˆ =ν∗ (recall that ρˆ=ρ(· ×A)for

ρ∈M(X×A)). Hence, we getμ=ν∗⊗ϕ for someϕ∈Φ. We need to prove that

ϕ∈Φ.

Letu∈Cb(X). By Hypothesis1(d) together with weak convergence and Defini-tion4,

u(x)ν∗(dx)=

u(y)Q (dy|x, a)

ϕk(da|x)ν∗(dx)

−→

u(y)Q (dy|x, a)

ϕ (da|x)ν∗(dx);

therefore,ν∗=μ⊗Q, that is,

ν∗(B)=

Q(B|x, ϕ)ν∗(dx),

which meansϕ∈Φ.

Finally, by Hypothesis1(b) and Prokhorov’s Theorem (Bourbaki1969, No. 5.5),

ν∗⊗Φis sequentially compact.

End of the proof of Theorem2 Letμbe an extreme point ofν∗⊗Φ. From a theo-rem given by Winkler (1988, Theotheo-rem 2.1), there areμ1, . . . , μm+1∈ν∗⊗Φs and

α1, . . . , αm+1∈ [0,1]such that

m+1

k=1 αk=1 andμ=

m+1

k=1αkμk. Following the proof of a theorem given by Piunovskiy (1997, 1993, Theorem 10, Sect. 2.2.3), for each k ∈ {1, . . . , m+1}, there is a ϕk ∈Φ1,s such that μk =ν∗ ⊗ϕk. Let

ϕ=m_k₌+₁1αkϕk. Then, by Lemma3,ϕis an extreme point ofΦ.

We have that ν∗⊗Φ is a subset of a metric linear space with the Prokhorov metric (Billingsley1999, p. 72). Sinceν∗⊗Φis closed and sequentially compact, it is compact.

By the Krein–Milman Theorem (Phelps1966, p. 59), the minimum of the CP is attained in an extreme pointν∗⊗ϕ∗ofν∗⊗Φ.

4 Examples

(10)

x_tibe the observed stock volume of theith kind of cereal at the beginning of the stage

t when it is not negative and minus the nonsatisfied demand to spurt in the follow-ing period, which will be given to the client directly if it is negative. The manager then orders a quantitya_ti of theith kind of cereal. The state and control variables are

xt =(xt1, . . . , xtn)andat=(at1, . . . , ant), respectively, and are such thatxti+ati ≥0 andn_j₌₁(x_tj+a_tj)≤Mfor alli∈ {1, . . . , n}andt∈ {0,1,2, . . .}.

LetD_ti be the demand of theith kind of cereal through the periodt. We assume that(D_ti)∞_t₌₀is a sequence of continuous identically distributed random variables and 0≤Di

t ≤R for eachi∈ {1, . . . , n}and some constantR > M. The dynamics of the process is given by

x_ti₊₁=x_ti+a_ti−D_ti.

In this example, the state space isX= {(y1, . . . , yn)∈ [−R, M]n:

n

i=1yi≤M}, the action space isA= [0, M+R]n, and

A(y1, . . . , yn) =

(b1, . . . , bn)∈A: n

i=1

(bi+yi)≤Mandbj+yj≥0 for allj∈ {1, . . . , n}

.

The objective of the control problem is to maximize

K(π, ν):=lim inf N→∞

1

N+1E π ν

N

t=0 n

i=1

Pi

x_ti+a_ti−x_ti₊₁−Ciait−S

x_ti+a_ti

,

subject to

J1(π, ν):=lim sup N→∞

1

N+1E π ν

N

t=0 n

i=1

Pimax

−x_ti,0

≤k1,

where for eachi,Pi is the price, Ci is the cost,S is the storage cost, andk1 is a given positive number. Note that the expressionn_i₌₁(Pi(xti+ati−xti+1)−Ciait−

S(x_ti+a_ti))depends of the three variablesxt,at, andxt+1. Under the same constraint, the problem is equivalent to minimize

J (π, ν):=lim sup N→∞

1

N+1E π ν

N

t=0

c(xt, at)

,

wherec(xt, at):=

n

i=1(Ciati+S(xti+ati)−Piait+RPi). Effectively, to maximize

lim inf N→∞

1

N+1E π ν

N

t=0 n

i=1

Pi

x_ti+ai_t−x_ti₊₁−Ciait−S

(11)

is equivalent to minimize

lim sup N→∞

−1

N+1E π ν

N

t=0 n

i=1

Pi

x_ti+a_ti−x_ti₊₁−Ciati−S

x_ti+ai_t , but lim sup N→∞ −1

N+1E π ν

N

t=0 n

i=1

Pi

x_ti+ai_t−x_ti₊₁−Ciati−S

x_ti+a_ti

=lim sup N→∞

1

N+1E π ν

N

t=0

c(xt, at)+ N

t=0 n

i=1

Pi

x_ti₊₁−x_ti−R

=lim sup N→∞

1

N+1E π ν

N

t=0

c(xt, at)+ n

i=1

Pi

x_Ni ₊₁−x₀i−(N+1)R

n

i=1

Pi

=lim sup N→∞

1

N+1E π ν

N

t=0

c(xt, at)+ 1

N+1E π ν

n

i=1

Pi

x_Ni ₊₁−x₀i−R

n

i=1

Pi

,

and sincen_i₌₁Pi(x_Ni ₊1−x0i)is bounded, we have

lim sup N→∞

1

N+1E π ν

N

t=0

c(xt, at)+ 1

N+1E π ν

n

i=1

Pi

x_Ni ₊₁−x₀i−R

n

i=1

Pi

=lim sup N→∞

1

N+1E π ν

N

t=0

c(xt, at)−R n

i=1

Pi

.

Now, Rn_i₌₁Pi is constant with respect toN and(π, ν), therefore to maximize

K(π, ν)is equivalent to minimizeJ (π, ν)=lim sup_N_→∞_N1₊₁Eπ ν(

N

t=0c(xt, at)). We have that the new equivalent problem fulfills Hypotheses 1 and 2.

To illustrate Theorem 2 in this example, assume that we have only a kind of cereal,

Dt:=Dt1is uniformly distributed in[0, R], andP1> S+C1. For eachγ ∈ [0,1], let us define the measurable selectorfγ as

fγ(x):=

γ M−x ifx≤γ M,

0 ifx > γ M,

let ϕγ be the deterministic control such thatϕγ({fγ(x)}|x)=1, and πγ :=πϕγ. The policyπγ represents the action to start every period with a inventory levelγ if possible. Ifλ is the Lebesgue measure inR andνγ(·):=λ(·∩[γ M_R−R,γ M]), then

νγ⊗ϕγ is a stable measure, so the deterministic controlϕγ is stable. Moreover,

lim t→∞E

πγ

νγ

P1max{−xt,0}

=

0 γ M−R

−P1x

R dx=

P1(R−γ M)2

(12)

thus,πγ fulfills the constraint if and only if P1(R−γ M)

2

2R ≤k1, that is,πγ fulfills the constraint if and only ifγ≥

R−

2k1R P₁

M . Now, if R−

2k1R P₁

M ≤1, an optimal solution for CP is given by(πγ∗, νγ∗), whereγ∗:=max{

R−

2k₁R P₁

M ,0}.

An example without optimal stable policies LetX=Q∩(−1,1)be the state space, and letA= {−1,1}be the action space. The dynamics of the system is given by the conditional probabilityQ({αx+(1−α)a}|x, a)=1 for everyx∈X, whereαis a fixed number in(0,1₂] ∩Q. The cost function is given byc(x, a):=1+x.

In this case the definitions of stable measuresμ= ˆμ⊗ϕand stable policiesϕare the following:

(a) J (πϕ,μ)ˆ =

x∈X

a∈A

c(x, a)ϕ{a} |xμˆ{x}<+∞

and

(b) μ(B)ˆ =

y∈X

a∈A

Q(B|y, a)ϕ{a}|yμˆ{y}, for everyB⊂X.

If we take two points with the same imagex=αx0+(1−α)a0=αx1+(1−α)a1, then the only solution inXfulfillsa0=a1andx0=x1. Note that applying the former definition (b) in the case thatBis a singleton{x}, the first sum of this definition has

at most a nonzero term. Hence, the stable policies are deterministic policies. Let

πϕf _{any of these policies, where}_ϕ

f is the control corresponding to the measurable selectorf. We have, for any singleton{x1}with positive measureμ(ˆ {x1}), that (b) becomes

ˆ

μ{x1}=Qαx0+(1−α)f (x0)

|x0, f (x0)

ˆ

μ{x0}= ˆμ{x0}.

Now we apply the same argue tox1to obtainx2, and so on. In this way we obtain a sequence(xi)∞_i₌0such thatμ(ˆ {x0})= ˆμ({x1})= ˆμ({x2})= · · ·. We can not have an infinite number of equiprobable points; therefore, there are two natural numbersn,m

withm < nsuch thatxm=xn. We can suppose thatx0=xm. In this way we have

a0αn−1(1−α)+a1αn−2(1−α)+ · · · +an−1(1−α)=x0

1−αn, (10) whereai∈ {−1,1}fori∈ {0,1, . . . , n−1}. For any stable measure, the points in the support of this measure satisfies (10). Then the pointsxi in a “cycle” satisfy

xi=a0αi−1(1−α)+a1αi−2(1−α)+ · · · +ai−1(1−α) fori∈ {1, . . . , n}, (11) whereai∈ {−1,1}fori∈ {0,1, . . . , n−1}.

The average expected cost for one cycle of (11) is

J (πϕf_{, ν)}=1

n

n+x0 1−αn

1−α +a0(1−α)

n−1_{+ · · · +}_a

n−2(1−α)

(13)

where the initial distribution is given byμ(xˆ i)=_n1 for the pointsxi satisfying (11). Note thatJ (πϕf, ν)is positive becausecis strictly positive.

On the other hand, letf∗ be the function given byf∗(x):= −1, let x0be any element inX, and let us define xi again as in (11) but now not in a cycle. By a straightforward calculation we getJ (πϕf∗_{, ν)}=_{0. Hence, the optimal policy is not}

a stable policy.

An example without stable policies Now let us consider the same model as in the former example but with the space

Y:=

x∈R\Q:x=αix0+(1−α) i−1

k=0

akαi−k−1

forai∈ {−1,1}andi∈ {0,1,2, . . .}

,

wherex0is any fixed irrational number in(−1,1). Then the same formula (10) holds for the stable policies, but its solution are rational numbers. Hence, there are no solu-tions inY. Therefore, there are no stable policies. On the other hand, the deterministic stationary policyπϕf∗_{given by}_f∗_(x)= −1 still is an optimal policy.

Acknowledgements This work was partially sponsored by CONACYT grant SEP-2003-C02-45448/A-1 and PAICYT-UANL grant CA826-04. The authors are thankful to the referees for their valuable comments.

References

Altman E (1999) Constraint Markov decision processes. Chapman & Hall/CRC, Boca Raton Ash RB (1972) Real analysis and probability. Academic Press, London

Beutler FJ, Ross KW (1985) Optimal policies for controlled Markov chains with a constraint. J Math Anal Appl 112:236–252

Billingsley P (1995) Probability and measure, 3rd edn. Wiley–Interscience, New York

Billingsley P (1999) Convergence of probability measures, 2nd edn. Wiley–Interscience, New York Borkar VS (1994) Ergodic control of Markov chains with constraints—the general case. SIAM J Control

Optim 32:176–186

Bourbaki N (1969) Éléments de mathématique, chapitre IX. Hermann, Paris

Collins EJ, McNamara JM (1998) Finite-horizon dynamic optimization when the terminal reward is a concave function of the distribution of the final state. Adv Appl Probab 30:122–136

Feimberg EA, Shwartz A (1996) Constrained discounted dynamic programming. Math Oper Res 21:922– 945

Frid EB (1972) On optimal strategies in control problems with constraints. Theory Probab Appl 17:188– 192

González-Hernández J, Hernández-Lerma O (2005) Extreme points of sets randomized strategies in con-strained optimization and control problems. SIAM J Optim 15(4):1085–1104

Haviv M (1996) On constrained Markov decision processes. Oper Res Lett 19(1):25–28

Hernández-Lerma O, González-Hernández J (2000) Constrained Markov control processes in Borel spaces: the discounted case. Math Methods Oper Res 52:271–285

Hernández-Lerma O, Lasserre JB (1996) Discrete-time Markov control processes: basic optimality criteria. Springer, New York

(14)

Hinderer K (1970) Foundations of non-stationary dynamic programming discrete-time parameter. Lecture notes oper res math syst, vol 33. Springer, Berlin

Hu Q, Yue W (2008) Markov decision processes with their applications. Springer, New York

Karr AF (1983) Extreme points of certain sets of probability measures with applications. Math Oper Res 8:74–85

Kurano M, Nakagami J, Huang Y (2000a) Constrained Markov decision processes with compact state and action spaces: the average case. Optimization 48(2):255–269

Kurano M, Yasuda M, Nakagami J, Yoshida Y (2000b) A fuzzy treatment of uncertain Markov decision processes: average case. In Proceedings of ASSM2000 International Conference on Applied Stochas-tic System Modeling, Kyoto, pp 148–157

Loève M (1977) Probability theory, vol 1, 4th edn. Springer, New York Munkres JR (1975) Topology: a first course. Prentice–Hall, New Jersey Phelps RR (1966) Lectures on Choquet’s theorem. Van Nostrand, New York

Piunovskiy AB (1993) Control of random sequences in problems with constraints. Theory Probab Appl 38(4):751–762

Piunovskiy AB (1997) Optimal control of random sequences in problems with constraints. Kluwer Acad-emic, Dordrecht

Piunovskiy AB, Khametov VM (1991) Optimal control by random sequences with constraints. Math Notes 49:654–656

Sennott LI (1991) Constrained discounted Markov decision chains. Probab Eng Inf Sci 5:463–475 Sennott LI (1993) Constrained average cost Markov decision chains. Probab Eng Inf Sci 7:69–83 Tanaka K (1991) On discounted dynamic programming with constraints. J Math Anal Appl 155:264–277 Winkler G (1988) Extreme points of moment sets. Math Oper Res 13:581–587

Yoshida K (1978) Functional analysis, 5th edn. Springer, Berlin