Algunas pistas en torno a un desafío - DIÁLOGO INTERRELIGIOSO

Let us first consider the task to compute the maximal probabilities for ψ[r]. The

MDP Mlo

max is obtained from M as follows. In each state of B, there is the choice between two fresh actions τ and ι. Similar to Mup

max, the action τ represents the choice of maximising Acc, going to states goal and fail with the corresponding probabilities. The action ι from a state b P B represents the choice of continuing normally, going with probability one to a fresh copy bc, where the actions for b in M are enabled. Effectively, each B-state is split into two parts, with the first part allowing to choose τ and the second part allowing the original choices.

Formally, Mlo

max= (rS, Act Y tτ, ιu, Pmaxlo ,rew) arises from M by adding trap statesĂ goal and fail and fresh and pairwise distinct copies bc for each state b P B, i.e.,

S = S Y BcY tgoal, failu

with Bc= tbc: b P Bu. The action set of Mlo_max is Act Y tτ, ιu where: ActMlo_max(s) =ActM(s) if s P SzB ActMlo_max(s) = τ, ι( if s P B ActMlo_max(sc) =ActM(s) if s P B The probabilistic effect of the actions act P Act in Mlo

max is the same as in M. More precisely, for t P S and act P Act:

P_maxlo (s,act, t) = P (s, act, t) if s P SzB P_maxlo (sc,act, t) = P (s, act, t) if s P B Furthermore, for s P B:

P_maxlo (s, τ,goal) = Prmax_M,s Acc P_maxlo (s, τ,fail) = 1 ´ P_maxlo (s, τ,goal)

Furthermore, Plo

max(goal, τ, goal) = Pmaxlo (fail, τ, fail) = 1 and no other action is enabled in states goal and fail. The reward structure rew for MĂ

max is given by: Ă

rew(s, act) = rew(s, act) for s P SzB Ă

rew(sc,act) = rew(s, act) for s P B Ă

rew(s, τ) = 0 for s P B Y tgoal, failu Ă

rew(s, ι) = 0 for s P B

Each infinite pathπr in M

max where action τ is never scheduled induces a pathπ|rM in M by dropping all occurrences of action ι and the Bc-states. Vice versa, whenever π is an infinite path in M then a corresponding path rπ in M

max is obtained by replacing each B-state s in π with s ι sc. Analogous transformations can be provided for finite paths in Mlo

max and M. For the transformation of a finite path ρ in M that ends in a B-state, we skip the copy at the end and suppose that last(rρ) =last(ρ) P B.

Intuitively, Mlo

max can simulate M-paths satisfying ψ[r] = (A Uěr B) ^Acc by paths satisfying (A1 _Uěr goal), and vice versa, where A1 consists of the A-states, the B-states and those Bc-states bc where b P A. For an infinite path π in M that satisfies

(A Uěr _{B) ^}Acc, the corresponding path

π is obtained by choosing action τ and going to goal the first time B is visited with an accumulated reward ě r. As all B-states visited on π strictly before that point have to be included in A (otherwise, A Uěr _B would not hold), the corresponding B

c-states visited along rπ in M

max are

included in A1, as are all the B-states. Thus, r

π satisfies (A1 _Uěr goal). Vice versa, paths in Mlo

max satisfying (A1 Uěr goal) induce paths satisfying (A Uěr B) ^Acc, with path fragmentsρrin M

max satisfying (A1 Uěr B) inducing path fragmentsρ|rM in M satisfying (A Uěr _B).

Lemma 3.6.2. For all states s of M and all r P N:

Prmax

M,s (A Uěr B) ^Acc = PrmaxMlo_max,s A

1 _Uěr goal

where A1 _{= A Y B Y tb}

c P Bc : b P Au.

Proof. As for Lemma 3.6.1, the proof relies on a scheduler transformation.

Part 1. To prove that the maximal probability for ψ[r] = (A Uěr _{B) ^} Acc is bounded from above by the maximal probability for A1 _Uěr goal in Mlo

max, we pick an arbitrary scheduler S for M and define a scheduler T for Mlo

max as follows. In its first mode, scheduler T simulates S by scheduling ι in the B-states, i.e., T(rρ) = S(ρ|_rM) if

last(rρ) P rSzB and T(ρ) = ιr if last(ρ) P B. As soon as a finite pathr ρrwith

ρ =s_r0act0sr1act1 . . . rsn wherers0, . . . ,rsn´1 P A

1_,

snP B and rew(rĂ ρ) ě r has been generated, T switches mode and schedules τ from now on.

Obviously, whenever ρris a finite T-path not ending in goal or fail, then by dropping the ι-actions and the states scP Bc, we obtain an S-path in M.

3.6 Quantiles under side conditions

Let FP[A Uěr_B] denote the set consisting of all finite S-paths ρ in M that have the following form:

ρ = s0act0s1act1 . . . sn where snP B, s0, . . . , sn´1P A and rew(ρ) ě r such that no proper prefix of ρ belongs to FP[A Uěr _B], i.e., if m ă n and s

m P B

then rew(s0act0s1act1. . .actm´1sm) ă r. Similarily, letFP[AĂ 1 Uěr B] denote the set consisting of all finite T-paths ρrin M

max that have the following form: r

ρ =_rs0act0rs1act1 . . .rsn wheresrnP B,sr0, . . . ,rsn´1 P A

1 and

rew(rρ) ě r such that no proper prefix of ρrbelongs to FP[AĂ

1 _Uěr_B].

Note that each finite path ρ P FP[A Uěr _B] in M has a corresponding path r

ρ P ĂFP[A1 _Uěr _B] in Mlo

max with ρ|rM = ρ and vice versa. For t P B, we define

FP[A Uěr _t]as the set of paths ρ P FP[A Uěr_B] with last(ρ) = t and

FP[A1 _Uěr _t] as

the set of paths ρ P Ăr FP[A

1 _Uěr _B] with last(r_{ρ) = t}.

As in the proof of Lemma 3.6.1, we write Pr(ρ) for the probability of ρ given by the product of the transition probabilities. Note that Pr(rρ) =Pr(rρ|M)forρ P Ăr FP[A

1 _Uěr_B],

i.e., that the transformation of adding or dropping the ι actions and Bc-states does not change the probability. We then have:

PrS M,s (A Uěr B) ^Acc = ÿ tPB ÿ ρPFP[AUěr_t] Pr(ρ) ¨ PrS[ρ] M,t Acc ďÿ tPB ÿ ρPFP[AUěr_t] Pr(ρ) ¨ Prmax M,t Acc =ÿ tPB ÿ r ρPĄFP[A1_Uěr_t] Pr(ρ) ¨ Pr lo max(t, τ,goal) (:) =PrT_Mlo_max_,s A1 Uěr goal

Equation (:) in the above calculation holds because the set of infinite T-paths satisfying A1 _Uěrgoal consists of all infinite paths in Mlo

max that have a prefix ρrin FP[AĂ

1 _Uěr _t]

for some t P B and move from last(rρ) = t to goal (rather than fail) via action τ. It can be concluded:

Prmax

M,s (A Uěr B) ^Acc ď PrmaxMlo_max,s A

1 _Uěrgoal

Part 2. For proving that the maximal probability for (A Uěr _{B) ^}Acc in the pointed

MDP (M, s) is greater or equal than the maximal probability for A1 _Uěr goal in (Mlo_max, s), we consider an arbitrary scheduler T for Mlo_max. Let Smax be a scheduler for M maximising the probability for Acc from all states s P S. We now construct a scheduler S for M as follows. In its initial mode, S mimics T by using T’s choice in the bc copy for states b P B, provided that T does not select action τ in b. As soon as T schedules action τ, scheduler S switches its mode and simulates Smax from then on.

LetFPĂ_τ[A1 Uěr B] be the set of T-paths r

ρ in Mlo_max of the form

ρ =s_r0act0sr1act1 . . . actn´1rsn with rs0, . . . ,rsn´1P A

1_,

r sn P B,

with rew(ρ) ě rr and T(ρ) = τr , i.e., up to the B-state where T schedules τ for the first time. Obviously, no path ρ P Ăr FPτ[A

1 _Uěr _B] is a proper prefix of some other path in Ă

FPτ[A1 UěrB]. Similarily, let FPτ[A Uěr B] be the set of S-paths ρ in M with ρ = s0act0s1act1 . . .actn´1sn with s0, . . . , sn´1 P A, sn P B, and rew(ρ) ě r and such that T(ρ) = τr , where ρris the path in M

max corresponding to ρ (by adding the ι actions and Bc copies) and such that no proper prefix of ρ belongs to FPτ[A Uěr B].

Then:

S[ρ] = Smax for all ρ P FPτ[A UěrB]

Additionally, we have a one-to-one correspondence between the paths ρ P FPτ[A Uěr B] and ρ P Ăr FPτ[A

1 _Uěr _B], satisfying r

ρ|M = ρ, rew(rĂ ρ) = rew(ρ) and Pr(rρ) = Pr(ρ). For t P B, let FPτ[A Uěr t] be the set of finite paths ρ P FPτ[A Uěr B] with last(ρ) = t. Likewise, letFPĂ_τ[A1 Uěr t]be the set of finite paths

r ρ P ĂFPτ[A1 Uěr B]with last(ρ) = t.r We get: PrS M,s (A UěrB) ^Acc = ÿ tPB ÿ ρPFPτ[AUěrt] Pr(ρ) ¨ PrS[ρ] M,t Acc =ÿ tPB ÿ ρPFPτ[AUěrt] Pr(ρ) ¨ PrSmax M,t Acc =ÿ tPB ÿ ρPFPτ[AUěrt] Pr(ρ) ¨ Prmax M,t Acc =ÿ tPB ÿ r ρPĄFPτ[A1Uěrt] Pr(rρ) ¨ P_maxlo (t, τ,goal) =PrT_Mlo_max_,s A1 Uěrgoal

In summary, this yields for every scheduler T for Mlo

max a scheduler S for M with PrS M,s (A UěrB) ^Acc = Pr T Mlo_max,s A 1 _Uěrgoal Hence: Prmax

M,s (A UěrB) ^Acc ě PrmaxMlo_max,s A

1 _Uěrgoal This completes the proof of Lemma 3.6.2.

3.6 Quantiles under side conditions

In document DIÁLOGO INTERRELIGIOSO (página 64-71)