Let us first consider the task to compute the maximal probabilities for ψ[r]. The
MDP Mlo
max is obtained from M as follows. In each state of B, there is the choice between two fresh actions τ and ι. Similar to Mup
max, the action τ represents the choice of maximising Acc, going to states goal and fail with the corresponding probabilities. The action ι from a state b P B represents the choice of continuing normally, going with probability one to a fresh copy bc, where the actions for b in M are enabled. Effectively, each B-state is split into two parts, with the first part allowing to choose τ and the second part allowing the original choices.
Formally, Mlo
max= (rS, Act Y tτ, ιu, Pmaxlo ,rew) arises from M by adding trap statesĂ goal and fail and fresh and pairwise distinct copies bc for each state b P B, i.e.,
r
S = S Y BcY tgoal, failu
with Bc= tbc: b P Bu. The action set of Mlomax is Act Y tτ, ιu where: ActMlomax(s) =ActM(s) if s P SzB ActMlomax(s) = τ, ι( if s P B ActMlomax(sc) =ActM(s) if s P B The probabilistic effect of the actions act P Act in Mlo
max is the same as in M. More precisely, for t P S and act P Act:
Pmaxlo (s,act, t) = P (s, act, t) if s P SzB Pmaxlo (sc,act, t) = P (s, act, t) if s P B Furthermore, for s P B:
Pmaxlo (s, τ,goal) = PrmaxM,s Acc Pmaxlo (s, τ,fail) = 1 ´ Pmaxlo (s, τ,goal)
Furthermore, Plo
max(goal, τ, goal) = Pmaxlo (fail, τ, fail) = 1 and no other action is enabled in states goal and fail. The reward structure rew for MĂ
lo
max is given by: Ă
rew(s, act) = rew(s, act) for s P SzB Ă
rew(sc,act) = rew(s, act) for s P B Ă
rew(s, τ) = 0 for s P B Y tgoal, failu Ă
rew(s, ι) = 0 for s P B
Each infinite pathπr in M
lo
max where action τ is never scheduled induces a pathπ|rM in M by dropping all occurrences of action ι and the Bc-states. Vice versa, whenever π is an infinite path in M then a corresponding path rπ in M
lo
max is obtained by replacing each B-state s in π with s ι sc. Analogous transformations can be provided for finite paths in Mlo
max and M. For the transformation of a finite path ρ in M that ends in a B-state, we skip the copy at the end and suppose that last(rρ) =last(ρ) P B.
Intuitively, Mlo
max can simulate M-paths satisfying ψ[r] = (A Uěr B) ^Acc by paths satisfying (A1 Uěr goal), and vice versa, where A1 consists of the A-states, the B-states and those Bc-states bc where b P A. For an infinite path π in M that satisfies
(A Uěr B) ^Acc, the corresponding path
r
π is obtained by choosing action τ and going to goal the first time B is visited with an accumulated reward ě r. As all B-states visited on π strictly before that point have to be included in A (otherwise, A Uěr B would not hold), the corresponding B
c-states visited along rπ in M
lo
max are
included in A1, as are all the B-states. Thus, r
π satisfies (A1 Uěr goal). Vice versa, paths in Mlo
max satisfying (A1 Uěr goal) induce paths satisfying (A Uěr B) ^Acc, with path fragmentsρrin M
lo
max satisfying (A1 Uěr B) inducing path fragmentsρ|rM in M satisfying (A Uěr B).
Lemma 3.6.2. For all states s of M and all r P N:
Prmax
M,s (A Uěr B) ^Acc = PrmaxMlomax,s A
1 Uěr goal
where A1 = A Y B Y tb
c P Bc : b P Au.
Proof. As for Lemma 3.6.1, the proof relies on a scheduler transformation.
Part 1. To prove that the maximal probability for ψ[r] = (A Uěr B) ^ Acc is bounded from above by the maximal probability for A1 Uěr goal in Mlo
max, we pick an arbitrary scheduler S for M and define a scheduler T for Mlo
max as follows. In its first mode, scheduler T simulates S by scheduling ι in the B-states, i.e., T(rρ) = S(ρ|rM) if
last(rρ) P rSzB and T(ρ) = ιr if last(ρ) P B. As soon as a finite pathr ρrwith
r
ρ =sr0act0sr1act1 . . . rsn wherers0, . . . ,rsn´1 P A
1,
r
snP B and rew(rĂ ρ) ě r has been generated, T switches mode and schedules τ from now on.
Obviously, whenever ρris a finite T-path not ending in goal or fail, then by dropping the ι-actions and the states scP Bc, we obtain an S-path in M.
3.6 Quantiles under side conditions
Let FP[A UěrB] denote the set consisting of all finite S-paths ρ in M that have the following form:
ρ = s0act0s1act1 . . . sn where snP B, s0, . . . , sn´1P A and rew(ρ) ě r such that no proper prefix of ρ belongs to FP[A Uěr B], i.e., if m ă n and s
m P B
then rew(s0act0s1act1. . .actm´1sm) ă r. Similarily, letFP[AĂ 1 Uěr B] denote the set consisting of all finite T-paths ρrin M
lo
max that have the following form: r
ρ =rs0act0rs1act1 . . .rsn wheresrnP B,sr0, . . . ,rsn´1 P A
1 and
Ă
rew(rρ) ě r such that no proper prefix of ρrbelongs to FP[AĂ
1 UěrB].
Note that each finite path ρ P FP[A Uěr B] in M has a corresponding path r
ρ P ĂFP[A1 Uěr B] in Mlo
max with ρ|rM = ρ and vice versa. For t P B, we define
FP[A Uěr t]as the set of paths ρ P FP[A UěrB] with last(ρ) = t and
Ă
FP[A1 Uěr t] as
the set of paths ρ P Ăr FP[A
1 Uěr B] with last(rρ) = t.
As in the proof of Lemma 3.6.1, we write Pr(ρ) for the probability of ρ given by the product of the transition probabilities. Note that Pr(rρ) =Pr(rρ|M)forρ P Ăr FP[A
1 UěrB],
i.e., that the transformation of adding or dropping the ι actions and Bc-states does not change the probability. We then have:
PrS M,s (A Uěr B) ^Acc = ÿ tPB ÿ ρPFP[AUěrt] Pr(ρ) ¨ PrS[ρ] M,t Acc ďÿ tPB ÿ ρPFP[AUěrt] Pr(ρ) ¨ Prmax M,t Acc =ÿ tPB ÿ r ρPĄFP[A1Uěrt] Pr(ρ) ¨ Pr lo max(t, τ,goal) (:) =PrTMlomax,s A1 Uěr goal
Equation (:) in the above calculation holds because the set of infinite T-paths satisfying A1 Uěrgoal consists of all infinite paths in Mlo
max that have a prefix ρrin FP[AĂ
1 Uěr t]
for some t P B and move from last(rρ) = t to goal (rather than fail) via action τ. It can be concluded:
Prmax
M,s (A Uěr B) ^Acc ď PrmaxMlomax,s A
1 Uěrgoal
Part 2. For proving that the maximal probability for (A Uěr B) ^Acc in the pointed
MDP (M, s) is greater or equal than the maximal probability for A1 Uěr goal in (Mlomax, s), we consider an arbitrary scheduler T for Mlomax. Let Smax be a scheduler for M maximising the probability for Acc from all states s P S. We now construct a scheduler S for M as follows. In its initial mode, S mimics T by using T’s choice in the bc copy for states b P B, provided that T does not select action τ in b. As soon as T schedules action τ, scheduler S switches its mode and simulates Smax from then on.
LetFPĂτ[A1 Uěr B] be the set of T-paths r
ρ in Mlomax of the form
r
ρ =sr0act0sr1act1 . . . actn´1rsn with rs0, . . . ,rsn´1P A
1,
r sn P B,
with rew(ρ) ě rr and T(ρ) = τr , i.e., up to the B-state where T schedules τ for the first time. Obviously, no path ρ P Ăr FPτ[A
1 Uěr B] is a proper prefix of some other path in Ă
FPτ[A1 UěrB]. Similarily, let FPτ[A Uěr B] be the set of S-paths ρ in M with ρ = s0act0s1act1 . . .actn´1sn with s0, . . . , sn´1 P A, sn P B, and rew(ρ) ě r and such that T(ρ) = τr , where ρris the path in M
lo
max corresponding to ρ (by adding the ι actions and Bc copies) and such that no proper prefix of ρ belongs to FPτ[A Uěr B].
Then:
S[ρ] = Smax for all ρ P FPτ[A UěrB]
Additionally, we have a one-to-one correspondence between the paths ρ P FPτ[A Uěr B] and ρ P Ăr FPτ[A
1 Uěr B], satisfying r
ρ|M = ρ, rew(rĂ ρ) = rew(ρ) and Pr(rρ) = Pr(ρ). For t P B, let FPτ[A Uěr t] be the set of finite paths ρ P FPτ[A Uěr B] with last(ρ) = t. Likewise, letFPĂτ[A1 Uěr t]be the set of finite paths
r ρ P ĂFPτ[A1 Uěr B]with last(ρ) = t.r We get: PrS M,s (A UěrB) ^Acc = ÿ tPB ÿ ρPFPτ[AUěrt] Pr(ρ) ¨ PrS[ρ] M,t Acc =ÿ tPB ÿ ρPFPτ[AUěrt] Pr(ρ) ¨ PrSmax M,t Acc =ÿ tPB ÿ ρPFPτ[AUěrt] Pr(ρ) ¨ Prmax M,t Acc =ÿ tPB ÿ r ρPĄFPτ[A1Uěrt] Pr(rρ) ¨ Pmaxlo (t, τ,goal) =PrTMlomax,s A1 Uěrgoal
In summary, this yields for every scheduler T for Mlo
max a scheduler S for M with PrS M,s (A UěrB) ^Acc = Pr T Mlomax,s A 1 Uěrgoal Hence: Prmax
M,s (A UěrB) ^Acc ě PrmaxMlomax,s A
1 Uěrgoal This completes the proof of Lemma 3.6.2.
3.6 Quantiles under side conditions