Finally, we address the task of computing the minimal probabilities for ψ[r]. The
MDP Mlo
min is fairly the same as Mlomax except that we assign reward 1 to the self-loop in the goal state and that we deal with the minimal probability for Acc in M in the definition of the transition probabilities for the τ-transitions from the B-states to the goal state.
Formally, Mlo
min = (rS, Act Y tτ, ιu, Pminlo ,rew), where rS as well as the action set andĂ the enabled actions are as for Mlo
max. Likewise, Pminlo is as for Pmaxlo , except for the probabilities in the B-states, i.e., for s P B,
Pminlo (s, τ,goal) = PrminM,s Acc
Pminlo (s, τ,fail) = 1 ´ Pminlo (s, τ,goal) Pminlo (s, ι, sc) = 1
The reward structure for Mlo
min is given by: Ă
rew(s, act) = rew(s, act) for s P SzB Ă
rew(sc,act) = rew(s, act) for s P B Ă
rew(s, τ) = 0 for s P B Y tfailu Ă
rew(s, ι) = 0 for s P B
Ă
rew(s, τ) = 1 for s = goal
As for Mlo
max in the previous transformation, we can transform between paths in M and Mlo
min.
Intuitively, Mlo
min can simulate M-paths satisfying ψ[r] = (A Uěr B) ^Acc by paths satisfying (A1 UěrC), and vice versa, where A1 consists of the A-states, the B-states, those Bc-states bc where b P A and the goal state. The right hand side C of the until consists of all Bc-states and the goal state. In contrast to Mlomax, the goal state is included on the left side of the until to ensure (in conjunction with the self-loop on goal with reward 1) that every finite pathρrthat satisfies (A
1 U goal) will be extended to an infinite path that satisfies (A1 Uěr goal), regardless of the concrete value of r. Without the positive reward loop for the goal state, a minimising scheduler could ensure by scheduling τ that every path fragment ρrconsisting of A
1-states, ending in a B-state b and having reward rew(Ă ρ) ă rr is extended in such a way that (A
1 Uěr goal) (and thus also (A1 Uěr C)) is never satisfied. With the presented construction of Mlo
min, a
minimising scheduler has the following choice for continuing these paths: If it chooses τ, (A1 Uěr C) will be satisfied with Prmin
M,b(Acc). Essentially, choosing τ focuses on minimising the probability of ψ[r] = (A Uěr B) ^Acc by minimising the probability only of Acc, ignoring the possibility of minimising the probability for (A UěrB). The choice ι in this situation postpones the minimisation of the probability of Acc to a later moment. For a path fragment consisting of A1-states, ending in a B-state b and having rewardrew(rĂ ρ) ě r that has not yet satisfied (A
1 Uěr C), the choice of τ becomes attractive in all cases, as choosing ι will surely lead to satisfaction of (A1 Uěr C), as a Bc-state is reached in the next step.
Lemma 3.6.3. For all states s of M and all r P N:
Prmin
M,s (A Uěr B) ^Acc = PrminMlomin,s A
1 UěrC
where A1 = A Y B Y tb
c P Bc : b P Au Y tgoalu and C = BcY tgoalu.
Proof. As before, we will provide scheduler transformations in both directions.
Part 1. We first show that the minimal probability for (A Uěr B) ^Acc in M is greater or equal than the minimal probability for A1 Uěr C in Mlo
min. For this, we consider an arbitrary scheduler S for M. We define a scheduler T for Mlo
min as in Part 1 of the proof of Lemma 3.6.2, i.e., which first simulates S and switches to schedule τ continuously as soon as a finite path
r
ρ =sr0act0sr1act1 . . . rsn wherers0, . . . ,rsn´1 P A
1,
r
snP B and rew(rĂ ρ) ě r, has been generated.
With the above definition of T, there is no finite T-path of the form:
r
s0act0sr1act1 . . . srn wherers0, . . . ,rsn´1 P A
1,
r
snP Bc and rew(rĂ ρ) ě r
Note that for such paths we would have rsn´1 P B and actn´1= ι, which conflicts with T(rs0act0rs1act1 . . .actn´2rsn´1) = τ. This yields:
PrT Mlomin,s A 1 Uěr goal = PrT Mlomin,s A 1 Uěr (goal _ B c) =PrTMlo min,s A 1 Uěr C With a calculation as in the proof of Lemma 3.6.2 we get:
PrS M,s (A UěrB) ^Acc ě PrTMlomin,s A 1 Uěrgoal Hence, we get: PrS M,s (A Uěr B) ^Acc ě PrTMlomin,s A 1 UěrC As a consequence we obtain: Prmin
M,s (A Uěr B) ^Acc ě PrminMlomin,s A
1 UěrC
Part 2. To prove that the minimal probability for A1 Uěr C in Mlo
min is greater or equal than the minimal probability for (A UěrB) ^Acc in M, we pick an arbitrary scheduler T for Mlo
min. Let Smin be a scheduler for M that minimises the probabilities for Acc from all states. We now construct a scheduler S for M as follows. In its initial mode, S mimics T, provided that T does not select action τ. As soon as T schedules action τ, scheduler S switches its mode and simulates Smin from then on. The goal is now to show that:
PrS
M,s (A Uěr B) ^Acc ď Pr
T
Mlomin,s A
3.6 Quantiles under side conditions
The set of T-paths in Mlo
min that satisfy A1 Uěr C can be partitioned into two sets, those that satisfy A1 Uěr B
c and those that satisfy A1 Uěrgoal. As it can be the case that a path satisfies both A1 UěrB
c and A1 Uěrgoal, i.e., with a prefix that satisfies A1 UěrB
c before the τ-action is scheduled at a later point, we partition according to which of the two path formulas is satisfied first. Formally, letFP[AĂ 1 Uěr Bc] denote the set of finite T-paths of the form
r
ρ =rs0act0rs1act1 . . .srn where rsnP Bc,rs0, . . . ,rsn´1 P A
1 and rew(ρ) ě r
such that no proper prefix of ρrbelongs toFP[AĂ
1 Uěr B
c], i.e., if m ă n and sm P Bc then rew(rs0act0rs1act1 . . .actm´1rsm) ă r. LetFPĂτ[A
1 U B] be the set of T-paths r ρ in Mlomin of the form
r
ρ =rs0act0rs1act1 . . .actn´1srn with rs0, . . . ,rsn´1 P A
1,
r sn P B,
with T(rρ) = τ, i.e., up to the B-state where T schedules τ for the first time and which do not have a prefix in FP[AĂ 1 Uěr Bc]. Obviously, no path ρ P Ăr FPτ[A1 U B] is a proper prefix of some other path in FPĂτ[A1 U B] and all infinite T-paths that satisfy A1 Uěrgoal have a prefix in
Ă
FPτ[A1 U B]. We writeFPĂτ[A1 U t]for t P B to denote the set of paths in FPĂτ[A1 U t] ending in t, with an equivalent notation for FP[AĂ 1 Uěr t]. Then: PrT Mlomin,s (A 1 Uěr C) =PrT Mlomin,s FP[AĂ 1 Uěr B c] +ÿ tPB ÿ r ρPĄFPτ[A1U t] Pr(rρ) ¨ Pminlo (t, τ,goal)
Let FP[A Uěr B] be the set of finite M-paths ρ with r
ρ|M = ρ and ρ P Ăr FP[A
1 UěrB
c]
and let FPτ[A U B] be the set of finite M-paths ρ withρ|rM = ρandρ P Ăr FPτ[A
1 U B]. No path in FP[A Uěr B] has a prefix in FP
τ[A U B] and vice-versa. Additionally, all paths in FP[A Uěr B] or FP
τ[A U B] are S-paths.
In particular, all S-paths that satisfy (A Uěr B) ^Acc have a prefix in either
FP[A Uěr B] or FP
τ[A U B]. However, not all infinite S-paths that have a prefix in FP[A Uěr B] satisfy (A Uěr B) ^Acc, as Acc is not guaranteed to be satisfied. Likewise, not all infinite S-paths that satisfy Acc with a prefix in FPτ[A U B] satisfy
(A Uěr B) ^Acc, as it is not guaranteed that (A Uěr B) holds. Thus, we have PrS
M,s (A UěrB) ^Acc = PrSM,s FP[A Uěr B] ^Acc
+PrSM,s FPτ[A U B] ^ (A Uěr B) ^Acc ďPrSM,s FP[A Uěr B] +PrSM,s FPτ[A U B] ^Acc =PrSM,s FP[A Uěr B] +ÿ tPB ÿ ρPFPτ[AU t] Pr(ρ) ¨ PrS M,t(Acc) =PrSM,s FP[A Uěr B] +ÿ tPB ÿ ρPFPτ[AU t] Pr(ρ) ¨ PrSmin M,t (Acc) =PrTMlo min,s FP[AĂ 1 Uěr B c] +ÿ tPB ÿ r ρPĄFPτ[A1U t] Pr(rρ) ¨ Pminlo (t, τ,goal) =PrTMlo min,s (A 1 Uěr C)
Thus, for every scheduler T in Mlo
min we can construct a scheduler S in M with PrS
M,s (A Uěr B) ^Acc ď PrTMlomin,s A
1 UěrC
Hence, we get:
Prmin
M,s (A Uěr B) ^Acc ď PrminMlomin,s A
1 UěrC
This completes the proof of Lemma 3.6.3.
As a consequence of Lemma 3.6.2 and Lemma 3.6.3 the transformations M Mlo
minand
M Mlomax permit to apply the methods presented in Section 3.4 for the computation of quantiles for lower reward-bounded until properties under side conditions:
quM,s DPDp((A Uě?B) ^Acc) = quMlomax,s DPDp(A1 Uě? goal) quM,s @PDp((A Uě?B) ^Acc) = quMlomin,s @PDp(A1 Uě? C
where A1 and C are defined as in Lemma 3.6.2 and Lemma 3.6.3, respectively. Please note that the definition of A1 slightly differs in both cases.