Theorem 4.12 states the conditions when the relaxed environment monitoring problem is a POMDP MAB. In this subsection we show how the property may be exploited to compute an upper bound for the optimal value function of the original unrelaxed problem. We prove a monotonicity result for the Gittins index, which may be taken advantage of to practically compute the upper bound. A similar monotonicity result was shown for reward functions linear in the belief state by Krishnamurthy and Wahlberg (2009). We extend it to reward functions concave in the belief state.
Consider a POMDP MAB with n arms. We restrict our attention to the special case where |Yi| “ 2, 1 ď i ď n and |Z| “ 2 with MI as the reward function. If the ith arm is activated, its state transition model is a two-state
Markov chain with parameters pi
11P r0, 1s and pi01P r0, 1s, where pijk denotes the probability that arm i transitions from state j to state k. The observation model can be described by two parameters ppz1 “ 1 | y
i “ 0q “ q` and
ppz1
“ 0 | yi “ 1q “ q´denoting false positive and false negative probabilities,
respectively.
The belief about the state of the ith arm can be represented by a single parameter pi P r0, 1s denoting the probability that the arm is in state 1, i.e. bi “ r1 ´ pi, pisT. In this case it is more convenient to work with the representative parameter pi instead of the belief vector. We make the following surrogate definition for the belief update equation.
Definition 4.13. Let τi : BiˆAˆZ Ñ Bi be the belief update equation for the
ith arm. An equivalent surrogate belief update equation ˆτi : r0, 1s ˆ Z Ñ r0, 1s
that operates on the real parameter pi is defined
ˆ τippi, z1q “ gpf ppiq, z1q, (4.22) where f ppiq “ ppi11´ p i 01qpi` pi01 (4.23)
is the Chapman-Kolmogorov equation (Equation (2.5) and gppi, z1q “ ppz1 | y1 i “ 1qppyi “ 1q ppz1 | b iq (4.24)
is the Bayes’ rule for this specific case.
The following monotonicity results for the Chapman-Kolmogorov equation, the surrogate belief update equation and the prior probability of observations are obtained as a special case of (Lovejoy, 1987, Lemma 1.2.).
Lemma 4.14. Let ppyi “ 1q ” pi denote the prior probability that Yi “ 1.
If pi
01ă pi11, q´ ă 0.5 and q` ă 0.5, then
1. f ppiq is increasing in pi and f ppiq P rpi01, pi11s,
2. ˆτippi, z1q is increasing in both pi and z1, and
3. ppz1 “ 0 | b
iq is decreasing in pi and ppz1 “ 1 | biq is increasing in pi.
We make the following surrogate definition for the reward function.
Definition 4.15. Let Ri : BiˆYi Ñ R denote the reward function for the ith
arm. Define an equivalent surrogate reward function ˆRi : r0, 1s ˆ t0, 1u Ñ R
as
ˆ
Rippi, yiq “ Ripr1 ´ pi, pisT, yiq. (4.25)
In the environment monitoring problem, Ri is the MI of the state and observation. Applying Equation (3.1) on page 54 the surrogate reward function in this case is independent of yi and equal to
ˆ
where Hp¨q refers to the entropy of a binary random variable (see Appendix A, Equation A.2). According to (Cover and Thomas, 2006, Thm. 2.7.4), ˆRippiq is concave in f ppiq.
To prove the monotonicity result for the Gittins index, we will need the following lemma determining the conditions when MI is increasing in pi. The lemma is given for MI reward, but can be generalised to other reward functions concave in pi.
Lemma 4.16. Let pi
11 P r0, 1s and pi01 P r0, 1s, and let ppz1 “ 1 | y1i “ 0q “ q`, and ppz1 “ 0 | yi1 “ 1q “ q´ and define functions f and g as
in Definition 4.13. Let ˆRippiq “ Hpf ppiqq ´ EZrHpgpf ppiq, z1qs which is
concave, and let p˚ “ argmax pPr0,1s pHppq ´ EZrHpgpp, z1qqs . (4.27) If pi 01ă pi11ď p˚, then ˆRippiq is increasing in pi P r0, 1s.
Proof. Let p1, p2 P r0, 1s such that p1 ą p2. By Lemma 4.14, f pp1q ą f pp2q,
and both are in the range rpi01, pi11s. Since ˆRi is concave in f ppiq, it is increasing for f ppiq P r0, p˚s. Since f pp2q ă f pp1q ď pi11 ď p˚, ˆρipp1q ą
ˆ
ρipp2q.
We next prove the monotonicity of the Gittins index of a POMDP MAB arm for this special case with a concave belief-dependent reward function. The proof technique is similar to (Krishnamurthy and Wahlberg, 2009, Theorem 4.1.).
Theorem 4.17 (Monotonicity of the Gittins index). Consider the ith arm
of a POMDP MAB with Yi “ t0, 1u and Z “ t0, 1u, with pi11 P r0, 1s,
pi
01 P r0, 1s, ppz1 “ 1 | yi1 “ 0q “ q` and ppz1 “ 0 | yi1 “ 1q “ q´,
and a reward function ˆRippiq “ Hpf ppiqq ´ EZrHpgpf ppiq, z1qs, and define
p˚
P r0, 1s as in Equation (4.27). If
1. pi
01 ă pi11 ď p˚, and
2. q´ ă 0.5 and q` ă 0.5,
then the Gittins index vippiq for the arm is increasing in pi.
Proof. Consider the characterisation of the Gittins index presented in Equa-
tion (2.37) through the Bellman recursion.
We first show inductively that the value function Vi
ppi, M q is increasing in pi. Consider the following value iteration scheme stated in terms of the surrogate belief update function (Definition 4.13) and surrogate reward function (Definition 4.15): Vk`1i ppi, M q “ max # ˆ Rippiq ` γ ÿ z1PZ ppz1 | biqVkipˆτippi, z1q, M q, M + . (4.28)
Now choose Vi
0ppi, M q “ ˆRippiq which by Lemma 4.16 is increasing in
pi. Then assume the induction hypothesis: for p1 ą p2, Vkipp1, M q ě
Vi
kpp2, M q, i.e. Vkipp1, M q is increasing. Consider Vk`1i as defined above in Equation (4.28). Lemma 4.16 indicates that ˆRipp1q ě ˆRipp2q, so we
can concentrate on the latter sum part of the equation. Furthermore by Lemma 4.14, ppz1
“ 0 | b1q ď ppz1 “ 0 | b2q and ppz1 “ 1 | b1q ě ppz1 “ 1 | b2q,
and consequently for any increasing function φ : Z Ñ R, ř z1PZ ppz1 | b 1qφpz1q ě ř z1PZ ppz1 | b
2qφpz1q. We may thus write
ÿ z1PZ ppz1 | b1qVkipˆτipp1, z1q, M q ě ÿ z1PZ ppz1 | b2qVkipˆτipp1, z1q, M q, (4.29)
and by Lemma 4.14 ˆτipp1, z1q ě ˆτipp2, z1q by which we further bound the
above equation from below:
ě ÿ
z1PZ
ppz1
| b2qVkipˆτipp2, z1q, M q. (4.30)
Combining the above we have shown
Vk`1i pp1, M q “ ˆRipp1q ` γ ÿ z1PZ ppz1 | b1qVkipˆτipp1, z1q, M q ě ˆRipp2q ` γ ÿ z1PZ ppz1 | b2qVkipˆτipp2, z1q, M q “ Vk`1i pp2, M q . (4.31)
As value iteration converges to a fixed point, i.e. Vi
k`1ppi, M q Ñ Vippi, M q as k Ñ 8, we conclude that Vipp
i, M q is increasing in pi. The rest of the proof follows the same steps as (Krishnamurthy and Wahlberg, 2009, Theorem 4.1.). According to Equation (2.37) the Gittins index is
vippiq “ min M P R : Vippi, M q ´ M “ 0( . (4.32) Suppose again that p1 ą p2, implying Vipp1, M q ě Vipp2, M q for all M .
Further, Vi
pp1, vipp2qq ´ vipp2q ě Vipp2, vipp2qq ´ vipp2q “ 0. According to
(Ross, 1983, Lemma 2.1), Vi
ppi, M q ´ M is decreasing in M . It follows that mintM : Vipp1, M q ´ M “ 0u ą mintM : Vipp2, M q ´ M “ 0u. (4.33)
Thus, vipp1q ą vipp2q.
Theorem 4.17 says that the Gittins index for an arm satisfying the required properties is increasing in the probability pi that the arm state is 1. Suppose that pi
01, pi11 and q´ and q` are the same for all arms. As the optimal policy
in a MAB is to play the arm with the greatest Gittins index, the optimal policy in this case assumes a simple form: select the arm argmax
i
pi.