Proof. This proof follows the proof of Corollary 6.14. We write the optimal value explic- itly as in (4.2). For a fixed m, all involved quantities are reflective-oracle-computable. Moreover, this quantity is monotone increasing inmand the tail sum fromm+ 1to ∞ is bounded byΓm+1 which is computable according to Assumption 4.6a and converges
to 0 asm→ ∞. Therefore we can enumerate all rationals above and belowVν∗.
Proof of Theorem 7.19. According to Lemma 7.20 the optimal value function Vν∗ is reflective-oracle-computable. Hence there is a probabilistic Turing machineT such that
λOT(1|æ<t) = Vν∗(æ<tα)−Vν∗(æ<tβ) + 1
/2. We define the policy
π(æ<t) :=
(
α if O(T,æ<t,1/2) = 1, and β if O(T,æ<t,1/2) = 0
This policy is stochastic because the answer of the oracleO is stochastic.
It remains to show that π is a ν-optimal policy. If Vν∗(æ<tα) > Vν∗(æ<tβ), then λOT(1 | æ<t) > 1/2, thus O(T,æ<t,1/2) = 1 since O is reflective, and hence π
takes action α. Conversely, if Vν∗(æ<tα) < Vν∗(æ<tβ), then λOT(1 | æ<t) < 1/2,
thus O(T,æ<t,1/2) = 0 since O is reflective, and hence π takes action β. Lastly, if Vν∗(æ<tα) =Vν∗(æ<tβ), then both actions are optimal and thus it does not matter which action is returned by policy π. (This is the case where the oracle may random- ize.)
7.2.3 Solution to the Grain of Truth Problem
Together, Proposition 7.18 and Theorem 7.19 provide the necessary ingredients to solve the grain of truth problem (Problem 7.1).
Corollary 7.21 (Solution to the Grain of Truth Problem). For every lower semicom- putable prior w ∈ ∆MO
refl the Bayes optimal policy π∗ξ is reflective-oracle-computable where ξ is the Bayes-mixture corresponding to w defined in (7.2).
Proof. From Proposition 7.18 and Theorem 7.19. Hence the environment class MO
reflcontains any reflective-oracle-computable modi-
fication of the Bayes optimal policy πξ∗. In particular, this includes computable multi- agent environments that contain other Bayesian agents over the class MO
refl. So any
Bayesian agent over the class MO
refl has a grain of truth even though the environment
may contain other Bayesian agents of equal power. We proceed to sketch the implica- tions for multi-agent environments in the next section.
7.3
Multi-Agent Environments
In amulti-agent environment there arenagents each taking sequential actions from the finite action spaceA. In each time stept= 1,2, . . ., the environment receives action ait
agent π1 agent π2 .. . agent πn multi-agent environment σ a1t e1t a2 t e2t ant ent
Figure 7.2: Agentsπ1, . . . , πn interacting in a multi-agent environment.
from agent i and outputs n percepts e1t, . . . , ent ∈ E, one for each agent. Each percept eit = (oit, rit) contains an observation oit and a reward rit ∈ [0,1]. Importantly, agent i only sees its own actionaitand its own percepteit(see Figure 7.2). We use the shorthand notation at := (a1t, . . . , ant) and et := (e1t, . . . , ent) and denote æi<t = ai1e1i . . . ait−1eit−1
andæ<t =a1e1. . . at−1et−1. Formally, multi-agent environments are defined as follows.
Definition 7.22 (Multi-Agent Environment). Amulti-agent environmentis a function σ : (An× En)∗× An→∆(En).
Together with the policiesπ1, . . . , πn the multi-agent environmentσ induces a history
distribution σπ1:n where σπ1:n() : = 1 σπ1:n(æ 1:t) : =σπ1:n(æ<tat)σ(et|æ<tat) σπ1:n(æ <tat) : =σπ1:n(æ<t) n Y i=1 πi(ait|æi<t).
Agent i acts in a subjective environment σi given by joining the multi-agent en-
vironment σ with the policies π1, . . . , πn and marginalizing over the histories that πi
does not see. Together with policyπi, the environmentσi yields a distribution over the
histories of agent i σπi i (æ i <t) := X æj<t,j6=i σπ1:n(æ <t).
We get the definition of the subjective environmentσiwith the identityσi(eti |æi<tait) :=
σπi
i (eit |æi<tait). The subjective environment σi depends on πi because other policies’
actions may depend on the actions of πi. It is crucial to note that the subjective
environmentσi and the policyπi are ordinary environments and policies, so we can use
§7.3 Multi-Agent Environments 139
Our definition of a multi-agent environment is very general and encompasses most of game theory. It allows for cooperative, competitive, and mixed games; infinitely repeated games or any (infinite-length) extensive form games with finitely many players. Example 7.23 (Matching Pennies). In the game of matching pennies there are two agents (n= 2), and two actions A ={α, β} representing the two sides of a penny. In each time step agent1 wins if the two actions are identical and agent2wins if the two actions are different. The payoff matrix is as follows.
α β
α 1,0 0,1 β 0,1 1,0
We use E = {0,1} to be the set of rewards (observations are vacuous) and define the multi-agent environment σ to give reward 1 to agent 1 iff a1t = a2t (0 otherwise) and reward 1to agent 2 iffa1t 6=a2t (0 otherwise). Formally,
σ(r1tr2t |æ<tat) := 1 ifr1t = 1, r2t = 0, a1t =a2t, 1 ifr1t = 0, r2t = 1, a1t 6=a2t, and 0 otherwise.
Letπα denote the policy that always takes actionα. If two agents each using policyπα
play matching pennies, agent 1wins in every step. Formally, settingπ1 :=π2:=πα we
get a history distribution that assigns probability one to the history αα10αα10. . . .
The subjective environment of agent 1 is
σ1(r1t |æ1<ta1t) = 1 if rt1= 1, a1t =α, 1 if rt1= 0, a1t =β, and 0 otherwise.
Therefore policy πα is optimal in agent 1’s subjective environment. 3
Definition 7.24 (ε-Best Response). A policy πi acting in multi-agent environmentσ
with policies π1, . . . , πn is anε-best response after historyæi<t iff
Vσ∗i(æi<t)−Vπi
σi (æ
i
<t)< ε.
If at some time step t, all agents’ policies are ε-best responses, we have an ε- Nash equilibrium. The property of multi-agent systems that is analogous to asymptotic optimality is convergence to an ε-Nash equilibrium.