Medios de implementación y Alianza Mundial

Proof. This proof follows the proof of Corollary 6.14. We write the optimal value explic- itly as in (4.2). For a fixed m, all involved quantities are reflective-oracle-computable. Moreover, this quantity is monotone increasing inmand the tail sum fromm+ 1to ∞ is bounded byΓm+1 which is computable according to Assumption 4.6a and converges

to 0 asm→ ∞. Therefore we can enumerate all rationals above and belowV_ν∗.

Proof of Theorem 7.19. According to Lemma 7.20 the optimal value function V_ν∗ is reflective-oracle-computable. Hence there is a probabilistic Turing machineT such that

λO_T(1|æ<t) = Vν∗(æ<tα)−Vν∗(æ<tβ) + 1

/2. We define the policy

π(æ<t) :=

(

α if O(T,æ_<t,1/2) = 1, and β if O(T,æ<t,1/2) = 0

This policy is stochastic because the answer of the oracleO is stochastic.

It remains to show that π is a ν-optimal policy. If V_ν∗(æ_<tα) > V_ν∗(æ_<tβ), then λO_T(1 | æ<t) > 1/2, thus O(T,æ<t,1/2) = 1 since O is reflective, and hence π

takes action α. Conversely, if V_ν∗(æ<tα) < Vν∗(æ<tβ), then λO_T(1 | æ<t) < 1/2,

thus O(T,æ_<t,1/2) = 0 since O is reflective, and hence π takes action β. Lastly, if V_ν∗(æ_<tα) =V_ν∗(æ_<tβ), then both actions are optimal and thus it does not matter which action is returned by policy π. (This is the case where the oracle may random- ize.)

7.2.3 Solution to the Grain of Truth Problem

Together, Proposition 7.18 and Theorem 7.19 provide the necessary ingredients to solve the grain of truth problem (Problem 7.1).

Corollary 7.21 (Solution to the Grain of Truth Problem). For every lower semicom- putable prior w ∈ ∆MO

refl the Bayes optimal policy π∗_ξ is reflective-oracle-computable where ξ is the Bayes-mixture corresponding to w defined in (7.2).

Proof. From Proposition 7.18 and Theorem 7.19. Hence the environment class MO

reflcontains any reflective-oracle-computable modi-

fication of the Bayes optimal policy π_ξ∗. In particular, this includes computable multi- agent environments that contain other Bayesian agents over the class MO

refl. So any

Bayesian agent over the class MO

refl has a grain of truth even though the environment

may contain other Bayesian agents of equal power. We proceed to sketch the implica- tions for multi-agent environments in the next section.

7.3 Multi-Agent Environments

In amulti-agent environment there arenagents each taking sequential actions from the finite action spaceA. In each time stept= 1,2, . . ., the environment receives action ai_t

agent π1 agent π2 .. . agent πn multi-agent environment σ a1_t e1_t a2 t e2_t an_t en_t

Figure 7.2: Agentsπ1, . . . , πn interacting in a multi-agent environment.

from agent i and outputs n percepts e1_t, . . . , en_t ∈ E, one for each agent. Each percept ei_t = (oi_t, ri_t) contains an observation oi_t and a reward ri_t ∈ [0,1]. Importantly, agent i only sees its own actionai_tand its own perceptei_t(see Figure 7.2). We use the shorthand notation at := (a1t, . . . , ant) and et := (e1t, . . . , ent) and denote æi<t = ai1e1i . . . ait−1eit−1

andæ_<t =a1e1. . . at−1et−1. Formally, multi-agent environments are defined as follows.

Definition 7.22 (Multi-Agent Environment). Amulti-agent environmentis a function σ : (An_{× E}n₎∗_{× A}n_→_∆(_En₎_.

Together with the policiesπ1, . . . , πn the multi-agent environmentσ induces a history

distribution σπ1:n _where σπ1:n₍_{) : = 1} σπ1:n₍_æ 1:t) : =σπ1:n(æ<tat)σ(et|æ<tat) σπ1:n₍_æ <tat) : =σπ1:n(æ<t) n Y i=1 πi(ait|æi<t).

Agent i acts in a subjective environment σi given by joining the multi-agent en-

vironment σ with the policies π1, . . . , πn and marginalizing over the histories that πi

does not see. Together with policyπi, the environmentσi yields a distribution over the

histories of agent i σπi i (æ i <t) := X æj<t,j6=i σπ1:n₍_æ <t).

We get the definition of the subjective environmentσiwith the identityσi(eti |æi<tait) :=

σπi

i (eit |æi<tait). The subjective environment σi depends on πi because other policies’

actions may depend on the actions of πi. It is crucial to note that the subjective

environmentσi and the policyπi are ordinary environments and policies, so we can use

§7.3 Multi-Agent Environments 139

Our definition of a multi-agent environment is very general and encompasses most of game theory. It allows for cooperative, competitive, and mixed games; infinitely repeated games or any (infinite-length) extensive form games with finitely many players. Example 7.23 (Matching Pennies). In the game of matching pennies there are two agents (n= 2), and two actions A ={α, β} representing the two sides of a penny. In each time step agent1 wins if the two actions are identical and agent2wins if the two actions are different. The payoff matrix is as follows.

α β

α 1,0 0,1 β 0,1 1,0

We use E = {0,1} to be the set of rewards (observations are vacuous) and define the multi-agent environment σ to give reward 1 to agent 1 iff a1_t = a2_t (0 otherwise) and reward 1to agent 2 iffa1_t 6=a2_t (0 otherwise). Formally,

σ(r1_tr2_t |æ<tat) :=        1 ifr1_t = 1, r2_t = 0, a1_t =a2_t, 1 ifr1_t = 0, r2_t = 1, a1_t 6=a2_t, and 0 otherwise.

Letπα denote the policy that always takes actionα. If two agents each using policyπα

play matching pennies, agent 1wins in every step. Formally, settingπ1 :=π2:=πα we

get a history distribution that assigns probability one to the history αα10αα10. . . .

The subjective environment of agent 1 is

σ1(r1t |æ1<ta1t) =        1 if r_t1= 1, a1_t =α, 1 if r_t1= 0, a1_t =β, and 0 otherwise.

Therefore policy πα is optimal in agent 1’s subjective environment. 3

Definition 7.24 (ε-Best Response). A policy πi acting in multi-agent environmentσ

with policies π1, . . . , πn is anε-best response after historyæi<t iff

V_σ∗_i(æi_<t)−Vπi

σi (æ

<t)< ε.

If at some time step t, all agents’ policies are ε-best responses, we have an ε- Nash equilibrium. The property of multi-agent systems that is analogous to asymptotic optimality is convergence to an ε-Nash equilibrium.

In document Asamblea General A (página 32-36)