• No se han encontrado resultados

Conclusiones acerca del estado actual del CHQ

The simplest reinforcement learning problems are multi-armed bandits.

Definition 4.7 (Multi-Armed Bandit). An environment ν is a multi-armed bandit iff O={⊥} andν(et|æ<tat) =ν(et|at)for all histories æ1:t∈(A × E)∗.

In a multi-armed bandit problem there are no observations and the next reward only depends on the previous action. Intuitively, we are deciding between#Adifferent slot machines (so-called one-armed bandits), pull the lever and obtain a reward. The reward is stochastic, but it is drawn from a distribution that is time-invariant and fixed for each arm.

A multi-armed bandit is also called bandit for short. Although bandits are the sim- plest reinforcement learning problem, they already exhibit the exploration-exploitation- tradeoff that makes reinforcement learning difficult: do you pull an arm that has the best empirical mean or do you pull an arm that has the highest uncertainty? In bandits it is very easy to come up with policies that perform (close to) optimal asymptoti- cally (e.g.,εt-greedy withεt= 1/t). But coming up with algorithms that perform well

in practice is difficult, and research focuses on the multiplicative and additive constants on the asymptotic guarantees. Bandits exist in many flavors; see Bubeck and Bianchi (2012) for a survey.

Definition 4.8 (Markov Decision Process). An environment ν is a Markov decision process (MDP) iff ν(et|æ<tat) =ν(et|ot−1at)for all histories æ1:t∈(A × E)∗.

Intuitively, in MDPs, the previous observationot−1 provides a sufficient statistic for

the history: givenot−1 and the current actionat, the next perceptet is independent of

the rest of the history. In other words, everything that the agent needs to know to make optimal decisions is readily available in the previous percept. This is why observations are calledstates in MDPs. Note that bandits are MDPs with a single state.

Much of today’s literature on reinforcement learning focuses on MDPs (Sutton and Barto, 1998). They provide a particularly good framework to study reinforcement learning because they are simple enough to be tractable for today’s algorithms, yet general enough to encompass many interesting problems. For example, most of the Atari games (see Figure 1.1 for an overview) are (deterministic) MDPs when combining

§4.1 The General Reinforcement Learning Problem 55

the previous four frames into one percept. While they have a huge state space2 they can still be learned usingQ-learning with function approximation (Mnih et al., 2015).

The MDP framework is restrictive because it requires the agent to be more powerful than the environment. Since the agent learns, its actions are not independent of the rest of the history given the last action and percept. In other words, learning agents are not Markov. The following definition lifts this restriction and allows the environment to bepartially observable.

Definition 4.9 (Partially Observable Markov Decision Process). An environmentν is a partially observable Markov decision process (POMDP) iff there is a set of states S, an initial state s0 ∈ S, a state transition function ν0 : S × A → ∆S, and a percept distribution ν00:S →∆E such that

ν(e1:tka1:t) = t

Y

k=1

ν00(ek|sk)ν0(sk|sk−1, ak).

Usually the setS is assumed to be finite; with infinite-state POMDPs we can model any environmentν by setting the set of states to be the set of histories,S := (A × E)∗. A common assumption for MDPs and POMDPs is that they do not contain traps. Formally, a (PO)MDP is ergodic iff for any policy π and any two states s1, s2 ∈ S,

the expected number of time steps to reach s2 from s1 is µπ-almost surely finite. A

(PO)MDP is weakly communicating iff for any two states s1, s2 ∈ S there is a policy

π such that the expected number of time steps to reachs2 froms1 is µπ-almost surely

finite. Note that any ergodic (PO)MDP is also weakly communicating, but not vice versa.

In general, our environments are stochastic. Stochasticity can originate from noise in the environment, noise in the sensors, or modeling errors. Sometimes we also consider classes of deterministic environments. These are usually easier to deal with because they do not require as much mathematical machinery. For example, in a deterministic environment the next percept is certain; if a different percept is received this environ- ment is immediately falsified and can be discarded. In a stochastic environment, an unlikely percept reduces our posterior belief in this environment but does not rule it out completely.

In Chapter 6 and Chapter 7 we make the assumption that the environment is com- putable. This encompasses all finite-state POMDPs and most if not all AI problems can be formulated in this setting. Moreover, the current theories of quantum mechanics and general relativity are computable and there is no evidence that suggests that our physical universe is incomputable. For any physical system of finite volume and finite (average) energy, the amount of information it can contain is finite (Bekenstein, 1981), and so is the number of state transitions per unit of time (Margolus and Levitin, 1998). This gives us reason to believe that even the environment that we humans currently face (and will ever face) falls under these assumptions.

2

The size of the state space is at most256128 since the Atari 2600 has only 128 bytes of memory. However, the vast majority of these states are not reachable.

Formally we define the setMCCS

LSC as the set of environments that are lower semicom-

putable chronological contextual semimeasures and MCCM

comp as the set of environments

that are computable chronological contextual measures. Note that for chronological contextual semimeasures it makes a difference whether ν(· k a1:∞) is lower semicom-

putable or the conditionalsν(· |æ<tat)are. The latter implies the former, but not vice

versa.

Documento similar