• No se han encontrado resultados

Hierarchical abstract machines [Parr,1998;Parr and Russell,1998], or HAMs, provide a mechanism for specifying domain knowledge for constraining the search space of a learning problem. The idea is to capture this knowledge as a hierarchy of partially spec- ified machines. So while options [Sutton et al.,1999] (Section7.2.1) capture sub-tasks as fixed policies, HAMs specify them using non-deterministic finite state machines. HAMs may be conceptualised as a tiered system of connected finite state machines whose transitions may invoke lower-level machines, and where the layers represent different levels of abstraction or detail in the system. Further, HAMs cater for non- deterministic decision making in a Markovian process by providing what are termed as choice stateswhere the optimal selection is to be decided by the learning process. This

Choose N-E start S-E Stop s r d s r d ↑ → × × × ×

North East Stop

d,s r

d,s r

South East Stop

d,s r

d,s r

Figure 7.3: A machine for the room navigation problem [Parr,1998].

framework for constricting the set of possible policies to be considered combined with the ability to specify such constraints at different levels of abstraction, allows HAMs to be applied to problems with much larger state spaces than possible in traditional reinforcement learning.

Specifically, a HAM is a program that when executed by an agent in a given state deter- mines the set of actions that are allowed in that state. It is described by a set of states, a transition function that stochastically determines the next state, and a start function that specifies the initial state of the machine. States in themselves may be of four types: action states that directly interact with the environment, call states that invoke other HAMs as a subroutine, choice states that non-deterministically select the next state, and finally stop states that terminate execution of the machine (and optionally return control to a preceding call state).

Consider a grid world problem (Figure 7.3), taken from [Parr, 1998], where a robot is navigating a set of interconnected rooms in order to exit a building. The robot is

equipped with sonar sensors that detect when it has reached an obstacle in either direc- tion. Suppose that the robot enters a given room via a southern entrance (marked by ↑) and the only exit is to the east (marked by →). Given that the exit is always to the right of the robot entering this room, the domain expert may encode this knowledge in a HAM-constricted policy that effectively directs the robot towards the right.

An example of such a machine is also shown in Figure7.3. The idea is to try and locate the exit by moving in an easterly direction. The HAM specifies two strategies for this: sub-machineN-E, i.e., a “move north or east” strategy, andS-E, i.e., a “move south or east” strategy. When invoked, they choose between moving east or north (south) with equal probability, and terminate and return control back to the parent machine when the door or right wall is reached (i.e.,d,s r). The robot begins by adopting theN-Estrategy for finding the door. If that does not work and it reaches the eastern wall instead (the cells marked × and as indicated by the right sonar readings r), then it must adjust its strategy. The choice of which strategy to select next, however, is not exactly specified by the machine and is left up to the robot to decide (denoted by stateChoose).

As described in [Parr and Russell,1998], HAMs offer two important properties: first, given an MDP and an expert-provided HAM, there exists a new MDP in which the op- timal policy is also optimal in the original MDP (in the set of policies that satisfy the constraints specified by the HAM), and an algorithm exists to determine this optimal policy; and second, a reinforcement learning algorithm may be constructed to find an optimal policy that satisfies the constraints of that HAM, without needing to construct a new MDP from it first (this is important since the environment model is not generally known a priori). The benefit is that HAM-constrained exploration during reinforcement learning allows the agent to focus on a significantly reduced state space while still en- suring that the optimal solution is found. Evidently, this comes at the cost of offloading some of the onus of decision making to the designer in the construction of the machines.

HAMs allow procedural knowledge to be encoded in a hierarchical manner similar to the way a BDI plan library does. Learning constitutes optimising decisions at each choice point that may lie at different levels in the hierarchy, and is conceptually similar to learning plan selection at different levels in a BDI goal-plan hierarchy.

Of course, HAMs are tied to the theory of MDPs, and BDI systems to logics and pro- gramming, and so they use very different languages. Overall, HAM-constrained rein-

forcement learning is focussed on finding optimal solutions using the MDPs (similar to the options framework [Sutton et al.,1999]), given a (hierarchical) model of the agent’s behaviour. In contrast, the focus of our work in BDI learning is on maintaining the existing structure and benefits of BDI programs but seamlessly integrating (existing) machine learning techniques. Our aim is to address the nuances of learning in BDI hierarchies for use in practical applications. To that end, our contribution is to agent programming rather than to machine learning research.

Documento similar