Etapa de Cierre o Abandono .1 Abandono de la Bocatoma

D.- Humedad Relativa

IX. PLAN DE CIERRE O ABANDONO

9.6 Procedimientos Específicos de Abandono

9.6.2 Etapa de Cierre o Abandono .1 Abandono de la Bocatoma

Oates and Cohen [31] present an agent capable of learning in a world with probabilistic effects, context-dependent effects, and exogenous events (i.e. events caused by other agents). The learning scenario involves a robot with two arms painting blocks. The robot has a fixed number of input sensors (e.g. the agent can sense that block ‘X’ is painted) and actions (e.g. paint block ‘X’). The learning task is to create STRIPS-like operators for each action where an operator describes the probable outcomes given the current state of the world. Each operator assigns an estimated probability to each possible outcome (see section 2.1.6 for details of the action repre- sentation).

The ‘MSDD’ algorithm finds operators by performing a best first search through the space of all possible contexts and effects starting with a most general operator (i.e. the operator specifies that: ‘any sensor can take any value’, given the precondition that: ‘any sensor has any value’). An observation trace of actions and world state transitions is used to evaluate operators. The operators are heuristically scored by counting how often the context and effect occur together, plus a bonus for a frequently occur- ring context. The search space is pruned by removing illegal and duplicate

2.2. LEARNING ACTIONS 29 operators. High scoring operators are output along with a probability of success (the probability is simply calculated from the number of successes in the observation trace). New search nodes are generated by adding new operators that are specializations of existing operators.

The agent was able to learn a full set of operators for the simple blocks painting world using the algorithm described. It was able to do this even when a number of exogenous actions were included in the observation trace. Exogenous actions are modelled as ‘noise’ streams in which random observations are added to the world state. These noise streams are not a realistic model of exogenous actions for two reasons: first, actions by other agents are not simply random (note that the search algorithm uses frequency of pre/post state correlations to guide the search); and secondly, actions by other agents should be able to interact with the same objects as the learning agent (this too will have an impact on the correlation heuris- tic). Both of these limitations make the learning domain a less realistic model of real world complex environments, and certainly simplifies the learning task.

The agent’s fixed number of actions and sensors was very limited (less than 10 of each) and can be considered a classic ‘micro-world’. The au- thors suggest that the search time grows linearly as the number of noise streams increases. The claim is supported empirically for upto 20 noise streams. There is no additional evidence to suggest that this will be true if the number of sensors is increased significantly into the thousands (a real world environment can easily generate this number of observables).

Other limitations of MSDD (as applied to learning actions) are the assumptions that effects occur immediately after an action and that the world is completely observable; this precludes the agent from learning about actions with delayed or hidden effects.

2.2.4 EXPO: detecting and refining incomplete actions

Gil [18] describes a proactive experimental approach to learning action operators. The problem addressed is that of refining incomplete STRIPS rules that have missing preconditions or effects, for example, an ‘open door’ action is missing a precondition such that the door must be unlocked for successful execution. The system, named ‘EXPO’, works in an online manner by selectively and continuously monitoring the world during plan execution (plans are created and executed within the PRODIGY [39] agent architecture). The agent is monitoring for either of two kinds of failures that indicate that an action rule is incorrect. One such event is observing an unexpected outcome immediately after an action execution, e.g. an open action was executed and the door didn’t open as expected. In this case EXPO attempts to find a new precondition for the action. The precondition is found by drawing up a list of candidates (derived from successful past executions) and experimenting with each to discover the correct one.

The second type of failure is observed when a precondition for the current action (as predicted in the plan) is false. In this case EXPO determines that a previously executed action has a missing effect involving the predicate specified in the required precondition. The action with the missing effect is found by experimenting with all the actions executed since the predicate was last observed. The incorrect action is then refined by adding the predicate as an effect. In this manner EXPO incrementally refines its action rules.

This technique was successfully applied to a small deterministic problem domain with many incorrect rules. One advantage of EXPO is that it only refines rules for situations that are actually encountered (because learning is triggered during plan execution). This allows EXPO to avoid having to create action schema for every situation in which a rule might be used. A general purpose action, such as put on, has a potentially un- bounded number of rare and esoteric situations for which pre-conditions and effects must be specified in order to learn a ‘perfect’ model. An un-

2.2. LEARNING ACTIONS 31 guided learner may learn a number of rules that are correct but never actually used.

The concept of incremental refinement through experimentation that is guided by plan execution failure has some advantages over off-line and unguided approaches to learning in complex environments. Unfor- tunately, the EXPO system makes a number of assumptions that limit its particular implementation to micro worlds. For example, introducing exogenous events would confuse EXPO (and cause it to learn incorrect rules) because it expects only its actions to affect the world. Another weakness is that EXPO requires approximately correct rules to begin with, it cannot learn from zero knowledge of actions, unlike some other action learning systems.

In document EVALUACIÓN PRELIMINAR DEL PROYECTO (página 139-142)