CAPÍTULO 3. HEURÍSTICAS DE CONSTRUCCIÓN
3.3. Heurísticas de Inserción
3.3.3. Inserción en Paralelo de Christofides, Mingozzi y Toth
A(x)
Q(x,A)
A(x)
Q(x,A)
A(x)
Figure 5.2: The basic Q-AHC architecture. The circles represent the AHC modules which are selected when the action value linked to them by an arrow is selected.
As with on-line Q-learning, it is the eligibilities
e
tthat determine the extent to which theindividual parameters
w
tare updated in response to the TD-error.Of course, there is also a choice to be made of which update rule is used for the Q-learning part of the system. This can be any of those discussed in chapter 2. In the experiments presented later in this chapter Modied Q-Learning (see section 2.2.2) is used.
5.3 Vector Action Learning
The immediate problem faced in applying adaptive action function methods to a task like the Robot Problem (chapter 4) is the fact that the action is actually a vector. For example, in the Robot Problem, the action has two components, the steering angle and the speed. Thus methods are needed to determine how to alter each action component in response to the scalar internal payo
"
. Previously, this problem was avoided because each action value was associated with a separate xed action vector.With an AHC system, the obvious choice is to learn an overall value function,
V
(x
), and use the TD-errors produced by this to update all of the action elements (Tham and Prager 1992, Cichosz 1994). This ignores the structural credit assignment problem, which is to take into account the contributions of the individual elements to the TD-error. For example, in the Robot Problem, the robot might be heading for a collision with an obstacle and the selected angle component is to turn sharply (good idea), whilst the selected speed component is to travel at top speed (bad idea). If in consequence the robot crashes, then it will really be the fault of the speed element. However, in the above formulation, both elements will see the same internal payo. It is questionable how much of a problem this is good action choices by individual elements will generally see higher average payos than bad choices, as they contribute positively to the overall quality of the action vector. However, the lack of explicit structural credit assignment may increase the convergence times.5. Systems with Real-Valued Actions
82
5.3.1 Q-AHC with Vector Actions
If the action to be performed at each step is in fact a vector of individual scalar actions, then their are more choices to be made when it comes to implementing a Q-AHC system. Fig. 5.3 shows 3 possible architectures for a system with two components to its actions.
The rst architecture, Separate Q-AHC, involves treating the components of the ac- tions as completely independent. Therefore independent Q-AHC learning elements are used to select each component of the action. The diculty is that each action component is selected taking no account of the values selected for the other components. This makes it harder for each Q-AHC element to predict the expected return for their action compo- nent, because the eect of the other action components cannot be taken into consideration when making the prediction.
The second architecture, Combined Q-AHC, allows each action value to correspond to a particular combination of the AHC elements. This is similar to the xed action combi- nations used by the Q-learning systems for the Robot Problem in chapter 4, where the 6 actions were made up of all combinations of the 3 xed angles and 2 speeds. The prob- lem with the Combined Q-AHC architecture is that each AHC element will see internal payos generated from their inclusion in dierent action vectors. These payos may con- ict | e.g. one vector action is very useful, another is not, hence any AHC element that contributes to both will receive conicting signals depending on which action vector was selected. However, much of this problem should be absorbed by the Q-function, which will learn to assign the poor action vector a low action value and so not select it very often.
The nal architecture, Grouped Q-AHC, simply involves having a separate action vector of AHC elements associated with each action value. This would appear to be the most satisfactory architecture, as it reduces all action component interference problems. However, there still remains the same problem of structural credit assignment associated with training vector AHC learning systems (see section 5.3).
It should also be noted that the rst architecture allows the greatest number of action combinations per element (both Q-function and AHC), whilst the last architecture allows the least. This is because the last system has a set of action vectors that are completely independent, whereas the other two rely on combinations of AHC elements.
5.4 Experiments using Real-Valued Methods
In this section, some of the real-valued action methods discussed in the previous sections are examined by testing them on the Robot Problem introduced in chapter 4.
The real-valued action systems used in these experiments are considered for the case where MLP neural networks are used as the function approximators (chapter 3). In addi- tion, the updating method used throughout is the on-line temporal dierence algorithm as discussed in section 3.3.1. The real-valued systems use MLPs not only for the prediction of the return, but also for the action functions. For Gaussian ASLA action functions using stochastic hill-climbing techniques, this means that one output is required for the mean
(x
) and one for the variance (x
). In the following experiments, separate single output networks were used, rather than a single network with two outputs, to avoid the weights of hidden units receiving conicting updates from the two outputs.In the rst experiments, the robot task is restricted to single action component se- lection, where the reinforcement learning system attempts to select the optimum steering angle whilst the speed of the robot remains constant. Hence, the quality of the optimal