27% Estrategia, estruct y
8.1. C ONSIDERACIONES SOBRE LA MEDICIÓN DE LA COMPETITIVIDAD
There has been some research on improving the performance of RL by using high- level domain knowledge. Marthi [Mar07] proposed abstract MDP to find an ap- proximation of the true shaping functions by solving a simpler abstract problem under the instructions of prior knowledge. Grzes and Kudenko [GK08] used the high-level STRIPS [FN72] operator knowledge in reward shaping to search for an optimal policy and showed that the STRIPS-based reward shaping converges faster than the abstract MDP approach. However, these approaches have very high and restrictive requirements on the domain knowledge being used: for example, in the abstract MDP approach, people are required to identify which states are ‘similar’, so as to merge these states into one state and therefore propose the ab- stract MDP; in Grzes and Kudenko’s approach [GK08], people have to provide STRIPS-style domain knowledge, which cannot contain conflicts. We can see that
our argumentation framework based approach is able to handle more flexible do- main knowledge: for example, the domain knowledge we provide in this chapter for the Keepaway and Takeaway games does not meet the requirement of their ap- proaches, but can be used in our approach. In addition, their approaches can only be applied to single-agent learning problems, while ours can be used in multi- agent problems also. To summarise, our approach is more generic and flexible than existing techniques for proposing heuristics for RL.
Our work is closely related to work in argumentation-based decision making, because, in each state, our argumentation framework needs to decide which ac- tions should be recommended to which agents. Amgoud [Amg09] proposed a two-phase argumentation-based model for decision making: in the first inference step, the model uses a Dung style system in which arguments in favour/against each option (action) are built, then evaluated by using certain semantics; in the second comparison phase, pairs of alternative options are compared using a given criterion, which is generally based on the winning arguments computed in the first phase. Note that, in the first phase, two kinds of arguments are used: prac-
tical arguments, whose conclusions are actions, and epistemic arguments, whose
conclusions are premises of practical arguments. By introducing the epistemic arguments, the selection of the applicable arguments can also be modelled in an argumentative way. This distinction between practical and epistemic arguments is also present in other approaches to argumentation-based decision making, e.g. [AP09, BCP06]. Note that SCAF and VSCAF only allow practical arguments and the applicability of each argument is decided by comparing its premise with the current state, not by using epistemic arguments. Compared with Amgoud’s ap- proach, the argumentation framework we proposed has two immediate advantages: (a) by only allowing practical arguments, we reduce the overall number of argu- ments involved in each learning step, and thus reduce the domain expert’s burden to propose arguments; this property is especially useful when the domain experts do not have much expertise in argumentation; and (b) the computational overhead of selecting the applicable arguments in our approach is lower than that in Amgoud’s approach: in our approach, the programme just needs to go through the premises of each argument, but in Amgound’s approach, the programme needs to compute the winning arguments in an argumentation framework, which involves both prac- tical and epistemic arguments, and this computation is generally very expensive (see Section 2.1.4). However, the epistemic arguments allow for more powerful knowledge representation and justification capability. We leave extending our ar-
gumentation frameworks with epistemic arguments as a future work, which will be discussed in greater details later in Chapter 7.
3.5 Conclusion
In this chapter, we first presented the challenges of deriving high-quality heuristics from conflicting domain knowledge, and then proposed a VAF-based argumenta- tion framework to tackle this problem. We proved that the heuristics generated by this framework are suitable for cooperative RL algorithms, because each agent re- ceives at most one recommended action (this property is desirable because each we focused on problems where agent can perform only one action at each time slot) and each action is recommended to at most one agent (this property is desirable be- cause we focused on cooperative RL problems where multiple agents performing the same action is a waste of resources). In addition, we proposed Argumenta-
tion Accelerated RL(AARL) as an incorporation of our VAF-based argumentation
frameworks with RL algorithms. In particular, we outlined the architecture of AARL and described the functionality of each of its modules.
In the next chapter, we will instantiate the AARL framework on different RL al- gorithms so as to empirically test the effectiveness of AARL. We select SARSA(λ) and MAXQ-0 as the representatives of the flat and hierarchical RL algorithms, respectively, and implement AARL on these two algorithms. The resulting al- gorithms — SARSA(λ)- and MAXQ-based AARL — and their performances on some application domains will be presented in Chapter 4 and Chapter 6, respec- tively. In addition, in Chapter 5, we will give the PBRS-augmented MAXQ-0 al- gorithm: PBRS-MAXQ-0, which is essential for the construction of MAXQ-based AARL.
4 SARSA(λ)-based AARL
In Chapter 3, we have proposed a generic argumentation framework for deriv- ing high-quality heuristics from conflicting domain knowledge. In this chapter, we integrate the heuristics generated by the aforementioned argumentation frame- work into SARSA(λ) (see Section 2.2.4), and propose the resulting algorithm:
SARSA(λ)-based Argumentation Accelerated RL (SARSA(λ)-based AARL). We
choose SARSA(λ) as the RL algorithm to implement AARL because it is a simple and widely used RL algorithm [SB98]; also, it has been integrated with mature PBRS techniques, e.g. look-ahead advice (LA, see Section 2.2.5), which have been proved sound and empirically effective.
This chapter is organised as follows: in Section 4.1, we present SARSA(λ)- based AARL; then we empirically test its effectiveness in the RoboCup Keepaway and Takeaway games in Section 4.2, and in a Wumpus World game in Section 4.3. Related works are reviewed in Section 4.4, and we conclude this chapter in Section 4.5.
4.1 SARSA(λ)-based AARL
To propose the SARSA(λ)-based AARL algorithm, let us first revisit the archi- tecture of AARL we illustrated in Figure 3.5 in Chapter 3. We discussed that AARL amounts to the combination of three modules: AF, Potential Generator and PBRS+RL. We choose LA-SARSA(λ) to be the algorithm used in the PBRS+RL module, and this algorithm has been presented in Section 2.2.5.
For the other two modules, we can see that the output of the AF module, i.e. the heuristics, are only used in the Potential Generator module; as a result, for efficiency purposes, we implement one single function with the combined func- tionality of both these two modules. We name this function getPotential(s), and its output is a table containing all actions’ potential values in states. The pseudo code of this function is presented in Algorithm 7.
Algorithm 7 The combined AF module and Heuristics Generator module of SARSA(λ)-based AARL for a learning agent Agenti.
1: function getPotential(States, AgentIndex i)
2: Obtain candidate argument set Arg∗, value set V, value promotion relation
val, value ranking Valpref, the argumentation extension type T ype, and the potential value given to recommended actions:c ∈ R, c > 0
3: Obtain all agents’ applicable argument set 4: Build SCAF, VSCAF, and derive AF− 5: E := getExtensions(AF−, T ype) 6: arec := getRecActFromExt(E, i)
7: Build aT able, whose keys are actions, entries are all 0 8: ifarecis not null then
9: T able(arec) := c
10: end if 11: returnT able
ing agent indexi. In the beginning (line 2), getPotential first obtains all domain knowledge provided by the domain expert. As we have discussed in Section 3.3, this knowledge is ‘upfront’, i.e. provided before the learning starts and remaining the same throughout the learning. Given this knowledge, each agent can obtain the applicable arguments (line 3, note that this may need communication with other agents, see Section 3.3), and then build SCAF, VSCAF and derive AF− (line 4). Given AF−, the agent can compute theT ype extensions of AF− and store these extensions in setE (line 5). Note that T ype can be preferred or grounded, as dis- cussed in Chapter 3. Given all extensions, the agent can choose the recommended action for itself, by invoking function getRecActFromExt (line 6), as defined in Algorithm 6 in Chapter 3. Note that if this agent does not have any recommended actions, function getRecActFromExt returns null. After obtaining the recom- mended actionarec, the agent creates a table, in which the entries corresponding
to the recommended actions arec, while the other actions’ corresponding entries are0 (lines 7 to 9, for why these potential values are given to each action, see Section 3.2.4). This table is the output of the function (line 11).
Function getPotential is integrated into LA-SARSA(λ) to give the SARSA(λ)- based AARL (in Algorithm 8). Here we highlight some significant differences between this algorithm and LA-SARSA(λ) (see Section 2.2.5):
• Before entering into the first learning step (see Section 1.3 and 2.2.2 for ‘learning step’), we compute the potential values in the initial state and store the results in tableCurT able (line 5 in Algorithm 8). Note that the current
Algorithm 8 The SARSA(λ)-based AARL for Agenti.
1: InitialiseQ(s, a) arbitrarily for all states s and actions a 2: while the experiment does not terminate do
3: Initialisee(s, a) := 0 for all s and a 4: Initialise current statest
5: CurT able := getPotential(st, i)
6: Choose actionatfromstusing the biasedǫ-greedy policy
7: whilestis not a terminal state do
8: Execute actionat, observe rewardrtand new statest+1
9: N extT able := getPotential(st+1, i)
10: Chooseat+1fromst+1using the biasedǫ-greedy policy
11: δ := rt + γQ(st+1, at+1) − Q(st, at) + γN extT able(st+1, at+1) − CurT able(st, at) 12: e(st, at) := 1 13: for alls, a do 14: Q(s, a) := Q(s, a) + αδe(s, a) 15: e(s, a) := γλe(s, a) 16: end for
17: CurT able := N extT able 18: st:= st+1;at:= at+1
19: end while 20: end while
state’s potential values are computed before selecting the action to be per- formed in the current state (line 6), because in LA-SARSA(λ), actions are selected by using the biased ǫ-greedy policy (see Equation (2.8) in Section 2.2.5), which needs the potential values of all actions.
• After the next state st+1 is observed, this algorithm computes the potential
values for all actions inst+1 and store the results in tableN extT able (line
9). Also, this potential value computation is performed earlier than selecting the action to be performed in st+1, which also requires potential values of
all actions inst+1.
• In Q-value updating (line 11), according to Equation (2.8) in Section 2.2.5, potential values of both the current state-action pair(st, at) and the next state
action pair(st+1, at+1) are needed. We obtain these two potential values by
visiting their corresponding table entries.
• Towards the end of each learning step (line 17), the algorithm updates table CurT able by replacing its entries with entries in N extT able. By doing this,
in the next learning step, potential values forstdo not need to be computed
again.