RoboCup Soccer is an international project which aims at providing an experimen- tal framework in which various technologies can be integrated and evaluated1. In order to facilitate RL research in this application domain, two simplified tasks have been developed: the Keepaway game [SSK05], and the Takeaway game [IE08]. The basic settings of these games are the same:N + 1 (N ∈ N, N ≥ 1) keepers are competing withN takers on a fixed-size field. Keepers attempt to keep posses- sion of the ball within their team for as long as possible, whereas takers attempt to win possession of the ball as quickly as possible. A game scenario involving two keepers and three takers is shown in Figure 3.1. At the start of each episode, the keeper in the top-left corner holds the ball, while all the other keepers are on the right. All takers are initially in the bottom-left corner. An episode ends when the ball goes off the field or any taker gets the ball, and a new episode starts immedi- ately with all the players reset. We call a Keepaway (Takeaway) game consisting ofN + 1 keepers and N takers a N -Keepaway (N -Takeaway, respectively) game. In Keepaway, only the keeper holding the ball learns; all the other keepers and all takers act in accordance with hand-coded strategies. In Takeaway, on the con- trary, all takers learn independently while all keepers play in accordance with hand- coded strategies. So Takeaway games are cooperative multi-agent learning prob- lems, whereas Keepaway games are single-agent learning problems taking place in multi-agent scenarios.
Most research on Keepaway/Takeaway games is performed in the RoboCup Soccer Simulation Platform2, and agents are assumed to have a perfect knowledge of the environment: they can observe the accurate position of the ball and each agent. In the remainder of this chapter, in all Keepaway and Takeaway examples, we also adopt this assumption that all learning agents have perfect information of the environment. The simulation platform only provides primitive actions for each agent, e.g. change the velocity of each wheel3to some value. Directly using these primitive actions in RL results in poor performances and, as a result, macro actions
1See http://www.robocup.org/ for more information. 2
For more information about the Simulation platform, please refer to http://wiki. robocup.org/wiki/Soccer_Simulation_League.
Figure 3.1: An example scenario in RoboCup Soccer Game. The ball is the white circle next to keeperK1.
have been proposed first in Keepaway games by Stone et al. [SSK05], and then adjusted by Iscen and Erogul [IE08] in Takeaway games. To be specific, there are two macro actions for Keepaway:
HoldBall(): stay still while keeping the ball, PassBall(p): kick the ball towards keeper Kp,
and two macro actions for Takeaway:
TackleBall(): move towards the ball/K1 to tackle the ball,
MarkKeeper(p): go to mark keeper Kp,p 6= 1,
whereKp, p ∈ {1, · · · , N + 1} represents the pth closest keeper to the ball (so
K1 is the keeper in possession of the ball). Takers are also indexed according to
their distance to the ball:T1is the closest taker to the ball, whileTN is the farthest
taker to the ball. When a taker marks a keeper, the taker stops between the ball holder and that keeper, in the hope of intercepting the ball if it is passed to that keeper. A taker is not allowed to mark the ball holder. Overall, in aN -Keepaway (N -Takeaway) game, there are N + 1 macro actions available for each learning agent in each state.
SARSA(λ) is the most widely used algorithm for both Keepaway and Takeaway games (e.g. [SSK05, IE08, GTC12, GT14a]). Because these problems are with
continuous state space, to discretise the space, tile coding function approximation techniques [SB98] are used together with SARSA(λ). However, even after dis- cretisation, the state space is still big4and, therefore, standard SARSA(λ) usually takes a very long time to find the optimal policy, especially in Takeaway games, which involve multiple learning agents (the time that SARSA(λ) needs to find the optimal policies are given later in Chapter 4). Since the game itself is easy to un- derstand, people can easily propose some advise for agents, and the advise can feed into recommendations (heuristics) so as to accelerate RL. As a concrete example, consider the scenario shown in Figure 3.1. We focus on giving recommendations to takers (i.e. view this scenario as a snapshot of a Takeaway game). We first propose the following domain knowledge based on our observation of the game, whereq ∈ {1, 2}:
1. Tq should tackle the ball if it is closest to the ball holder, because it can
tackle the ball most quickly;
2. Tqshould mark a keeper if the angle betweenTqand this keeper, with vertex
at the ball holder, is the smallest among all takers, because thenTqcan block
the pass to that keeper most quickly;
3. Tqshould mark a keeper ifTqis closest to this keeper, because thenTq can
approach this keeper most quickly.
Given this domain knowledge and the current state (Figure 3.1), we can give several recommendations to each taker. As forT1, we can recommend it to tackle
the ball according to the first item in the domain knowledge, and recommend it to markK3 according to the second. Similarly, we can recommend T2 to mark
K2 because of items 2 and 3, and recommendT2 to markK3 because of item 3.
However, we can see that there are conflicts between these recommendations: for example,T1is recommended to perform both TackleBall() and MarkKeeper(3),
but it can perform at most one action at each moment. We call the conflicts between recommendations that recommend the same agent to perform different actions in-
ternal conflicts. In addition, we can see that bothT1 andT2 are recommended to
markK3, but asking multiple takers to mark one keeper is a waste of resources:
one taker is able to do the job, so we think there exists a conflict between these two recommendations. This conflict is between recommendations of the same action
4
The exact number of states depends on the number of agents involved. In a 2-Keepaway game, there are roughly 300 states. More details can be found in [SSK05].
to different agents, and we call conflicts of this kind external conflicts. From this example, we can see that to give useful heuristics, we not only need to give domain knowledge, but also need to resolve conflicts arising from this domain knowledge.