• No se han encontrado resultados

As discussed in this chapter, and in Chapter 4, the Super Mario experiments performed in this thesis use two different feature sets. The first is the Littman implementation (Goschin et al., 2013), which has at least 359 state features, and increases with the number of enemies and items in the environment. The Littman feature set is very detailed, containing multiple continues features, as well as a full representation of the environment. The second implemen- tation is named Brys+, and is a small extension to the original implementation used by Tim Brys in his research (Brys, 2016; Harutyunyan, Brys, et al., 2015) (See Chapter 4 (Table 4.2) for a list of modifications. The Brys+ feature set has a constant 31 features, all discrete. By comparison, the Brys+ implementation is much simpler than the Littman implementation, but contains less much less information about the environment.

Figure 8.12 show the benchmark performance of an unassisted Q-Learning agent for the Littman and Brys+ implementations of the Super Mario environment. These results show that while the agent using the Brys+ implementation initially learnt more quickly than the

Figure 8.12: Unassisted Q-Learning benchmark performance on Super Mario using Littman and Brys+ feature spaces. These results show that an unas- sisted Q-Learning agent will learn faster using the Brys+ implementation. However, when using a Littman implementation, the agent can learn a better solution despite taking longer to learn.

Littman agent, it was unable to learn a better solution. This is expected, as the Littman implementation contains more information, resulting in a slower learning speed but a better fitting solution.

The Super Mario environment is complex and quite large, making the ability for a user to provide a full and accurate advice model difficult. Additionally, it is not easily identifiable whether the strategy to maximise reward is to collect all the items and kill all the enemies, or to race to the end of the environment to collect the time bonus (See Chapter 4 for more information). The simulated user designed for the following experiments attempts to provide advice that considers both of the strategies. The user will provide advice that encourages the user to move right, to the end of the environment, but also collect items that are easily reachable along the way. To provide advice, the user is using the Brys+ feature set, as it is short and discrete, making the process of constructing and providing rules easier than for the Littman implementation.

Figure 8.13 shows the results of a rule-assisted interactive reinforcement learning agent,

Figure 8.13: Rule-Assisted Interactive Reinforcement Learning on Super Mario using a Brys+ advice and state feature set. The assisted agent ini- tially has a better performance. However, as the agent begins to ignore the human advice and search for the optimal behaviour, the unassisted agent outperforms the assisted agent. The benefit of advice in this situation is debatable. While both agents have roughly equal total reward, the assisted agent had a higher minimal performance.

where the agent and the advisor used the Brys+ feature set. There results show that the assisted agent initially learnt much faster than the unassisted Brys+ agent. However, the performance of the assisted agent dropped over time, as the began to ignore the advice it received, and rely on it’s own policy and exploration strategy. The benefit of advice in this situation is debatable. The cumulative performance of the two agents are the same, within error margin. However, the reward from the worst performing episodes for the unassisted agent is substantially lower than the worst episodes for the assisted agent.

The results from Figure 8.12 shows that an agent can learn a better policy when using the Littman feature set. However, it is easier for the human to provide advice using the Brys+ feature set. As discussed earlier in this Chapter, all the information required to construct a Brys+ observation can be derived from a Littman observation. This allows the agent to learn a policy for the Littman observation, but accept and use advice provided with the Brys+ feature set. The agent interprets the Brys+ advice, and applies it to the current Littman observation for each timestep.

Figure 8.14: Rule-Assisted Interactive Reinforcement Learning on Super Mario using a Brys+ advice feature set and a Littman state feature set. The user provided advice in the context of a Brys+ implementation and the agent learnt a policy for the Littman implementation. The assisted agent has a greater performance initially, but degrades as it begins to disregard user advice in favour of its own exploration policy. Both agents end up learning the same behaviour, with the assisted agent slightly outperforming the unassisted agent overall.

Figure 8.14 shows the results of a rule-assisted agent that is learning a policy for the Littman feature set, but receives advice for the Brys+ feature set. These results show that the assisted agent had an immediate performance gain, but performance dropped as the agent incorporated the advice and began its own exploration. However, the assisted agent was able to recover and match the performance of the unassisted agent with little delay.

The number of interactions performed by the user for each of these experiment was equal to the number of rules the user could provide. The simulated user built for Mario has 4 rules, so each experiment only recorded 4 interactions.

8.7

Conclusion

This chapter introduced rule-based Interactive Reinforcement Learning, a method for users to assisted agents through the use of rule-structured advice and retention. Three environments were tested to investigate the impact that rule-based advice has on performance and the number of interactions performed to achieve the measured performance. Two of the environments also tested the use of rules that, while not optimal for the reward function, would still provide beneficial advice to the agent. These tests found that the agent can use this advice to improve learning speed, and still learn to ignore the incorrect/non-optimal advice later to achieve the optimal behaviour. Compared to state-based advice for Interactive Reinforcement Learning, rule-based advice was able to achieve the same level of performance with substantially fewer interactions between the agent and the user.

This chapter did not investigate the time and cognitive requirements for users to construct state-based and rule-based advice. It is likely that rule-based advice will require more time and thought to construct. However, existing research has shown that decision trees built with Ripple-Down Rules are easier for users to construct (Gaines & Compton, 1995; Compton et al., 1991; Compton, Peters, Edwards, & Lavers, 2006). Future work is required to test if this will justify the benefits that rules provide over state-based advice, in terms of the number of interactions.

Chapter 9

Conclusion and Future Work

This chapter summarises the contributions made by this thesis, and highlights directions for future research.