In the first experimental setup, we tested the Q-learning algorithm in a grid-world. A grid-world is an environment where a state is described only by a location in the world; the agent navigates such environment by choosing to move in one of the cardinal directions. The shape of such environment is very similar to that used in [93], where a square 13× 13 world is divided in 4 areas by means of barriers. Figure 3.1 shows a representation of such environment. Barriers are interrupted to allow traveling from one “room” to another; these holes are called “hallways”. The environment is non-deterministic: the probability of transitioning in the desired direction is23; the agent moves in one of the other cardinal directions with probability of 13, that is 19 for each of them.
Flavors
We used the grid-world topology described above in two different flavors.
• The black states separating the four areas are walls, meaning they are impene- trable. This is the same setting used in [93].
• The black states separating the four areas are ponds, which means they are not impassable, but have a low reward. The rationale behind this choice is to make the hallways part of the paths chosen by the experts rather than unavoidable steps. This is because our algorithm detects as subgoals only states that the
1http://www.numpy.org/ 2http://scikit-learn.org/
0
45 47
Figure 3.1 The grid-world used in the experiments. The yellow squares represent the destinations used in the different experiments: the one in the upper corner is labeled as [0], the one in the hallway is labeled [45] and the other one is labeled [47]. These labels will be used to present the results of the experiments. The black squares represent unreachable states in the wall setting and low rewards states in the ponds setting.
experts explicitly chose among the others - as opposed to states that were simply not avoidable.
Specifically, while normal states have a reward of 0, ponds have a reward of
−10. The final state has reward 1000. When the final state is reached, the execution
terminates.
The key difference between the two flavors lies in where the obstacles are en- coded. In the “walls” case, the unreachability of the blocks is encoded in the reward function, which our algorithm knows. As a consequence, the expert decision of pass- ing through the hallways is not surprising. In the “ponds” scenario, the penalty of these states is encoded in the reward function, which is unknown by our algorithm. As a consequence, the decision of the expert of passing through a hallway rather than directly over a wall is unpredictable and therefore picked up by our algorithm.
In each of the flavors, different sets of experiments are run: each set of experi- ments uses a different destination for the agent. The states we chose as destinations are shown in Figure 3.1. We selected these destinations for specific reasons: one of them is in a corner, away from most common paths; another one is in a hallway, which is one of the hand-crafted options in [93]; the third one is close to a hallway and, as such, is part of many common paths.
By choosing different destinations, it is possible to compare the performance of agents using different sets of options. If the options are close to the destination, the agent is advantaged because when it chooses a random, exploratory action, it is more likely that it will select an option the will lead it close to the destination. On the other hand, if the options are all far from the destination, they will put the agent at disadvantage.
Q-learning details
The value of the discount factor of the MDP was set to γ = 0.97. The options are generated according to Equation 3.1 using c = 50. The exploration strategy used in the experiments is -greedy, with = 0.1.
Baseline
The purpose of the experiments is to compare the quality of the hand-crafted options used in [93], aiming at the hallways, denoted by H, with that of options learned from demonstrations by our algorithm, denotedL. These sets of options are also compared against the set of primitive actions only, denotedA. Sets A ∪ H and A ∪ L are also tested,
We selected the handcrafted options as the baseline for comparison, rather than a random selection of states, because the former are supposed to perform better. This is because, due to the structure of the environment, a hallway state is for sure in any path that travels between two different rooms. Given this, their contribution to the Q-learning algorithm is expected to be more useful than that of other random locations [93].
Goals number reduction
We reduced the number of detected goals using the graph-based technique described in Section 3.3.1. We then selected a representative from each of the top 4 clusters of goals. The number 4 is selected to provide a fair comparison with the baseline, which uses 4 options.
Experimental procedure
Each of the 5 setups (A, H, L, A∪H and A∪L) was repeated 200 times, each allowing Q-learning to run for 3000 episodes. For the setups including learned optionsL, a new set of demonstrations was generated at every repetition as follows:
• a state was randomly selected as a goal;
• a reward function was generated as in Equation 3.1;
• Value Iteration was used to compute the optimal value function; • the optimal policy was computed from the optimal value function; • a random initial state was selected;
• the optimal policy was followed until the goal state was reached;
• every step of this interaction was recorded and made into a demonstration.