Given this generative model, the maximum likelihood estimate of the transition graph for a given set of transitions T would be G∗= arg maxGLT(G). Unfortunately, solving this problem directly is hard; we propose the FIGE heuristic, which aims at finding close-to-optimal transition graphs iteratively and is computationally tractable.FIGE is an iterative algorithm whose update equations are derived from the maximum likelihood objective using two simplifying assumptions (see Appendix A.2): (A1) For each tran- sition (s, a, s0) ∈ T , assume p(v0|v, a) = 1 if v = NNV(s) ∧ v0= NNV(s0) else 0, where NNV(s) = arg minv∈V||s − v||2. This assumption implies that whenever action a is executed in any state of the Voronoi cell Vo(v) = {s ∈S|NNV(s) = v} the successor state will be with probability 1 in Vo(v0). (A2) Assume p(T |V ) = ∏
v∈V
p(T |v). This assumption implies that the choice of the positions of the graph nodes v ∈ V can be made independently. Both assumptions are typically oversimplifying; A1 is more oversimplifying for domains whose dynamics are less locally smooth. The Assumption A2 on the other hand is more simplifying in strongly connected domains where many transitions from the Voronoi cell of one node to the Voronoi cell of another node occur. To account for some of the errors made because of the oversimplifications of A1 and A2,FIGE iteratively refines the graph node positions by applying the derived update equations several times. Note thatFIGE is a heuristic and no guarantee for converging to G∗is given.
FIGE is summarized in Algorithm 5.1: The set of graph nodes V is initialized such that it covers the set of states contained in T uniformly by, e.g., maximizing the distance of the closest pair of graph nodes (line 3). Afterwards, for K iterations, the graph nodes are moved according to two kind of “forces” that act on them (see Figure 5.2): The “sample representation” force FS(line 6-7) pulls each graph node v to the mean of all states Svfor which it is responsible, i.e., the states s for which it is the nearest neighbor NNV(s) in V . Thus, this force encourages node positions that capture the on-policy state distribution well and corresponds to an intrinsic k-means clustering. The “graph consistency” force FG(line 8-10) pulls each graph node v to a position where for all (s, a, s0) ∈ T with NNV(s) = v there is a vertex v0such that v0− v is similar to s0− s, i.e., both vectors are close to parallel. Thus, this force encourages node positions which can represent the domain’s dynamics well. The nodes are then moved according to the
91 5.3 METHODS
Algorithm 5.1 Force-Based Iterative Graph Estimation (FIGE)
1: Input: Transitions T = {(si, ai, s0i)}ni=1, parameters vnum, K
2: # Choose initial node positions V from states in T s.t. the distance of closest pair is maximized
3: V = INITIALIZE(T, vnum)# |V | = vnum
4: for i = 0 to K − 1 do 5: for all v ∈ V do
6: SV[v] = {s | ∃(s, a, s0) ∈ T : NNV(s) = v}# Observed states in Voronoi cell Vo(v)
7: FS[v] = MEAN(SV[v]) − v# Sample representation force
8: T→(v) = {NNV(s0) − s0+ s | ∃(s, a, s0) ∈ T : NNV(s) = v}# Transitions starting in Vo(v)
9: T←(v) = {NNV(s) − s + s0| ∃(s, a, s0) ∈ T : NNV(s0) = v}# Transitions ending in Vo(v)
10: FG[v] = 0.5 · [MEAN(T→(v)) + MEAN(T←(v))] − v# Graph consistency force
11: V= V + αi· 0.5(FS[V ] + FG[V ])# Update node positions (vector notation)
12: # Count transitions from Voronoi cell Vo(v) to Voronoi cell Vo(v0) under action a
13: navv0= |{(s, s0) | ∃ (s, a, s0) ∈ T : NNV(s) = v ∧ NNV(s0) = v0}|
14: E =(v, v0) | v, v0∈ V ∃a ∈ A : na vv0> 0
# Edge between v and v0
15: wvv0= 1
|A|∑a∈A na
vv0
∑v˜nav˜v # Off-policy edge weights from Section 4.2.1
16: return (V, E, w)
two forces (line 11), where the parameter αi∈ (0, 1] controls how greedily the node is moved to the position where the forces would become minimal. In order to ensure convergence of the graph nodes, αishould go to 0 for i approaching K. If not explicitly stated, we use αi= di/5e−1and K = 15. An edge is added between two nodes v and v0 if there exists at least one transition (s, a, s0) ∈ T with s being in the Voronoi cell of Vo(v) and s in Vo(v0) (line 14). Moreover, the off-policy edge weights derived in Section 4.2.1 are used (line 15).
The derivation ofFIGE from the maximum likelihood objective is given in Appendix A.2. FIGE’s property of first choosing the graph node positions V and afterwards choosing E and w is a direct consequence of Assumption A1. Similarly, Assumption A2 allows thatFIGE can ignore node interactions within an iteration and chooses each graph node’s position greedily.
FIGE’s time complexity is dominated by the nearest neighbor queries: in every of the K iterations and for any s and s0occurring in T , the nearest neighbor in V need to be determined. Using |T | = n and assuming that “naive”, linear nearest-neighbor search is used, this requires O(nvnumns), where nsis the dimensionality of the state space. Thus, the time complexity ofFIGE is in O(Knvnumns).
Note thatFIGE uses the assumption of a discrete action space solely in lines 13-15 for the importance sampling required for computing the off-policy edge weights.FIGE could be extended to continuous action spaces by using the on-policy edge weights won= nvv0/nv with nvv0 = |{(s, s0) | ∃ (s, a, s0) ∈ T : NNV(s) = v ∧ NNV(s0) = v0}| and
nv= ∑v0nvv0. Alternatively, other means for estimating the sampling policy could be employed which extend to policies over continuous action spaces.
5. LEARNING GRAPH-BASED REPRESENTATIONS 92
Figure 5.2 – Illustration of the forces in . Shown are the forces that act on graph
nodes (depicted as stars) in . The left diagram depicts the sample representation force
FS acting on node v. This force is exerted by the set of states SV (red dots) for which
v is the closest graph node and pulls v to the mean of SV. The left diagram shows the graph consistency force FGexerted by the transition from state s to sonto the graph nodes
v= NNV(s) and v= NNV(s). The force FGpulls node v to position v− (s− s) and node
vto position v+ (s− s).
Figure 5.3 – Illustration of the mountain car domain. The car is denoted by a black dot; its movement is restricted to a one-dimensional surface. The objective of the car is to reach the top of the right hill. Since the car is underpowered, it cannot reach the goal directly but must first build sufficient energy by oscillating back and forth between the two hills. We refer to Sutton and Barto (1998, Chapter 8.4) for more details.
5.3.2.1 Illustration
We illustrate the different heuristics for transition graph generation in the mountain
car domain (see Figure 5.3). In mountain car, the agent controls a car that is placed
in a one-dimensional valley and must reach the top of the right hill. Since the car is underpowered, it cannot reach the goal directly but must first build sufficient energy by oscillating back and forth between the two hills. The agent observes two continuous
93 5.3 METHODS
Figure 5.4 – Illustration of the transition graphs duringFIGE’s iterations. Every point in the upper left plot corresponds to one state and the streamlines show the dynamics of the domain when no force is applied to the car. Background colors show the density of the on-policy state distribution for a policy that selects actions uniform randomly and the gray area corresponds to the terminal region of the state space. The other plots visualize the graphs generated byFIGE after 0, 5, and 25 iterations for vnum= 250.
actionsleft, none, and right, which add −0.001, 0, and 0.001 to vx, respectively. At each time step, x is incremented by vx and due to gravity −0.0025cos(3x) is added to vx. The velocity vxis constrained to a maximal absolute value of 0.07 and set to 0 if the top of the left hill is reached. The mountain car domain is well suited for illustration because of its two-dimensional state space.
The domain’s dynamics for the none action are visualized in the upper left plot of Figure 5.4. The other plots show the graphs generated byFIGE after 0, 5, and 25 iterations for vnum= 250 and |T | = 40000. Because of its initialization of graph node positions,FIGE covers the state space of the mountain car domain already quite well before the first iteration of the force-based updates. Unfortunately, the dynamics of the domain can hardly be represented by edges of the graph for this choices of graph node positions. However, because of the graph consistency forces that act on the graph node positions, the representability of the domain’s dynamics increases considerably with the number of iterations. At the same time, the sample representation forces ensure that all relevant parts of the state space remain covered by graph nodes. After 25 iterations, the graph structure reflects nicely the domain’s dynamics.
Figure 5.5 illustrates the graphs generated byFIGE after 25 iterations and by the three other heuristics discussed in Section 5.2.1. While the transition graphs generated by the grid and the ε-net heuristics cover the state space close to uniformly, the domain’s dynamics are hardly recognizable. Even worse, the on-policy heuristic generates graphs that do not cover the state space well because independent sampling does not take
5. LEARNING GRAPH-BASED REPRESENTATIONS 94
Figure 5.5 – Illustration of four different heuristics for transition graph generation. Shown are the generated transition graphs in the mountain car domain (cf. Figure 5.4).
into account the euclidean nature of the state space. In contrast, the transition graph generated byFIGE nicely reflects the domain’s dynamics.
5.3.3 Skill Prototype Generation
PluggingFIGE in the graph-based skill discovery approach shown in Figure 5.1 allows generating a transition graph G = (V, E, w) and by means of graph clustering its partition PG. For learning an option o based on a newly identified bottleneck, we need to choose an appropriate skill prototype Ψo= (Io, βo, Ro) based on the identified partitionPG. For this, the partitionPG= {p1, . . . , pn} of the transition graph is generalized to a partition PS= {ΠS(p1), . . . , ΠS(pn)} of the entire state spaceS by a nearest-neighbor based generalization ΠS: for this, we set ΠS(pi) = {s ∈S | NNV(s) ∈ pi}, i.e., we assign each state s to the cluster of its closest vertex NNV(s).
Similar to Section 4.2.4, we can now create skill prototypes for each pair of con- nected clusters A, B ∈PG: For each cluster A, one skill is generated for each adjacent1 cluster B. The corresponding skill prototype ΨA→B= (IA→B, βA→B, RA→B) is defined as:
IA→B= ΠS(A) βA→B(s) = 0 if s ∈ IA→Belse 1 RA→B((s, a, r, s0)) = −1 if s0∈ ΠS(A ∪ B) else rp− 1.
The prototype ΨA→Bcorresponds to a skill that can be invoked anywhere in cluster A, terminates successfully everywhere in cluster B, and terminates unsuccessfully in all other clusters. The parameter rp 0 of the algorithm determines the penalty for such
1Two clusters A and B are adjacent in G = (V, E) if there exists v
95 5.4 RESULTS
an unsuccessful termination. Otherwise a reward of −1 is given. Thus, the optimal policy corresponds to traversing the bottleneck area between A and B as fast as possible while not leaving the clusters A and B.
Additionally, for each cluster that contains nodes in which an episode has terminated, a special skill ΨAst is created that can be invoked in any state of the cluster, terminates
successfully when the episode terminates, and terminates unsuccessfully (i.e., obtains the penalty rp) if the clusters is left (cf. Section 4.2.4). Note that in contrast to Mannor et al. (2004), the generalization of the graph partition to the entire state space allows performing the learning of skills and higher-level policies in the original MDP and not in a discretized version of it.
5.4
RESULTS
In this section, we present an empirical evaluation of the transition graphs generated by FIGE (see Section 5.4.1) and of a hierarchical RL architecture which uses skill discovery based onFIGE internally. Skill discovery based on FIGE is evaluated with regard to the quality of the obtained partitions in Section 5.4.2.2 and with regard to the learning performance of the whole hierarchical RL architecture in Section 5.4.2.3. We refer to Figure 4.1 for information on which parts of the overall architecture correspond to these evaluations.