The benefit of using looping STs comes from the ability to keep relevant past events in memory by ’forgetting’ or looping over irrelevant details. Holmes et al. (2006) restrict their discussion to the deterministic case without rewards. The above discussion of the relation to hPOMDP/Rs extends their results to some subclass of stochastic POMDP/Rs for which their exists a resolving sequence, but is not a constructive proof. Unfortunately the loopability criterion of Holmes et al. (2006) cannot work in the general stochastic case since a loop can change not only the possible transitions but also the transition probabilities. More importantly, they use looping suffix trees to completely represent POMDPs, but this is not necessary. As long as we have an LST that can predict rewards well, there is no need to completely recover the original POMDP, in fact it is beneficial to have a more compact map. The utility of the reward prediction ability of LSTs is particularly demonstrated in the TMaze (Figure 3.2) example. Here, a very simple
1
0
1
0
s
0s
2s
11
0
s
00
1
1
0
s
1s
2Figure 4.4:Example (observation-based) suffix and looping suffix trees
LST can predict perfectly the rewards of a TMaze of any length, whereas the LST exactly predicting every observation of the TMaze is as long as the corridor in TMaze.
In this section we present an extension of the genericΦMDP method to looping suffix trees that can learn LSTs useful for performing well in reinforcement learning domains, where there is a need to “forget” or excise certain sequences of observations. The cost function of the ΦMDP framework (OCost
α) immediately gives us a well-motivated criterion for evaluating
looping suffix trees. Using looping suffix trees as the map class in this framework allows us to extend them to stochastic environments. Experimental results show thatΦMDP works well in the space of looping suffix trees. The extension to stochastic tree sources is also useful in deterministic environments, where in some cases a smaller stochastic tree source can sufficiently capture a deterministic environment.
For the rest of this chapter we do not use action-observation looping suffix trees, rather we restrict our trees to observations alone. This is primarily for practical reasons, it decreases the branching factor of the search space, the amount of space required to store the trees, and also the computational time required to map a history sequence to a state sequence. We do sacrifice some representation power when we do this, however the environments that we test on are representable by observation looping suffix trees.
We also note that in this chapter, we do not give the agent any information about when an episode ends in order to be as general as possible.
Algorithm
The algorithm consists of a specification ofCL(φ)and the neighbourhood method which is needed for the simulated annealing algorithm in the genericΦMDP algorithm (Algorithm 3). We call our algorithm LSTΦMDP. A tree withknodes can be coded inkbits (Veness et al., 2011, Sec.5) and the starting and ending nodes of allsloops can be coded in2slog(k), so
we define the model cost of the mapCL(φ)asCL(φ) = k+2slogk.The getNeighbour() method (Algorithm 6) first selects a state randomly and then with equal probability subject to certain conditions, it selects between one of 4 operations. Note that the simulated annealing procedure that we use is a very simple generic method. However this can be extended to more sophisticated annealing schemes such as parallel tempering as done by Nguyen et al. (2011). We observe that the simple scheme we use works extremely well in practice.
• merge : In order to merge a state, all sibling nodes must also be states. From the definition of a suffix set, we know that every state corresponds to a unique suffix. The merge is simply the shortening of a context for those states. If si is the state being merged andsi =ojn0whereoj ∈ Oandn0 is the remainder of the suffix corresponding to that state, then the siblings ofsiareokn0wherek6=i. If these siblings are also states then the merge operator removesoin0for allifrom the suffix set and adds a new state
n0.
• split: Analogously, we can split any statesiby adding a depth one context to the state i.e. by constructing|O|new states of the formojsi for alloj ∈ O and removing the statesi.
• addLoop: The addLoop function has two cases. Either we add a loop from an existing state to it’s parent (thereby removing it from the state set and adding it to the loop set) or we extend an existing loop to the parent of the existing node looped to.
• removeLoop: The removeLoop function is simply a reverse of the addLoop function allowing us to decrease the length of a loop, or if it is a length one loop create a new state from the node.
Definition 19. A historyhis said to beconsistentwith respect to a particular looping suffix treeLifLmaps every prefix ofhto a non-empty state.
Loops introduce a few problems to the standardΦMDP procedure. A looped tree can be inconsistent with the current history. This can be problematic if, for instance, the optimal tree is inconsistent with the current history. One solution is to always provide a reasonable initial history that the optimal tree should be consistent with. For example in the TMaze case (see Section 4.5), we ensure that the first observation is in fact the start of an episode, which is a reasonable assumption. Then any trees that are inconsistent can be discarded. In fact to make the search quicker, we can mark nodes where loops make the tree inconsistent and no longer add those loops. The initial map is always set to be the depth one tree (i.e. one split). The single state tree can have the lowest cost for very large amount of data, and we explicitly ban it as a neighbour.
For the reinforcement learning part of the algorithm, we use the model-based method as specified by Hutter (2009b) which is based on Szita and Lörincz (2008b). This method adds an additional “garden of eden” state (se) to the estimated MDP, which is an absorbing state with a high reward. The agent is told that it has been toseonce from every other state, however the
agent cannot actually transition to this state. Then we simply perform value iteration on this augmented MDP. Initially the agent will explore in a systematic manner to try and visitse, but as it accumulates more transitions from a particular state, the estimated transition probability tosedecreases, and the agent eventually settles on the optimal policy.
The space of looping STs includes the space of ordinary STs. Therefore, results from the non- looping case (Nguyen et al., 2011) should be reproducible, as long as the simulated annealing procedure is not (very) adversely affected by the enlargement of the search space. Experimental results show that some care must be taken in choosingαfor this to be the case. This is further discussed in Section 4.5.