Servicios complementarios no incluidos en el Mercado 1
UBICACIÓN FIJA
MCRT-CAS expands the look-ahead search from the current state s to a fixed depth D in a simulation. At s, it chooses the best action as∈ A(s) as a sample. It explores
the actions at the successor state s0 = T (s, as) of s seen in the look-ahead search. At
s0, it selects an action a0 ∈ A(s0) from a small list of actions A
c(s0) ⊂ A(s0) such that
a0 ∈ Ac(s0). The small list of actions is called a corridor. Ac(s0) is constructed using
the best action a at s0. This sampling continues until the look-ahead search reaches depth d. For each state-action pair seen during the look-ahead search, MCRT-CAS keeps computing the reward values using the R function. The state seen at depth d is evaluated using an admissible heuristic and this value is added to the sum of the rewards of all state-action pairs seen in the look-ahead search in a simulation. This accumulated value is used to update the expected long term reward of the action sampled at the current state, i.e. Q(s, as). If the returned value is bigger than the
current Q(s, as) then the estimated action value is modified otherwise it remains
unchanged. If Q(s, as) remains unchanged for nc > 0 consecutive simulations, then
MCRT-CAS selects the best action at s by excluding as from the action list. If s has
converged or if the time to run the simulations expires, it selects the best action at s as a plan and returns a for execution.
3.5.2.1 Objective
In Monte-Carlo simulations, the exploration of new actions is an important and es- sential part of the search. It is important because the action values at the current state of the planning agent are estimated using local information. However, the ex-
ploration of new actions that are not useful to solve the current planning problem is computationally very expensive for the planning search algorithm. It is intuitive to limit the exploration of the search space - during the Monte-Carlo simulations - within the vicinity (or corridor) of the effect(s) of the current best action of a state. The exploration scheme also gives importance to the actions that are less explored by increasing their chance of selection. This ensures the exploration of all possible ac- tions in a corridor. The chance of selection is computed using 1/n(s, a) where n(s, a) is the number of times an action a ∈ A(s) is sampled at state s. The corridors are kept overlapping, i.e. two corridors (each one by a different action) can have one or more common actions, so the exploration of new actions can move from one corridor to another one in two consecutive simulations. At the current state s, the exploration of a new action is performed if the estimated value of the best action ab ∈ A(s) does
not change for some iterations. 3.5.2.2 Algorithmic Details
MCRT-CAS makes two main changes to the original MCRT [52]. First it uses the estimated action values to draw a sample at the current state of the planning agent. Second, it uses a corridor-based exploration scheme to draw action samples at the successor states of the current state. To explore the new actions, MCRT-CAS uses a small set of actions that are relevant to the best action in the current state. MCRT- CAS performs exploration of new actions within that corridor. The construction of the corridor for an action is done automatically. This can be done by using the angle between the directional lines of two actions if the actions are directional. A corridor for an action is built before the start of planning and is stored in memory for online
use. The construction of a corridor for an action is a trivial process. We use the angle between the directions of two actions to build a corridor. For example, a corridor of an action “MOVE TO NORTH” is a set {“MOVE TO NORTH WEST”, “MOVE TO NORTH”, “MOVE TO NORTH EAST”} where each member of the set has an angle of 45◦ or less with the action “MOVE TO NORTH”. The general overview of MCRT-CAS is given in Figure 3.8. At the current state sc, if sc has converged then
MCRT-CAS returns the best action at sc (line 2, Figure 3.8). The convergence of
a state sc is discussed in section 3.6. If sc has not converged yet, then MCRT-CAS
runs several rollouts depending on the time limit to estimate the action values at sc.
In each rollout, MCRT-CAS chooses (line 6, Figure 3.8) the best action a ∈ A(sc)
at sc as a sample. If sc is seen for the first time, or any of its actions have not
been sampled yet, then ChoooseAction randomly selects an action (from the unseen actions). The next state sn of sc along with the reward (rn) of the state-action pair
(sc, a) is returned by the function SimulateAction (line 7, Figure 3.8). The details
of SimulateAction have been discussed in section 3.3. sn is expanded for a length of
depth − 1 using a combination of the corridor-based action sampling. The function ChooseActionF romCorridor (line 9, Figure 3.8) chooses the best action a0 at sn and
creates a corridor Ac(a0). The function then selects an action a randomly from the
corridor of a0. The chance of selection of an action as a sample in the corridor depends on the number of times the action has been sampled in the previous searches. An action a ∈ Ac(a0) has more chance of selection as a sample than other members of
Ac(a0) if a is the least explored. The function SimulateAction (line 10) computes
the immediate reward rw and the next state snext of the state action pair (sn, a). rw
Function M CRT − CAS(sc, g)
Read access depth, timelimit, nc;
1. IF sc is converged THEN
2. a := ChooseBestAction(sc);
3. RETURN the action a; 4. ELSE
5. REPEAT
6. a := ChooseAction(sc);
7. [sn, rn] := SimulateAction(sc, a, g);
8. FOR: i=1 to depth -1
9. a := ChooseActionF romCorridor(sn); 10. [snext, rw] := SimulateAction(sn, a, g); 11. rn:= rn+ rw; 12. sn := snext; 13. END FOR 14. rn := rn+ 1/dist(sn, g) ; 15. IF Q(sc, a) < rn THEN 16. Q(sc, a) := rn; 17. timelimit = timelimt − 1; 18. UNTIL (timelimit > 0); 19. a := ChooseBestAction(sc);
20. RETURN the action a End M CRT − CAS
Figure 3.8: High Level Design of MCRT-CAS.
continues for depth − 1 iterations. At depth depth of the look-ahead search, the leaf node sn is evaluated using the distance heuristic dist and the inverse of the heuristic
is added to rn (line 14). If the current estimate of the long term reward rn of the
state action (sc, a) is greater than or equal to the previous value, i.e. Q(sc, a), then
MCRT-CAS updates Q(sc, a) (line 16). MCRT-CAS keeps running the simulations -
with sc as the start node for each simulation - until the time is out. At the end of
the simulations, MCRT-CAS selects the best action at sc (line 19) and returns it for
execution.
3.5.2.3 Complexity Analysis
MCRT-CAS selects an action from current state sc using ChooseAction (lines 6,
Figure 3.8). It takes O(1) to select an action randomly at sc if it is seen for the first
time. If all actions at sc have been sampled in previous searching efforts, then it
takes O(|A(sc)|) to select the best action at sc (line 1, Figure 3.8). The average time
complexity of ChooseAction is O(1) and the worst-case time complexity is O(|A|). The worst-case time complexity of ChooseActionF romCorridor (line 9, Figure 3.8) is O(C) where C is the size of the corridor. The worst-case space complexity of MCRT-CAS per simulation is O(d + C|A|). It has a worst-case time complexity of O(|A| + (d − 1)C). The MCRT-CAS technique has the worst-case time complexity of O(K(|A| + (d − 1)C)) if K is the number of rollouts. MCRT-CAS is better than MCRT and MCRT-RAS in terms of the worst-case time complexity.