UBICACIÓN FIJA

Servicios complementarios no incluidos en el Mercado 1

MCRT-CAS expands the look-ahead search from the current state s to a fixed depth D in a simulation. At s, it chooses the best action as∈ A(s) as a sample. It explores

the actions at the successor state s0 = T (s, as) of s seen in the look-ahead search. At

s0, it selects an action a0 ∈ A(s0_{) from a small list of actions A}

c(s0) ⊂ A(s0) such that

a0 ∈ Ac(s0). The small list of actions is called a corridor. Ac(s0) is constructed using

the best action a at s0. This sampling continues until the look-ahead search reaches depth d. For each state-action pair seen during the look-ahead search, MCRT-CAS keeps computing the reward values using the R function. The state seen at depth d is evaluated using an admissible heuristic and this value is added to the sum of the rewards of all state-action pairs seen in the look-ahead search in a simulation. This accumulated value is used to update the expected long term reward of the action sampled at the current state, i.e. Q(s, as). If the returned value is bigger than the

current Q(s, as) then the estimated action value is modified otherwise it remains

unchanged. If Q(s, as) remains unchanged for nc > 0 consecutive simulations, then

MCRT-CAS selects the best action at s by excluding as from the action list. If s has

converged or if the time to run the simulations expires, it selects the best action at s as a plan and returns a for execution.

3.5.2.1 Objective

In Monte-Carlo simulations, the exploration of new actions is an important and es- sential part of the search. It is important because the action values at the current state of the planning agent are estimated using local information. However, the ex-

ploration of new actions that are not useful to solve the current planning problem is computationally very expensive for the planning search algorithm. It is intuitive to limit the exploration of the search space - during the Monte-Carlo simulations - within the vicinity (or corridor) of the effect(s) of the current best action of a state. The exploration scheme also gives importance to the actions that are less explored by increasing their chance of selection. This ensures the exploration of all possible actions in a corridor. The chance of selection is computed using 1/n(s, a) where n(s, a) is the number of times an action a ∈ A(s) is sampled at state s. The corridors are kept overlapping, i.e. two corridors (each one by a different action) can have one or more common actions, so the exploration of new actions can move from one corridor to another one in two consecutive simulations. At the current state s, the exploration of a new action is performed if the estimated value of the best action ab ∈ A(s) does

not change for some iterations. 3.5.2.2 Algorithmic Details

MCRT-CAS makes two main changes to the original MCRT [52]. First it uses the estimated action values to draw a sample at the current state of the planning agent. Second, it uses a corridor-based exploration scheme to draw action samples at the successor states of the current state. To explore the new actions, MCRT-CAS uses a small set of actions that are relevant to the best action in the current state. MCRT- CAS performs exploration of new actions within that corridor. The construction of the corridor for an action is done automatically. This can be done by using the angle between the directional lines of two actions if the actions are directional. A corridor for an action is built before the start of planning and is stored in memory for online

use. The construction of a corridor for an action is a trivial process. We use the angle between the directions of two actions to build a corridor. For example, a corridor of an action “MOVE TO NORTH” is a set {“MOVE TO NORTH WEST”, “MOVE TO NORTH”, “MOVE TO NORTH EAST”} where each member of the set has an angle of 45◦ or less with the action “MOVE TO NORTH”. The general overview of MCRT-CAS is given in Figure 3.8. At the current state sc, if sc has converged then

MCRT-CAS returns the best action at sc (line 2, Figure 3.8). The convergence of

a state sc is discussed in section 3.6. If sc has not converged yet, then MCRT-CAS

runs several rollouts depending on the time limit to estimate the action values at sc.

In each rollout, MCRT-CAS chooses (line 6, Figure 3.8) the best action a ∈ A(sc)

at sc as a sample. If sc is seen for the first time, or any of its actions have not

been sampled yet, then ChoooseAction randomly selects an action (from the unseen actions). The next state sn of sc along with the reward (rn) of the state-action pair

(sc, a) is returned by the function SimulateAction (line 7, Figure 3.8). The details

of SimulateAction have been discussed in section 3.3. sn is expanded for a length of

depth − 1 using a combination of the corridor-based action sampling. The function ChooseActionF romCorridor (line 9, Figure 3.8) chooses the best action a0 at sn and

creates a corridor Ac(a0). The function then selects an action a randomly from the

corridor of a0. The chance of selection of an action as a sample in the corridor depends on the number of times the action has been sampled in the previous searches. An action a ∈ Ac(a0) has more chance of selection as a sample than other members of

Ac(a0) if a is the least explored. The function SimulateAction (line 10) computes

the immediate reward rw and the next state snext of the state action pair (sn, a). rw

Function M CRT − CAS(sc, g)

Read access depth, timelimit, nc;

1. IF sc is converged THEN

2. a := ChooseBestAction(sc);

3. RETURN the action a; 4. ELSE

5. REPEAT

6. a := ChooseAction(sc);

7. [sn, rn] := SimulateAction(sc, a, g);

8. FOR: i=1 to depth -1

9. a := ChooseActionF romCorridor(sn); 10. [snext, rw] := SimulateAction(sn, a, g); 11. rn:= rn+ rw; 12. sn := snext; 13. END FOR 14. rn := rn+ 1/dist(sn, g) ; 15. IF Q(sc, a) < rn THEN 16. Q(sc, a) := rn; 17. timelimit = timelimt − 1; 18. UNTIL (timelimit > 0); 19. a := ChooseBestAction(sc);

20. RETURN the action a End M CRT − CAS

Figure 3.8: High Level Design of MCRT-CAS.

continues for depth − 1 iterations. At depth depth of the look-ahead search, the leaf node sn is evaluated using the distance heuristic dist and the inverse of the heuristic

is added to rn (line 14). If the current estimate of the long term reward rn of the

state action (sc, a) is greater than or equal to the previous value, i.e. Q(sc, a), then

MCRT-CAS updates Q(sc, a) (line 16). MCRT-CAS keeps running the simulations -

with sc as the start node for each simulation - until the time is out. At the end of

the simulations, MCRT-CAS selects the best action at sc (line 19) and returns it for

execution.

3.5.2.3 Complexity Analysis

MCRT-CAS selects an action from current state sc using ChooseAction (lines 6,

Figure 3.8). It takes O(1) to select an action randomly at sc if it is seen for the first

time. If all actions at sc have been sampled in previous searching efforts, then it

takes O(|A(sc)|) to select the best action at sc (line 1, Figure 3.8). The average time

complexity of ChooseAction is O(1) and the worst-case time complexity is O(|A|). The worst-case time complexity of ChooseActionF romCorridor (line 9, Figure 3.8) is O(C) where C is the size of the corridor. The worst-case space complexity of MCRT-CAS per simulation is O(d + C|A|). It has a worst-case time complexity of O(|A| + (d − 1)C). The MCRT-CAS technique has the worst-case time complexity of O(K(|A| + (d − 1)C)) if K is the number of rollouts. MCRT-CAS is better than MCRT and MCRT-RAS in terms of the worst-case time complexity.

In document COMISION DEL MERCADO DE LAS TELECOMUNICACIONES (página 48-51)