Análisis numérico
IV.1. El método del elemento finito 1. Introducción
A Dynamic Programming (DP) approach computes Q(s, a) and V (s) using the Bell-man equation [6]. A general form of the BellBell-man equation is shown in Equation 2.2 where α ∈[0, 1] is the discount factor. The intuition behind using the discount fac-tor in computing the expected reward is to give more importance to the reward (i.e.
R(s, a)) of the current state s than the rewards of the future states.
Q(s, a) = (R(s, a) + αX
s′∈S
P (s, a, s′) × V (s′))) (2.2) There are two main algorithms that use the Bellman equation to solve a MDP based planning problem. One is Policy Iteration (PI) algorithm and other is Value Iteration (VI) [6]. Both these algorithms search the whole state space to compute Q(s, a) and V (s) for every state s. These algorithms are polynomial in the size of the state space.
Both VI and PI perform backward search to update the policy values in an iteration.
The main difference between PI and VI is the focus of their search for the optimal
solution in the policy space. PI starts from a random policy and improves it in each iteration. VI starts from random state values (V (s)) for each state s ∈ S and updates them in each iteration. Computationally, an iteration of PI is more expensive than that of VI as PI sweeps through the entire state space twice in an iteration whereas VI traverses through the entire state space only once. These algorithms keep searching for the optimal values until there is no improvement either in the policy or in the state value. Another stopping condition is proposed by Williams and Baird [76].
According to this definition, VI or PI are converged relative to a parameter called the Bellman residual or Bellman equation error (θ). If the absolute difference between the values of every state s ∈ S in two consecutive iteration is smaller than θ, then the dynamic programming algorithm is stopped. This convergence definition guarantees the solution of near-optimal quality.
VI and PI are not suitable for large scale state spaces because these algorithms traverse the whole state space in an iteration. A less expensive kind of dynamic programming that does not traverse the whole state space in an iteration is called Asynchronous Dynamic Programming (ADP) [7]. Real-time Dynamic Programming (RTDP) is an example of ADP. The details of RTDP and its variations are given in section 2.2.2.1.
2.2.2.1 Real-Time Dynamic Programming
Real-Time Dynamic Programming (RTDP) [5] is an asynchronous approximate dy-namic programming algorithm. It uses a heuristic function to estimate the initial state values and then applies an asynchronous value iteration to improve these val-ues. It only updates the value of those states which are seen during the look-ahead search. RTDP repeatedly performs a look-ahead search from the current state s. The
look-ahead search is kept focussed towards the goal state in all iterations. In each iteration, it finds actions to reach the goal location. At each state s′ in the look-ahead search, it selects an action a that has the highest value. RTDP updates the value of s′ using the value of the best action and then finds the next state using the stochastic state transition function. The stochastic transition function uses the probability dis-tribution. In an iteration, the algorithm stops expanding the look-ahead search when it reaches the goal state. The iterations are run until the termination condition is reached. The authors in [5] did not specify any bound on the number of iterations in RTDP to tune the policy function for a given state. However, in the work of [13], the authors use the Bellman residual (error) to set the termination condition in RTDP.
In this work ([13]), RTDP stops its iteration if the error is less than ǫ. RTDP conver-gence (e.g. when the error is smaller than ǫ) can be slow as it always selects the most likely next states in the look-ahead search, and can ignore the potentially useful part of the search space that seems unlikely to be the next states of the best actions.
To speed up RTDP convergence, a variation of RTDP - called Label RTDP (LRTDP) [13] - uses a labeling scheme. According to the labeling approach, a state s is labeled as a solved state if the residual of s is less than ǫ. LRTDP maintains two lists: OPEN and CLOSED. The OPEN list keeps all states that have been seen but not expanded.
The CLOSED list stores the states that have been expanded. If any state that is in the CLOSED list has a Bellman residual of less than ǫ it is declared as a solved state. In LRTDP, the root state (initial state of the planning problem) is simulated for a certain number of iterations to declare it as solved. LRTDP is explored and compared against Value Iteration and RTDP in the Racetrack domain.
In Racetrack, a car is moved (by a planner) from its initial position to the goal
loca-tion on a race track. The race track is a grid of cells (x, y) and some of the places on the grid are slippery. If a car is on the slippery part of the track, then the actions may not have their intended effects. The experiments are performed on nine different tracks. The sizes of the tracks range from 9312 to 239089 cells. The experiments are performed using two initial heuristics: h(s) = 0 and h(s) = hmin. hmin is the initial estimate of the policy value for each state s ∈ S and is always greater than 0.
The performance is measured in terms of convergence time. The results show that LRTDP converges faster than VI. RTDP is the slowest performing algorithm in these experiments and its convergence time always exceeded the threshold. The authors used ten minutes as a threshold in all experiments. General Planning Tool (GPT) [12]
[14] is a planning tool that provides the implementation of both RTDP and LRTDP.
LRTDP and RTDP can be applied to solve path planning problems in RTS games if the state transition probabilities are available prior to the start of the game. It is not possible to compute the probability distribution in a pre-processing phase as the game map is initially unknown to all players. A variation of RTDP is designed as a suitable planning algorithm for comparison with our contributions. In this variation, we made two changes in the original RTDP. First, to guarantee a solution within a fixed time, RTDP stops a trial if the look-ahead search reaches a depth d. Second, to arrange a probability distribution for RTDP, we use an online mechanism to build P (s, a, s′). The details of this mechanism are given in section 3.2.1.3.