Resumen ejecutivo GAIN: Civil Uavs Initiative

In this section, we consider a case similar to the previous subsection but with spatial dependencies in the environmental variables. A maze-like environment as shown in Figure 5.2 was set up. A robot is traversing an undirected graph G “ pX , Eq. The robot starts at vertex x0 P X in the bottom left

hand corner of the environment had the task of navigating to the goal vertex

g P X in the upper right hand side of the environment, as indicated by the

graph vertex, with the dark squares corresponding to vertices that cannot be entered. At x P X , the robot may choose to move to any of the at most eight neighbouring vertices in N pxq. Staying in place was not permitted in this experiment. The state transition model Tx for the robot is deterministic, and its current location x P X is always fully observable.

x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Path with measurements A* path

Figure 5.2: The maze navigated by the robot, where the black areas are not

traversable. The up- and downward facing triangle markers denote the start and goal locations, respectively. The solid line denotes the path taken by the robot optimising the operation of its measurement resources. A path computed by an A* search using the expected traversal costs based on information in the initial belief state is shown by the dashed line. The black crosses indicate locations where an obstacle was present at any time during the experiment. Figure adapted from Lauri and Ritala (2012).

Four moving obstacles y1, y2, y3 and y4 with random walk dynamics are also

traversing the graph. The state transition model Ti for any obstacle i is such that with probability ps “ 0.9 it remains at its current location, and with probability 1 ´ ps“ 0.1 it moves to a neighbouring vertex uniformly at random. The presence of obstacles at any vertex is partially observable. A separate belief over the location of each of the obstacles was maintained. Furthermore, the beliefs over each obstacle were approximated to be inde- pendent.

While traversing the environment, the robot may observe its surroundings by a noisy sensor with eight operational degrees of freedom. The degrees of freedom correspond to choosing two squares adjacent to the robot’s current vertex to observe. The robot first selects a single cardinal direction: either north, east, south, or west. This is the primary direction of observation where the robot focuses its attention. In addition, for any choice of cardinal direction the robot can choose another secondary focus of attention such that it is adjacent to both the robot and the primary focus of attention. The eight observation modalities are indicated in Figure 5.3. The robot observes one bit of information per vertex observed, i.e. Z “ t0, 1u2_{, with}

0 corresponding to no obstacle observed and 1 corresponding to obstacle observed. The observation model O is such that there is a symmetric false positive and false negative probability p1 “ 0.1 for the primary focus of

attention and p2 “ 0.25 for the secondary focus of attention.

* - 0 * - 4 * - 1 * - 5 * - 2 * - 6 * - 3 * - 7

Figure 5.3: The eight observation options available to the robot. The robot is

located at the centre of each grid labelled consecutively from 0 to 7. The asterisk “*” indicates the cardinal direction corresponding to the primary focus of attention,

and the hyphen “-” indicates the secondary focus of attention.

As in the previous subsection, the reward function Rps, aq “ Rmps, aq `

Rcps, aq consists of the cost of movement Rm and the possible cost Rc of entering a vertex with an obstacle. Noting that the state is s “ px, yq, we define Rmps, aq “ ´cterrpaq ` # ´cmovepx, aq if a ‰ g rgoal if a “ g . (5.3)

Here cterrpaq is a terrain cost of entering vertex x1 “ a, sampled uniformly at

random for each vertex from the range r0, 1s. The movement cost cmovepx, aq

from x to a “ x1

P npxq is 0.1 times the Euclidean distance between the vertices1_{. A one-time reward r}

goal “ 100 was accumulated then the goal

vertex g is entered. For the other term,

Rcps, aq “ #

0 if Ei P t1, 2, 3, 4u : yi “ a ´ccoll if Di P t1, 2, 3, 4u : yi “ a

(5.4)

which indicates that a collision cost of ccoll “ 1500 is accumulated if the

agent enters a vertex x1

“ a where there is at least one moving obstacle. We solved the problem applying RTBSS with the optimistic upper bound and a lower bound obtained by the time expanded network approach. The solid line in Figure 5.2 illustrates the path taken by the robot while optimising the use of its observation resources to detect and avoid the moving obstacles. A look-ahead depth of d “ 2 was applied in this case. Compared to the A*

solution, a preference for non-diagonal movements is seen in several parts of the path. Moving non-diagonally allows the robot to observe more vertices possibly traversed in the near future, helping it avoid obstacles.

In this problem, the depth of the look-ahead is limited by the more com- plicated system dynamics compared to the previous subsection. For d “ 1, required planning times were less than 1 second, and for d “ 2, on average 70 seconds with maximum times around 200 seconds. Experimenting with

d “ 3, we could not obtain solutions in a reasonable time.

The computational burden is the most severe barrier to the application of RTBSS to the problems presented in this section. As a non-valid upper bound via the time expanded network was applied for RTBSS, the solution it finds cannot be guaranteed to be optimal. Based on the experimental data, we conjecture that POMCP with an informative rollout policy is preferable to RTBSS in this problem type.

5.2 Environment monitoring

In the environment monitoring problem (Problem 3.5), a robot is traversing an undirected graph and observing sets of environmental variables related to the vertices in the graph. We determined in Section 4.3 the conditions when the environmental monitoring problem can be relaxed into a POMDP MAB. In this section, we empirically examine the effects of using the upper bounds obtained via the relaxation to solve the original, constrained monitoring problem. We consider two cases: one where the MAB relaxation conditions are fulfilled and one where some of the conditions are violated. In the first case, the upper bounds obtained via the MAB relaxation are valid, while in the second case the MAB value is applied as a heuristic value. We compare the RTBSS algorithm to the sampling-based POMCP algorithm.

In document Guía de buenas prácticas para favorecer la contratación pública de innovación en GALICIA (página 134-143)