All of the simulations were were created using the Social Forces model. Argil - a library designed for pedestrian simulation - was used to model, visualize, and process the scenarios. More information about Argil is presented in Chapter 7. Argil includes an implementation of the Social Forces model that was inspired by the implementation in Menge[10].
For debugging, two environments were constructed - a Hallway scene and a Fork scene. Sample frames from the Hallway are shown in Figure 5.1, and sample frames from the Fork scene are shown in 5.2. In each figure, the current location of the agent is the largest circle while the trail of smaller circles shows the previous locations of the agent. In all scenarios, there are ten agents moving from one sink towards one of the other sinks. A sink is an entrance or exit from the environment (the Hallway has
Figure 5.1: Representative Sequence from Hallway Simulation
two sinks and the Fork has three sinks.
Data was obtained by simulating the progress of agents for 2000 timesteps, and each agent began its path after a random delay where the delays are calculated from the uniform distribution with lower bound of 0 and upper bound of 400 timesteps. Each scene corresponds roughly to a 10 meter by 10 meter region in the real-world. For every tenth timestep, the position of all agents was recorded.
For the Hallway scenario, agents were randomly assigned to a starting side and each agent then moved towards the other side horizontally. The vertical start and end positions of each agent were calculated using a Normal distribution with a standard deviation of 2.0 centered at a y-coordinate of 7.5 for the agents traveling right and a y- coordinate of 2.5 for agents traveling left. This choice of y-coordinate ensures that the majority of the agents traveling right will be on the upper half of the hallway while those traveling left will be on the lower half. Figure 5.1 shows how the tendency of agents to travel together in the same direction results in relatively few head-on collision-avoidance situations.
The Fork scenario (Figure 5.2) includes a much greater number of pedestrian interactions. Agents in the Fork scenario were randomly assigned to one of the three start sides and then a goal destination was also selected at random. To make the
Figure 5.2: Representative Sequence from Fork Simulation
scene even more interesting, each agent was also given a waypoint in the center where all three pathways converge. These center waypoints must be reached by the agent before it can continue on to its final destination. The center waypoints were assigned using the Uniform distribution where every position in the center is equally likely to be chosen.
Evaluating architectures on multiple types of interactions provides a better esti- mate of how well the model may perform on real-world situations where there is a mix of complex and simple interactions. Since the Hallway simulation involves few pedestrian interactions, it was used to validate that all neural network architectures were capable of learning the linear patterns of movement. The models converged to reasonable solutions within 30 minutes to one hour on the Hallway simulations, so these simulations were used to evaluate general hyper-parameters. The Fork scenario involves many interactions between agents; the Fork simulations were used to validate the neighbor representations to ensure that the models were able to detect nearby agents.
5.1.1 Hardware and Software
All neural network models were trained on Amazon EC2 with the memory optimized instances. r4.large instances (15.25 Gigabytes RAM, 2 vCPUs) were used for small simulations with fewer than 10 agents; the r4.xlarge instances (30.5 Gigabytes of RAM, 4 vCPUs) and the r4.2xlarge instances(61 Gigabytes of RAM, 8 vCPUs) were used for scenarios with more than 10 agents. Training on GPUs was only marginally faster than CPUs, so the less expensive servers without GPUs were used. The bot- tleneck that prevented faster training on GPUs was the high memory requirements of the models during training. When the maximum number of agents in a scene was ten, most models used 4 to 8 Gigabytes of RAM. For models trained on environments where up to 40 agents were present in an individual scene, 10 to 35 Gigabytes of RAM were required. The SATTN model typically had the highest memory requirements since each attention Gaussian would need to process all other agents in a loop.
All networks were implemented in TensorFlow[1] v1.0, and the networks were trained with a learning rate of .003 using the RMSProp[15] optimizer. Every epoch (training using all samples) the learning rate was annealed by 10%. Decaying the learning rate has been shown to help networks converge to better solutions. Gradient clipping was employed to prevent the error gradients for becoming too large or too small. When the absolute value of the gradient becomes large, the parameters can quickly approach infinity or negative infinity. To avoid this, each individual gradient was clipped to be between positive five and negative five.
In general the training of the neural network models was chaotic. Small changes to the hyper-parameters like learning rate or Dropout probability could dramatically affect the convergence. Using the Social Forces simulations was essential to quickly debug the architectures. A logistical challenge of building the models was the high memory requirements; a custom system for queuing training jobs on AWS servers was
developed to reduce time spent configuring and setting up training jobs.