• No se han encontrado resultados

CONDICIONES IDEALES PARA PRODUCIR COMPOST

El proceso que se propone para el aprovechamiento del contenido gastrointestinal y del estiércol se muestra en el flujo de proceso presentado a continuación:

CONDICIONES IDEALES PARA PRODUCIR COMPOST

While simulations are useful for choosing hyper-parameters and selecting potential architectures, real-world datasets provide the strongest validation of neural network models. The authors of the Social-LSTM paper used data from the ETH[41] and UCY[32] datasets. Both of these datasets were constructed from an overhead camera looking at an area of land where many people were walking. The position of the agents at at each timestep was recorded and logged. The duration of the timestep between measurements for both datasets was .4 seconds. The UCY datasets have significantly higher concentrations of people and therefore the interactions between pedestrians have a larger impact on the trajectories of individuals. Additionally, the height of the video recorded for the ETH and UCY datasets is different, so it is not easy to train using both datasets unless accommodations are made for the difference in height. Due to the difficultly of normalizing the scale of the ETH and UCY datasets caused by the disparity in camera heights, only the UCY datasets were used for the validation of the models presented in this thesis. UCY was chosen over ETH because there are more trajectories and the crowds are denser. Specifically, the three Zara datasets were used. A sample sequence from the first video is shown in Figure 6.3.

The Zara sequences of the UCY dataset include both videos and annotations. For this project, only the annotations were used. The annotation files are listed as a spline for each agent where each timestep when the agent was present is an entry in spline

for that agent. The entries include the x and y position of the agent (in pixels with the center of the video being 0,0) along with the frame number and the gaze direction (not used). Before training, the coordinates were shifted and scaled to be between 0 and 1 to make it easier to compare the results to those from the simulations where the scale was also 0 to 1. The first two videos were used for training and the third was used for testing and validation. Together, the first two videos have 19527 frames. The first 2000 frames of the third video was used for validation and to perform early stopping. The remaining 5524 frames from the third video were used to evaluate the performance of each architecture.

The setup of the parameters and systems were the same as for the Intersection simulation, although this time the Social-LSTM used a scaled equivalent of a 32 pixel- wide grid since that was the recommendation of the authors. Like before, the models were trained to compute the parameters of a probability distribution. The mean log probability of all predictions on the last timestep (8) for the test data was collected and summarized in Table 6.2. When only a single Gaussian was used, the Social- LSTM model was the best, while the Hierarchical LSTM exceeded the performance of all other architectures when outputting parameters for 20 Gaussians (Mixture). Surprisingly, the best models are different from the Intersection Simulation where the Hierarchical LSTM was the best when using a single Gaussian, but the Social- LSTM architecture was the best for the mixture of Gaussians. These slightly different outcomes further confirm the idea that the Social-LSTM and Hierarchical LSTM appear to achieve nearly equivalent results.

Figure 6.4 is a visual summary of the data. One notable occurrence in the results is that the mixture of Gaussians actually hurt the performance of the two attention models even though it provided a major boost to all of the other architectures. One explanation for this behavior is that the mixture of Gaussians makes the models

Table 6.2: Results for UCY Dataset Single Mixture BASE 6.01 6.41 S-LSTM 7.98 8.07 OCC 6.94 7.53 HIER 7.79 8.14 SATTN 6.99 6.93 NATTN 7.90 7.45

work, more rigorous regularization could be incorporated to test this hypothesis. The reason that more Gaussians makes the model prone to overfitting is that the Mixture Density Networks have more capacity to learn specific training examples. In the worst case, for each set of similar trajectories, the mixture versions may output parameters that specify a mode for the next position of each trajectory in the set of similar trajectories. However, this approach would not learn the true underlying motivations for how pedestrians pick their next movements and thus would perform poorly on new examples. Since all other architectures performed better using many Gaussians, it does not seem that overfitting was a major issue overall.

The pattern of performance in the UCY data is similar to the results from the Intersection scenario, which means that the simulation was a useful analogue to real- world interactions. Although never achieving the best result, the Neighbor Attention model also achieved results that are only slightly lower than the Hierarchical LSTM and Social-LSTM models (especially in the single Gaussian case). Once again, the Spatial Attention and Occupancy Grid architectures were the worst performing of the proposed models. For the single Gaussian case, the Occupancy Grid was the worst while for the mixture case, the Spatial Attention was the worst. Remarkably the Occupancy Grid achieved impressive results (7.53) when allowed to output parameters

Figure 6.4: UCY Results

for a mixture distribution.

6.1.2.1 Limitations

A major limitation of this dataset is its size. Using less than 20,000 frames of video for training is meager in comparison to the hundreds of thousands of training samples that are available in other domains. Compounding the small dataset size is the fact that many frames do not even include pedestrian-pedestrian interactions since sometimes there are only one or two pedestrians in the scene (although there are still many more interactions that other datasets like ETH). The results certainly indicate that the Hierarchical LSTM and the Neighbor Attention models are comparable or nearly comparable to the well-configured Social-LSTM model, but larger datasets should be able to show the differences between them.

The scene is also fairly narrow in the geographic area that it covers. People can cross the environment in just over 15 timesteps, so the dataset does not adequately test whether the models can learn from long-range interactions. Since there are only two exists from the scene (on the left and on the right), there are few interactions between agents coming at each other from perpendicular directions. There is no no- ticeable mimicry behavior, and the scenes are composed solely of pedestrians who are moving at relatively the same speed. These limitations, however, are also beneficial to training because the consistent destinations (left or right) mean that goal-oriented behavior is not a major contributing factor to the movements of people since it is obvious what the end destinations will likely be. Thus the movements of pedestrians are more attributable to navigating around other people and obstacles, which better tests the models proposed in this thesis. The authors of Social-LSTM noted that the ETH datasets often had few pedestrians in the scenes, so much of the choices made by agents were related to their end-goal rather than their interactions with others. This made it difficult for the authors to demonstrate the advantage of the Social-LSTM model.