• No se han encontrado resultados

5.2.1.1 Dense Video Models

First we examine the results of the dense networks that were trained and tested using only the video data. Both the accuracy and loss results are shown in Figures 5.11 and 5.12.

CHAPTER 5. PERFORMANCE ANALYSIS

Figure 5.11: Video Accuracy Results

Figure 5.12: Video Loss Results

Among the 7 dense networks, Dense-4 performed the best during the training phase finishing at 50.1% accuracy whereas Dense-2 and Dense-6 performed the worst during training at about 47.6% training accuracy each. The notable difference be- tween these two network configurations is the use of dropout and batch normalization in Dense-2 and Dense-6. The full list of configurations can be found in Table 3.1. This trend however is inverted during the validation phase. The highest performing model is Dense-6 which reached a validation accuracy of 46.8%. The worst perform- ing model was Dense-4 which only reached 42.8% accuracy. So while dropout and batch normalization reduced performance of the training phase, the validation phase benefited from their addition. Dense-6 performed almost equivalently between the phases, where Dense-4 suffered a large drop in accuracy. This suggests that Dense-4 is over fitting to the data due to the lack of those optimizations. Comparing Dense-2 to Dense-6, the difference is that Dense-6 has 2 extra layers. While these models

CHAPTER 5. PERFORMANCE ANALYSIS

performed almost identically during training, the extra layers aided the validation, with Dense-2 only validating at 45.1%.

Looking at the loss data, these trends remain generally true. In the training phase, Dense-4 has the lowest loss among the configurations, whereas Dense-2 and Dense-6 has the highest loss. In validation, Dense-4 instead has the highest loss compared to Dense-2 and Dense-6 having the lowest.

Figure 5.13: Video Average Results

Looking at the average among all 7 of the models, shown in Figure 5.13, the highest training accuracy was 48.8% whereas validation reached 46.4%. Overall, the models tended to over fit the data, resulting in the lower validation accuracy.

5.2.1.2 Audio Models

Next, the same analysis can be conducted on the models trained only using the audio data. The accuracy and loss results for these models is shown in Figures 5.14 and 5.15.

CHAPTER 5. PERFORMANCE ANALYSIS

Figure 5.14: Audio Accuracy Results

Figure 5.15: Audio Loss Results

Looking at the accuracy numbers first, the results look quite different than the video results did. The audio training data numbers are much more closely grouped than the video was, with every model finishing within .003 points of each other. Dense-4, Dense-5, and Dense-7 all finished within .0016 of each other around 46.8% accuracy. Dense-2 and Dense-6 finished as the worst performing models at 46.5%. This is interesting as this matches the video training results. With the models all being so close in accuracy however, it is difficult to say that the configurations made a significant difference.

When examining the validation data however, the data has stronger trends that can be examined. Similarly to the video, the worst performing models from training performed the best, which were again Dense-2 and Dense-6. These models actually had a higher validation accuracy than training, with both models validating at 47.2% accuracy compared to 46.5% during training. This suggests that these audio models

CHAPTER 5. PERFORMANCE ANALYSIS

were actually under fitting the dataset. The worst performing model was Dense-7 which fell from 46.8% during training down to 46.0% during validation.

Figure 5.16: Audio Average Results

The average results for the audio models are shown in Figure 5.16. Similarly to the video data, the models tended to over fit, however not as severely in this case. The average Audio model reached 46.7% accuracy during training while validation reached 46.4% accuracy. So while the average model performed worse during training, both the video and the audio finished the validation stage around 46.4%.

5.2.1.3 Pre-Fusion Combined Models

Finally, the combination models utilizing both video and audio can be analyzed to determine how viable the classification method is when utilizing the TVSum dataset. The accuracy and loss numbers for these models are shown in Figures 5.17 and 5.18.

CHAPTER 5. PERFORMANCE ANALYSIS

Figure 5.18: Combined Loss Results

From the graphs it can be seen that one model stands out compared to the others. Dense-3 in training performed significantly better than all other models, reaching an accuracy of 62.6%. This model was unique in that it was the model that had more nodes per layer than the other models. It also utilized both dropout and batch normalization. As can be seen in the validation data however, this model heavily over fit the data, and performed significantly worse than all other models, reaching only 38.0% at the end of the validation phase. The remaining models however all performed similarly to the single modality models. Dense-4 performed the best of the remaining models during training reaching 50.4% accuracy, while the worst performing models were again Dense-2 and Dense-6 at 47.5% and 47.6% respectively. The validation data again supports the previous claims, with the worst performing models from the training being the best performing models. Dense-2 performed the best with a 46.6% accuracy rating, while Dense-4 fell to 43.8%.

CHAPTER 5. PERFORMANCE ANALYSIS

Based on the averages of the combination models, shown in Figure 5.19, the combination models were the most accurate of the three sets during training. During Validation, they were the lowest of the three models. The average accuracy for the models reached 50.9% during training and 46.3% during validation. When the outlier model in Dense-3 is removed from the calculations however, the new average during training is 48.9% with validation reaching 46.6%. The numbers without Dense-3 are almost identical to those found during both the video and audio tests.

5.2.1.4 TVSum Final Analysis

When comparing the 3 sets of models, the results all supported each other. In each case, Dense-2 and Dense-6 were the least likely to over fit the data. They also suffered almost no change in accuracy when comparing the training phase to the validation stage. The final step in the analysis of these models is comparing the F1 scores. This is typically the measurement associated with video summarization as the final quantifier of its accuracy.

One item to note is the shape of the graphs throughout. The training accuracy graphs tended to follow the expected trajectory or rising sharply at first and leveling off as time continued. The training losses reflected this with a steep drop in loss at first, and smoothing as time continued. The graphs during the validation stage however did not follow these trends. During the validation stages, the graphs were notable jagged with many different points of rising and falling, and trends less obvi- ous. The average graphs reflect this with the slopes of the lines not aligning. These observations lead us to believe that the models had trouble with learning the TVSum dataset. This could be for a number of reasons. One of these is that the videos in TVSum are grouped into categories. With categories that contain many similar objects, the classifications will be based upon similar situations. For example, one category within TVSum is vehicle repair, and a second category in TVSum is Dog

CHAPTER 5. PERFORMANCE ANALYSIS

Show videos. There are very few classes that overlap between these two categories. Our theory behind the summarization after classification is that the summarizer will begin to learn class relationships. With few overlapping classes between videos, there is less value in learning those videos. So when the videos are limited to categories, it is imperative that a sufficient number of classes overlap between categories. This could be a major reason that we see such large spikes and changes among the validation data. When the models begin to learn one category of videos better than another, the validation accuracy begins to reflect how that category compares to the others.

The typical metric used to measure the accuracy of a video summarization model is F-Score, rather than accuracy or loss. This is because it is better able to represent how far from the truth the summary is than accuracy is able to. Table 5.3 displays the F-Scores for each category as well as the overall F-Score for the entire dataset. Both Dense-2 and Dense-4 are displayed to compare the models that performed the best and worst in each phase of learning.

CHAPTER 5. PERFORMANCE ANALYSIS

Table 5.2: F-Scores for Dense-2 and Dense-4 for TVSum

Dense-2 F1-Scores

Category Video Audio Combined VT 0.16742 0.21208 0.21864 VU 0.22346 0.22722 0.18476 GA 0.20202 0.20236 0.2243 MS 0.2224 0.24812 0.18538 PK 0.20738 0.2228 0.21016 PR 0.19776 0.23382 0.18802 FM 0.20928 0.2338 0.1732 BK 0.20642 0.2258 0.18 BT 0.23692 0.20798 0.18962 DS 0.1904 0.2305 0.18904 All 0.2089 0.2068 0.1795 Dense-4 F1-Scores

Category Video Audio Combined VT 0.079755 0.176866 0.160622 VU 0.145506 0.251812 0.084599 GA 0.112373 0.230657 0.097492 MS 0.08179 0.125552 0.080057 PK 0.110434 0.129719 0.141394 PR 0.117086 0.120964 0.142175 FM 0.165948 0.117753 0.085717 BK 0.113667 0.152308 0.056322 BT 0.226391 0.237184 0.110621 DS 0.086191 0.147375 0.105503 All 0.122383 0.164543 0.107607

When we begin to analyze the F-Score we see results that are different than those we saw from the accuracy and loss. Looking at the dataset as a whole, Dense-2 performed almost equally between Video and Audio, while the combined F-Score was much lower, suggesting that combining the data was actually not beneficial. In Dense-4, a different story is true. Audio heavily outperforms the video, with the combination resulting in an even lower accuracy. There also seems to be no real bias towards any category. The model doesn’t favor any of the categories which is actually intended. One of the goals of this approach was to make a generalized model that is not application specific. With no category significantly outperforming the others across all 3 data sources, this seems to be the case.

To get another understanding of what is actually happening, the output summaries are useful. The first example is a tutorial video of how to clean your dogs ears. This

CHAPTER 5. PERFORMANCE ANALYSIS

video features a mix of text slides along with video, and music mixed with talking. For the graph, the video, audio, and combined predictions are shown along with the ground truth expected value. The x-axis is the frame number, with y representing the prediction for that frame in the video on the 1-5 importance scale, thus creating a timeline of importance as the video progresses. These examples were taken using the Dense-2 Network specifically. The first example is shown in Figure 5.20

Figure 5.20: TVSum Example Output 1

From this example we find that the video tends to take the lead with the prediction, with the audio acting almost as a filtering effect. One leading example of this is shown at the 1500 frame mark. In the video prediction there are multiple spikes in prediction from 3 down to 0, whereas the audio has a constant line at 0 importance. The combined prediction took both data inputs and the result is a flat line occurring at a level 3 importance over that section. This is the filtering effect that was mentioned. Looking at the same frame count in the expected values, the prediction has a spike from 5 down to 2, with the average looking to be around 3. This filtering effect was a common theme among examples, where the combined prediction utilized the values of the video summary more often but applied the shape of the audio summary more

CHAPTER 5. PERFORMANCE ANALYSIS

often. A second example is shown in Figure 5.21.

Figure 5.21: TVSum Example Output 2

In the second example video, a similar sampling effect can be seen specifically at the points the audio prediction spikes. Although the audio prediction is marking these as a 0 importance, that is not reflected in the combined prediction. The shape however is reflected, with those spiking values showing up in the combined prediction. One example of where this benefits the summary to match the expected values is around the 800 frame mark. The inverse effect is also shown with spikes in the video prediction that are incorrect being smoothed to lower values. One example happens near the 4500 frame mark. The F-Score for this summary was 10.53, however looking at the key spikes it is shown that the combined prediction actually is aligning fairly well. An important note is that video summarization is about capturing the most important parts of a video. This can be seen in Figure 5.22.

CHAPTER 5. PERFORMANCE ANALYSIS

Figure 5.22: TVSum Example Output 2, Scores of 3 or above

When we zoom in on scores of 3 or above as shown in Figure 5.22, the similarity between the prediction and expectation becomes more evident. This continues the discussion of what the best metrics for video summarization are. We discuss the F-score metric more in section 6.2.

This filtering effect however is not only beneficial to the summary. We also found multiple examples where this effect has a negative effect on the prediction. One of these examples is shown in Figure 5.23.

CHAPTER 5. PERFORMANCE ANALYSIS

Figure 5.23: TVSum Example Output 3

As can be seen in Figure 5.23, the video prediction has many spikes, similar to the expectation. However the combined prediction has many constant sections, like the audio prediction. This is detrimental to the combined summary. Looking at the area from around frames 500-750 in the video prediction it can be seen that it almost matches the expectation. The combined summary smoothed these spikes to a constant prediction, which is worse than the video prediction. So although in some cases the combination helped as shown in the previous examples, there have also been cases where the combination has hurt the result.

Finally, we can also examine the post fusion model. To create this, the video and audio outputs were averaged, and a summary was created from those results. The data averaged in this case was from Dense-2 as that was the best scoring Pre-Fusion model.

CHAPTER 5. PERFORMANCE ANALYSIS

Table 5.3: F-Scores for Dense-2 Post-Fusion on TVSum

Category F-Score VT 0.2201962 VU 0.1636618 GA 0.0949427 MS 0.1681878 PK 0.1280141 PR 0.1112792 FM 0.1898702 BK 0.0938838 BT 0.1093584 DS 0.1938902 All 0.149511185

Interestingly, the Post-Fusion model averaged similar to the best Pre-Fusion model. The overall average was slightly lower though. This is as expected as the Pre-Fusion model is specifically trained on the problem, compared to the Post-Fusion model which instead averages trained data. So although the Post-Fusion model was not better, it was still performed well compared to the other models.

5.2.2 SumMe Analysis

5.2.2.1 Video Models

The same analysis can be performed on the SumMe dataset as the TVSum dataset. The models tested were the exact same models as TVSum, with the configurations found in Table 3.1. We analyze the models trained only using the video data first. The accuracy and loss for these models is shown in Figure 5.24 and 5.25.

CHAPTER 5. PERFORMANCE ANALYSIS

Figure 5.24: Video Accuracy Results

Figure 5.25: Video Loss Results

Based on the graph of the 7 models, we can see that Dense-5 performed the best during training of the models, reaching 84.6%. The worst performing model was Dense-6 at 72.7%. The data from the validation stage of learning shows a different story however. Similarly to the TVSum models, the best performing models from training perform worse during the validation stage as the data beings to over fit the dataset. Dense-5 fell from 84.6% all the way to just 55.2%. Dense-6 improved in performance, reaching 74.0% accuracy. The differences between these two models are the inclusion of dropout and 2 extra layers in Dense-6.

The average results for all of the models trained on the video data are shown in 5.26.

CHAPTER 5. PERFORMANCE ANALYSIS

Figure 5.26: Video Average Results

Similarly to the TVSum video models, the average model trained better than the validation model. The average model reached 76.7% accuracy while training, while peaking at 74.0% accuracy during validation. The shapes of these graphs however differ from the TVSum models. Examining the loss graphs specifically, one notable feature is that the curves all follow a similar trajectory during both training and validation stages. This is compared to the TVSum dataset that had unstable loss graphs and less obvious trends.

5.2.2.2 Audio Models

The results from using only the audio from the SumMe Videos are shown in Figures 5.27 and 5.28.

CHAPTER 5. PERFORMANCE ANALYSIS

Figure 5.28: Audio Loss Results

Once it can be seen that the models followed the trend of previous experiments. During the training phase, Dense-5 performed the best of all the models at 74.8% accuracy. The worst performing model was Dense-2 at 72.6% accuracy. During the validation stage, Dense-5 fell to 69.6% accuracy and was the worst performing models, whereas Dense-2 improved to 74.0%. These results are consistent with the previous video model findings.

The averages for the audio models are shown in 5.29.

Figure 5.29: Audio Average Results

From the figure it can be seen that the average audio model reached 73.3% ac- curacy during training while peaking at 74.0% during validation. Like the video models,the general curve shape is more aligned to the expectation compared to the TVSum audio models. This is best seen in the validation loss graph where the shape resembles the training data. The validation accuracy graph doesn’t closely track the

CHAPTER 5. PERFORMANCE ANALYSIS

training curve, however the results are much more stable and consistent compared to those seen in the TVSum results.

5.2.2.3 Pre-Fusion Combined Models

Finally, the results of using both the video and audio data from the SumMe dataset to train combined models is shown in Figures 5.30 and 5.31.

Figure 5.30: Combined Accuracy Results

Figure 5.31: Combined Loss Results

From the figures we see that the previous trends continue to be present themselves in the combined model. Dense-5 is again the best performing model during training reaching 86.7%. Dense-6 was the worst performing model at 73.6% in this case. The validation data again is a reversal of the training data, with Dense-6 having the highest accuracy at 74.0% and Dense-5 being the second lowest at 61.3%. The lowest in this case was Dense-1 at 60.1%.

CHAPTER 5. PERFORMANCE ANALYSIS

The averages for the combined models are shown in Figure 5.32.

Figure 5.32: Combined Average Results

For the combined model average scores, the results are close to those of both the video and audio which is as expected. The combined accuracy reached a peak of 77.8% while training which is higher than the audio or video when they are separate from each other. While validating, the final result is 74.0% which matches the separate validation models.

5.2.2.4 SumMe Final Analysis

Overall, the SumMe models performed better than the TVSum Models. All three sets of models were consistent and similar to each other, with the expected results of the combined model being slightly better while validating and the audio model being slightly worse. In all three cases, Dense-5 outperformed during the training stages while under performing during the validation stages. Dense-2 and Dense-6 were consistent between training and validation in each case, where although performing the worst in the training phase, would perform the same or better during validation