5.3 Experiments
5.3.4 Discussion
We have trained our models without having any selection of the training and test frames to be closer to a real environment, unlike other methods like [Babaee et al., 2018, Lim and
Category Scene PBAS UMBS DeepBS 3D-LSTM Vanilla AtVE-Net
Baseline
Highway 94.51 92.17 96.55 98.76 98.50 99.95
Office 94.20 97.19 97.80 97.40 96.49 98.21
Pedestrians 93.63 95.66 94.59 95.01 97.48 98.16 PETS2006 87.36 86.48 94.25 92.18 94.04 99.95 Dynamic
Background
Canoe 71.96 93.45 97.94 75.36 88.88 99.92
Boats 36.11 90.41 81.21 95.90 97.97 99.99
OverPass 79.25 89.90 94.16 84.33 70.93 99.96
Fall 87.14 56.68 82.94 34.05 38.55 98.72
Camera Jitter Boulevard 66.02 86.72 86.23 90.23 95.13 99.86 Shadow
CopyMachine 87.27 87.11 95.34 77.71 94.56 97.40 PeopleInShade 89.19 90.16 91.97 85.24 89.39 99.76 BusStation 86.09 86.95 93.74 95.08 76.29 95.14 PTZ Camera TwoPosPTZCam n/a 79.59 87.04 88.70 94.21 97.40 Low Framerate Turnpike n/a 89.01 49.17 97.84 98.59 99.92 Intermitent
Object Motion Sofa 73.81 84.55 81.34 92.33 96.14 92.75 Night Videos TramStation 82.43 88.56 47.54 85.61 87.63 90.46
Overall 80.64 87.16 85.74 87.12 88.42 97.84
Table 5.2: F-measure performance comparison. In this table, we present the comparison on sixteen CDnet2014 scenes of our vanilla model and our AtVE-Net method against PBAS [Hofmann et al., 2012], UMBS [Sajid and Cheung, 2017], DeepBS [Babaee et al., 2018] and 3D-LSTM [Akilan et al., 2019]. n/a means that the result was not available.
The scores of the PBAS, UMBS and DeepBS methods were acquired from [Akilan et al., 2019].
Category UMBS DeepBS 3D-LSTM U-NetMDI Vanilla AtVE-Net
Baseline 0.64 0.24 0.21 0.13 0.30 0.08
Camera jitter 0.30 0.89 3.03 3.54 0.28 0.03
Dynamic Background 1.17 0.20 0.35 0.28 1.28 0.01 Intermittent Object Motion 2.63 4.12 1.15 3.07 2.44 0.51
Shadow 1.78 0.74 1.10 0.64 1.10 0.19
Low Frame Rate 0.92 1.35 0.34 0.90 0.46 0.01
Night Videos 3.97 2.57 0.90 0.56 0.84 0.67
PTZ 0.51 7.72 0.98 na 0.77 0.1
Overall 1.49 1.99 1.01 1.30 0.93 0.2
Table 5.3: Performance comparison with PWC. In this table, we present the comparison on eight CDnet2014 categories of our vanilla model and our AtVE-Net method against UMBS [Sajid and Cheung, 2017], DeepBS [Babaee et al., 2018], 3D-LSTM [Akilan et al., 2019] and U-Net MDI [Kim and Ha, 2021]. n/a means that the result was not available.
The scores of the PBAS, UMBS and DeepBS methods were acquired from [Akilan et al., 2019].
Keles, 2018] which choose randomly selected frames. Therefore, our vanilla model exhibits a drastic drop in performance in the fall scene, as shown in the results in Table 5.2, with an F-measure of 38.55. This is because the foreground objects in this scene are too small,
Number of Spatial Attention Modules
Category One Two Three
Baseline 94.18 95.16 99.07
Camera jitter 91.45 93.71 99.86
Dynamic Background 72.14 75.43 99.56 Intermittent Object Motion 92.46 94.83 92.75
Shadow 79.24 85.85 97.43
Low Frame Rate 98.62 99.01 99.92
Night Videos 89.47 91.9 90.50
PTZ 95.80 96.79 97.40
Overall 89.17 91.58 97.06
Table 5.4: F-measure performance comparison with one, two and three spatial attention modules in our AtVE-Net architecture. These experiments were performed on eight CD- net2014 categories using the spatial attention modules from the third skip-connection to the first skip-connection.
Number of Spatial Attention Modules
Category One Two Three
Baseline 0.34 0.27 0.08
Camera jitter 3.42 1.68 0.03
Dynamic Background 1.41 0.30 0.01
Intermittent Object Motion 3.25 0.40 0.51
Shadow 1.82 1.02 0.19
Low Frame Rate 1.32 0.14 0.01
Night Videos 1.89 0.61 0.67
PTZ 0.77 0.15 0.1
Overall 1.95 0.57 0.2
Table 5.5: PWC performance comparison attention with one, two and three spatial atten- tion modules in our AtVE-Net architecture. These experiments were performed on eight CDnet2014 categories using the spatial attention modules from the third skip-connection to the first skip-connection.
such as people walking, resulting in a poor output segmentation. AtVE-Net overcomes this problem by adding spatial attention modules on the last three skip-connections of the model and obtains an F-measure of 98.72, outperforming our benchmark models.
Vanilla AtVE-Net performs better than other models that use a background model methodology [Hofmann et al., 2012, Sajid and Cheung, 2017, Babaee et al., 2018] because their methods are sensitive to error propagation from the first background model esti- mations. Note that we are using a video encoding as a context which is computed once, which allows a high-level abstraction of the entire video sequence, leaving out small video changes or foreground objects that contains few pixels. This context in the feature atten- tion module provides the regularities or the features of the static background in the video sequence; hence, the network has additional information to segment foreground objects.
Ablation study
The comparison of the number of spatial attention modules, shown in tables 5.4 and 5.5, indicates that the use of three of these modules have a better performance in terms of F- measure and PWC. On the other hand, the use of one and two spatial attention modules presents a significant improvement over our vanilla model, which only uses the feature attention module on high features coming from the encoder. These evaluations validate the correct performance of spatial attention on mid-low features coming from the skip- connections. Note that we only tested using three spatial attention modules in the last three skip-connections of our model due to memory limitations. However, we hypothesize that leaving the last skip-connection without a spatial attention module leads to better performance because it avoids degradation problems in the resulting segmentation.
Fig. 5.2 presents the attention maps obtained by the spatial attention module in AtVE-Net, these attention maps show a delineation of the foreground objects very close to the ground-truth. The addition of spatial attention modules allows our network to distinguish large and small foreground objects, which is a significant advantage compared to using the feature attention module alone. The validation of the attention maps obtained from the spatial attention module demonstrates the relevance of these modules in our model.
Highway
Of�ice
Pedestrians
PETS2006
Canoe
Boats
Overpass
Fall
Boulevard
CopyMachine
People in shade
BusStation
TwoPosPTZCam
Turnpike
Sofa
Tramstation
Input Ground-truth Output FG Input Ground-truth Output FG
Pedestrians Fall
Figure 5.1: Qualitative results of the proposed model. We divide our qualitative results in blocks of three images: the input frame, the ground-truth and the foreground segmentation obtained from AtVE-Net model.
Boats
Sofa
Tramstation CopyMachine
Input Ground-truth Attention map
Fall
Figure 5.2: Attention maps obtained by the spatial attention module in AtVE-Net. For each scene we present the input frame, ground-truth, and the attention maps obtained from the spatial attention module in the third skip-connection of our AtVE-Net model.
Note that we changed the black pixels to white pixels to have a better identification of the attention maps.
Chapter 6 Conclusions
We have proposed two methods based on deep learning for the foreground detection task, our vanilla and AtVE-Net models. Our vanilla model is based on a U-net architecture with an attention module using a video encoding of the entire video sequence as a context for the foreground detection task. The attention module provides the U-net decoder with a representation of the common patterns in the video sequence. AtVE-Net enhances our vanilla model by adding spatial attention modules on the last three skip-connections, the spatial attention highlights mid-low encoding feature maps in order to attend to the irregularities of the scene. The novel combination of spatial and feature attention modules in AtVE-Net outperforms the evaluated models [Hofmann et al., 2012, Sajid and Cheung, 2017, Babaee et al., 2018, Akilan et al., 2019, Kim and Ha, 2021], obtaining the best overall F-measure and PWC scores in sixteen CDnet2014 scenes, where illumination changes, dynamic background, camera jitter and camouflage challenges are exhibited.
Our main contributions were: (i) A video encoding able to obtain the features from the video sequence, replacing the usual background model approach. (ii) A feature attention module to detect irregularities by comparing the encoding features of the current input and the video encoding. (iii) A spatial attention module that uses the attention maps obtained from the feature attention module to provide highlighting of irregularities to the mid-low layers of the network.
As mentioned in our hypothesis, the inclusion of attention modules and a video encoding as a context improved our model; therefore, AtVE-Net is outperforming our benchmark models in terms of F-measure and PWC. Our vanilla model demonstrates that our feature attention module using a video encoding as context can model the scene in a latent space, and these features can be used to compare irregularities in the scene. On the other hand, the analysis performed on the spatial attention modules validates their correct operation and the relevance of the attention maps in AtVE-Net performance.
One of the main problems when training and testing our methods was the unbalanced scenes from CDnet2014 because there are two problems: (i) CDnet2014 has many frames with a void label, which means that those frames are not labeled foreground or background, leading to poor training and/or testing performance. We overcome this by not taking the unlabeled pixels into our model’s training loss and also not into the test metrics. (ii) There are many consecutive only background frames, this leads to training or test sets
being just background frames. We remove many of these background frames to have foreground objects in the training and test sets.
AtVE-Net reflects a fairly quite F-measure score; however, in the Fig. 5.1, the frames with red borders, which do not contain foreground objects, present multiple false positives in the output segmentation. Furthermore, the categories of intermittent object motion and night videos have a small drop in PWC compared to the other categories.
This is because our model has a drawback when there are no foreground objects, as it detects false positives in dynamic background areas.
As future work, we are going to evaluate the generalization of AtVE-Net by training a model with all scenes such as [Patil et al., 2020, Patil et al., 2021, Akilan and Wu, 2019].
Also, as in [Tompson et al., 2015], we are going to addition more dropouts layers to our network to improve generalization performance by preventing activations from becoming strongly correlated, which leads to overfitting. In addition, we are going to add a spatial- median filter to our model as a post-processing step as in [Babaee et al., 2018] in order to eliminate false positives. Spatial-median filter returns the median over a given size neighborhood (the kernel size) for each pixel in an image, as a consequence, the operation removes outliers in the segmentation map. Finally, we consider doing data augmentation as in [Tezcan et al., 2019], which improves their performance under varying illumination.
Bibliography
[Akilan and Wu, 2019] Akilan, T. and Wu, Q. J. (2019). sendec: An improved image to image cnn for foreground localization.IEEE Transactions on Intelligent Transportation Systems, 21(10):4435–4443.
[Akilan et al., 2019] Akilan, T., Wu, Q. J., Safaei, A., Huo, J., and Yang, Y. (2019).
A 3d cnn-lstm-based image-to-image foreground segmentation. IEEE Transactions on Intelligent Transportation Systems.
[Ba et al., 2016] Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization.
arXiv preprint arXiv:1607.06450.
[Babaee et al., 2018] Babaee, M., Dinh, D. T., and Rigoll, G. (2018). A deep convolutional neural network for video sequence background subtraction.Pattern Recognition, 76:635–
649.
[Benavides-Arce et al., 2022] Benavides-Arce, A. A., Flores-Benites, V., and Mora- Colque, R. (2022). Foreground detection using an attention module and a video encod- ing. In Sclaroff, S., Distante, C., Leo, M., Farinella, G. M., and Tombari, F., editors, Image Analysis and Processing – ICIAP 2022, pages 195–205, Cham. Springer Inter- national Publishing.
[Bennet et al., 2017] Bennet, M. A., Lokesh, S., SankaBabu, G., Lavanya, C., Deepa, D., and Srimarthiya, S. (2017). Performance analysis of foreground-adaptive background subtraction in grayscale video sequences. IIOAB JOURNAL, 8(2):99–104.
[Bishop, 1998] Bishop, C. (1998). Bayesian pca. Advances in neural information process- ing systems, 11.
[Bishop and Nasrabadi, 2006] Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recog- nition and machine learning, volume 4. Springer.
[Boufares et al., 2016] Boufares, O., Aloui, N., and Cherif, A. (2016). Adaptive thresh- old for background subtraction in moving object detection using stationary wavelet transforms 2d. Int J Adv Comput Sci Appl, 7(8):29–36.
[Chen et al., 2019] Chen, M., Li, Y., and Li, R. (2019). Research on neural machine translation model. InJournal of Physics: Conference Series, volume 1237, page 052020.
IOP Publishing.
[Comaniciu, 2003] Comaniciu, D. (2003). An algorithm for data-driven bandwidth selec- tion. IEEE Transactions on pattern analysis and machine intelligence, 25(2):281–288.
[Comaniciu and Meer, 2002] Comaniciu, D. and Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence, 24(5):603–619.
[Cucchiara et al., 2003] Cucchiara, R., Grana, C., Piccardi, M., and Prati, A. (2003).
Detecting moving objects, ghosts, and shadows in video streams. IEEE transactions on pattern analysis and machine intelligence, 25(10):1337–1342.
[Dosovitskiy et al., 2020] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.
(2020). An image is worth 16x16 words: Transformers for image recognition at scale.
arXiv preprint arXiv:2010.11929.
[Elgammal et al., 2000] Elgammal, A., Harwood, D., and Davis, L. (2000). Non- parametric model for background subtraction. In European conference on computer vision, pages 751–767. Springer.
[Flores-Benites et al., 2021] Flores-Benites, V., Mugruza-Vassallo, C. A., and Mora- Colque, R. (2021). Tvanet: a spatial and feature-based attention model for self-driving car. In 2021 34th SIBGRAPI Conference on Graphics, Patterns and Images (SIB- GRAPI), pages 263–270. IEEE.
[Fratama et al., 2019] Fratama, R. R., Partiningsih, N. D. A., Rachmawanto, E. H., Sari, C. A., Andono, P. N., et al. (2019). Real-time multiple vehicle counter using background subtraction for traffic monitoring system. In2019 International Seminar on Application for Technology of Information and Communication (iSemantic), pages 1–5. IEEE.
[Gao et al., 2018] Gao, Y., Cai, H., Zhang, X., Lan, L., and Luo, Z. (2018). Background subtraction via 3d convolutional neural networks. In2018 24th International Conference on Pattern Recognition (ICPR), pages 1271–1276. IEEE.
[Gers et al., 1999] Gers, F. A., Schmidhuber, J., and Cummins, F. (1999). Learning to forget: Continual prediction with lstm.
[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org.
[Han et al., 2004] Han, B., Comaniciu, D., and Davis, L. (2004). Sequential kernel density approximation through mode propagation: Applications to background modeling. In proc. ACCV, volume 4, pages 818–823.
[Hanchinamani et al., 2016] Hanchinamani, S. R., Sarkar, S., and Bhairannawar, S. S.
(2016). Design and implementation of high speed background subtraction algorithm for moving object detection. Procedia computer science, 93:367–374.
[He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
[Hema, 2016] Hema, C. (2016). Hand gesture identification using preprocessing, back- ground subtraction and segmentation techniques. International Journal of Applied Engineering Research, 11(5):3221–3228.
[Hofmann et al., 2012] Hofmann, M., Tiefenbacher, P., and Rigoll, G. (2012). Background segmentation with feedback: The pixel-based adaptive segmenter. In2012 IEEE com- puter society conference on computer vision and pattern recognition workshops, pages 38–43. IEEE.
[Huynh-The et al., 2016] Huynh-The, T., Banos, O., Lee, S., Kang, B. H., Kim, E.-S., and Le-Tien, T. (2016). Nic: A robust background extraction algorithm for foreground detection in dynamic scenes. IEEE transactions on circuits and systems for video technology, 27(7):1478–1490.
[Kim and Ha, 2021] Kim, J.-Y. and Ha, J.-E. (2021). Foreground objects detection by u-net with multiple difference images. Applied Sciences, 11(4):1807.
[Ko et al., 2010] Ko, T., Soatto, S., and Estrin, D. (2010). Warping background sub- traction. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1331–1338.
[Koller et al., 1994] Koller, D., Weber, J., Huang, T., Malik, J., Ogasawara, G., Rao, B., and Russell, S. (1994). Towards robust automatic traffic scene analysis in real-time. In Proceedings of 12th International Conference on Pattern Recognition, volume 1, pages 126–131. IEEE.
[Levin, 1990] Levin, E. (1990). A recurrent neural network: limitations and training.
Neural Networks, 3(6):641–650.
[Liang and Liu, 2021] Liang, D. and Liu, X. (2021). Coarse-to-fine foreground segmenta- tion based on co-occurrence pixel-block and spatio-temporal attention model. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 3807–3813. IEEE.
[Lim and Keles, 2018] Lim, L. A. and Keles, H. Y. (2018). Foreground segmentation using convolutional neural networks for multiscale feature encoding. Pattern Recognition Letters, 112:256–262.
[Lo and Velastin, 2001] Lo, B. P. L. and Velastin, S. (2001). Automatic congestion detec- tion system for underground platforms. InProceedings of 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing. ISIMP 2001 (IEEE Cat. No.
01EX489), pages 158–161. IEEE.
[Maddalena and Petrosino, 2008] Maddalena, L. and Petrosino, A. (2008). A self- organizing approach to background subtraction for visual surveillance applications.
IEEE Transactions on image processing, 17(7):1168–1177.
[Mandal et al., 2018] Mandal, M., Saxena, P., Vipparthi, S. K., and Murala, S. (2018).
Candid: Robust change dynamics and deterministic update policy for dynamic back- ground subtraction. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 2468–2473. IEEE.
[McIvor, 2000] McIvor, A. M. (2000). Background subtraction techniques. Proc. of Image and Vision Computing, 4:3099–3104.
[McReynolds and Blythe, 2005] McReynolds, T. and Blythe, D. (2005). Advanced graph- ics programming using OpenGL. Elsevier.
[Moo Yi et al., 2013] Moo Yi, K., Yun, K., Wan Kim, S., Jin Chang, H., and Young Choi, J. (2013). Detection of moving objects with non-stationary cameras in 5.8 ms: Bringing motion detection to your mobile device. pages 27–34.
[Oktay et al., 2018] Oktay, O., Schlemper, J., Folgoc, L. L., Lee, M., Heinrich, M., Mis- awa, K., Mori, K., McDonagh, S., Hammerla, N. Y., Kainz, B., et al. (2018). Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999.
[Oliver et al., 2000] Oliver, N. M., Rosario, B., and Pentland, A. P. (2000). A bayesian computer vision system for modeling human interactions. IEEE transactions on pattern analysis and machine intelligence, 22(8):831–843.
[Patil et al., 2020] Patil, P. W., Biradar, K. M., Dudhane, A., and Murala, S. (2020). An end-to-end edge aggregation network for moving object segmentation. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8149–
8158.
[Patil et al., 2021] Patil, P. W., Dudhane, A., and Murala, S. (2021). Multi-frame re- current adversarial network for moving object segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2302–2311.
[Perreault et al., 2020] Perreault, H., Bilodeau, G.-A., Saunier, N., and H´eritier, M.
(2020). Spotnet: Self-attention multi-task network for object detection. In 2020 17th Conference on Computer and Robot Vision (CRV), pages 230–237. IEEE.
[Piccardi, 2004] Piccardi, M. (2004). Background subtraction techniques: a review. In 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat.
No. 04CH37583), volume 4, pages 3099–3104. IEEE.
[Power and Schoonees, 2002] Power, P. W. and Schoonees, J. A. (2002). Understanding background mixture models for foreground segmentation. In Proceedings image and vision computing New Zealand, volume 2002, pages 10–11.
[Ronneberger et al., 2015] Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net: Con- volutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer.
[Russakovsky et al., 2015] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–
252.
[Sajid and Cheung, 2017] Sajid, H. and Cheung, S.-C. S. (2017). Universal multimode background subtraction. IEEE Transactions on Image Processing, 26(7):3249–3260.
[Sakkos et al., 2018] Sakkos, D., Liu, H., Han, J., and Shao, L. (2018). End-to-end video background subtraction with 3d convolutional neural networks. Multimedia Tools and Applications, 77(17):23023–23041.
[Sharma, 2015] Sharma, C. (2015). Analysis of percentage of wrong classification (pwc) and precision for different categories of videos on gpu. In 2015 IEEE International Advance Computing Conference (IACC), pages 95–99. IEEE.
[Sharma et al., 2017] Sharma, S., Sharma, S., and Athaiya, A. (2017). Activation func- tions in neural networks. towards data science, 6(12):310–316.
[Sheikh and Shah, 2005] Sheikh, Y. and Shah, M. (2005). Bayesian modeling of dynamic scenes for object detection. IEEE transactions on pattern analysis and machine intel- ligence, 27(11):1778–1792.
[Simonyan and Zisserman, 2014a] Simonyan, K. and Zisserman, A. (2014a). Two-stream convolutional networks for action recognition in videos. In Advances in neural infor- mation processing systems, pages 568–576.
[Simonyan and Zisserman, 2014b] Simonyan, K. and Zisserman, A. (2014b). Very deep convolutional networks for large-scale image recognition.
[Soydaner, 2022] Soydaner, D. (2022). Attention mechanism in neural networks: Where it comes and where it goes. arXiv preprint arXiv:2204.13154.
[Specht, 1990] Specht, D. F. (1990). Probabilistic neural networks. Neural networks, 3(1):109–118.
[St-Charles et al., 2014] St-Charles, P.-L., Bilodeau, G.-A., and Bergevin, R. (2014). Sub- sense: A universal change detection method with local adaptive sensitivity. IEEE Transactions on Image Processing, 24(1):359–373.
[Stauffer and Grimson, 1999] Stauffer, C. and Grimson, W. E. L. (1999). Adaptive back- ground mixture models for real-time tracking. In Proceedings. 1999 IEEE computer society conference on computer vision and pattern recognition (Cat. No PR00149), vol- ume 2, pages 246–252. IEEE.
[Tarafdar et al., 2019] Tarafdar, A., Roy, S., Mondal, A., Sen, R., and Adhikari, A.
(2019). Image segmentation using background subtraction on colored images. In 2019 International Conference on Opto-Electronics and Applied Optics (Optronix), pages 1–4. IEEE.
[Tezcan et al., 2019] Tezcan, M. O., Konrad, J., and Ishwar, P. (2019). A fully- convolutional neural network for background subtraction of unseen videos.
[Tompson et al., 2015] Tompson, J., Goroshin, R., Jain, A., LeCun, Y., and Bregler, C.
(2015). Efficient object localization using convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition, pages 648–656.
[Touvron et al., 2021] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and J´egou, H. (2021). Training data-efficient image transformers & distillation through at- tention. InInternational Conference on Machine Learning, pages 10347–10357. PMLR.
[Trinh et al., 2016] Trinh, T. T., Yoshihashi, R., Kawakami, R., Iida, M., and Naemura, T. (2016). Bird detection near wind turbines from high-resolution video using lstm networks. InWorld Wind Energy Conference (WWEC), volume 2, page 6.
[Vaswani et al., 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.