Movie recommender based on visual content analysis using deep learning techniques

Texto completo

(1)Universidad Politécnica de Madrid. Movie recommender based on visual content analysis using deep learning techniques MÁSTER UNIVERSITARIO EN INGENIERÍA DE TELECOMUNICACIÓN TRABAJO FIN DE MÁSTER. Lucı́a Castañeda González. 2019.

(2)

(3) MÁSTER UNIVERSITARIO EN INGENIERÍA DE TELECOMUNICACIÓN TRABAJO DE FIN DE Master Tı́tulo: Movie recommender based on visual content analysis using deep learning techniques. Autor: Lucı́a Castañeda González Tutor: Alberto Belmonte Hernández Ponente: Federico Álvarez Garcı́a Departamento: Señales Sistemas y Radiocomunicaciones (SSR). MIEMBROS DEL TRIBUNAL Presidente: Vocal: Secretario: Suplente:. Los miembros del tribunal arriba nombrados acuerdan otorgar la calificación de:. .......... Madrid, a. de. de 2019.

(4)

(5) Universidad Politécnica de Madrid. Movie recommender based on visual content analysis using deep learning techniques MÁSTER UNIVERSITARIO EN INGENIERÍA DE TELECOMUNICACIÓN TRABAJO FIN DE MÁSTER. Lucı́a Castañeda González. 2019.

(6)

(7) Summary Nowadays there is a growing interest in the artificial intelligence sector and its varied applications allowing solve problems that for humans are very intuitive and nearly automatic, but for machines are very complicated. One of these problems is the automatic recommendation of multimedia content. In this context, the work proposed try to exploit Computer Vision and Deep Learning techniques for content analysis in video. Based on intermediate extracted information a recommendation engine will be developed allowing the inclusion of learning algorithms using as base data trailers of the films. This project is divided into two main parts. After getting the dataset of movie trailers, the first part of the project consists of the extraction of characteristics from different trailers. For this purpose, computer vision techniques and deep learning architectures will be used. The set of algorithms goes from computer vision tasks as the analysis of color histograms and optical flow to complex analysis of actions or object detectors based on Deep Learning algorithms. The second part of the project is the recommender engine. For the recommender, different machine learning and Deep learning methods will be put into practice in order to learn efficiently about correlations between data. This recommender will be trained using neural networks over the proposed selected dataset. Three different options will be made with three different architectures for the recommender engine. The first will be a simple sequential neural network, the second an autoencoder and the third a double autoencoder. To compare the results of the three options, objective metrics (MSE, MAE, precision) and subjective metrics (polls) will be used. The final output of the project is provide from one input trailer, the ten best matches only based on the content analysis and the trained recommender..

(8) Resumen Hoy en dı́a, existe un interés creciente en el sector de la inteligencia artificial y sus variadas aplicaciones que permiten resolver problemas que para los humanos son muy intuitivos y casi automáticos, pero para las máquinas son muy complicados. Uno de estos problemas es la recomendación automática de contenido multimedia. En este contexto, el trabajo propuesto trata de explotar las técnicas de visión artificial y Deep learning para el análisis de contenido en vı́deo. Basándose en la información extraı́da, se desarrollará un motor de recomendación que permite la inclusión de algoritmos de aprendizaje que utilizan como base de datos tráileres de pelı́culas. Este proyecto se divide en dos partes principales. Tras obtener el conjunto de datos de tráileres de pelı́culas, la primera parte del proyecto consiste en la extracción de caracterı́sticas de dichos tráileres. Para este propósito, se utilizarán técnicas de visión artificial y arquitecturas de aprendizaje profundo. El conjunto de algoritmos va desde tareas de procesamiento de imágenes, como el análisis de histogramas de color y flujo óptico, hasta análisis complejos de acciones o detectores de objetos basados en algoritmos de Deep learning. La segunda parte del proyecto es la máquina de recomendación. Para el recomendador, se pondrán en práctica diferentes métodos de aprendizaje automático y aprendizaje profundo para aprender de manera eficiente sobre las correlaciones entre los datos. Este recomendador se capacitará utilizando redes neuronales sobre el conjunto de datos seleccionado propuesto. Se realizarán tres opciones diferentes con tres arquitecturas distintas para el motor de recomendación. La primera será una simple red neuronal secuencial, el segundo un autoencoder y el tercero un doble autoencoder. Para comparar los resultados de las tres opciones, se utilizarán métricas objetivas (MSE, MAE y precisión) y métricas subjetivas (encuestas). El resultado final del proyecto proporciona, a partir de un tráiler de entrada, las diez mejores coincidencias solo en función del análisis de contenido y el recomendador capacitado..

(9)

(10) Keywords Machine learning, deep learning, recommender, neuronal network, autoencoder, image processing, computer vision, Python, Tensorflow, Keras, Pytorch.. Palabras clave ‘Machine-Learning’, aprendizaje profundo, recomendador, red neuronal, autoencoder, procesamiento de imágenes, visión artificial, Python, Tensorflow, Keras, Pytorch..

(11)

(12) Gracias a mi familia, por el apoyo incondicional a una hija que, cuando les contaba sobre su TFM, parecı́a hablar en klingon. Y a mi tutor por su ayuda inagotable y por contagiarme su entusiasmo..

(13)

(14) Index. 1 Introduction and objectives. 1. 1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.2. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 2 State of the art 2.1. 2.2. Recommendation systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 2.1.1. Deep learning basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.1.2. Deep learning and recommendation systems . . . . . . . . . . . . . . . .. 8. Deep learning and visual content based recommendation systems . . . . . . . .. 9. 2.2.1. Computer vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10. 2.2.2. Action recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13. 2.2.3. Object detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16. 3 Development 3.1. 3. Machine Learning and Deep Learning process chain. 21 . . . . . . . . . . . . . . . 21.

(15) 3.2. Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27. 3.3. Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.1. Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29. 3.3.2. Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32. 3.4. Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49. 3.5. Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51. 3.6. Deep Learning Recommender System Architectures . . . . . . . . . . . . . . . . 53 3.6.1. Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54. 3.6.2. Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57. 3.6.3. Double autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60. 4 Results 4.1. 4.2. 66. Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1.1. Action recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67. 4.1.2. RGB Histogram Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . 75. 4.1.3. Object detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86. 4.1.4. Optical flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95. 4.1.5. Joined Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102. Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2.1. Embedding training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103. 4.2.2. Embedding prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.

(16) 4.2.3 4.3. Comparison between using or not embedding . . . . . . . . . . . . . . . 106. Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.3.1. Euclidean distances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110. 4.3.2. Cosine distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111. 4.4. Recommender Objective Evaluation Metrics . . . . . . . . . . . . . . . . . . . . 113. 4.5. Deep Neural Network Recommender . . . . . . . . . . . . . . . . . . . . . . . . 114. 4.6. 4.7. 4.8. 4.5.1. Neuronal Network training . . . . . . . . . . . . . . . . . . . . . . . . . 114. 4.5.2. Deep Neural Network prediction . . . . . . . . . . . . . . . . . . . . . . 116. Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.6.1. Autoencoder training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118. 4.6.2. Autoencoder prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 119. Double autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.7.1. Double autoencoder training . . . . . . . . . . . . . . . . . . . . . . . . 120. 4.7.2. Double autoencoder prediction . . . . . . . . . . . . . . . . . . . . . . . 123. Subjective comparison between solutions . . . . . . . . . . . . . . . . . . . . . . 124 4.8.1. Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124. 5 Conclusions and future lines. 129. 5.1. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129. 5.2. Future lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131. References. 132.

(17) Appendices. 138. A Ethical, social, economic and environmental aspects. 139. A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.2 Description of relevant impacts related to the project . . . . . . . . . . . . . . . 139 A.2.1 Ethic impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 A.2.2 Social impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.2.3 Economic impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A.2.4 Environmental impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 B Economic budget. 142. C Survey results. 144. C.1 Euclidean distance recommendations . . . . . . . . . . . . . . . . . . . . . . . . 144 C.2 Artificial Neural Network recommendations . . . . . . . . . . . . . . . . . . . . 147 C.3 Autoencoder recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 C.4 Double Autoencoder recommendations . . . . . . . . . . . . . . . . . . . . . . . 152 D Survey template. 155. E Detectable classes by object detector. 157. F Detectable classes by the action recogniser. 164.

(18) Index of figures 2.1. Youtube Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 2.2. LRNC architecture [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14. 2.3. 3D CNN example from [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14. 2.4. Faster R-CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17. 2.5. YOLO working scheme [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18. 2.6. SSD working scheme [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18. 2.7. RetinaNet working scheme [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20. 2.8. Mask R-CNN working scheme [6] . . . . . . . . . . . . . . . . . . . . . . . . . . 20. 3.1. Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22. 3.2. Gradient descent function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23. 3.3. Classification overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24. 3.4. Classification underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25. 3.5. Classification compromise between underfitting and overfitting . . . . . . . . . 26. 3.6. Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.

(19) 3.7. Multi-genres distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33. 3.8. Action recognition Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36. 3.9. ResNet50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37. 3.10 Action recognition prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.11 Histogram process chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.12 Action film colour histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.13 Action film colour histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.14 Object detector training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.15 YOLO architecture [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.16 Object detection architectures comparison . . . . . . . . . . . . . . . . . . . . . 45 3.17 Object prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.18 Object prediction example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.19 Optical flow extraction process . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.20 Embedding training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.21 Embedding prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.22 Artificial neuronal network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.23 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.24 Double autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1. Example outside image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75. 4.2. Example inside image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75.

(20) 4.3. Outside example results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76. 4.4. Inside example results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77. 4.5. Example day image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78. 4.6. Example night image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78. 4.7. Day example results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79. 4.8. Night example results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79. 4.9. Example mountain image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80. 4.10 Example sea image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.11 Mountain example results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.12 Sea example results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.13 ”Batman & Robin” histogram results. . . . . . . . . . . . . . . . . . . . . . . . 82 4.14 ”Someone Marry Barry” histogram results. . . . . . . . . . . . . . . . . . . . . 82 4.15 ”17 again” histogram results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.16 ”Night at the Museum” histogram results. . . . . . . . . . . . . . . . . . . . . . 83 4.17 ”A resurrection” histogram results. . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.18 ”Say It Is not So” histogram results. . . . . . . . . . . . . . . . . . . . . . . . . 85 4.19 ”It’s Complicated” histogram results. . . . . . . . . . . . . . . . . . . . . . . . . 85 4.20 Animals detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.21 Vehicle detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.22 Sport equipment detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.23 Weapon detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89.

(21) 4.24 Not all objects detected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.25 Wrong detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.26 Dark place person . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.27 Cartoon person . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.28 Blurry image person . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.29 Semi-transparent person . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.30 Person detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.31 Not all objects detected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.32 Human face detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.33 Burning car . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.34 Boat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.35 Sci-Fi ship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.36 ”Harry Potter” clothes detection . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.37 ”Star Wars” clothes detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.38 Object detections in cartoons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.39 Dancing optical flow representation . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.40 Dancing optical flow HSV representation . . . . . . . . . . . . . . . . . . . . . . 96 4.41 Talking optical flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.42 Talking optical flow HSV representation . . . . . . . . . . . . . . . . . . . . . . 97 4.43 Fighting optical flow representation . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.44 Fighting optical flow HSV representation . . . . . . . . . . . . . . . . . . . . . . 99.

(22) 4.45 Join feature with PCA scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.46 Join feature with two dimensional TSNE . . . . . . . . . . . . . . . . . . . . . . 103 4.47 Embedding loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.48 Embedding accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.49 Embedding feature representations . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.50 Without embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.51 Embedded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.52 PCA action representation with 1 (red), 2 (yellow) and 3 (green) for action genre106 4.53 Without embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.54 Embedded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.55 TSNE action representation with 1 (red), 2 (yellow) and 3 (green) . . . . . . . 107 4.56 Without embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.57 Embedded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.58 PCA three genres representation action (red), science-fiction (yellow) and horror (green) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.59 Without embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.60 Embedded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.61 TSNE three genres representation adventure (red), crime (yellow) and thriller (green) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.62 Evolution along epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.63 Evolution from epoch 100000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.64 Neuronal Network RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.

(23) 4.65 Evolution along epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.66 100000-300000 epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.67 Autoencoder RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.68 Evolution along epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.69 100000-300000 epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.70 Doble autoencoder, first autoencoder RMSE . . . . . . . . . . . . . . . . . . . . 121 4.71 Evolution along epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.72 100000-300000 epochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.73 Doble autoencoder, second autoencoder RMSE . . . . . . . . . . . . . . . . . . 121 B.1 TFM budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 D.1 Survey Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156.

(24)

(25) Glossary ML – Machine Learning DL – Depp Learning CV – Computer Vision RGB - Red, Green, Blue HSV - Hue, Saturation, Value NMS – Non-Maximum Suppression NN – Neural Network ANN – Artificial Neural Network CNN – Convolutional Neural Network RPN - Region Proposal Network LR - Learning Rate SGD - Stochastic Gradient Descent ReLU - Rectified Linear Unit ResNet - Residual Neural Network Faster R-CNN – Faster Region Based Convolutional Neural Network YOLO – You Only Look Once KNN - K-Nearest-Neighbours GMM - Gaussian Mixture Models PCA - Principal Component Analysis TSNE - t-Distributed Stochastic Neighbour Embedding.

(26)

(27) 1. Chapter 1. Introduction and objectives 1.1. Introduction. Nowadays, multimedia content recommenders are in high demand. It is being observed that there are many advantages in using recommenders in multimedia services on demand. It directly influences the users evaluation of the service and, therefore, of its permanence or their purchases in the service. The most innovative recommenders are based on deep learning. Deep learning is also a booming technology in recent years. Deep learning is a key technology in the future of Artificial Intelligence and Big Data. This project makes use of these trendy technologies to create a film recommender. Uniting it with image processing, computer vision and machine/depp learning techniques. This work described the development and results of a movie recommender based on visual content analysis using deep learning techniques. The work is divided into three large blocks. The extraction of content, an embedding and the artificial nets. From each block comes a solution that can be applied in different areas. From the extraction of content, information is generated from a movie. This information can be used for multiple purposes. In this case it is used to recommend, but it could be used to.

(28) 2. classify the content or to find peculiarities of the films. What feature extraction does is create a database with visual information about the movies. The embedding block trains a model that basically allows two things. The first is the one used in this project and it is to relocate the content extracted in another subspace that facilitates its subsequent training to recommend. But it has another possible use and that is as a classifier of film genres. And finally there is the block of artificial networks that generate a recommendation. In this block, three different deep learning architectures have been tested to find the best way to recommend a movie from a dataset. Each network allows to train for any movie dataset, first going through the previous blocks. In addition, it not only generates a recommendation but also indicates how recommended is each film in the dataset with respect to the film for which the recommendation is sought.. 1.2. Objectives. The main objectives of this work are presented in the following list. • Learn the use of deep learning techniques and programming tools. • The creation of a method for extracting visual content based from a movie set. • Use the concept of embedding to project the data in a different subspace. • Generation of a movie recommender using different deep learning architectures with the same purposes. • Evaluation with objective and subjective metrics. The work includes different types of techniques starting from computer vision tools to finish with state of the art techniques in deep learning to extract hidden knowledge from the films. Different deep learning architectures have been tested increasing the complexity. The results include both, general metrics to measure the performance of the final trained algorithms and subjective tests carry out to get the feeling with the proposed recommendations with real persons..

(29) 3. Chapter 2. State of the art 2.1. Recommendation systems. Nowadays, due to several reasons, among which include the increase in broadband Internet access and the proliferation of smart-phones, multimedia content is increasing rapidly. This has meant that both, traffic and multimedia consumption in the network, have grown exponentially in recent years. This rise is the key to the success of multimedia platforms such as YouTube, Netflix or Spotify. However, this rapid growth of multimedia information in our daily lives has created an information overload and a greater complexity in decision making. Therefore, due to the large amount of multimedia content that exists, it is very important to filter it, having two main objectives. The first objective is to provide the user with a specialised service that allows him to easily access the contents that interest him and allows a better user experience. And the second is triggered by the first, because offering the user content according to their interests leads to greater consumption by them and, therefore, increases the benefits for the company. A recommender system is a technology that filter a content in order to improve access and proactively recommends relevant items to users by considering the content information and/or the users’ preferences and behaviours. In order to work with machine learning to implement a recommender it is necessary to use.

(30) 4. technology based on algorithms. The algorithms used for recommendation are usually divided in two categories. The contentbased methods and the collaborative filtering methods or a combination of both. The contentbased methods do not involve other users. It only need the user likes to find a recommendation. It is based on analysing the content characteristics using different techniques like NLP, computer vision or audio processing. Once the content has been analysed, the recommender will make recommendations for multimedia material that have content like those that the user has indicated they like. The collaborative filtering bases its recommendations on users’ past behaviours. The collaborative filtering is based on the idea that similar users will have similar interests. Nowadays there are several companies working exclusively in recommenders with machine learning. Such as Think Analytics [7] , Gravity R&D [8] and Recombee [9]. And there are also a lot of company that are dedicated to the broadcasting of multimedia content that are also improving their recommendations with machine learning algorithms, such as Netflix [10] or Spotify [11]. This technique of recommendation has begun to be in high demand in recent years since it has been seen to be a powerful tool to satisfy the user expectations. The best algorithms and methods have been implemented by privates’ companies, so the codes have not been released. But there are some datasets and public code that are also interesting. About open code, it can be found that for each of the dataset already mention there is a lot of open code already developed. In the LMTD [12] dataset GitHub there are two notebooks with examples of how to use the dataset. Apart from this dataset examples, it can be also found a lot of open source codes in GitHub and in Kaggle. But we could find that while for recommendation of music you can find many projects, for movies recommendation is not so easy to find complete projects in this regard. Although there is not abundant open source code, there are tow examples of the scheme that important video broadcasting companies follow. The two examples shown before are the Youtube and the Netflix mechanism: The YouTube system [13] consist in two neural networks: one for candidate generation and one for ranking..

(31) 5. The first network, candidate generation, takes events from the user’s YouTube activity history as input and transform the whole set of videos that make up the corpus in only a small set of candidate videos. This first neural network provides a personalised recommendation per user through collaborative filtering. The ranking network is responsible for scoring each video based on an objective function established considering a series of parameters, this score allows the user to be persuaded those videos considered as the best recommendations.. Figure 2.1: Youtube Machine. For the Netflix recommendation system [10] the scheme starts from the first moment, when a user creates a Netflix account, or add a new profile in its account, they ask the user to choose some titles he likes. They use these titles to start their recommendations and connect with the user’s preferences. If the user skips this step, the first recommendations that will be provided will be content that is popular and relevant among most Netflix users and later will have more personalised content. In the second step, the Netflix recommendation system takes care of observing the user’s interactions with the service, other members with similar tastes and preferences in their service (Collaborative Filtering), and information about the titles, such as gender, actors, etc. In addition to knowing what you have watched on Netflix, to best personalise the recommendations they also look at things like the time of day you watch, the time you watch or the devices that Netflix is watching..

(32) 6. The system chooses which titles to include in the rows of its homepage, in addition it also classifies each title within the row and then classifies the rows themselves (Neural networks).. 2.1.1. Deep learning basics. In these days, deep learning solutions are presenting the state of the art solutions in a wide range of fields. Artificial neural networks (ANN) were the start of this world breaking the way of learning of machine learning techniques introducing non linear layers that breaks the linearity between data. Several types of deep learning networks have been appearing and breaking the results obtained with traditional techniques as feature extraction and selection, computer vision, or time series analysis. Artificial neural networks are able to learn complex behaviours from feature vectors in different way that machine learning techniques do this process. Convolutional Neural Networks (CNN) are one of the main important advantages in computer vision due to the ability to extract automatically feature vectors learned during training avoiding the manual selection of them. Finally, Recurrent Neural Networks (RNN) can work with data in time learning complex patterns to predict future behaviours in time.. 2.1.1.1. Deep learning advantages and disadvantages. Below are presented the more significant advantages and disadvantages of the use of deep learning algorithms. Due to the intrinsic complexity of this type of algorithms some tasks can be preformed efficiently but others suffer from computational requirements. Deep learning is only a new tool but traditional algorithms are able to solve some concrete solutions efficiently without the inclusion of this complexity in the system. Some advantages and disadvantages are enumerate in the next paragraph regarding deep learning techniques in real world applications in contrast to traditional techniques (computer vision, well known algorithms to perform concrete operations among others). Deep learning advantages: • Non-linearity: Unlike classical models, which are basically linear models, deep learning models are non-linear. Using non-linear activations (relu, sigmoid, tanh, etc.) deep.

(33) 7. neural networks are able to model non-linearity in data. With this property the deep learning algorithms can find complex and intricate interaction patterns of the data. • Representation learning: This advantage is due to the fact that deep neural networks are effective in learning helpful representations of data. In the case of recommendations, it can be easily verified that there is a large amount of data available with information regarding the relationship between items and users. Making use of these data can expand the knowledge we have regarding the items and users, which improves the recommender. For this reason, using deep neural networks to learn representation shows that it is a good choice to improve recommendations. Using representation learning present to main advantages: – The difficulty in hand-craft feature design decreases. Deep neural networks make possible that feature engineering can be treated as an automatic activity, with supervised or unsupervised approaches. – Using representation learning makes possible to include different content such as text, audio, images or video. It has been proved that deep learning improves with representations learning from different sources. • Sequential modelling: Sequential models can deal temporal dynamics of users behaviours with good results. And both, RNN and CCN, are deep learning techniques that can be applied in sequential modelling. • Flexibility: In general, not only in the recommendations field, deep learning techniques have high flexibility. In especial when working with the most popular frameworks such as Tensorflow, Keras, Pytorch, Theano, etc. This frameworks works in a modular way and they also have a good support by a very active community and professionals. The modular way they have to work provides them efficiency when developing. One example of this is the facility they have in combining different neural networks in hybrid models, or when replacing one modules. These easiness makes less complex the task of capturing different characteristics and factors simultaneously. Deep learning disadvantages: • Interpretability: One of the main problems of deep learning is that it acts as a black box. Not providing explanations of the predictions is a complicated disadvantage. This makes the hidden weights and activations non-interpretable. Nevertheless, nowadays.

(34) 8. models are starting to be able of some interpretability what makes possible explainable recommendations. • Hyperparameter tuning: The disadvantage arises because in order to have good results the correct choice of hyperparameters is essential. But this choice is very complex since there is no correct way to calculate them, there are many hyperparameters and a very large tuning range. But this problem is not only in deep learning, it already appears in machine learning, although it is true that normally in deep learning more hyperparameters are added. Many researches has been done to find the correct way to calculate which is the ideal selection of values for the hyperparameters, but an optimal solution has not yet been found. Other investigations pursue to achieve to be able to work with a single hyperparameter instead of several, to facilitate the hyperparameter tuning. • Data need: This last disadvantage is associated with the fact that deep learning in general, not only for recommendations, is data hungry. That is, you need data sets large enough to work correctly. But on the other hand, in the field of recommendations there are many data so this problem would be a less concern.. 2.1.2. Deep learning and recommendation systems. At this moment deep learning (DL) enjoys great popularity. In the last few decades have increase considerable in success in many domains, like speech recognition or computer vision. And both academia and industry are currently in constant search to improve deep learning techniques. Investigating how to apply it to a wider range of applications, in which this discipline can help thanks to its ability to solve complex tasks. Recommendation architectures, in the recent times, have drastically change since deep learning have been applied in them. Deep learning provides more opportunities to enhance recommendation efficiency. The interest for the last advances in recommendation systems based on deep learning has increased considerably because it has overcome obstacles that the conventional models were not able to solve, obtaining recommendations of great quality. For the industry, a recommendation system is very important to improve the user experience, what promotes sales. Some interesting examples are the recommendation of Netflix and YouTube. In the case of Netflix, 80 percent of the movies that users watch are thanks to the recommendations. For YouTube, 60 percent of the videos that are clicked come from.

(35) 9. recommendations. In YouTube case, the paper [13] explain how have been used recommendation algorithm based on deep neural network for video recommendation. In the paper [14] can be seen the Google Play recommender system that use wide and deep model. And the last example is the Yahoo News recommender, it uses a recommender system based on RNN as the paper [15] explains. All these model examples have shown the important improvement over traditional models. An other example of the enhance of the deep learning in the recommendation systems is that since 2016 RecSys, the leading international conference on recommender system, started a regular workshop on deep learning for recommender system. Deep learning is a subfield of machine learning that make use of artificial neural networks. Deep learning learns deep representations, this mean multiple levels of abstractions and representations from data. The different algorithms of deep learning are based on techniques such as Convolutional Neural Network, Recurrent Neural Network Multilayer Perceptron, Autoencoder, Restricted Boltzmann Machine, Neural Autoregressive Distribution Estimation, Adversarial Networks, Attentional Models and deep reinforcement learning among others. Deep learning has many advantages for recommendations. One of the most interesting properties for this field is that they are end-to-end differentiable and provide adequate inductive biases for the class of data. That is, if in the data it is possible to find some kind of inherent structure, deep neural networks will be adequate for that case. In addition, in the cases of content-based recommendation, deep learning has the advantage that they are composite. This means that multiple neural building blocks can be presented as a unique differentiable function and trained end-to-end.An example of this is that to work with textual data or image data, CNNs and RNNs are a neural building blocks that are practically indispensable.. 2.2. Deep learning and visual content based recommendation systems. In order to make a content based recommender, in the visual field, we will analyse the classifications of visual concepts that must be performed. This classification is complicated due to the complexity it requires and the variability of its appearance. In paper [16] it is proposed, for example, objects, sites, scenes, personalities, events, or activities as visual concepts to analyse. In this other article [17], a standardisation is proposed when looking for these.

(36) 10. concepts to avoid a semantic gap, they call it lexicon. It consists in categorising a series of general concepts into five categories: ”who”, ”what”, ”where”, ”when” and ”how”. And for each category, it proposes a definition. ”who” corresponds to the number of people or animals that appear on the scene, ”what” indicates the actions or events, ”where” the location or places, ”when” indicates whether it is day or night and finally ”how” information about shot sizes since they strongly correlate with specific actions. In the article [16] it also indicated the importance of defining a minimum number of positive samples per concept. Generally, in the classics image recognition, it have only been considered a single concept per image, this is ”singlelabel”. But nowadays there are other proposals, such as [16], in which ”multilabels” are used extending the CNN architecture with a sigmoid layer. Next, the state of the art of the most interesting features to analyse for a content based recommendation systems will be detailed. For two of them using deep learning and for the other two computer vision techniques are applied.. 2.2.1. Computer vision. Computer vision is a field that acquire, process, analyse and try to understand images or sequence of images. This discipline seeks to quantify and produce information, from images, that a computer is able to understand and deal with. In order to achieve the acquisition of such information, exist a huge vary of techniques. Computer vision is closely linked with artificial intelligence, as the computer must interpret what it sees, and then perform appropriate analysis or act accordingly. But there are important challenges in computer vision. Initially, it was believed to be a trivially simple problem that could be solved by a student connecting a camera to a computer. After decades of research, computer vision remains unsolved, at least in terms of meeting the capabilities of human vision. One reason is that we don’t have a strong grasp of how human vision works. Studying biological vision requires an understanding of the perception organs like the eyes, as well as the interpretation of the perception within the brain. Much progress has been made, both in charting the process and in terms of discovering the tricks and shortcuts used by the system, although like any study that involves the brain, there is a long way to go..

(37) 11. Another reason why it is such a challenging problem is because of the complexity inherent in the visual world. A given object may be seen from any orientation, in any lighting conditions, with any type of occlusion from other objects, and so on. A true vision system must be able to “see” in any of an infinite number of scenes and still extract something meaningful. Computers work well for tightly constrained problems, not open unbounded problems like visual perception. Some example of techniques used in computer vision are colour histogram, background extraction, optical flow, surface and shape estimation, depth map or the optical flow. All these techniques have been very useful to extract knowledge from the image to perform another tasks as classification, regression or detection and recognition. One of the main important parts is the feature extraction. This task consist in extract different vectors that can represent in a accurate way different situations, scenes or parts of the images. This features or image descriptors can be obtained applying several different techniques. For example, Histogram of Oriented Gradients (HOG) is used to extract knowledge about the size and form of objects in images. Local Binary Patterns (LBP) is a descriptor that is very useful to detect different textures. Other several techniques as keypoints extractor exists in the literature (Harris Detector, Sobel mask, FAST, SURF, BRIEF, ORB among others). Combining all these techniques with machine and deep learning algorithms more complex solutions can be proposed and actually Convolutional Neural Networks are replacing this techniques due to the ability of this networks to extract automatically rich feature vectors learned during training. But computer vision techniques continues being a good choice in several works due to its availability, easy implementation and few time consuming. A lots of tasks takes computer vision as a fundamental part in the development. Here are just a handful of them: • Face recognition: Face-detection algorithms are applied and in combination with filters it is possible recognise you in pictures. • Image retrieval: Content-based queries to search relevant images. The algorithms analyse the content in the query image and return results based on best-matched content. • Gaming and controls: A great commercial products in gaming that uses stereo vision.

(38) 12. or other types of cameras exists. • Surveillance: Surveillance cameras are ubiquitous at public locations and are used to detect suspicious behaviours. • Biometrics: Fingerprint, iris and face matching remains some common methods in biometric identification. • Smart cars: Vision remains the main source of information to detect traffic signs and lights and other visual features. It may be helpful to zoom in on some of the more simpler computer vision tasks that are of interest to solve given the vast number of publicly available digital images and videos available in datasets. Many popular computer vision applications involve trying to recognise things in images, for example: • Object Classification: What broad category of object is in this image? • Object Identification: Which type of a given object is in this image? • Object Verification: Is the object in the image? • Object Detection: Where are the objects in the image? • Object Landmark Detection: What are the key points for the object in the image? • Object Segmentation: What pixels belong to the object in the image? • Object Recognition: What objects are in this image and where are they? Other common examples are related to information retrieval, for example, finding images like an image or images that contain an object..

(39) 13. 2.2.2. Action recognition. Action recognition is a complex task. It requires identifying the different actions that happen in a video clip, where such action may or may not be developed throughout the entire video. It also need to be analysed entirely in context, not just analyse the different frames separately. The biggest challenges that the action recogniser must overcome are the following: • Computational cost: Large architectures and probable overfitting • Long context:In order to recognise actions, it is necessary to capture a certain spatiotemporal context throughout the frames. Another problem also appears, and that is that you have to compensate for the movement of the camera. • High complexity architectures: The architectures that are needed to capture the spatiotemporal information require a high complexity. In them you have to choose a series of parameters that are complicated to select and evaluate and are expensive. • Non-standardized datasets: There is a lack of standardization in action datasets. The current basis of the recognition of actions are from two studies [18] and [2]. In [18] it is attempted, using 2D pre-trained convolutions, multiple ways to join the temporal information of consecutive frames. In the case of [2] instead of using a single network, separate the architecture into two networks. One of them for the spatial context, the pre-trained. And the other network for the context of the movement. Based on these two studies arise those that are currently the most novel. This new studies are LRCN [1], C3D [19], Conv3D & Attention [20], TwoStreamFusion [21], TSN [22], ActionVLAD [23], HiddenTwoStream [24], I3D [25], T3D [26]. LRCN [1] uses LSTM networks after making the convolutions to the images, using end-to-end training to entire architecture. The use of LSTM networks is interesting for this type of data since it is a recurrent neural network with feedback connections. In this way it processes the input data separately but also considering them as data sequences. The network architecture is presented in Figure 2.2 where the initial Convolutional part to extract features and the Recurrent Network to learn in time is drawn..

(40) 14. Figure 2.2: LRNC architecture [1]. In the case of C3D [19], Conv3D & Attention [20], I3D [25], T3D [26] they all use 3D convolutions. The use of 3D convolution in action recognition is very widespread since this technique allows finding patterns to 3 spatial dimensions data. In the case of action recogniser, these dimensions are time, height and width. the architecture of this network is drawn in Figure 2.3.. Figure 2.3: 3D CNN example from [2]. TwoStreamFusion [21], TSN [22] and ActionVLAD [23] are modifications of two stream ar-.

(41) 15. Network LRCN C3D Conv3D & Attention Two Stream Fusion TSN Action VLAD Hidden Two Stream I3D T3D. Score 82.92 71.1 82.3 85.2 90.4 92.5 94.2 94.0 94.2 92.7 93.6 89.8 92.5 93.4 98.0 90.3 91.7 93.2. Score note With flow and RGB inputs Only with RGB C3D (1 net) + linear SVM C3D (3 nets) + linear SVM C3D (3 nets) + iDT + linear SVM For video description prediction TwoStreamfusion TwoStreamfusion + iDT TSN (input RGB + Flow ) TSN (input RGB + Flow + Warped flow) ActionVLAD ActionVLAD + iDT Hidden Two Stream Hidden Two Stream + TSN Two Stream I3D Imagenet + Kinetics pre-training T3D T3D + Transfer T3D + TSN. Table 2.1: Action recognition state of art comparative chitecture. With this architecture the frame input is considered by two different stream. The first one analyse only the frame (spatial stream net) , and the second one analyse the frame in the context of a sequence of frames (temporal stream net). As in the case of 3D convolution, this technique is very successful to recognise actions. Since they are analysing both the image and the movement along several images. HiddenTwoStream [24] analyse the optical flow of the video. The optical flow can be used to measure the quantity of movement. In the case of recognising activities, optical flow is useful to relate the amount of movement with the different activities. Table 2.1 shows a comparative summary of all the studies below described with different configurations..

(42) 16. 2.2.3. Object detector. Object detector is an area that is improving very quickly. The most important reason for this improvement is the application of deep learning for object detector. Each year new algorithms appear, that considerably improve the previous ones. There are a lot of object detection algorithms with high efficiency. In addition there are many of these algorithms already pre-trained with known datasets so it is not necessary to train them to start detecting objects. Among the most famous models and models, the most interesting ones are detailed below. These algorithms are in order of effectiveness, from the least good to the best (usually the newest ones). • R-CNN (Region-based Convolutional Neural Networks) [27]. The methodology of this network begins in a given image. From which a series of regions of interest are generated. For each region a neuronal network extracts characteristics. And each region is classified according to a series of classes. Of the disadvantages of R-CNN it is worth mentioning the computational cost of the training. • Fast R-CNN emerges as a direct improvement of the R-CNN. In the article [28] Ross Girshick describes the disadvantages of an R-CNN and proposes a new methodology to reduce them. Fast R-CNN performs training in a single stage, improving detection rates. But this method has its bigger disadvantage in the cost of generating regions of interest, which is very high. • Faster R-CNN was created to mitigate the problem of the generation cost of Fast R-CNN regions of interest, [29] [30]. It allows simultaneously providing regions of interest and classification results. The Faster R-CNN architecture uses the Region Proposal Network (RPN). The RPN is a fully-convolutional network that works simultaneously predicting bounding of objects and objectness scores at each position. The detection network shares full-image convolutional features with the RPN. What gets a nearly cost free region proposals. RPN networks are trained end-to-end to achieve high quality region proposals. These region proposals are those used by the Fast R-CNN for detection. Also RPN and Fast R-CNN can be trained to share convolutional features. The architecture of Faster R-CNN can be found in Figure 2.4..

(43) 17. Figure 2.4: Faster R-CNN architecture • OHEM (Online Hard Example Mining is an algorithm for training region based ConvNet detector, [31]. It emerged as a proposal to solve the problem of the great imbalance between the number of annotated objects and the background examples. Shrivastava proposes an online mining algorithm for automatic selection of the hard examples. This new method increases the effectiveness and efficiency of training. • YOLO v1 (You Only Look Once) is the proposal of Redmon to an object detector [3]. In the previous proposals the detection of objects repurposes the classifiers to perform the detection. YOLO proposes to address the problem of object detectors as a regression problem to spatially separated bounding boxes and associated class probabilities. To carry it out with one neural network, bounding boxes and class probabilities will be predicted, all directly from the complete image to evaluate. The detection is optimised end-to-end thanks to the fact that only one neural network was used. In the image 2.5 it is presented the working methodology of YOLO. The architecture of YOLO consists of 24 Convolutional layers and 2 Fully Connected layers. The performance supposes an improvement of the previous proposals. The images can be processed at 45 frames per second, this means that you can process them in real time using the proposed sizes of the images. Exist another version with a smaller network, called Fast YOLO that can process up to 155 frames per second lossing accuracy in the predicted bounding boxes. It also has another improvement, and that is that it produces less false positives in the.

(44) 18. background.. Figure 2.5: YOLO working scheme [3] • SSD (Single Shot MultiBox Detector) emerged as YOLO improvement. Since YOLO had certain problems when detecting small objects in a group. This is due to strong spatial constraints imposed on bounding box predictions. To solve this problem in [4] is raised SSD. From a given map feature SSD takes advantage of the set of default anchor boxes with different aspect ratios and scales that allows to discretize the output space of bounding boxes. In order to detect objects with different sizes, the network fuses the predictions of several feature maps that have different sizes. The architecture of an SSD network can be found in Figure 2.6.. Figure 2.6: SSD working scheme [4] • R-FCN (Region-based Fully Convolutional Networks) [32] is a fully convolutional network that attempts to improve the accuracy and efficiency of other region-based prior.

(45) 19. objects detectors, such as Fast R-CNN and Faster R-CNN. While those other detectors performed costly per-region sub-network hundreds of times, this new approach is fully convolutional and with practically all computation shared on the entire image. To carry out the detection they use position-sensitive score maps to solve the problem between translation-invariance in image classification and translation-variance in object detection. Resembles ResNet in that it can naturally adopt fully convolutional image classifier backbones. The results show that it takes 170ms to process an image, this time is 2.5-20x faster than the results of Faster R-CNN. • YOLO v2 [33] is an improvement of YOLO v1. It developes new strategies such as batch normalization (now used on all convolutional layers), convolution with anchor boxes (removing all fully connected layers and uses anchor boxes to predict bounding boxes), dimension cluster, direct location prediction and multi-scale training. In [34] can be found a more exhaustive comparison of the YOLOv2 improvements. • FPN (Feature Pyramid Net) [35] exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to create feature pyramids with marginal extra cost. It is built a top-down architecture with lateral connections for building high-level semantic feature maps at all scales. FPN can work together with other object detection architectures improving the results. It allows to process images at a speed of 5 FPS in a GPU. • RetinaNet [5] is composed by a backbone network and two task-specific subnets, and it is a single unified network. To calculate a conv feature map on a complete input image, the backbone network is used. In addition, the backbone network is an independent convolutional network. For its part, the first subnet is responsible for performing the classification in the output of the backbone network. And the second subnet accomplishes the regression of the bounding convolutional box. The proposed loss function shows an improvement in training and in the final estimated bounding boxes. In Figure 2.7 is shown the RetinaNet architecture. • YOLO v3 [36] is the latest version of YOLO, which presents many small improvements. The network is somewhat larger and therefore larger but more effective. For example, a 320x320 image on YOLOv2 runs in 22 ms at 28.2 mAP and with an accuracy as SSD has under the same conditions, but SSD would be much slower. • Mask R-CNN [6] allows to detect objects efficiently in an image while generating highquality segmentation mask for each instance. Mask R-CNN is an extension of Faster R-CNN. A branch is added to Faster R-CNN that allows predicting an object mask in.

(46) 20. Figure 2.7: RetinaNet working scheme [5] parallel to the branch that is recognising the bounding boxes. Mask R-CNN only adds a small overhead to Faster R-CNN. The image processing speed is 5 fps. The new brach of Faster-RCNN to create Mask R-CNN network is shown in Figure 2.8.. Figure 2.8: Mask R-CNN working scheme [6] • RefineDet [37] is made up of two interconnected modules. These modules are anchor refinement module and the object detection module. The first module seeks to filter out negative anchors, so that the search space for the classifier is reduced, adjusting in a general way the locations and sizes of anchors. With this first module the initialisation for the subsequent regressor is improved. And the second module uses the anchors found in the first module as input to improve the regression and predict multi-class label. The whole network is in an end-to-end way thanks to the multi-task loss function. A comparison of the above explained technologies can be found in reference [38].

(47) 21. Chapter 3. Development This chapter describe all the steps taken to obtain the recommendation engine, starting with the description of the architecture of this project and continuing with a detailed description of each of the blocks involved in the system. Each block consist on a series of algorithms where in the beginning the purpose is extract some visual features and in the middle and last parts adapt and learn from the data. Deep Learning paradigm is used in order to learn complex representations of the data to provide a more accurate output. The aim of the project is provide from a set of input features (extracted from a movie trailer) the ten best recommendations that exists in the database.. 3.1. Machine Learning and Deep Learning process chain. When implementing a ’Machine Learning’ or ’Deep Learning’ project, it is usually followed the chain shown in Figure 3.1. In this project have been applied Supervised Learning because the categories/labels are available from the data used. In the last part of the project another type of Supervised Learning is applied using regression optimizations to fit labels values during training. First, it is necessary to acquire data for the creation of three datasets: one for training (’training set’), another for model validation (’validation set’) and another to test the model (’test set’)..

(48) 22. Figure 3.1: Proposed architecture Secondly, it is sometimes advisable to pre-process the data before being introduced to the algorithm for training. In the case of this project, pre-processing is very important since the extraction of features from the trailers has been made, and these features will be the input to the neural networks proposed. These features also need a standardisation process. The third step is to train the model, for this, samples of the training set are introduced by batches in order to adjust the parameters that define the models of ’Machine Learning’ and ’Deep Learning’ algorithms/architectures. The parameters are adjusted trying to minimise a cost/loss function that measures how well our model predicts when predicting the category to which the entered data belong compared to its true label. To adjust the parameters, two steps are carried out [39]: ’forward propagation’ and ’backward propagation’. The first is to enter the training samples to calculate the output and compare with the true value of the label, the difference between both values is the error. The second step is to propagate in the opposite direction by applying backward propagation algorithm to calculate the values of the parameters for which we are at the minimum of the Loss function. To do this, an optimisation algorithm calculate the slope at each point and steps are given proportional to the negative gradient shown in Figure 3.2. The function of the gradient can be seen in equation 3.1.. f (x + 1) = f (x) − α ·. ∂f (x) ∂x. (3.1).

(49) 23. Figure 3.2: Gradient descent function. Through the validation set some interesting metrics [40, 41] of the behaviour of the models can be obtained: confusion matrix, precision,recall, F1 score, accuracy, loss, etc. The precision metric measures the success of the algorithm, that is, the number of samples of a class that have been identified well from the total number of samples that have been classified as belonging to that class. The equation to calculate the precision percentage is presented in equation 3.2.. precision =. T rue positive T rue positive + F alse positive. (3.2). The recall metric measures meticulousness, that is, the number of samples of a class that a certain algorithm has been able to identify from the total number of samples of that class. The equation to calculate the recall percentage can be found in equation 3.3. recall =. T rue positive T rue positive + F alse negative. (3.3). An algorithm with proper functioning is one that finds a balance between recall and precision, that is, it detects all the samples of a class, but it is not wrong with other classes. The F1 score metric is a harmonic mean between precision and recall, which aims with a single value to provide an intuition of the functioning of the algorithms showing that balance between recall and precision that must exist. As its formula shows in equation 3.4, a high.

(50) 24. value of precision is not desirable if it is linked to a low value of recall (and vice versa), ideally it is a value as high as possible of both metrics.. F1 =. 1 recall. 2 +. 1 recall. =2·. precision × recall precision + recall. (3.4). Finally, other metrics used to evaluate the model are the accuracy and the loss. The accuracy is the fraction of predictions that the model made correctly with respect to the total and the loss is the sum of the errors committed in the training and validation set. The formula of accuracy can be formulated as shown in equation 3.5. Where tp is true positive, tn is true negative, f p is false positive and f ’ is false negative.. Accuracy =. tp + tn tp + tn + f p + f n. (3.5). These metrics are very important in order to detect a very common phenomenon in MachineLearning and Deep-Learning, the overfitting, which occurs when the built model is excessively complex and captures the noise of the information instead of the trend of it. This causes that it is not sufficiently generalised and with new information it will behave in an inappropriate way and will classify the samples erroneously with high probability. An example of this behaviour can be seen in Figure 3.3.. Figure 3.3: Classification overfitting A very simple way to identify this phenomenon, as well as using the previous metrics, is to.

(51) 25. compare the behaviour of the model in terms of accuracy (in Machine / Deep Learning) and loss (Deep Learning) in the training set and in the validation set, hence the importance of differentiating between both datasets. If the loss in the training set is very low, that is, it commits very few errors before known information while the error in the validation set is very high, it commits many errors before new information, the model is experiencing overfitting. A compromise must be reached between both errors. The curves of loss should be decreasing in a similar way in both sets along the epochs while the curves of accuracy should be growing in a similar way in both set too. To prevent this overfitting problem it is convenient to reduce the complexity of the model or modify the value or number of parameters. For example, in a neuronal network of DeepLearning this would result in reducing the number of hidden layers or hidden neurons. Another aspect to take into account when avoiding overfitting is that the size of the dataset necessary to train an algorithm grows exponentially with the size of the model. This means that more complex models require more samples for their correct operation. As it is sometimes expensive to get as much information for training it is necessary to simplify the models. It can also happen the opposite phenomenon, the underfitting, shown in Figure 3.4. This occurs when the model is too simple and is not able to capture the trend that the data follow, therefore, it will have a bad behaviour in both the ’training set’ as in the ’validation set’.. Figure 3.4: Classification underfitting. The ideal is to find a commitment value in such a way that the behaviour of the loss curves in the training set and validation set is decreasing and similar in both sets or that the accuracy in both sets behaves similarly. An example of this good behaviour can be found in Figure 3.5..

(52) 26. Figure 3.5: Classification compromise between underfitting and overfitting. The selected model will depend on the algorithm used to train. As for Machine-Learning there are numerous classification algorithms growing in complexity or that adapt better to certain conditions depending on the data used. Some examples are Logistic Regression, KNearest-Neighboors, Decission Tree, Random Forest among others. In this work have been used K-Means Clustering, Hierachical Clustering and Gaussian Mixture Models in order to get some intermediate results in the feature selection block. Regarding the paradigm Deep-Learning the model depends on the creation of a network architecture using a series of available layers that perform different functions on the input data. Deep learning algorithms are now improving the results presented by machine learning solutions and will be used in this work to create the final recommendation engine..

(53) 27. 3.2. Proposed architecture. Figure 3.6: Proposed architecture. The architecture that has been proposed to perform the recommendation system is represented in Figure 3.6. The architecture consists of four differentiated blocks in which different image processing and machine/deep learning techniques are used. The first block is the feature extraction (Section 4.1). In it, an analysis of the trailers from the selected dataset is carry out 3.3.1 using both Computer Vision and Deep Learning techniques. Four different features have been extracted from each trailer. These features have been selected with the criterion of adjusting to the most relevant characteristics to describe successfully each movie trailer. The selected feature extractor considered in this work have been a deep learning action recogniser, colour histograms, deep learning object detector and optical flow. Each characteristic is processed to obtain a vector of values per feature. These four vectors are joined together forming the final vector of characteristics of each film that will be the input to the second block in the architecture. The next block is an embedding of the feature vectors. The embedding allows to find another sub-dimension to represent the feature vectors in a different sub-space that can separate vectors in the space and fit better the values to the final purpose. To perform the embedding, a neural network was used, this network has as input the feature vectors and as labels the labeled genres of the films trailers. The output of this block is the prediction of all the films once the model is trained giving a vector per film, with a dimension equal to the layer before.

(54) 28. the classification layer of the network. The network was trained as a multi-label classification problem due to its film is categorised by more than one genre. The next block solve an optimization problem where a distance function is optimized to output the final distances between the embeddings. This block calculates a distance value of each film trailer with the rest of the films in the dataset. The distances have been calculated with different algorithms to check which one best fits the problem. The output of this block is a vector per film with a length equal to the number of films trailers. In the last block a final training of all the trailers is carried out. Different network architectures were used in order to compare the performance between them. Fistly, an Artificial Neural Network (ANN) was used using as input the embedding and as labels the distances. This problem is solve as a regression problem in order to learn the distances between films trailers. The second approach takes the advantage of deep learning autoencoder architectures. A first autoencoder was used to learn the distances between films in order to reproduce the input in the output. A second autoencoder takes the decoder part from the previous autoencoder and include a new encoder part that takes as input the embedding vectors. After the training, a model is generated that allows to made recommendations. When a prediction is performed the output is a vector of a length equal to the number of movies where values are between 0-1 range where 1 represent the most similar film trailer and 0 the less similar one, So, the positions with the 10 highest values are the final recommendations from the proposed recommender system engine. Through the complete architecture we obtain a final model that allows us to carry out a content-based recommendation of a movie trailer.. 3.3. Feature extraction. As the previous section explains, the first step carry out in the proposed recommendation engine system is the features extraction process. To achive this firstly it is necessary to choose the dataset of movie trailers (Section 3.3.1) used. Next, an analysis of the features to be extracted is carried out. Different feature extractors were used (Section 3.3.2.1), to get the input values and were normalised in order to get a common representation for each movie trailer..

(55) 29. Dataset LDOS-CoMoDa dataset [42] Million Song Dataset [43] Million Musical Tweets [44] LFM-1b [45] MovieLens 20M (ML-20M) [46] MMTF-14K [47] Labeled Movie Trailer Dataset [12]. Domain movie music music music movie movie movie. Content Feature M+context A,M A,M M M,A,V M,A,V M,A,V. Number of items 1K 1M (track) 134K (track), 25K (artist) 32M (track), 3M (artist) 26.7K 13.6K 4K. Number of users 1K 1M 214K 120K 138.5K 138.5K IMDB. Number of ratings 2K 48M 1M 1.1B 20M 12.4M 4K. Table 3.1: Comparison between datasets, based on [47] research. 3.3.1. Dataset. In order to begin the process of the recommendation system, the trailers and its information are needed. The database or dataset is an essential part of any machine learning and deep learning project. It is necessary to have a good set of information that can represent the mejor part of the possible cases that can appear in your problem. It is of vital importance that the data is appropriate for each problem and also takes in account that the information it provides is reliable (labels they offer are correct, information have good quality...). To select the dataset that has been used, a deep search of the available datasets in open source has been made. This comparison can be checked in the table 3.1. The type of content feature is denoted as M (metadata), V (video) and A (Audio). The most outstanding datasets are the last three one (MovieLens, Multifaceted Movie Trailer Feature Dataset and Labeled Movie Trailer Dataset), since they have video information.A description of what they offer, the quality of their information and the dimensions of each one are described below. The first dataset is the MovieLens [46] dataset. This is a set of different datasets parts which differ in the number of movies. For each set they offer a list of movies and their ids on youtube, which facilitates the download of the dataset. The recommended dataset for research is called MovieLens 20M. The 20M indicates the number of ratings it has in metadata. The database contains 27,000 movies. The dataset is not updated, then the most of the links to youtube are out of the date. Another dataset of movie trailers of great interest is Multifaceted Movie Trailer Feature Dataset [47]. This dataset provides 14,000 movie trailers. In addition to a series of audio and video descriptors, metadata and ratings. The visual descriptors include Aesthetic features and AlexNet features. And the audio descriptors include block-level features and i-vector.

(56) 30. features. Finally, other common dataset is the Labeled Movie Trailer Dataset [12]. It is oriented to a multilabel movie genre classification providing 9 different classes. This dataset offers 4021 trailers of tagged movies. In addition to the multi-genres of the films, one of its biggest advantages is its metadata, that offers all the data stored in IMDB for those films. That includes the genres indicated by IMDB, name, director, film awards, main actors, plot resume, image url of the film cover, among other much information. After this exploration, the dataset used in this work is the Labeled Movie Trailer Dataset. This dataset has been selected taking into account the importance of the labels of the genres and the quality and quantity of the trailers available to download from Youtube as in our work only genres and video trailers will be used from all the provided data.. 3.3.1.1. Genres. In a deep learning project, an exploratory data analysis of the used dataset is usually carried out. But in this case, the only metadata that must be checked are the genders. For all the films it is verified that there is data of the genres. Next it is shown other relevant information about the genres of the dataset. The genres per film offered by this database have two different types. In one hand, IMDB provide 24 different genres (Table 3.2). In the other hand, the genres of LMTD dataset are divided in 9 classes (Table 3.3)..