Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations

Texto completo

(1)Universidad Politécnica de Madrid Escuela Técnica Superior de Ingenieros de Telecomunicación. MÁSTER UNIVERSITARIO EN INGENIERÍA DE TELECOMUNICACIÓN TRABAJO FIN DE MÁSTER. Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations Patricia Alonso de Apellániz. 2020.

(2)

(3) UNIVERSIDAD POLITÉCNICA DE MADRID ESCUELA TÉCNICA SUPERIOR DE INGENIEROS DE TELECOMUNICACIÓN Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations Autor: Patricia Alonso de Apellániz Tutor: Dr. Alberto Belmonte Hernández Departamento: Departamento de Señales, Sistemas y Radiocomunicaciones. MIEMBROS DEL TRIBUNAL: Presidente: Vocal: Secretario: Suplente:. Realizado el acto de lectura y defensa del Trabajo de Fin de Máster acuerdan la calificación de:. Calificación:. Madrid, a. de. de.

(4)

(5) Universidad Politécnica de Madrid. Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations MÁSTER UNIVERSITARIO EN INGENIERÍA DE TELECOMUNICACIÓN. Patricia Alonso de Apellániz. 2020.

(6)

(7) Summary Generating synthesized images, being able to animate or transform them somehow, has lately been experiencing a breathtaking evolution thanks, in part, to the use of neural networks in their approaches. In particular, trying to transfer different facial gestures and audio to an existing image has caught the attention in terms of research and even socially, due to its potential applications. Throughout this Master’s Thesis, a study of the state of the art in the different techniques that exist for this transfer of facial gestures involving even lip movement between audiovisual media will be carried out. Specifically, it will be focused on different existing methods and researches that generate talking faces based on several features from the multimedia information used. From this study, the implementation, development, and evaluation of several systems will be done as follows. First, knowing the relevant importance of training deep neural networks using a big and well-processed dataset, VoxCeleb2 will be downloaded and will suffer a process of conditioning and adaptation regarding image and audio information extraction from the original video to be used as the input of the networks. These features will be ones widely used in the state of the art for tasks as the one mentioned, such as image key points and audio spectrograms. As the second approach of this Thesis, the implementation of three different convolutional networks, in particular Generative Adversarial Networks (GANs), will be done based on [1]’s implementation but adding some new configurations such as the network that manages the audio features or loss functions depending on this new architecture and the network’s behavior. In other words, the first implementation will consist of the network based on the paper mentioned; to this implementation, a encoder for audio features will be added; and, finally, the training will be based on this last architecture but taking into account a loss calculated for the audio learning. Finally, to compare and evaluate each network’s results both quantitative metrics and qualitative evaluations will be carried out. Since the final output of these systems will be obtaining a clear and realistic video with a random face to which gestures from another one have been transferred, the perceptual visual evaluation is key to solve this problem..

(8) Keywords Deep Learning, face transfer, image generation, synthesized frames, encoder, Convolutional Neural Networks (CNNs), autoencoder, Generative Adversial Networks (GANs), Generator, Discriminator, data processing, dataset, Python, qualitative and quantitative evaluations..

(9) Resumen Generar imágenes sintetizadas, siendo capaces de animarlas o transformarlas de alguna manera, ha experimentado en los últimos años una evolución muy significativa gracias, en parte, al uso de redes neuronales en sus implementaciones. En particular, el intento de transferir diferentes gestos faciales y audio a una imagen existente ha llamado la atención tanto en la investigación como, incluso, socialmente, debido a sus posibles aplicaciones. A lo largo de este Proyecto de Fin de Máster, se realizará un estudio del estado del arte en las diferentes técnicas que existen para esta transferencia de gestos faciales entre los medios audiovisuales que implican, incluso, el movimiento de los labios. Especı́ficamente, se centrará en los diferentes métodos e investigaciones existentes que generan rostros parlantes basados en varios rasgos de la información multimedia utilizada. A partir de este estudio, la implementación, desarrollo y evaluación de varios sistemas se hará de la siguiente manera. En primer lugar, conociendo la importancia relevante de entrenar redes neuronales profundas utilizando un conjunto de datos grande y bien procesado, VoxCeleb2 se descargará y sufrirá un proceso de condicionamiento y adaptación en cuanto a la extracción de información de imagen y audio del vı́deo original para ser utilizado como entrada de las redes. Estas caracterı́sticas serán las que se utilizan normalmente en el estado del arte para tareas como la mencionada, como los puntos clave de la imagen y los espectrogramas de audio. Como segundo enfoque de esta Tesis, la implementación de tres redes convolucionales diferentes, en particular Generative Adversarial Networks (GANs), se hará basándose en la implementación de [1] pero añadiendo algunas nuevas configuraciones, como la red que gestiona las caracterı́sticas de audio o las funciones de pérdidas dependiendo de esta nueva arquitectura y el comportamiento de la red. En otras palabras, la primera implementación consistirá en la red del paper mencionado; a esta implementación se le añadirá un encoder para las caracterı́sticas del audio; y, finalmente, el entrenamiento se basará en esta última arquitectura pero teniendo en cuenta la pérdida calculada para el aprendizaje del audio. Por último, para comparar y evaluar los resultados de cada red se realizarán tanto.

(10) mediciones cuantitativas como evaluaciones cualitativas. Dado que el resultado final de estos sistemas será la obtención de un vı́deo claro y realista con un rostro aleatorio al que se le han transferido gestos de otro, la percepción visual es clave para resolver este problema.. Palabras Clave Aprendizaje profundo, transferencia de caras, generación de imágenes, imágenes sintetizados, encoder, Redes Neuronales Convolucionales (CNN), autoencoder, Generative Adversial Networks (GAN), Generador, Discriminador, procesamiento de datos, dataset, Python, evaluaciones cualitativas y cuantitativas..

(11) Agradecimientos Gracias al apoyo incondicional de mi tutor, Alberto, porque es una persona todoterreno capaz de centrarse y enseñar a todo el que se lo pida. Hacı́a mucho tiempo que no conocı́a a alguien al que le apasionase tanto saber y transmitir, consiguiendo meterme en un mundo en el que quiero seguir desarrollándome siempre, ası́ que muchı́simas gracias. Gracias a mi ’comuna’ por haber conseguido lo que pocos pueden: aguantarme en mis peores momentos intentando sacarme una sonrisa, aunque sea vacilándome constantemente, y hacer posible que siga adelante con todo. Por último, gracias a madre y padre, que nunca han dudado de mı́..

(12)

(13) Index. 1 Introduction and objectives. 1. 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.2. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4. 1.3. Structure of this document . . . . . . . . . . . . . . . . . . . . . . . . .. 4. 2 State of the art. 5. 2.1. Image Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5. 2.2. Deep Learning basics . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.2.1. Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . .. 6. 2.2.2. Convolutional Neural Networks . . . . . . . . . . . . . . . . . .. 7. 2.2.3. Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . .. 13. DL and Image Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 2.3. 3 Development setup 3.1. Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29 29.

(14) 3.2. PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33. 3.3. Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36. 3.4. Overview of the proposed DL process . . . . . . . . . . . . . . . . . . .. 37. 4 Implementation. 40. 4.1. Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 41. 4.2. Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 4.2.1. Visual features extraction . . . . . . . . . . . . . . . . . . . . .. 46. 4.2.2. Audio features extraction. . . . . . . . . . . . . . . . . . . . . .. 48. Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 51. 4.3.1. Embedders . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 53. 4.3.2. Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 56. 4.3.3. Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 58. Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 59. 4.4.1. Meta-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 60. 4.4.2. Fine-tunning . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 65. 4.4.3. Other hyper-parameters . . . . . . . . . . . . . . . . . . . . . .. 67. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68. 4.5.1. Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . .. 69. 4.5.2. PS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73. 4.5.3. NMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73. 4.3. 4.4. 4.5.

(15) 4.5.4. Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . .. 5 Simulations and results 5.1. 5.2. 5.3. 76 77. Configuration of the evaluation . . . . . . . . . . . . . . . . . . . . . .. 78. 5.1.1. Evaluation dataset . . . . . . . . . . . . . . . . . . . . . . . . .. 78. 5.1.2. Reference system . . . . . . . . . . . . . . . . . . . . . . . . . .. 80. Project experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 85. 5.2.1. Reference system with this project’s dataset . . . . . . . . . . .. 86. 5.2.2. Results with both image and audio features . . . . . . . . . . .. 94. 5.2.3. Results with audio loss . . . . . . . . . . . . . . . . . . . . . . .. 101. Other experiments of possible interest . . . . . . . . . . . . . . . . . . .. 107. 5.3.1. Federated learning . . . . . . . . . . . . . . . . . . . . . . . . .. 107. 5.3.2. Angela Merkel video results . . . . . . . . . . . . . . . . . . . .. 112. 5.3.3. Video to Image results . . . . . . . . . . . . . . . . . . . . . . .. 116. 5.3.4. Video to not human face . . . . . . . . . . . . . . . . . . . . . .. 119. 6 Conclusions and future lines. 123. 6.1. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 123. 6.2. Future lines of work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 127. References. 129.

(16) Appendices. 140. A Social, economic, environmental, ethical and professional impacts. 141. A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 141. A.2 Description of impacts related to the project . . . . . . . . . . . . . . .. 142. A.2.1 Ethical impact . . . . . . . . . . . . . . . . . . . . . . . . . . .. 142. A.2.2 Social impact . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 143. A.2.3 Economic impact . . . . . . . . . . . . . . . . . . . . . . . . . .. 143. A.2.4 Environmental impact . . . . . . . . . . . . . . . . . . . . . . .. 143. A.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 144. B Economic budget. 145. C Code available. 147. D Other Results. 149. D.1 Reference system with this project’s dataset . . . . . . . . . . . . . . .. 150. D.2 Results with both image and audio features . . . . . . . . . . . . . . .. 154. D.3 Results with audio loss . . . . . . . . . . . . . . . . . . . . . . . . . . .. 158.

(17) Index of figures 1.1. Example of deepfake in ”The Shining”(1980). Jim Carrey replaces Jack Nicholson’s through DL techniques[2]. . . . . . . . . . . . . . . . . . . .. 2. 1.2. Example of Obama’s talking video generation through DL techniques[3].. 3. 2.1. ANN architecture [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 2.2. Object detection results using the Faster R-CNN system [10] . . . . . .. 8. 2.3. CNN architecture for classification [14] . . . . . . . . . . . . . . . . . .. 9. 2.4. GAN architecture [20] . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11. 2.5. GAN architecture [21] . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12. 2.6. LSTM architecture [24] . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13. 2.7. LRCN architecture [26] . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 2.8. Example results of Pix2Pix net on automatically detected edges compared to ground truth [28] . . . . . . . . . . . . . . . . . . . . . . . . .. 16. Face generation through the years. Faces on the left were created by Artificial Intelligence in 2014 and the ones on the right, in 2018. [31] .. 17. 2.9.

(18) 2.10 Dense alignment, including key points, and 3D reconstruction results for [38] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 18. 2.11 OpenFace behaviour analysis pipeline, including facial action unit recognition [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19. 2.13 X2Face network during the initial training stage [43]. . . . . . . . . . .. 19. 2.12 Results of the reenactment system Face2Face [42] . . . . . . . . . . . .. 20. 2.14 System to synthesize Obama’s talking head [44] . . . . . . . . . . . . .. 20. 2.15 Results comparison between Face2Face and Obama’s talking head generation model [44] . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 2.16 Overall Speech2Vid model [45] . . . . . . . . . . . . . . . . . . . . . . .. 22. 2.17 Proposed conditional recurrent adversarial video generation model [46]. 22. 2.18 Proposed Disentangled Audio-Visual System [47]. . . . . . . . . . . . .. 23. 2.19 Few shot Model Architecture [1] . . . . . . . . . . . . . . . . . . . . . .. 24. 2.20 Few shot Model Results compared to other models seen [1] . . . . . . .. 24. 3.1. Federated learning general representation [53] . . . . . . . . . . . . . .. 30. 3.2. Proposed federated learning system architecture . . . . . . . . . . . . .. 31. 3.3. PyTorch Vs. TensorFlow: Number of Unique Mentions. Conference legend: CVPR, ICCV, ECCV - computer vision conferences; NAACL, ACL, EMNLP - NLP conferences; ICML, ICLR, NeurIPS - general ML conferences. [53] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 35. 3.4. System architecture general training overview . . . . . . . . . . . . . .. 38. 3.5. System architecture final application overview . . . . . . . . . . . . . .. 39.

(19) 4.1. Deep learning process block diagram . . . . . . . . . . . . . . . . . . .. 40. 4.2. VoxCeleb2 faces of speakers in the dataset. . . . . . . . . . . . . . . . .. 42. 4.3. VoxCeleb2 downloaded folders organization . . . . . . . . . . . . . . . .. 43. 4.4. VoxCeleb2 txt file with video information to download it . . . . . . . .. 43. 4.5. VoxCeleb2 video example from YouTube . . . . . . . . . . . . . . . . .. 44. 4.6. Visual features extraction [41] . . . . . . . . . . . . . . . . . . . . . . .. 46. 4.7. Visual feature extraction. Frame A from dataset video, bounding-box coordinates provided by dataset and landmarks extracted, respectively.. 47. Final visual feature extraction. Input to the network consisting of frame and landmarks concatenated. . . . . . . . . . . . . . . . . . . . . . . .. 48. Audio feature extraction. First column: frame A talking from dataset video, audio waveform from frame A, MFCCs and Mel-spectrogram, respectively. Second column: the same for frame B not talking. . . . . .. 50. 4.10 Project’s network architecture . . . . . . . . . . . . . . . . . . . . . . .. 51. 4.11 Image Embedder Architecture . . . . . . . . . . . . . . . . . . . . . . .. 53. 4.12 Audio Embedder Architecture . . . . . . . . . . . . . . . . . . . . . . .. 54. 4.13 Single Residual Block[79] . . . . . . . . . . . . . . . . . . . . . . . . . .. 55. 4.14 Residual Down Sampling Block . . . . . . . . . . . . . . . . . . . . . .. 56. 4.15 Generator Architecture Without Audio vector . . . . . . . . . . . . . .. 57. 4.16 Generator Architecture With Audio vector . . . . . . . . . . . . . . . .. 58. 4.17 Residual Up Sampling Block . . . . . . . . . . . . . . . . . . . . . . . .. 59. 4.18 Discriminator Architecture . . . . . . . . . . . . . . . . . . . . . . . . .. 60. 4.8 4.9.

(20) 4.19 VGG-19 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 63. 4.20 VGG-Face Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . .. 63. 4.21 Different image distortions to proceed with image evaluation . . . . . .. 69. 5.1. Frame of generated video of myself with noticeable face gestures and speaking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 79. Frames of downloaded video from Pedro Sánchez before and after cutting and cropping it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 80. 5.3. Implementation inference after 5 epochs of training in small dataset [94]. 80. 5.4. Example of output of the reference system trained using the pre-trained model available. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. The first column of pictures just trains the meta-learning stage and the second trains for 40 epochs the fine-tuning stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 81. Training losses evolution graph in reference system with pre-trained weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 83. Example of losses output during the fine-tuning stage using the reference system model available. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 84. Example of output of the reference system trained just the meta-learning stage with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . .. 87. Different FT T values applied. Example of output of the reference system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . . . . .. 88. 5.2. 5.5 5.6 5.7. 5.8.

(21) 5.9. Different FT epochs applied. Example of output of the base system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . . . . . . . . . .. 89. 5.10 Different paddings applied. Example of output of the base system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . . . . . . . . . .. 89. 5.11 Training losses evolution graph in reference system with own dataset. .. 91. 5.12 Examples of Generator outputs during meta-learning. . . . . . . . . . .. 91. 5.13 Example of losses output during the fine-tuning stage using the base system model available with this projec’t dataset with different T values. 92 5.14 Example of losses output during the fine-tuning stage using the base system model available with this projec’t dataset with different configurations (default: T = 32, Ep = 40, Pad = 50; different padding: T = 32, Ep = 40, Pad = 200; and different epochs: T = 32, Ep = 200, Pad = 50). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 93. 5.15 Example of output of the Video-Audio system trained just the metalearning stage with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . .. 95. 5.16 Different FT T values applied. Example of output of Video-Audio system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . . . . .. 96.

(22) 5.17 Different FT epochs applied. Example of output of Video-Audio system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . . . . . . . . . .. 96. 5.18 Different paddings applied. Example of output of the Video-Audio system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . . . . .. 97. 5.19 Training losses evolution graph in Video-Audio system with own dataset. 98 5.20 Example of losses output during the fine-tuning stage using the VideoAudio system model available with this project’s dataset with different T values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 99. 5.21 Example of losses output during the fine-tuning stage using the VideoAudio system model available with this project’s dataset with different configurations (default: T = 32, Ep = 40, Pad = 50; different padding: T = 32, Ep = 40, Pad = 200; and different epochs: T = 32, Ep = 200, Pad = 50). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 100. 5.22 Example of output of the Video-Audio with audio loss system trained just the meta-learning stage with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 102. 5.23 Different FT T values applied. Example of output of Video-Audio with audio loss system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . .. 103.

(23) 5.24 Different FT epochs applied. Example of output of Video-Audio with audio loss system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . .. 103. 5.25 Different paddings applied. Example of output of the Video-Audio with audio loss system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . .. 104. 5.26 Training losses evolution graph in Video-Audio system using audio loss with own dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 105. 5.27 Example of losses output during the fine-tuning stage using the VideoAudio system model with audio loss available with this project’s dataset with different T values. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 106. 5.28 Example of losses output during the fine-tuning stage using the VideoAudio system with audio loss model available with this project’s dataset with different configurations (default: T = 32, Ep = 40, Pad = 50; different padding: T = 32, Ep = 40, Pad = 200; and different epochs: T = 32, Ep = 200, Pad = 50). . . . . . . . . . . . . . . . . . . . . . . .. 106. 5.29 Example of output of the Server 2 reference system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 108. 5.30 Reference networks’ generated image during meta-learning stage . . . .. 109.

(24) 5.31 Example of output of the Server 2 Video-Audio system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 110. 5.32 Video-Audio networks’ generated image during meta-learning stage . .. 110. 5.33 Example of output of the Server 2 Video-Audio with audio loss system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the Sánchez’s frame with the first video’s face translated. . . . . . . . . . . . . . . . . . . . . . . . .. 111. 5.34 Video-Audio with audio loss networks’ generated image during metalearning stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 112. 5.35 Angela Merkel video frame [97] . . . . . . . . . . . . . . . . . . . . . .. 113. 5.36 Different visual results for each experiment done using Sánchez video and using Merkel´s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 115. 5.37 Original image used for the application purpose . . . . . . . . . . . . .. 116. 5.38 Example of output of the reference system trained both the metalearning and fine-tuning steps. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the animated image with the first video’s face translated. . . . . . . . . . .. 117. 5.39 Example of output of the reference system trained both the metalearning and fine-tuning steps with the project’s dataset for the video to image application. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the animated image with the first video’s face translated. . . . . . . . . . . . . . . . . . . .. 118.

(25) 5.40 Example of output of the Video-Audio system trained just for the meta-learning stage with the project’s dataset for the video to image application. The picture shows a frame of the video of myself speaking with its associated landmarks image and the animated image with the first video’s face translated. . . . . . . . . . . . . . . . . . . . . . . . .. 118. 5.41 Example of output of the Video-Audio with audio loss system trained for the meta-learning step with the project’s dataset for the video to image application. The picture shows a frame of the video of myself speaking with its associated landmarks image and the animated image with the first video’s face translated. . . . . . . . . . . . . . . . . . . .. 119. 5.42 Example of non human face picture [98] . . . . . . . . . . . . . . . . .. 120. 5.43 Example of output using a non human face to transfer the gestures from the video of myself . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 120. 5.44 Example of video game face picture [100] . . . . . . . . . . . . . . . . .. 121. 5.45 Example of output using a video game face to transfer the gestures from the video of myself . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 121. 5.46 Example of output using the same two videos of myself for the network without Audio Embedder and for the one with it . . . . . . . . . . . .. 122. B.1 Economic budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 146. D.1 Generator synthesized frames during meta-learning training in the reference system using this project’s pre-processed dataset. Part I. . . .. 150. D.2 Generator synthesized frames during meta-learning training in the reference system using this project’s pre-processed dataset. Part II. . .. 151. D.3 Generator synthesized frames during meta-learning training in the reference system using this project’s pre-processed dataset. Part IV. . .. 152.

(26) D.4 Generator synthesized frames during meta-learning training in the reference system using this project’s pre-processed dataset. Part V. . .. 153. D.5 Generator synthesized frames during meta-learning training in the Audio-Video system using this project’s pre-processed dataset. Part I. .. 154. D.6 Generator synthesized frames during meta-learning training in the Audio-Video system using this project’s pre-processed dataset. Part II.. 155. D.7 Generator synthesized frames during meta-learning training in the Audio-Video system using this project’s pre-processed dataset. Part IV. 156 D.8 Generator synthesized frames during meta-learning training in the Audio-Video system using this project’s pre-processed dataset. Part V.. 157. D.9 Generator synthesized frames during meta-learning training in the Audio-Video system with audio loss using this project’s pre-processed dataset. Part I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 158. D.10 Generator synthesized frames during meta-learning training in the Audio-Video system with audio loss using this project’s pre-processed dataset. Part II(. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 159. D.11 Generator synthesized frames during meta-learning training in the Audio-Video system with audio loss using this project’s pre-processed dataset. Part IV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 160. D.12 Generator synthesized frames during meta-learning training in the Audio-Video system with audio loss using this project’s pre-processed dataset. Part V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 161.

(27) Index of tables 2.1. DL and image synthesis summary . . . . . . . . . . . . . . . . . . . . .. 27. 3.1. Hardware used in the development of the Master’s thesis . . . . . . . .. 32. 3.2. Different existing DL frameworks . . . . . . . . . . . . . . . . . . . . .. 34. 4.1. VoxCeleb2 description . . . . . . . . . . . . . . . . . . . . . . . . . . .. 41. 4.2. Final number of samples from VoxCeleb2 used . . . . . . . . . . . . . .. 45. 4.3. Dataset storage used . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 45. 4.4. Information about spectrogram computation . . . . . . . . . . . . . . .. 49. 4.5. Metrics values obtained for the different distortions of the first image to show range of values. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 75. 5.1. Test videos information . . . . . . . . . . . . . . . . . . . . . . . . . . .. 79. 5.2. Metrics values obtained for the different configurations for the video results using the reference system . . . . . . . . . . . . . . . . . . . . .. 85. Experiments carried out in this project . . . . . . . . . . . . . . . . . .. 86. 5.3.

(28) 5.4 5.5. Metrics values obtained for the different configurations for the video results using the reference system with this project’s dataset . . . . . .. 94. Metrics values obtained for the different configurations for the results using the Video-Audio system with this project’s dataset . . . . . . . .. 101. 5.6. Metrics values obtained for the different configurations for the results using the Video-Audio with audio loss system with this project’s dataset 107. 5.7. Best metric values obtained for each of the previous experiments. . . .. 114.

(29)

(30)

(31) Glossary DL - Deep Learning MIT - Massachusetts Institute of Technology ANN - Artificial Neural Network CNN - Convolutional Neural Network R-CNN - Region Convolutional Neural Network SVM - Support Vector Machine NLP - Natural Language Processing ReLU - Rectified Linear Unit NN - Neural Network GAN - Generative Adversarial Network DCGAN - Convolutional Generative Adversarial Network BN - Batch Normalization RNN - Recurrent Neural Network LSTM - Long Short-Term Memory WLAS - Watch, Listen, Atend and Spell PRN - Position map Regression Network CPU - Central Processing Unit GPU - Graphic Processing Unit HG - HourGlass.

(32) MFCC - Mel-Frequency Cepstrum Coefficients ILSVRC - ImageNet Large Scale Visual Recognition Challenge LAD - Least Absolute Deviations PSNR - Peak Signal-to-Noise Ratio SSIM - Structural Similarity CSIM - Cosine Similarity LMD - Landmark Distance Error IS - Inception Score FID - Frechet-inception Distance PS - Perceptual Similarity NMI - Normalized Mutual Information MI - Mutual Information MSE - Mean Squared Error WER - Word Error Rate.

(33)

(34)

(35) 1. Chapter 1. Introduction and objectives This section of the Master’s thesis introduces a discussion of the background that has motivated the fulfillment of the project, including a brief overview of its main targets to be carried out and the different sections in which it will be divided.. 1.1. Motivation. Transforming or manipulating photographs using different techniques or methods dates back to some of the first pictures captured during the 19th Century, not long after the first one was taken in 1825. From paint retouching, airbrushing, or even manipulating negatives while still in the camera device, techniques have transitioned to the radical new approaches thanks to digitization in just one century. Since then, more and more advances have been made in this task while digital image quality and advantages in equipment have been improving. Due to those developments through the years, interest slowly moved from static pictures to motion ones. Being able to animate a still image or to transform a video of a face in a controllable way has had a huge impact in society since it can be applied to many image editing applications, such as animating an onscreen person with other human expressions or even manipulating what this person is saying. These alterations are done, generally,.

(36) 2. for entertainment purposes in advertising or social media, for example. However, as it can be thought of immediately there are some ethical issues and controversies about it, which will be mentioned in Social, economic, environmental, ethical and professional impacts, involving, of course, the famous deepfakes among other applications. This synthetic media has gathered widespread attention for some unethical use in fake news, financial fraud, or even adult-content videos. It consists basically of a replacement of a person’s face with another one, as it can be seen in Figure 1.1. Nowadays, several online applications can be found just for this purpose, which, as its name can suggest, can be performed due to advances in Deep Learning (DL) algorithms.. Figure 1.1: Example of deepfake in ”The Shining”(1980). Jim Carrey replaces Jack Nicholson’s through DL techniques[2].. If this face replacement configuration has gained a lot of interest in academic and social fields, recent researches have been focusing in synthesizing talking faces, such as taking a still face image and making it ”talk” in means of moving its lips according to some audio, written sentences or even a video. This is the famous Barack Obama’s video case (Figure 1.2) in which, given audio, a video of himself was created depicting him mouthing the words of the track. This is going further, as it has been mentioned, taking a video of a person, not a still image, and replacing what he or she is saying by generating lips and facial movement according to a given video different from the.

(37) 3. original. There are many examples of talking people video generation techniques that are being developed and improved thanks to DL algorithms.. Figure 1.2: Example of Obama’s talking video generation through DL techniques[3].. DL is eclipsing other techniques and gaining a lot of interest in a huge amount of computer vision fields and, in particular, in this one thanks to allowing high computational models to learn how to generate images based on features extracted from other media. Nevertheless, there is still a large way to go, since generating a natural, realistic and personalized human head is really difficult because of its geometric and kinematic complexity and also because our visual system is able to recognize even minor mistakes in a human’s appearance. As it will be seen in the state of the art section of this report, there are several DL models proposed by researchers from over the world to overcome these challenges which use images from datasets to perform this approach. But what about combining those features with audio ones to help a model learn and try to improve the proposed ones? In the end, we need to synthesize lips and its motion depends as well on the audio features..

(38) 4. 1.2. Objectives. As stated before, the main target of this research is to implement a DL model that, given two videos, switches the facial expressions and poses from the first one to the second one. This main objective can be deployed in two secondary ones. First, to test the suitability of adding as an input to the neural network model audio features in addition to the image’s to assess whether it makes a better performance or not. The second one would be to extend the advances in this field, which is a new one and does not own a huge amount of examples available as other fields could have.. 1.3. Structure of this document. This thesis’ document is structured in 6 sections as follows: 1. Introduction and objectives: the motivation to carry out this project is explained. 2. State of the art: the theoretical and practical background referred to this field and similar ones are introduced. 3. Development setup: the software and hardware requirements are defined. 4. Implementation: the implemented algorithms are detailed. 5. Simulations and results: the final results of our experiments are presented. 6. Conclusions and future lines: the final conclusions that have been drawn from this project are shown and some future lines of work are proposed. 7. Annexes: some more information about the project, its impacts and the economic budget is provided..

(39) 5. Chapter 2. State of the art 2.1. Image Synthesis. Since the late 1960s, the condition and accuracy of computer-generated images have improved dramatically. It has gone from simple ambient representations or synthesis of a single object [4][5], with the only possibility of direct lighting, to generating complete scenes [6] with shadows and shading. This improvement can be due to several reasons, but advances in both hardware and software, such as the increased computational speed of technologies or equipment, stand out among the rest making it possible to generate high spatial and color resolution mature representations. So, for some years now, image synthesis has been explored in depth having a strong relevance in applications that consist in, for example, creating high-resolution images based on low-resolution ones or generating facial images with different poses. This task [7] addresses the process of generating images that constitute the information of a real scene by using some sort of description. Despite the progress in generating images and though it has strong relevance in multiple fields, creating high-resolution ones from a given input remains a challenge. This might be because traditional and newly developed techniques lack high-level information, which is required for generating images..

(40) 6. By making use of the advances in DL in Computer Vision, Image Synthesis has been getting more and more attention as it means defining new application fields while generating images improving in costs, scalability, and time consumption.. 2.2. Deep Learning basics. Nowadays, in the previous discipline defined and the other several ones, DL is becoming the main solution to handle them. DL techniques vary greatly and are found in fields as diverse as medicine, with tasks as identifying skin cancer through patient photos, and advertising, offering clients products that will fit them best, for example. DL could be defined as a type of machine learning technique where the information given as input is processed in hierarchical layers so that the machine understands some of its features in high levels of complexity. These techniques consist of neural networks, which share some properties: they are interconnected neurons organized in layers differing in architecture and maybe training. A good way to summarize this, Massachusetts Institute of Technology (MIT) official introductory course in DL [8] defines it as a technique that extracts patterns from data using neural networks. As it has been defined there are different types of neural networks, depending on architectures, training technique, and, not less important, type of raw input to the net. Among those types, the basic ones will be defined as follows.. 2.2.1. Artificial Neural Networks. The basic type of neural network, Artificial Neural Network (ANN), consists of processing elements organized in interconnected layers, as Figure 2.1 shows, where the flow of information is unidirectional through the network. This means that data travels in just a forward direction, input to output, in a way where each neuron is connected by a weighted link to every other one in the following layer. The input layer sends the information to the hidden one, which processes it and could be interconnected with more hidden layers, and the output layer generates the result. ANN modeling starts with a random selection of weight coefficients which are modified through the network.

(41) 7. until the output matches the true values.. Figure 2.1: ANN architecture [9]. This architecture might be useful for solving some regression and classification applications. While trying to solve an image as an input problem, the image has to be converted from a 2D one to a 1D vector before the training step, which has two drawbacks. The first one is that the number of parameters during the training is going to increase due to the increasing size of the image and the second one is that ANN loses the spatial characteristics of this image, which is a huge source of information. Another drawback which can be found due to this simple architecture and training is that ANN does not capture sequential information. All these limitations of ANNs are directly addressed by making use of other more complex architectures and training techniques explained below.. 2.2.2. Convolutional Neural Networks. In the area of computer vision, Convolutional Neural Networks (CNNs) have been positioning themselves above the rest becoming the core of most systems today, such as object detection and image classification ones..

(42) 8. Figure 2.2: Object detection results using the Faster R-CNN system [10]. Such is their effectiveness in object detection, for example, that many family models have been implemented and tested throughout the years. In [11] an R-CNN, RegionCNN, system for object detection is developed, taking advantage of the capacity of CNNs to bottom-up region proposals to localize and segment objects to finally classify the feature vectors developed with a Support Vector Machine (SVM) system. This approach compared to other previously implemented in the state of the art of this task outperforms results. Due to this, several architectures based on R-CNNs have been appearing to detect objects: in [10] a Faster R-CNN is implemented to predict separately the region proposals of the image, improving a lot of time speed compared to the previous ones and obtaining the following results shown in 2.2..

(43) 9. More recently, CNNs are also being applied to problems in Natural Language Processing (NLP), like machine translation or text classification, obtaining interesting results. In [12], a sentence classification system is implemented representing the input text as an array of vectors, just like an image can be represented as an array of pixel values. So, in DL, a CNN is a deep neural network that processes data that has a grid topology using convolutional and pooling layers to extract features from it. Unlike several other networks, a CNN works with matrices and filters of n dimensions taking into account the spatial dependency of pixel values. In a CNN, the connection between layers is restricted so that all the nodes of each one have the same weight, making them detect the same characteristics but in different areas of the image. As it can be seen in Figure 2.3, just the last layers of the net are flattened using fully connected ones. A CNN could use this combination of convolutional and pooling layers to classify a dataset. In this example, it classifies the CIFAR dataset [13], which consists of 60.000 color images divided into 10 animal types.. Figure 2.3: CNN architecture for classification [14]. The basic functionality of this architecture is as follows, being a basic one a combination of each of the layers defined: 1. The convolution layers scan their input looking for patterns. They are characterized by the number of independent filters, determining the number of output images, by the kernel size, which gives the size of the sliding filter, and by the stride, determining the number of pixels the filter slides..

(44) 10. 2. Talking about the rectifier or detector layer, it is usually chosen the Rectified Linear Unit (ReLU). 3. The pooling layers perform downsampling after the convolutional ones to reduce dimensionality so that performance is improved increasing computational efficiency. These layers are characterized by the size of a pooling window. 4. Finally, the output is flattened out to a vector and classified through the fully connected layers. As it has already been seen at the beginning of this section for object detection, there are several CNN architectures available that have been key in building DL algorithms achieving high accuracy results. Talking about image recognition, which is a task with an extensive research history, some architectures are worth-mentioned too. Examples of those are AlexNet [15], containing 8 layers: 5 convolutional (some followed by maxpooling layers) and 3 fully connected ones, and VGGNet [16], which uses convolutional kernels of size 3x3 and max-pooling kernels of size 2x2 with stride 2. After the celebrated victory of those too, the Resnet model [17] appeared providing to the DL fields a novel architecture with ”skip connections”, which let to train a Neural Network (NN) with a big amount of layers while still having lower complexity than the VGGNet mentioned before. It also achieved an error rate that beat human-level performance on the dataset used.. 2.2.2.1. Generative Adversarial Networks. As it has been said, there are multiple CNN architectures available, but this state of the art is going to focused on a few, which are of special interest. Since this thesis consists in synthesizing images based on some inputs, that will be discussed in the following sections, a generative architecture model should be used. Generative modeling involves automatically learning patterns from input in such a way that the model outputs data that apparently could have been obtained from the original set. This is where Generative Adversarial Networks (GANs) appear. GANs are able to produce or to generate new content using DL architectures, such as CNNs. Its architecture was first described in [18] in 2014, but a year later a standardized approach called Deep Convolutional Generative Adversarial Network (DCGAN).

(45) 11. [19] was developed and finally led to more formalized models. It is important to highlight that this DCGAN uses stridden convolutions instead of pooling layers to increase and decrease feature’s spatial dimensions and that it uses a technique of Batch Normalization (BN) to normalize so that zero mean and unit variance exists in all layers. The final target of this formalized model is to stabilize learning while dealing with poor weight initialization.. Figure 2.4: GAN architecture [20]. As Figure 2.4 shows, a GAN involves two sub-models: a generator one for generating new data and a discriminator one for deciding (classifying) whether the generated data is real or fake. So the aim of the first one would be to maximize the probability of making the discriminator mistake its inputs as real, while second one would aim to guide the generator to create more realistic images. At first, the generator doesn’t know how to begin producing images that are similar or resemble the real ones, the ones from the training dataset, and the discriminator doesn’t know how to classify the images in real and fake. This is why the discriminator model receives two different batches: one with the true images and another one with noisy signals. During the training, the generator learns how to output images that resemble the training set ones. The complexity part of this architecture comes when it needs two losses so that the.

(46) 12. discriminator can output probabilities close to 0 for fake images and near 1 for real images. One would maximize the probabilities for the real ones and the other one would minimize the probability of fake ones. Thus, the total loss for this sub-model is the sum of those partial losses. So, as it can be seen, GANs have the potential of expanding DL horizons, and researchers know it since they have been developing many techniques for training GANs. This architecture provides a pathway to a solution to problems that require a generative solution, such as this Thesis target.. 2.2.2.2. Autoencoders. When talking about generative neural network models, Autoencoders seek to ”reconstruct” its input, which means output data identical to the input by learning an identity function. Basically, an autoencoder can be thought of as two sub-networks, which can be seen in Figure 2.5. The encoder accepts the input compressing it into the latent-space representation, while the decoder takes it and reconstructs the data.. Figure 2.5: GAN architecture [21]. While both GANs and autoencoders are generative models, GANs generate new and realistic data but autoencoders simply compress inputs into a latent-space representation. So, autoencoders can be seen as neural networks used for applications such as dimensionality reduction or denoising, as well as for outliers detection. Out of the Computer.

(47) 13. Vision area, autoencoders are used in NLP applications, such as machine translation [22]. These models combined with other networks can lead to interesting architectures.. 2.2.3. Recurrent Neural Networks. It is important to highlight that the neural network’s input data doesn’t always have to be static since there is data that depends on past instances of itself to predict future ones. Applications in NLP such as speech recognition or machine translation, stock price prediction, or spam detection, among others, process this kind of data. Neural networks that address this state of data, temporal or sequential one, are Recurrent Neural Networks (RNN). Basically, RNNs store the last output calculated in its memory and use it to predict the new output. There are many architectures possible for RNNs, being common in all of them that, as it has been described, that they feed their outputs from a previous time step as inputs to the net. One of these architectures is shown in Figure 2.6, being this a Long Short-Term Memory (LSTM) [23].. Figure 2.6: LSTM architecture [24]. An LSTM shares information through the network learning from it to predict future data using a memory cell which is represented in the diagram above. This cell’s inner.

(48) 14. iterations can be explained as follows: 1. The first gate decides which details have to be discarded from the block using a sigmoid function looking at the previous state. 2. The second gate decides which value from the input is going to be used to modify the memory. A sigmoid function does that task, while a tanh one gives weightage to the values passed, depending on their importance. 3. Finally, the output gate consists of a sigmoid function which, again, decides which values to let through the net and a tanh one, giving weightage. Basically, the output is decided depending on the input and the memory of the block. When talking about images as the input of a RNN, researchers have used them in combination with a CNN, where the output of the second one is the input of the first. In [25] the problem of multi-label classification of images failing, due to not exploiting completely label dependencies in an image, is approached proposing a CNNRNN framework where an image-label embedding is learned to characterize the semantic label dependency. Another example of this combination but with LSTMs is [26] where temporal dynamics and convolutional perceptual representations are both learned for a visual recognition task showing good results compared to the state of the art ones. Many possible architectures using this combination are proposed in this paper. In Figure 2.7 this proposed model, LRCN, representation is describing how it processes the variable-length visual inputs using the CNN to feed the LSTM, sharing their weights across time..

(49) 15. Figure 2.7: LRCN architecture [26]. 2.3. DL and Image Synthesis. DL is attracting a lot of attention and interest, being in constant development achieving unprecedented levels of success. Its algorithms have even outperformed humans in many fields, such as Computer Vision. In the past few years, there has been a harsh growth of research in GANs, which have been defined before. Several fields are making use of these neural networks or architectures based on them to give solutions to, for example, translating an input image to an output one. Traditionally, this task has been approached with techniques such as stitching together small patches of images [27]. Nowadays, translating an image or an object to another image is tackled as in [28], where a CGAN is proposed to train this mapping. In Figure 2.8 an example output of the released net software for the image to image translation can be seen. This network’s, PixPix [29], results suggest that this approach is effective since many internet users have been posting their results using it. Such is this impact, that a Pix2PixHD net [30], has been already implemented for synthesizing high-resolution images, in particular 2048x1024, outperforming existing.

(50) 16. methods and also generating different results from the same input, allowing a user to edit them interactively.. Figure 2.8: Example results of Pix2Pix net on automatically detected edges compared to ground truth [28]. Another particular area in which progress is getting scarily good at is face generation. Figure 2.9 shows the work in research in this field through just 4 years, making it possible to generate lifelike looking faces using neural networks. Generating realistic facial images with different facial expressions or keeping the information about the identity is an open investigation topic which is having a deep impact on face recognition, image augmentation, even face aging and face to face translation. In the first field mentioned, face recognition, it is needed a huge dataset to be trained. There are many datasets available created by companies to train researchers’ nets. [32] uses VoxCeleb2 dataset [33], which contains over 1 million utterances from YouTube videos of over 6.000 celebrities, computing spectrograms from its raw audios to use them as the input of a CNN to finally recognize identities successfully. Another dataset available that has been used in [34] to recognize words being spoken by a human just using the video and not the audio is the LRW dataset [35]. It consists of 1.000 samples of 500 words which have been spoken by hundreds of speakers. There is a sentence version [36] in which this dataset has evolved and it has been tried as the input of a Watch, Listen, Atend and Spell (WLAS) network [37] to operate over visual, audio or both inputs to lip read outperforming previous techniques..

(51) 17. Figure 2.9: Face generation through the years. Faces on the left were created by Artificial Intelligence in 2014 and the ones on the right, in 2018. [31]. In one of the previous papers, the spectrogram from audio was used as input of a network. There are several other human features that are used in face reconstruction or generation, such as key points from facial structure or pose, which are worth mentioning and to know how to extract them. In [38] a simple CNN is trained to reconstruct a 3D face structure from a 2D image representation in UV space which predicts dense alignment. This is achieved using the 300W-LP [39] dataset as the training set to the Position map Regression Network (PRN) proposed resulting on a robust method to illumination, pose, and occlusions. The code for this model is also available in [40].. To help with this field of producing faces under many circumstances and extracting features such as key points, several tools and frameworks are being developed by researchers. In [41], OpenFace, open-source tool that detects landmarks, head pose, and eye-gaze, among others, has been developed. In Figure 2.11 its analysis can be seen with each of the features extracted. It is important to highlight how important is to implement a model which learns how to extract these characteristics for later applying.

(52) 18. Figure 2.10: Dense alignment, including key points, and 3D reconstruction results for [38]. them to some task as input data. It has already been shown how to create photo-realistic faces but, what about creating photo-realistic talking heads? Producing a virtual person or animated being sounding and appearing real is a challenge for some applications such as special effects. [42] is one of the first approaches for a real-time facial reenactment of a target video. It basically animates facial expressions of the target video by a source video recorded by a webcam rendering the output in a realistic way. Figure 2.12 shows how they address the facial identity recovery obtaining successful results. A neural network, X2Face [43], which controls pose and expression taking as input two frames: a source and a driving one, being the first one the input of the embedding submodel and the second one, the input of the driving submodel. This can be understood in Figure 2.13. The embedding network learns how to map from the source frame to representation and the driving network learns how to transform the pixels from this representation to a generated frame. It is said that controls pose, expression, or identity since it doesn’t make assumptions about them, using the ones in the generated frame..

(53) 19. Figure 2.11: OpenFace behaviour analysis pipeline, including facial action unit recognition [41]. Figure 2.13: X2Face network during the initial training stage [43]. [44] is one of the most famous approaches in this talking head generation task. Their LSTM model takes Obama’s audios as input, converts them to a time-varying sparse mouth shape generating, based on it, a realistic mouth texture which is composited into the mouth part of a video, as shown in Figure 2.14..

(54) 20. Figure 2.12: Results of the reenactment system Face2Face [42]. Figure 2.14: System to synthesize Obama’s talking head [44].

(55) 21. Figure 2.15 presents a comparison between Face2Face and the previous net for four different samples in the same speech making use of the same video. The second method can synthesize a more realistic mouth showing natural creases and more clear teeth.. Figure 2.15: Results comparison between Face2Face and Obama’s talking head generation model [44]. There are similar papers to the Obama´s one being distributed and researchers trying to improve their results based on them. Another neural network that uses both audio and still images as input is the Speech2Vid one [45] generating a video of a talking face but this time using an encoder-decoder CNN model and showing that there is a relation in generating video data based on audio sources. Figure 2.16 shows this model architecture with an emphasis on the deblurring block, which is used to refine the output frames..

(56) 22. Figure 2.16: Overall Speech2Vid model [45]. To try and outperform the Speech2Vid model, a new conditional adversarial network is presented in [46] to generate a video with a talking face too. In this case 2.17, a multi-task adversarial model is trained to treat audio input as a condition for the recurrent adversarial network to try and make the transition for the lip and facial expression smoother. It is important to mention that to reduce the size of the set without reducing quality, phoneme distribution information has been extracted from the audio. Results show a superior and more accurate visual representation.. Figure 2.17: Proposed conditional recurrent adversarial video generation model [46].

(57) 23. Even though face expression variation and speech semantics are coupled together because of the movement of the talking face, [47] learns disentangled audio-visual representations through a training process that generates a more realistic face with clear motion patterns. Figure 2.18 shows this model in which three encoders take part, one for Person-ID information from a visual source and the other two for Word-ID to extract speech information from visual and audio sources.. Figure 2.18: Proposed Disentangled Audio-Visual System [47]. To end with this research, [1] implements a talking head model which, unlike most of the recent works, learns from few-shot images instead of just a single one. Figure 2.19 shows this model’s architecture, which takes as the input of the generator the image key points. It is also able to initialize the parameters of both submodels, generator, and discriminator, in a person-specific way. Since landmarks from different people, the lack of landmark adaptation is being used from this task model is usually a problem, but this system achieves a high-realism solution to it 2.20..

(58) 24. Figure 2.19: Few shot Model Architecture [1]. Figure 2.20: Few shot Model Results compared to other models seen [1].

(59) 25. As told before, in the face translation or face to face field huge achievements have been approached. Being able to animate a still image of a face, whether it is real or not, is a challenging task that can produce mixed feelings due to ethical issues as it can also happen in other of the previously mentioned. These ethical issues will be announced later in this thesis. Finally, Table 2.1 shows a brief summary of each of the state of the art algorithms which have been studied and described previously. SUMMARY DL and Image Synthesis Cite. Paper. Input Data. Model. [32]. VoxCeleb2: Deep Speaker Recognition (Jun 2018). Audio spectrogram (Hamming window of 25ms and 10ms step). VGGVox: based on VGG-M and ResNet architectures. [34]. Lip Reading in the Wild (Nov 2016). Mouth region images. [37]. Lip Reading Sentences in the Wild (Jan 2017). Lip region images (120x120) and MFCC features (25ms windows at 100Hz, timestride of 1). [38]. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network (Mar 2018). 256x256 images and 3D Morphable Models (3DMM) parameters to generate UV map. PRNet: Lightweighted CNN. Input image or sequence. Conditional Local Neural Fields (CLNF): Based on Constrained Local Model (CLM)[48]. Two components (Point Distribution Model and patch experts).. [41]. OpenFace: an open source facial behavior analysis toolkit (Apr 2016). Four different VGG-M models. They differ in architecture and how they ”ingest” input data. WLAS network: Three model net (Watch, Listen and Spell). All LSTM with cell sizes 256,256 and 512, respectively.. Evaluation (Best Results) Cost function Cdet of 0.429 and Equal Error Rate (EER) of 3.95%. Top-1 accuracy is 65.4% and Top-10 accuracy is 92.3%. Audio and lips: a Character Error Rate (CER) of 7.9%, Word Error Rate (WER) of 13.9% and BLEU metric of 87.4. Mean Normalized Mean Error (NME) for 3D Face Alignment: 3.62%. For 3D Reconstruction a mean NME of 3.7551%.. Mean absolute degree error of 2.6 in Biwi dataset.. Dataset Train: VoxCeleb2. Test: VoxCeleb1.. LRW. LRSW. Train: 300WLP. Test: AFLW and Florence.. Among others... Train: MultiPIE, LFPW and Helen. Test: AFW, BU, SEMAINE and MPIIGaze..

(60) 26. [42]. Face2Face: Realtime Face Capture and Reenactment of RGB Videos (Jun 2016). Descriptors of a frame: landmarks, expression parameters, rotation and LBP, Local Binary Pattern. Dense, global nonrigid model-based bundling.. Evaluation based on visual comparison.. C. Cao & K. Zhou: blendshape and comparison data. V. Blanz, T. Vetter & O. Alexander: face data. A. Dai: voice. D. Ritchie: video reenactment.. [43]. X2Face: A network for controlling face generation by using images, audio, and pose codes (Jul 2018). Video frames (differed factors of variation) and audio features deatures. X2Face: embedding network (U-Net and pix2pix) and driving network (encoderdecorder).. Mean absolute error (MAE) in degrees for head pose regression of 9.36.. Train: VoxCeleb, AFLW and LRW. Test: VoxCeleb. [44]. Synthesizing Obama: learning lip sync from audio (Jul 2017). Audio MFCCs, 25ms sliding window with 10ms sampling interval. LSTM techniques (60 LSTM nodes, 20 step time delay). Evaluation based on visual comparison.. 14 hours of Barack Obama’s videos. You said that? (May 2017). 112x112x3 Images and Audio MFCC (0.35 second audio with 100Hz sampling rate). Speech2Vid: Encoderdecoder CNN model that uses joint embedding of face and audio(audio encoder, entity encoder and image encoder).. Evaluation based on visual comparison.. VoxCeleb and LRW. [46]. Talking Face Generation by Conditional Recurrent Adversarial Network (Apr 2018). Audio MFCC (350m s) and lip shade frames cropped (128x128). Audio encoder, image encoder, image discriminator, image decoder architectures. All of them constructed by convolutional or deconvolutional networks.. [47]. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation (Jul 2018). Face from video frames (256x256) and Audio MFCC (sampling rate of 100Hz). DAVS system: Three encoders based on VGG-M, FAN [49] and [50], respectively. Decoder contains 10 convolution layers.. [45]. Peak signal-to-noise ratio (PSNR) of 27.43, Structural Similarity index (SSIM) of 0.918. Lipreading accuracy of 63% in Top5 and Landmark Distance Error (LMD) of 3.14 For audio approach, PSNR of 26.7 and SSIM of 0.883. For video approach, PSNR of 26.8 and SSIM of 0.884.. TCD-TIMIT, VoxCeleb and LRW. LRW and MSCeleb-1M.

(61) 27. [1]. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models (May 2019). K frames and face landmarks. Three networks: Image Embedder, Generator and Discriminator.. For K=32, a Frechetinception distance (FID) of 30.6, a SSIM of 0.72, a cosine similarity (CSIM) of 0.45 and a user accuracy of detecting fake ones of 0.33%.. Table 2.1: DL and image synthesis summary. VoxCeleb1 and VoxCeleb2.

(62)

(63) 29. Chapter 3. Development setup In this section of the thesis report an overview of the proposed project system architecture involving the main development configurations, frameworks and tools used to carry out the project will be presented. When speaking about huge amounts of data processing and deep learning algorithms combined, several solutions have been implemented by researchers in the past years to make development easier, since massive processing power and ability to handle different data layers are required. Some of these solutions have gained a lot of interest among developers from over the world providing them with different tools, which enable deep learning applications research and production.. 3.1. Federated Learning. As it has already been said, massive processing power and a huge amount of data are needed for tasks carried out using deep learning algorithms. To free up Central Processing Unit (CPU) cycles in the device used for other jobs that don’t concern graphical and mathematical computations, a Graphic Processing Unit (GPU) with Nvidia CUDA toolkit should appear in this process. Nvidia CUDA-X[51] is a software stack for developers that provides a way to build high-performance GPU-accelerated.

(64) 30. applications taking advantage of optimizations such as mixed precision compute on Tensor Cores and accelerating a set of models. One of its worth mentioning libraries is cuDNN[52], giving the possibility to implement highly tuned routines, like forward convolution and pooling, for deep neural networks. Traditional deep learning tasks involve uploading data to a server and using it to train a model. In other words and applied to our case of study, due to the complexity of the algorithms chosen and with it, the complexity of this project’s task needing a big representative collection of samples, traditional ways of processing and training are not enough. In recent years, researchers and developers have been provided with devices being able to have enormous amounts of storage space but it never seems to be enough. It is quite normal too to own data on different devices and having to spend lots of time and power to centralize that data in a single one, which will be the one to use to train the model. It can also become a problem with privacy centralizing personally-identifiable information when it comes to using data obtained from different users, which is might be our case since faces from different people around the world are being used to train the model. These problems described referring to data quantity and quality can’t be resolved using a traditional way of centralized training machine learning models. This is where Federated Learning appears.. Figure 3.1: Federated learning general representation [53].

(65) 31. Federated learning is a training technique that basically makes able collaborative learning from the same model performed by several devices. This model is trained on a server using data stored in it and then every other device downloads this same model to improve it using its own local data. This improved model changes in other devices are sent to the main server, where the models are averaged to obtain a combined one. Figure 3.1 shows a generalized representation of how this would work: The phone trains the model locally (A) and many other devices create updates (B) that are averaged to form a change to the shared model (C). Having said this and being able to make use of several devices, the proposed decentralized system architecture, shown in Figure 3.2 will consist of the following devices and configurations based on the implementation developed in [54], which presents a method for federated learning of networks based on iterative model averaging proving that it can be made practical with few rounds of communication between devices. This configuration will allow faster deployment and testing of the project’s model consuming less power and time.. Figure 3.2: Proposed federated learning system architecture.

(66) 32. Since the database, which will be described later, owns a huge amount of data and due to the limited time there was to develop the project, six different devices from the same network where used in the step of data preparation. In the architecture representation four different devices provided with a CPU were used for this pre-process and storage, sending the data obtained through a virtual link to both the main server and a second server. These two servers are provided with two GPUs and one GPU, respectively, and each one trains the model saving their local updates. The second server sends every certain time its model update to the main server that computes the average between this information and its own creating an update for the global model, which is also stored in this server. The following Table 3.1 summarizes the hardware setup indicated in the proposed architecture used for the development of this project. Device Pseudonym PC 1 PC 2 PC 3 PC 4 SERVER 1 SERVER 2. Processor Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz Intel(R) Core(TM) 2 Quad CPU Q6600 @ 2.40GHz x 4 Intel(R) Core(TM) i7-4712MQ CPU @ 2.30GHz x 8 Intel(R) Core(TM) i7-4712MQ CPU @ 2.30GHz x 8 Intel(R) Core(TM) i9-7900X CPU @3.30GH x 20 Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz x 8. OS Version. GPU. Ubuntu 18.04.1 LTS. -. Ubuntu 18.04.4 LTS. -. Ubuntu 18.04.1 LTS. -. Ubuntu 16.04.6 LTS. -. Ubuntu 18.04.4 LTS Ubuntu 18.04.3 LTS. 2 units: GeForce RTX 2080 Ti/PCIe/SSE2 GeForce GTX 1080 Ti/PCIe/SSE2. Table 3.1: Hardware used in the development of the Master’s thesis. As it has been mentioned, both of the devices, SERVER 1 and SERVER 2, that will serve as training machines own GPU, which will be accessed through the use of CUDA toolkits versions 10.1 and 10.2, respectively..