• No se han encontrado resultados

Pix2Pitch: Generating music from paintings by using Conditionals

N/A
N/A
Protected

Academic year: 2022

Share "Pix2Pitch: Generating music from paintings by using Conditionals"

Copied!
96
0
0

Texto completo

(1)

Escuela Técnica Superior de Ingenieros Informáticos

Universidad Politécnica de Madrid

Pix2Pitch: Generating music from paintings by using Conditionals

GANs

Master Thesis

University Master in Artificial Intelligence

AUTHOR: Elena Rivas Ruzafa SUPERVISORS: Francisco Serradilla García

Julio 2020

(2)
(3)

Abstract

Generative adversarial networks (GANs) (Goodfellow et al., 2014) have been exten- sively used to transform and create images or sounds in their own domains. But transformation between different modalities is a problem that hasn’t been so ex- plored. This work proposes a method to generate sound from image, based on Pix2Pix architecture (Isola et al., 2017), a conditional GAN that was designed for general purpose image-to-image translation. In this work a new implementation that allows creating music from images has been developed.

The main goal is to create music that describes specific paintings and to answer the question: How does that image sound?. This is an answer that blind people could find useful in several applications, like in museums. To do so it has been taken into account different thesis that posit that there is an interaction between visual art and music, also several works that study synesthetic experimentations.

The process implies: first to label and pair images and sounds from different style and points in time, second extract common features from the data by explor- ing multiple methods for music feature extraction and third to introduce multimodal layers into the GAN. Finally, a method to create novel pieces of music by using the generated sound features has been implemented.

As it will be presented in the state-of-the-art section, some advances in cross- modal generation have been achieved but most of them are focused on creating image from sound or image from text, but only a few explore image-to-sound trans- formations.

iii

(4)
(5)

Contents

Abstract iii

1 Introduction 1

1.1 Cross-modal learning . . . 1

1.2 Music-Image paired dataset . . . 2

1.3 Music Representation . . . 2

1.4 Creativity . . . 3

1.5 Control Generation . . . 4

1.6 Motivations . . . 5

1.7 Objectives . . . 5

1.8 Structure of the document . . . 6

2 State of the Art 7 2.1 Music Generation . . . 8

2.1.1 Related Works . . . 9

2.1.1.1 Recurrent Neural Network (RNN) . . . 9

2.1.1.2 Long Short-Term Memory (LSTM) . . . 9

2.1.1.3 Auto-Encoders (AE) . . . 10

2.1.1.4 Convolutional Architectures . . . 13

2.1.1.5 Conditional Architectures . . . 14

2.1.1.6 Generative Adversarial Networks (GAN) . . . 14

2.1.2 Audio Representation . . . 18

2.1.2.1 Audio . . . 18

2.1.2.2 Piano Roll . . . 18

2.1.2.3 MIDI . . . 20

2.1.2.4 Transformed Representation . . . 20 v

(6)

2.2 Image Generation . . . 23

2.2.1 Related Works . . . 23

2.2.1.1 Style Transfer Painting Generation System . . . 23

2.2.1.2 Creative Adversarial Networks (CAN) . . . 24

2.2.1.3 Stacked Generative Adversarial Networks (SGAN) . 25 2.2.1.4 Pix2Pix Generative Adversarial Network . . . 25

2.2.1.5 Cyclic Generative Networks (CycleGANs) . . . 26

2.2.1.6 Spectral Normalization GAN (SN-GAN) . . . 29

2.3 Intersections between Image and Sound . . . 29

2.3.1 Related Works . . . 29

2.3.1.1 Audio-Visual Embedding Network (AVE-Net) and Audio-Visual Object Localization Network (AVOL- Net) . . . 29

2.3.1.2 Audio-Visual Scene Analysis with Self-Supervised Mul- tisensory Features . . . 31

2.3.1.3 Multimodal Deep Learning . . . 32

2.3.1.4 SoundNet . . . 33

2.3.1.5 Generating Images from Sounds using Multi-modal features and GANs . . . 33

2.3.1.6 Objects that Sound . . . 34

2.3.1.7 Discover Cross-Domain Relations with DiscoGANs . 34 2.3.1.8 Others works not published . . . 35

2.3.2 Cross-Modal similarities in Music and Visual Art . . . 35

3 Application 39 3.1 Architecture . . . 40

3.1.1 Generator . . . 41

3.1.2 Discriminator . . . 42

3.1.3 Training . . . 43

3.1.3.1 Configurations and hyperparameters . . . 46

3.2 Data Used . . . 47

3.2.1 Images . . . 49

3.2.2 Music . . . 50

3.3 Results . . . 55

3.3.1 Comparison between models with different hiperparameters. . 56

(7)

CONTENTS vii

3.3.2 Variations due to resolution adjustments. . . 58

3.3.3 Performance of the model according to the label. . . 59

3.3.4 How the model behaves with images from other styles, pho- tographs or sketches that are not included in the training. . . 59

3.4 Validation Methodology . . . 59

3.4.1 Quantitative Metrics. . . 60

3.4.2 Qualitative Metrics. . . 61

4 Conclusions and Future Work 63 4.1 Quality of the generated Audio . . . 63

4.2 Other strategies with GANs . . . 65

4.3 Wider explorations . . . 66

4.4 Real world implementations . . . 67

A Other Figures 69

(8)
(9)

List of Figures

2.1 Graphical representations of DeepBach’s neural network architecture.

Source: Hadjeres et al., (2017). . . 10 2.2 DeepHear Stacked Auto-encoder Architeture. Source: Sun, (2017). . . 11 2.3 Difference between autoencoder (deterministic) and variational au-

toencoder (probabilistic). . . 12 2.4 Deep Autoencoder Models. Source: Ngiam et al., (2011). . . 12 2.5 Overview of the residual block and the entire architecture of Wavenet.

Source: Van der Oord et al., (2016). . . 13 2.6 Conditional Adversarial Network (cGAN) Architecture. . . 15 2.7 System diagram of the proposed MidiNet model for symbolic-domain

music generation. Source: Yang et al., (2017). . . 16 2.8 MelGAN model architecture. Source: Kumar et al., (2019). . . 17 2.9 Summary of architectures for music generation. . . 19 2.10 Audio Representations: on the top, a digital audio signal as a wave-

form and below in order STFT, mel-spectrogram, CQT, and a Chro- magram. Source: Kenwoo et al., (2018). . . 22 2.11 Style transfer Painting Generation System. Source: Gatys et al., (2015). 24 2.12 Block diagram of the CAN system. Source: Elgammal et al., (2017). . 24 2.13 Example applications developed with pix2pix codebase. Source: Isola

et al., (2017). . . 26 2.14 CycleGAN examples. Source: Zhu et al., (2017) . . . 27 2.15 A comparison of generated building images for label-to-building trans-

formation. From left to right: the Input, GAN, Pix2Pix, DualGAN, CycleGAN, PS2GAN, CSGAN, CDGAN and original image. Source:

Kancharagunta et al., (2020). . . 28 ix

(10)

2.16 Conditional GAN Architecture in SN-GAN proposal. Source: Mufti et al., (2020). . . 30 2.17 AVE-Net and AVOL-Net architectures. Source: Arandjelovic and

Zisserman, (2018). . . 31 2.18 Fused audio-visual network. Source: Owens and Efros, (2018). . . 32 2.19 DiscoGAN: Discovering relations of images from visually very differ-

ent object classes. Source: Kim et al., (2017). . . 35 3.1 Visual representation of complete original Pix2Pix architecture that

transforms satellite images in their corresponding Google maps pages.

This architecture has had multiple different applications in addition to this.. . . 40 3.2 Differences between the standard Encoder-Decoder structure and the

U-Net. Source: Isola et al., (2017) . . . 41 3.3 Visual representation of the Generator U-NET structure in this project. 42 3.4 Visual representation of the Discriminator architecture in this project. 42 3.5 Patch size variations showing how smaller patch have better results

in image generation. Source: Isola et al., (2017). . . 43 3.6 Overall diagram of our model. Visual representation of complete

architecture of Pix2Pitch . . . 44 3.7 Errors for different iterations for the discriminator(d1,d2) and the

Generator (g). . . 45 3.8 Left:Top 10 trainings order by decreasing generation (g) error for

model A. Right:Top 10 trainings order by decreasing sumerror (d1+d2+g) for model A . . . 48 3.9 Left:Top 10 trainings order by decreasing generation (g) error for

model A. Right:Top 10 trainings order by decreasing sumerror (d1+d2+g) for model E . . . 48 3.10 Left:Top 10 trainings order by decreasing generation (g) error for

model F. Right:Top 10 trainings order by decreasing sumerror (d1+d2+g) for model F . . . 49 3.11 Summary of images and music pieces from five selected styles . . . 50 3.12 Comparison between the same wav converted into spectrogram and

mel-spetrogram. . . 52

(11)

LIST OF FIGURES xi 3.13 Mel-spectrogram for the same clip of audio for 22050 Hz and 44000

Hz respectively. . . 52 3.14 Comparison for two different songs between waveform representation

for original audio (left) and reverse audio (right) by using Griffin-Lim algorithm. . . 53 3.15 Result from a the generator: image source, generated spectrogram

and expected spectrogram. . . 55 3.16 Differences between generation images with contract and edges and

blended images. . . 57 3.17 Differences between generation images across models. Those trained

with 512 resolution (first and second) are able to generate creatively different piece of music while those trained with 256 resolution create only small variations over the same proposal. From left to right:

image source, E1, E2, A1, A2, C1. . . 57 3.18 Differences between creation from model E1 and F1 where model F1

with higher sample rate is able to adapt better to more detailed songs with shorter notes . . . 58 3.19 Some test with art style not included in the training. Chosen artist:

Andy Warhol, César Barrio, Sandra Gobet. . . 60 A.1 Summary of discriminator structure of Pix2Pitch based on Pix2Pix. . 70 A.2 Summary of generator structure of Pix2Pitch based on Pix2Pix part-A. 71 A.3 Summary of generator structure of Pix2Pitch based on Pix2Pix part-B. 72 A.4 Process to transform music clips of 10 seconds in data matrix to be

inputted into the model. The inversion of the process, to create WAV from the output of the GAN, is the result of reverse the process. . . . 73 A.5 Three examples of images form WikiArt for each styles after scaling to

squared format(Baroque, Classicism, Expressionism, Impressionism, Romanticism) . . . 74 A.6 Results for images 105/510 for model A1, A2, C1 from Training Im-

ages (256x256). . . 75 A.7 Results for images 460/484 for model A1, A2, C1 from Training Im-

ages (256x256). . . 76 A.8 Results for images 010/900/400 for model E1, E2 from Training Im-

ages (512x512). . . 77

(12)

A.9 Results for images 003/700/800 for model F1, F2 from Training Im- ages (512x512). . . 78

(13)

Chapter 1 Introduction

Generative Adversarial Networks (GANs) belong to the group of generative models that are able to produce new content based on the competition between two networks with adversarial goals: a discriminator trained on images from a ‘real’ distribution and a generator that given random noise z, generates new images to try to trick the Discriminator into classifying the images as they were real. While firsts GAN models (Goodfellow et al., 2014; Salimans et al., 2016); were able to generate data from input consisting of latent vector, new developments have allowed GANs to learn how to convert from one domain to another with conditional domain data as input (Mirza and Osindero, 2014).

1.1 Cross-modal learning

Although conditional GANs are very good at transforming between two domains, models that can transform between different modalities, such as image-to-sound or sound-to-image, are about to be developed. Since the main goal in this work is to develop an audio-visual generative model able to create music linked to specific type/class of images, we have to solve important issues on audio feature extraction, cross-modality conversions and conditional image/sound synthesis.

Recent interest in cross-modal learning from images and audio are mainly focused on classification (Aytar et al., 2016; Arandjelović and Zisserman, 2017; Owens et al., 2016; Owens and Efros, 2018) instead of generation (Wan et al., 2019; Lyu et al., 2019) as we are. We will see in following paragraphs some of these examples.

1

(14)

1.2 Music-Image paired dataset

In order to solve the problem of translating image into music, the proposed archi- tecture in this project demands pairs of painting and music pieces as input into the GAN. There is not datasets like this in the literature or other sources so in this project it has been developed a process to match music pieces and paintings based on style, age and location. To do so there have been used two different data sources and by analyzing the author (artist or composer) and its particularities, first the pieces have been classified and labeled and finally paired. In this approach there is not only one to one relation since multiple music pieces can be paired with multiple paintings.

1.3 Music Representation

Other issue that needs to be solved is how to translate music information into under- standable representation for the networks without missing so much information. In recent years two main strategies, waveform and spectrogram, are being commonly used. To use waveforms have the advantage that the information is untransformed and keep its full resolution. In fact original 1-dimensional representation have been probed to work successfully in training neural networks (Oord et al., 2016a; Mehri et al., 2017) and are very interesting since they don’t need engineered feature repre- sentations. But for audio signal learning a network requires a larger dataset. Also they are slow in generation and highly demanding in terms of both memory and processing.

Thus most of deep learning approaches take advantage of 2-dimensional repre- sentations (Choi et al., 2018). These representations have been considered as visual images and have been working well with image recognition algorithms for audio tasks in the discriminative setting (Hershey et al., 2017). But in generative models some issues to solve appears since the most informed spectrograms are non-invertible, and cannot be listened without lost (Griffin and Lim, 1984; Shen et al., 2018; Donahue et al., 2019). Audio feature extraction is a commonly explored problem. In this project, we decide to use the log-mel-spectrogram to represent the music. We will define a method to represent the results into WAV file.

(15)

1.4. CREATIVITY 3

1.4 Creativity

During recent years GANs have had a great success at generating high resolution images (Gulrajani et al., 2017; Arjovsky et al., 2017; Berthelot et al., 2017; Karras et al., 2018; Miyato et al., 2018). Also GANs have unlocked interesting domain transformations for images that have analogues applications in audio (Zhu et al., 2017; Engel et al., 2019).

But the application of deep learning to generate music or images gets limitations when the generated results just imitate the training set without creativity. Since machine-learning-based generation automatically learns the structure, the style from a set of images or music pieces, the content is a mimic of the human creation without novel creation. But there is some works in music and image generation where other possibilities in creation instead of copy styles have been explored.

The MidiNet (Yang et al., 2017) inspired by WaveNet (Oord et al., 2016a) is based on GANs and includes two methods to control creativity in music generation.

First by inserting the conditioned data only in the intermediate convolution layers of the generator architecture and second by decreasing the values of the two control parameters of feature matching regularization.

Creative Adversarial Network (CAN) (Elgammal et al., 2017) is other interesting proposal in this regard. CAN is an extension of GAN where the generator receives from the discriminator not just one but two signals: first, similar to the case of the common GAN, specifies how the discriminator identifies input as art or non-art and the second is about how easily the discriminator can classify the generated item into established styles. To achieve this, the Discriminator was trained on the wikiart dataset and learns to discriminate between art and non-art but the main difference here is that the Discriminator can also detect the style of an image with an art style classification function. When it notices that the generated image fits a particular style then a function called “style ambiguity” kicks in, pushing the generator to pro- duce works in different styles from those in the wikiart dataset, meaning to produce original pieces in new art styles.

(16)

In addition to this, other approaches that explore creativity space based on evolutionary processes by using genetic algorithm frameworks have been found but this approach wont be studied in this Masters thesis.

1.5 Control Generation

As opposed to Markov models where constraints can be attached onto their internal operational structure in order to control the generation, standard neural networks are not designed to be controlled. This makes difficult to artists to adapt their ideas to automatic modelling with specific goals. There are some strategies that allow controlling deep learning generation (Briot et al., 2019) but in this work we will focus on the strategy of conditioning which is based on to condition the architecture on some extra information as class label or data from other modalities. The con- ditioning information is usually fed into the architecture as an additional input layer.

There are some interesting examples with music that use this approach. The WaveNet architecture (Oord et al., 2016a) is a convolutional feedforward network that uses conditioning as a way to guide the generation. This can be done by adding for example linguistic features for better speech generation or to set tags specifying genre or instrument in music generation.

Anticipation-RNN Bach Melody Generation (Hadjeres and Nielsen, 2017) pro- pose a system for generating melodies that enforce a given note at a given time position to have a given value. The idea is to condition the recurrent network (RNN) on some information summarizing the set of constraints as a way to antici- pate oncoming constraints to generate notes with a correct distribution.

In image generation also there are some interesting conditional architectures that allow controlling the generation. In this work we will take advantage of conditional GANs, specifically from Pix2Pix model (Isola et al., 2017).

(17)

1.6. MOTIVATIONS 5

1.6 Motivations

Given the previous section the main motivations that drive this project are the following:

• Most researchers in music generation are focused on generation of random audio from unlabeled training data. In this work we focus on generation of audio given a paired image-sound dataset to generate a process for music composition based on historical knowledge of key composers from all styles.

• We take into account how style evolves over time and how events or social context in history can explain different qualities of art. Since human’s creative process uses prior experience we believe that an important element in art- generating algorithms is to learn what human have produced throughout time as humans do in their learning process.

• Taking this into account we propose a methodology that allows creating novel pieces of music. Since the music is generated based on historical image, if the machine sees a contemporary painting with a new style the searched result should be a piece of music in an unknown style that is based on the evolution of art creation.

• We will base our work on conditional architectures to control this generation process.

1.7 Objectives

The objectives of this Masters’ thesis are directly implied from its motivations:

• Provide a comprehensible state-of-the-art and present the current processes that have faced similar problems in music and image generation.

• Create a specific process based on the state-of-the-art over datasets of paired music and paintings pieces from 5 different styles.

• To demonstrate that conditional GANs produce reasonable results and present a framework that can embed audio and visual inputs to achieve good results.

(18)

• To define an efficient audio synthesis process that allow to invert network results into music pieces.

1.8 Structure of the document

The structure of this work is organized as follows: first, a state-of-the-art that describes the last techniques and approaches in music and image generation. In this part it is introduced the general context of deep learning-based music generation identifying the main challenges and it includes a comparison to some related work.

After this chapter, a methodology applied to real datasets is presented with the results. And finally, a conclusion and future improvements are exposed.

(19)

Chapter 2

State of the Art

Since we are facing a problem that intersect between image and music generation this state-of-the-art needs to deep into previous work in both areas to understand the cutting-edge techniques that can help to solve the problem of generating im- age from sound. In the current work we have taken inspiration from successes in image generation and adapted the methods to instead generate audio. In fact, our proposal is based on an architecture that has been originally developed for image to image translation: Pix2Pix conditional GAN. This means that some important improvement needs to be done by not only understanding this complex architecture but also by defining specific methods to process, represent and generate music data that differ enormously from image generation.

As it will be presented, main advances in cross-modal generation, which is the problem that we have to solve, have been done like creating image from sound or image from text (Reed et al., 2016), but only a few incipient works that create music from image have been found. This project proposed a methodology that is the re- sult of mixing solutions applied in different context to solve a very specific problem:

create a piece of music from an image.

To structure this state-of-the-art we are going to present the content in three main parts: Music Generation, Image Generation and Intersection between image and music.

7

(20)

2.1 Music Generation

We have found several surveys and analysis of deep learning techniques to generate musical content in papers and books with a taxonomy of music generations systems and AI-based methods (Carnovalini and Rodà, 2020; Briot et al., 2019; Herremans et al., 2018) where have been presented how the music generation problems have been solved in the past years.

We can group any algorithm and technique applied to music generation in fol- lowing seven categories:

• Markov Chains

• Formal Grammars

• Rule/Constraint based systems

• Neural Networks/Deep Learning

• Evolutionary/Genetic algorithms

• Chaos/Self Similarity

• Agents based systems

In this Masters thesis our focus will we on Neural Networks/Deep Learning cat- egory but for future implementations it could be interesting to take into account some of the other approaches.

In this regard, multiple architectures have been created for automatic music gen- eration. From feedforward neural networks, to architecture that include convolution, conditions adversarial networks or auto-encoders. Currently most of the cases are compound solutions of compound architectures. In the following we are going to present the main solutions that are being used.

(21)

2.1. MUSIC GENERATION 9

2.1.1 Related Works

2.1.1.1 Recurrent Neural Network (RNN)

This solution is important to note since most existing neural network models for music generation until 2017 included recurrent neural networks in their proposals.

Recurrent Neural Network (RNN) is a feedforward neural network extended to in- clude recurrent connections. Instead of the usual feed-forward process for solving problem, the networks could also move a few steps back or in a feedback loop, like composing music. This means they have a sort of memory. Past decisions could affect their response to new data. An example element of a sequence is presented as the input and the actual next element of the sequence as the output that will train the recurrent network to predict next element of the sequence. This way the network can learn not only based on current data but also on previous one which allow learning temporal series.

The main problem with RNNs is that they can easily lose their knowledge of the past if the weights are not perfectly adjusted in the feedback loops making generated music not too interesting.

Until 2017 most of existing neural network models for music generation used recurrent neural networks (RNNs) and new variants. These models differ in the model assumptions and the way musical events are represented and predicted, but they all use information from the previous events to condition the generation of the present one. Some examples are the Melody-RNN models for symbolic-domain gen- eration (MIDIs) and the Sample-RNN model (Mehri et al., 2017) for audio-domain generation (WAVs). Other known examples are Johnson’s Hexahedria architecture (Johnson, 2015), which combines two layers recurrent on the time dimension with two other layers recurrent on the pitch dimension, as an integrated alternative to the RNN-RBM architecture.

2.1.1.2 Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) was pre- sented two decades ago. It is a solution that protects RNN against memory loss by gating off the results accumulated by the RNNs and keeping them in a memory cache, as we do when creating music.

(22)

Figure 2.1: Graphical representations of DeepBach’s neural network architecture.

Source: Hadjeres et al., (2017).

DeepBach architecture (Hadjeres et al., 2017) combines two recurrent (LSTMs) and two feedforward networks as it can be seen in Figure 2.1. As opposed to standard use of recurrent networks where a single time direction is considered, DeepBach architecture takes into account the two directions forward in time and backwards in time. Therefore, two LSTMs are used, one summing up past information and another summing up information coming from the future, together with a non- recurrent neural network for notes occurring at the same time.

2.1.1.3 Auto-Encoders (AE)

An Auto-encoder is a neural network with one hidden layer and the number of out- put nodes is equal to the number of input nodes. The output layer actually mirrors

(23)

2.1. MUSIC GENERATION 11

Figure 2.2: DeepHear Stacked Auto-encoder Architeture. Source: Sun, (2017).

the input layer. They work by compressing the input into a latent-space represen- tation, and then reconstructing the output from this representation.

A more complex solution very useful for music generation is the Stacked Auto- encoders architecture. In this case successive auto-encoders are hierarchically nested with decreasing hidden layer. The chain of encoders will increasingly compress data and extract higher-level features. This makes that Stacked auto-encoders are used for feature extraction.

An example of this architecture were implemented in DeepHear (Sun, 2017), a system that uses a decoder feedforward strategy on a stacked auto-encoders archi- tecture (4-layer stacked with a decreasing number of hidden units) and is aimed at generating ragtime jazz melodies (Figure 2.2). In addition to the generation of new melodies, DeepHear is used to harmonize a melody, while using the same architec- ture as well as what has already been learnt.

The idea of encapsulating two identical recurrent networks (RNNs) into an auto- encoder was initially proposed in (Cho et al., 2014) a RNN Encoder-Decoder as a technique to encode a sequence of variable length learnt by a recurrent network into another sequence of variable length produced by other recurrent network. The motivation and application target is the translation from a language to another lan- guage, resulting in sentences of possibly different lengths. One limitation of the RNN Encoder-Decoder approach is the difficulty for the summary to memorize very long sequences.

(24)

Figure 2.3: Difference between autoencoder (deterministic) and variational autoen- coder (probabilistic).

Figure 2.4: Deep Autoencoder Models. Source: Ngiam et al., (2011).

Variational Autoencoder (VAE) is a version of the RNN Encoder-Decoder. This solution is getting increasing attention because of their generative built-in capaci- ties. Deep Auto-encoders have achieved improved image classification performance for data with additional sound input by learning cross-modal representations (Ngiam et al., 2011). Multimodal features generated by this bimodal deep auto-encoder can also provide significant information for noisy speech classification tasks.

These results indicate that such multimodal features can capture meaningful information about both sounds and images. However, the learned representations were only partly invariant to the input modality, suggesting that the shared repre- sentations could be improved.

(25)

2.1. MUSIC GENERATION 13

Figure 2.5: Overview of the residual block and the entire architecture of Wavenet.

Source: Van der Oord et al., (2016).

2.1.1.4 Convolutional Architectures

Convolutional architectures (Lecun and Bengio, 1998) are commonly used for image applications. The main idea is not to learn from individual elements (pixels), but from an area (convolution) by nearby elements while sharing the same connection weights for this entire area.

Convolution architectures are less frequent for music generation but are being included in different solution in last years. This is because for music motives are not invariant in all dimensions.

For musical applications, it could be interesting to apply convolutions to the time dimension, in order to model temporally invariant motives. This convolutional approach is actually the basis for time-delay neural networks. The convolution op- eration allows a network to share parameters across time, like for RNNs but in comparison with convolutional architectures, RNNs are much more frequent for mu- sical applications.

A milestone to highlight in sound generation is WaveNet (Oord et al., 2016a), an audio generative model based on the PixelCNN architecture (Oord et al., 2016b)

(26)

that show how an image approach can succeed in generating wideband raw audio waveforms. The main particularities of WaveNet are causal convolutions (Figure 2.5). By using causal convolutions, its assured that the model cannot change the ordering in which the data is generated.

For images, the equivalent of a causal convolution is a masked convolution. This model was presented as a stack of causal convolutional layers, somehow analog to recurrent layers, a system for generating raw audio waveforms showing much better performance over traditional parametric vocoders.

Because models with causal convolutions do not have recurrent connections, they are typically faster to train than RNNs, especially when applied to very long sequences. But one of the problems of causal convolutions is that they require many layers or larger filters to increase the receptive field.

2.1.1.5 Conditional Architectures

In these models the architecture is conditioned on some extra conditioning informa- tion like a class label or data from other modalities. The objective is to have some control over the data generation process. In practice, the conditioning information is usually fed into the architecture as an additional input layer. As we will see in the following sections this approach is being commonly used in architecture that mix these strategies with some others already presented.

It is possible to further combine architectures that are already compound, for example VRASH architecture (Yamshchikov and Tikhonov, 2017) is a variational auto-encoder encapsulating RNNs with the decoder being conditioned on history.

VRASH combines variational auto-encoder, recurrent networks and conditioning (on the decoder).

2.1.1.6 Generative Adversarial Networks (GAN)

This architecture appears in most of the last solutions regarding to generative mod- els in music and image generation. The main idea is to train simultaneously two networks: a generative model, whose objective is to transform random noise vectors into new samples, and a discriminative model, that estimates the probability that

(27)

2.1. MUSIC GENERATION 15

Figure 2.6: Conditional Adversarial Network (cGAN) Architecture.

a sample came from the training data rather than from generator. The generator gets the data distribution and the discriminator estimates the probability that the data generated by generator comes from the data distribution (Figure 2.6).

But in music generation the control over the results is being demanded in contrast to generate random results. The conditional GANs (Mirza and Osindero, 2014) are a good solution to control the produced music. They were introduced to make use of the additional information available in the form of labels. So instead of generating a random data from the generator, the auxiliary information are passed to both the generator and the discriminator along with the input features so as to generate data conditioned on class labels.

In recent years there are multiple works that use GANs in music generation. We can compare with autoregressive models, such as WaveNet (Oord et al., 2016a) that models local structure at the expense of global latent structure and slow iterative sampling. On the contrary GANs have global latent conditioning and efficient par- allel sampling, but instead have problems to generate locally-coherent audio wave- forms. This is relevant since human perception is sensitive to both global and local structure and coherence. Therefore most of the projects these years will try to find the best solution to solve both local and global results.

(28)

Figure 2.7: System diagram of the proposed MidiNet model for symbolic-domain music generation. Source: Yang et al., (2017).

MidiNet (Yang et al., 2017) is both an adversarial and convolutional architecture to generate pop music melodies. This work is intended to provide a general, highly adaptive network structure for symbolic-domain music generation (Figure 2.7). The network takes random noise as input and generates a melody sequence one measure after another. Instead of creating a melody sequence continuously, its proposed to generate melodies one bar (measure) after another, in a successive manner. MidiNet can generate music of arbitrary number of bars, by concatenating these.

Regarding to speech synthes were recently presented two GAN models, Wave- GAN and SpecGAN (Donahue et al., 2019). The waveGAN works in the time- domain and specGAN works in the frequency-domain. WaveGAN, which is based on the DCGAN architecture (Salimans et al., 2016) can produce intelligible words from a small vocabulary of human speech, as well as synthesize audio from other domains such as bird vocalizations, drums, and piano. These sounds are human- recognizable and have good scores but the samples generated are completely random.

Hence CWaveGAN (Lee et al., 2018) was developed. CWaveGAN focus on make the WaveGAN conditioned and explores a way to generate audio samples conditioned on class labels. In GANSynth (Engel et al., 2019) is demonstrated that GANs can generate high-fidelity and locally-coherent audio by modeling log magnitudes and in- stantaneous frequencies with sufficient frequency resolution in the spectral domain.

(29)

2.1. MUSIC GENERATION 17

Figure 2.8: MelGAN model architecture. Source: Kumar et al., (2019).

GANs are able to outperform strong WaveNet baselines on automated and human evaluation metrics, and efficiently generate audio several orders of magnitude faster than their autoregressive counterparts.

Based on previous works (Engel et al., 2019) MelGAN (Kumar et al., 2019) in- troduces a set of architectural changes and simple training techniques to generate high quality coherent waveforms. This work successfully trains GANs for raw au- dio generation without additional distillation or perceptual loss functions while still yielding a high quality text-to-speech synthesis model (Figure 2.8). Also MelGAN is substantially faster than other mel-spectrogram inversion alternatives. Specifically it is 10 times faster than the fastest available models including less degradation in audio quality.

Last published work regarding to music generation with GANs is the Paral- lel WaveGAN (Yamamoto et al., 2020), an effective parallel waveform generation method based on a generative adversarial network (GAN). Unlike the conventional distillation-based methods, the Parallel WaveGAN does not require the two-stage training process. In this method, only a non-autoregressive WaveNet model is trained by optimizing the combination of multi-resolution short-time Fourier trans-

(30)

form (STFT) and adversarial loss functions that enable the model to effectively capture the time-frequency distribution of the realistic speech waveform. As a re- sult, the entire training process becomes much easier than the conventional methods, as well as the model can produce natural sounding speech waveforms with a small number of model parameters.

2.1.2 Audio Representation

In this section, we will review multiple audio data representations in the context of using machine learning methods. Representation of data for image is quite straight- forward but for music different representations can be used (signal, transformed signal as spectrum or via Fourier transformation, piano roll, MIDI, text. . . ). The main differences between audio and images can be shown by analyzing how these data vary most their principal component. While the principal components of im- ages generally capture intensity, gradient, and edge characteristics, those from audio form a periodic basis that decompose the audio into constituent frequency bands.

2.1.2.1 Audio

The audio signal is often called raw audio, compared to other representations that are transformations based on it. A digital audio signal consists of audio samples that specify the amplitudes at time-steps. The audio signal has not been the most popular choice because learning a network starting from the audio signal requires even a larger dataset. Recently, however, one-dimensional convolutions are often used to learn an alternative of existing time-frequency conversions (Lee and Nam, 2017). The advantage of using a waveform is in considering the raw material un- transformed, with its full initial resolution. The disadvantage is in the computational load since low level raw signal is demanding in terms of both memory and processing.

The audio signal has not been the most popular choice, instead researchers have preferred 2D representations such as STFT and mel-spectrograms.

2.1.2.2 Piano Roll

The piano roll representation of a melody is inspired from automated pianos. It was a continuous roll of paper with perforations in it. Each perforation represents a piece of note control information, to trigger a given note. The length of the per-

(31)

2.1. MUSIC GENERATION 19

Figure 2.9: Summary of architectures for music generation.

foration corresponds to the duration of a note and the localization of a perforation corresponds to its pitch. The piano roll is one of the most commonly used represen-

(32)

tations but it has limitations. If we compare with MIDI representation, there is no note off information. This mease that there is no way to distinguish between a long note and a repeated short note.

2.1.2.3 MIDI

Musical Instrument Digital Interface (MIDI) is a technical standard that describes a protocol, a digital interface and connectors for interoperability between various electronic musical instruments, software and devices. MIDI carries event messages that specify real-time note performance data as well as control data. Main disad- vantage of encoding MIDI messages directly is that it does not preserve multiple notes being played at once through the use of multiple tracks.

2.1.2.4 Transformed Representation

Using transformed representations of the audio signal usually leads to data com- pression and higher-level information, but at the cost of losing some information and introducing some bias. These are the most common transformed representation used in generative models:

• Spectrogram. The fast Fourier transform allows us to analyze the frequency content of a signal, but our signal’s frequency content varies over time. These signals are known as non-periodic signals. To represent the spectrum of these signals varying over time we compute several spectrums by performing FFT on several windowed segments of the signal. It is called the short-time Fourier transform. The FFT is computed on overlapping windowed segments of the signal, and we get what is called the spectrogram. A spectrogram is a bunch of FFTs stacked on top of each other. It is a way to visually represent a signal’s loudness, or amplitude, as it varies over time at different frequencies. The y- axis is converted to a log scale, and the color dimension is converted to decibels.

This is because humans can only perceive a very small and concentrated range of frequencies and amplitudes. A visual representation of a spectrum, where the x axis represents time (in seconds) and the y axis represents the frequency (in kHz) and the third axis in color represents the intensity of the sound (in dBFS).

(33)

2.1. MUSIC GENERATION 21

• Mel-spectrogram. This is a 2D representation that is optimized for human au- ditory perception. Studies have shown that humans do not perceive frequencies on a linear scale. We are better at detecting differences in lower frequencies than higher frequencies. We can tell the difference between 500 and 1000 Hz, but we are not able to tell the difference between 10,000 and 10,500 Hz. In 1937 it was proposed a unit of pitch such that equal distances in pitch sounded equally distant to the listener and this was called the mel scale. It compresses the STFT in frequency axis and therefore can be more efficient in its size while preserving the most perceptually important information. One of the merits of STFT is that it invertible to the audio signal. Mels-pectrograms provide an efficient and perceptually relevant representation compared to STFT and have been shown to perform well in various tasks.

An STFT is closer to the original signal and neural networks may be able to learn a representation that is more optimal to the given task than mel- spectrograms. This requires large amounts of training data however mels- pectrograms outperformed STFT with a smaller dataset. In addition mel- spectrogram only provides the magnitude of the time-frequency bins, which means it is not invertible to audio signals.

• Constant-Q Transform (CQT). CQT provides a 2D representation with log- arithmic scale centre frequencies. This is well matched to the frequency dis- tribution of the pitch, hence CQT has been predominantly used where the fundamental frequencies of notes should be precisely identified. The compu- tation of a CQT is heavier than that of an STFT or mels-pectrogram.

• Chromagram. This is a variation of the spectrogram, discretized onto the tempered scale and independent of the octave. It is restricted to pitch classes.

The x axis represents time (in seconds). The y axis represents the note, pitch class or the signal. For the third axis in color can represent the intensity. The chromagram, also often called the pitch class profile, provides the energy dis- tribution on a set of pitch classes, often with the western music’s 12 pitches.

One can consider a chromagram as a CQT representation folding in the fre- quency axis. Chromagram is more processed than other representations and can be used as a feature by itself.

In this project, we will use mel-spetrogram because of the advantages presented

(34)

Figure 2.10: Audio Representations: on the top, a digital audio signal as a waveform and below in order STFT, mel-spectrogram, CQT, and a Chromagram. Source:

Kenwoo et al., (2018).

in this section.

(35)

2.2. IMAGE GENERATION 23

2.2 Image Generation

2.2.1 Related Works

Conditionals GANs have been proved to be applicable to learn mapping functions within conditional information such as in image-to-image translation tasks. Typical examples as transform a monochrome image to a colored image or filling in objects based on an edge map. In addition, more advanced works have demonstrated that they can convert back and forth between two domains based on unpaired data with- out needing pairs of elements, one from each domain.

A new idea that changed the common conception of what neural networks were capable of was DeepDream (Mordvintsev et al., 2015). Instead of mimic what hu- mans do, this project was one of the first approaches where neural network creativity was explored. This solution was found while looking for ways to visualize the knowl- edge in a trained neural net. The proposal was a deep feedforward neural network architecture where the potential occurrence of a specific visual feature was maxi- mized. The result was the generation of psychedelic versions of standard images.

2.2.1.1 Style Transfer Painting Generation System

Since this milestone lots of proposals have been exploring creativity in image cre- ation. One example is the Style Transfer Painting Generation System (Gatys et al., 2015). This is an artificial system based on a deep neural network that creates artistic images of high quality. The system uses neural representations to separate and recombine content and style of arbitrary images, providing a neural algorithm for the creation of artistic images (Figure 2.11). Transposing this style transfer technique to music was a natural direction and it has been experimented for audio (Huzaifah and Wyse, 2019; Foote et al., 2016) both using a spectrogram and not a direct wave signal as input. The result is effective, but not as interesting as in the case of painting style transfer, being more similar to a sound merging of the style and of the content probably because of the anisotropy of global music content representation.

(36)

Figure 2.11: Style transfer Painting Generation System. Source: Gatys et al., (2015).

Figure 2.12: Block diagram of the CAN system. Source: Elgammal et al., (2017).

2.2.1.2 Creative Adversarial Networks (CAN)

Creative Adversarial Networks (CAN) (Elgammal et al., 2017) proposed to solve the issue of creativity by extending a generative adversarial networks architecture (GAN) into Creative Adversarial Networks to generate art by learning about styles and deviating from style norms. Main assumptions are that in a standard GAN

(37)

2.2. IMAGE GENERATION 25 architecture, the generator objective is to generate images that fool the discriminator and as a consequence the generator is trained to be emulative but not creative (Figure 2.12). Instead, CAN proposes a modification to GAN’s objective to make it able to generate creative art by maximizing deviation from established styles while minimizing deviation from art distribution. In this solution the generator is designed to receive two signals from the discriminator, instead of one as in GAN, that act as two contradictory forces to achieve to generate novel works.

2.2.1.3 Stacked Generative Adversarial Networks (SGAN)

Stacked Generative Adversarial Networks (SGAN) (Huang et al., 2017; Zhang et al., 2017) is an improved solution for GANS in image generation. Specifically it is a generative model trained to invert the hierarchical representations of a bottom- up discriminative network. This model consists of a top-down stack of GANs that generate lower-level representations conditioned on higher-level representations. A representation discriminator is introduced at each feature hierarchy to encourage the representation manifold of the generator to align with that of the bottom up discriminative network, leveraging the powerful discriminative representations to guide the generative model. The results from SGAN seem to be images of much higher quality than GANs without stacking.

2.2.1.4 Pix2Pix Generative Adversarial Network

Pix2Pix Generative Adversarial Network (Isola et al., 2017) is other approach for image-to-image translation tasks and specifically the architecture that we will ex- plore in the next chapter to solve our main problem of creating music from image.

Pix2Pix model is a type of conditional adversarial network (CGAN), where the generation of the output image is conditional on an input (Figure 2.13). The dis- criminator is fed by a source image and the target image and has to determine if the target is a variation of the source image. The generator is trained via adversarial loss, which leads the generator to generate images in the target domain. The gen- erator is also updated via L1 loss measured between the generated image and the expected output image. This additional loss leads the generator to create transla- tions of the source image.

Pix2Pix GAN success has been probed on multiple image-to-image translation

(38)

Figure 2.13: Example applications developed with pix2pix codebase. Source: Isola et al., (2017).

tasks as converting maps to satellite photographs, black and white photographs to color or sketches of products to product photographs. This type of architecture as image conditional GAN allows the generation of large images compared to prior GAN models. In addition to this, CGAN successfully reduces the number of training examples, as the data are now split per attribute. Although Pix2Pix has demon- strated by multiple examples their huge capabilities the main constraint that could appear is that it works only for paired image datasets. This is an issue that has been solved in following solutions.

2.2.1.5 Cyclic Generative Networks (CycleGANs)

Cyclic Generative Networks (CycleGANs) (Zhu et al., 2017) are a particular vari- ation over traditional GANs to solve image-to-image translation problem. This is a very interesting solution since it solves the problem of not having paired images (Figure 2.14).

One major difference between the Pix2Pix GAN and the CycleGANs is that unlike the Pix2Pix GAN which consists of only two networks (Discriminator and

(39)

2.2. IMAGE GENERATION 27

Figure 2.14: CycleGAN examples. Source: Zhu et al., (2017) .

Generator), the CycleGAN consists of four networks (two Discriminators and two Generators). The approach used by CycleGANs to perform image-to-image transla- tion is similar to Pix2Pix GAN with the exception of that unpaired images are used for training CycleGANs and the objective function of the CycleGAN has an extra criterion the cycle consistency loss.

The main goal is to learn a mapping (G: X > Y) so the distribution of images from G(X) is alike from the distribution Y using an adversarial loss and mapping it with an inverse (F: Y > X), introducing a cycle consistency loss to enforce F(G(X))

= X and the opposite. The unpaired datasets make difficult to train the generator network due to the discrepancies in the two domains. This leads to a problem where most of the generated images share the common properties results in similar images as outputs for different input images. To solve this problem cycle-consistency loss is introduced along with the adversarial loss. Lots of applications for style transfer, object transfiguration, season transfer or photo enhancement where paired training data does not exist were tested with high quality results. After this solution ap- peared, some others that explore a similar approach have been proposed .

DualGAN (Yi et al., 2017) is a dual learning framework for image-to-image trans- formation of the unsupervised data that consists of the reconstruction loss (similar to the cycle-consistency loss) and the adversarial loss as an objective function. Photo-

(40)

Figure 2.15: A comparison of generated building images for label-to-building trans- formation. From left to right: the Input, GAN, Pix2Pix, DualGAN, CycleGAN, PS2GAN, CSGAN, CDGAN and original image. Source: Kancharagunta et al., (2020).

Sketch Synthesis with Multi-Adversarial Networks (PS2MAN) (Wang et al., 2018) is focused on generating high-quality realistic photos from sketches and sketches from photos. Since photo-sketch synthesis is a coupled/paired translation problem, the problem of pairing information was faced by using CycleGAN framework. This work shows a similar solution that synthesized loss in addition to the cycle-consistency loss and adversarial loss as an objective function.

Cyclic-Synthesized Generative Adversarial Networks (CSGAN) (Kancharagunta and Dubey, 2019) is one of the last proposal in this regard followed by Cyclic Discriminative Generative Adversarial Networks (CDGAN) (Kancharagunta and Dubey, 2020) that seem to be the last improvements of CycleGANs that search for more realistic generation of images by incorporating the additional discrimina- tor networks for cycled images in addition to the architecture of the CycleGAN.

These methods use Cyclic-Discriminative (CD) adversarial loss computed over the Real Images and the Cycled Images. This loss helps to increase the quality of the generated images and also reduces the artifacts in the generated images.

Specifically the CDGAN consists of two generators and two discriminators. The

(41)

2.3. INTERSECTIONS BETWEEN IMAGE AND SOUND 29 generator networks GAB and GBA are used to generate the Synthesized Images (SynB and SynA) from the Real Images (RA and RB) and the Cycled Images (CycB and CycA) from the Synthesized Images (SynA and SynB) in two different domains A and B. The two discriminator networks DA and DB are used to distinguish be- tween the Real Images (RA and RB) and the Synthesized Images (SynA and SynB) and also between the Real Images (RA and RB) and the Cycled Images (CycA and CycB). The results in this specific task seem to outperform all the state-of-the-art methods.

2.2.1.6 Spectral Normalization GAN (SN-GAN)

Spectral Normalization GAN (SN-GAN) (Mufti et al., 2019) based on previous work (Miyato et al., 2018) is one of the most recent proposal where the main goal is to train a GAN conditioned on different attributes of paintings in order to generate novel paintings that have specified attributes of our choosing. Spectral normalization seems led to a significant increase in image quality on each of the image categories (Figure 2.16).

2.3 Intersections between Image and Sound

2.3.1 Related Works

Cross-modal learning has been explored deeply but mainly for images and text.

The challenges of using audio and images are quite different from those with text (Harwath et al., 2016) since text is much closer to a semantic annotation than audio.

In the task of text-to-image, for example machine can turn text descriptions into images (Reed et al., 2016). But in audio and visual data there are difficulties to compare because of the huge dimensionality gap between them. But some examples that explore audio and image exchange has been developed with interesting results.

2.3.1.1 Audio-Visual Embedding Network (AVE-Net) and Audio-Visual Object Localization Network (AVOL-Net)

The Audio-Visual Embedding Network (AVE-Net) (Arandjelović and Zisserman, 2018) is a network that can embed audio and visual inputs into a common space.

AVE-Net was designed explicitly to facilitate cross-modal retrieval. Specifically pairs

(42)

Figure 2.16: Conditional GAN Architecture in SN-GAN proposal. Source: Mufti et al., (2020).

of one image and one second of audio (represented as a log-spectrogram) are pro- cessed by vision and audio subnetworks respectively the network to decide whether the image and the audio correspond (Figure 2.17).

An extension of AVE-net, the Audio-Visual Object Localization Network (AVOL- Net) (Arandjelović and Zisserman, 2018) was developed with a similar but not equal purpose. Its a system which understands the audio-visual world and its able to as- sociate appearance of an object with the sound it makes, and thus is able to answer, where is the object that is making the sound?. This specific paper was called “Ob- jects that Sound” and its based on a previous work “Pixels that sound” (Kidron et al., 2005) one of the first approach for audio-visual dynamic localization that had the ability to localize visual events associated with sound sources.

(43)

2.3. INTERSECTIONS BETWEEN IMAGE AND SOUND 31

Figure 2.17: AVE-Net and AVOL-Net architectures. Source: Arandjelovic and Zis- serman, (2018).

2.3.1.2 Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

When visual and audio events occur together, it seems that there might be a common event that produce both signals. Recent works (Owens and Efros, 2018) argue that the visual and audio components of a signal should be modeled jointly using a fused multi-sensory representation and propose to learn such a representation in a self- supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. In this proposal it has been shown how to train the model without using any manually labeled data, this means rather than explicitly telling the model the relation between sound and image, audio-visual associations should be discovered through self-supervised training.

(44)

Figure 2.18: Fused audio-visual network. Source: Owens and Efros, (2018).

Specifically, a neural network is train on the task of detecting misalignment between audio and visual streams in synthetically-shifted videos. The network ob- serves raw audio and video streams, some are aligned and other have been randomly shifted, and it has to distinguish between the two. This is a challenging training task that coerces the network to combine visual motion with audio information and to learn a useful audio-visual feature representation (Figure 2.18).

2.3.1.3 Multimodal Deep Learning

There is other cases where exploiting multiple modalities improves sound classifi- cation performance. Deep auto-encoders have achieved better image classification performance for data with additional sound input by learning cross-modal represen- tations (Ngiam et al., 2011). These results indicate that such multimodal features can capture meaningful information about both sounds and images. However, the learned representations were only partly invariant to the input modality, suggesting

(45)

2.3. INTERSECTIONS BETWEEN IMAGE AND SOUND 33 that the shared representations could be improved.

2.3.1.4 SoundNet

SoundNet (Aytar et al., 2016), a deep convolutional neural network for natural sound recognition, is other example where common high-level representations for images and sounds by matching sound and image features can be extracted. Its performance generating descriptions was improved by learning sound representations obtained via multimodal learning. Using 2 million videos a neural network learnt to recognize scenes or objects based on what they sound like. Based only on the noise in a video, it could detect a coral reef, for example, or a plane taking off. Given that SoundNet’s features include image feature information, it may be possible to reconstruct images from them.

2.3.1.5 Generating Images from Sounds using Multi-modal features and GANs

Other works that propose different methods for aligning visual and sound features have been developed (Lyu et al., 2019). This method is based on transforming be- tween images and sounds by using stacked GANs (Zhang et al., 2017). This work shows that multimodal layers can extract similar information and generate vectors with similar distributions from both input modalities. This method converts mul- timodal features derived from input sounds into image features, and subsequently into images, using conditional GANs. To validate their approach they created a dataset based on Flickr-SoundNet dataset which contains 104K pairs of sounds and images with matching scene content.

Inspired in text-to-image GANs (Reed et al., 2016) Audio2Img (Wen and Su, 2018) is able to generate recognizable images only based on short clips of audio. In this work is shown the possibility of conditioning GAN on audio features, showing a successful step in bridging the audio and visual learning gap. This model is trained and evaluated on audio and images from a subset of AudioSet (Gemmeke et al., 2017) a dataset that collects 10 seconds of footage from millions of videos on YouTube.

(46)

2.3.1.6 Objects that Sound

A different method of converting between images and sounds is proposed. This is based on learning to find relationships between sounds and images in data that includes material properties and physical interactions. By incorporating changes due to object interactions in a convolutional neural network-long short-term mem- ory (CNN-LSTM) structure, the model could predict the corresponding waveforms.

However, this model can only be applied in a restricted domain.

Furthermore generating images from given conditions, there are some projects that propose generating sounds from given videos, such as “Visually indicated sounds“

(Owens et al., 2015). In this research, Its used a recurrent neural network (RNN) with long short-term memory units (LSTM) that takes CNN features as input to extract features from video screenshots. In particular it is proposed the task of predicting what sound an object makes as a way of studying physical interactions within a visual scene. An algorithm that synthesizes sound from silent videos of people hitting and scratching objects with a drumstick was presented.

Other related study was able to successfully convert between images of musical instruments or poses and corresponding sounds (Chen et al., 2017). In this work it is posit that humans can imagine a scene from a sound and based on this it was proposed to create machines that were able to do so by using conditional GANs. In this work the proposed system is trained with pairs of visual and audio signals, which are typically contained in videos, and is able to generate one modality (visual/audio) given observations from the other modality (audio/visual). But this method is limited to converting between specific sound and image types that are commonly paired in the source and target domains, and is not applicable to general video data.

2.3.1.7 Discover Cross-Domain Relations with DiscoGANs

There are also more advanced methods of learning mapping functions between the source and target domains, such as DiscoGAN (Kim et al., 2017). DiscoGAN can learn the mappings needed to convert data between two domains that are not in one-to-one correspondence, and can exhibit higher performance than conventional GANs when learning the mapping functions between images and sounds with similar

(47)

2.3. INTERSECTIONS BETWEEN IMAGE AND SOUND 35

Figure 2.19: DiscoGAN: Discovering relations of images from visually very different object classes. Source: Kim et al., (2017).

characteristics but different dimensions (Figure 2.19). It may therefore be possible to use DiscoGAN to achieve more realistic sound-to-image and/or image-to-sound reconstruction.

2.3.1.8 Others works not published

In spite of fact that the following work has not been published the author of this Mas- ters thesis think that it is important to bring it since its the most similar approach to the current project that has been found. In 2018 the artist Nao Tokui created

“Imaginary Soundscapes”, a convolutional neural network that "hears sounds when it looks at images". Based on a given image, the software chooses from 15,000 sound files to find the “soundscape” that fits. First, it was applied to Google StreetView images to create an audio tour of the world so the viewers could “immerse themselves into the artificial soundscape imagine by our deep learning models” and after the application was opened to all kind of images.

Although the goal of that project is similar to ours, the approach is very different since their results extract chosen music from a list of existing sound and assign it to the image. In our case the produced music is new, never heard and it‘s able to generate styles that dont exit.

2.3.2 Cross-Modal similarities in Music and Visual Art

This thesis is based on the hypothesis that even when visual art and music do not directly influence one another, they can share abstract qualities without having di-

(48)

rect communication. Visual art and music are also associated in the way their styles or movements are, such as Classical, Romanticism, Impressionism... since in general music and visual art movements with similar names share the same period in time.

The interaction between visual art and music has been an important part of art theory and history. Since music and visual art imply a specific place and time, they share the same external influences as political movements, cultural innovation or technological developments, which imply that uniformity of materials and instru- ments should be included in the pieces.

There are also parallelisms between the terminology of music and visual arts, such as texture, balance, form, or harmony, which also share abstract qualities. While these shared qualities are similar in their abstract form, a methodology for measur- ing these qualities could establish a more direct connection between them, and could help to understand how they influence one another. Thus an interesting approach could be based on a model that was able to extract and measures attributes linked with one another through cross-modal abstraction like strength of color, lightness of value, and height of pitch. These qualities are empirically measurable and there are evidences that support a shared cross-modal gradient between them. As far as the author knows there are currently no publications that empirically measure any abstract similarities in music and visual art.

But although visual art and music have not been measured directly with one another, there has long been evidence that people do make connections between vi- sual and auditory input. Wolfgang Kohler (Kohler, 1910) conducted a psychological experiment to determine if humans are capable of mapping a connection between sounds and visual objects, specifically between speech and shapes. From the results of this study, it seemed clear that there exist a cross-sensory translation between visual and auditory information that is generally consistent within a population. In addition to this there are papers and books that have been writing on relationships and similarities between visual art and music (Wallen, 2012; Vergo, 2012; Janson, 1968). What has not been explored is whether there are measurable similarities between the musical and visual works of an entire artistic movement or period.

(49)

2.3. INTERSECTIONS BETWEEN IMAGE AND SOUND 37 More recent related works research about visual synthesizer creation. In (Col- lopy, 2020), it has been presented the insights of artists and musicians in literature that have explored the relationships of painting and music and exposed the prob- lems and solutions to align visual art with music. Related to this, synesthesia (a condition in which one sense is simultaneously perceived as if by one or more ad- ditional senses such as sight) and specifically chromesthesia (a kind of synesthesia that connects sounds and colors), become an extensive field of research. In (San- tini, 2019) is presented a synthesizer that produces color perception in response to sound stimulation, this is, it generates music according to color detection. More precisely, RGB values are detected, pixel by pixel, and used to determine the behav- ior of five physical models for virtual instruments. This specific solution has been done by using evolutionary algorithms and evoking the experience of chromesthesia .

This connection between visual and music art is not going to be extensively study in this Masters thesis. We are going to base our application in the theory explored by previous researchers and create the correspondence between music pieces and painting based on their styles. But it would be a huge domain to explore to find this parallelism by using quantitative models that cluster or identify the groups of both music and image in an automatic way.

(50)
(51)

Chapter 3 Application

In this section, we are going to describe our architecture by taking inspiration from successes in image generation. We will adapt Pix2Pix (Isola et al., 2017) genera- tive model that is focused on image-to-image translation tasks to instead generate audio. Same as Pix2Pix (Figure 3.1), we will make the networks conditioned but instead of image conditioned on image, we will generate sounds conditioned on im- ages. This solution for applying image-generating GANs to audio will operate on image like spectrograms (time-frequency representations of audio) and specifically on mel-spectrograms as other previous work.

In following paragraphs we will present the main components of the model and explain how we are going to include audio in the network. Also we will show how to represent music and image data to be inputted in the model.

The code needed to be executed in this section has been written in one specific programming language which is Python1. The implementation of the model has been developed by using Keras deep learning framework2 based on the model de- scribed in the Pix2Pix paper and designed to take and process images with any size (256 and 512 resolutions have been tested). After that, by inverting the generated mel-spectrogram the piece of music will be generated.

1https://www.python.org/

2https://keras.io/

39

(52)

Figure 3.1: Visual representation of complete original Pix2Pix architecture that transforms satellite images in their corresponding Google maps pages. This archi- tecture has had multiple different applications in addition to this..

3.1 Architecture

The architecture that we are going to explore and implement is a type of image- conditional GAN that allows the generation of larger images if we compare with other solutions with GANs. Also it is an architecture that give good results with not to much information in the input.

The structure is composed by a generator model defined to create new images and a discriminator model that classifies images as real or fake. The discriminator model is updated directly whereas the generator model is updated via the discrimi- nator model. The two models are trained at the same time in an adversarial process where the generator tries to mistake the discriminator and the discriminator tries to identify the fake images.

This specific solution is a conditional (cGAN ) where the generation of the output is conditional on source image. In our proposal the target image is a mel-spectrogram that describes the music and the source image is a painting. We have used the po- tential of this solution to pair image-sound by using a visual representation of the music which is the spectrogram.

Referencias

Documento similar