MARCO DE REFERENCIA
PENSAMIENTO CRÍTICO
4.4 DIMENSIONES SUSTANTIVA Y DIALÓGICA DEL PENSAMIENTO CRÍTICO
4.4.6 Escuchar y Expresar Oralmente (Dialógico)
As stated in the previous section, the training of GANs is more problematic than the training of most machine learning models, mainly because finding an equilibrium in a Nash game is more difficult than solving an optimization prob- lem. In this section, three of the most important problems that are commonly encountered during learning are analyzed.
Instability
The instability of GANs is well documented in the literature [34, 36, 37] and it is caused by trying to find an equilibrium point with gradient descent. The equilibrium is a point (θ(D), θ(G)) such that both the generator and the dis-
criminator losses lie at a minimum with respect to their parameters. This has motivated the idea of using gradient descent to update the parameters of the two models. But, as discussed in [37], this idea does not hold, even for simple functions. If the discriminator changes its parameters, in order to minimize its loss, the generator is likely to shift away from its optimum point, and when the generator update its parameters, the same thing occurs with the discriminator. Equilibrium is not reached, because each player destroys the progress of the other, causing the value of the GAN function to oscillate. This oscillation is not easy to detect and to avoid, making the training of GANs instable. For
2.3. Training 41
Algorithm 3 A more detailed GAN training algorithm.
For the discriminator cost function, cross entropy loss is used, instead of the objective function in Equation 2.9, in order to minimize instead of maximize. .
Require: k = number of steps the discriminator has to be trained before training the generator.
Require: m= minibatch size.
1: for number of training iterationsdo 2: for k steps do
3: • Sample a minibatchx of size m frompdata.
4: • Sample a minibatchz of size m from pz (latent space).
5: • Set the discriminator loss as:
LD =− 1 m m X i=1 logD(x(i)) + log(1−D(G(z(i)))). 6: • Back-propagate the loss only through the discriminator.
7: • Update the parameters of the discriminator.
8: end for
9: • Sample a minibatchz of size m from pz (latent space).
10: • Set the generator loss as:
LG= 1 m m X i=1 log(1−D(G(z(i)))).
11: •Back-propagate the loss through the discriminatorandthe generator.
12: • Update the parameters of the generator.
13: end for
a complete analysis of the various sources of instability during the training of GANs, see [36].
Even with those problems, SGD is still predominantly used during the train- ing of adversarial networks, due to its popularity and its adoption by many machine learning frameworks.
Vanishing gradient
The vanishing gradient is a problem that is not limited to GANs, but common to many deep learning models. However, in generative adversarial networks, this problem is particularly difficult to overcome, for two main reasons.
The first reason is that the gradient of the loss has to flow through the dis- criminator, before reaching the generator. As can be seen in subsection 1.3.4, the error on one neuron is calculated by summing the errors of the following layer, and multiplying the result by the derivative of the activation function. For some activation functions, the derivative varies between zero and one, re- ducing the propagated error. Thus, when arrives at the generator, the error might be too small to change the parameters in a perceptible way, thereby stopping the learning.
Another cause of vanishing gradient is the loss function of the generator. In the classic model, the generator wants to minimize:
J(G) = log(1−D(G(z))). (2.11) This function produces small values if the discriminator is able to reject the output of the generator with high confidence. This means that, if the discrim- inator overwhelms the generator, the latter does not have enough gradient to improve its parameters. This situation may seem difficult to encounter, but in [36] it is demonstrated that, if the space in which the pdata lies is high-
dimensional, in the early stages of learning, even a trivial classifier is able to distinguish between real and fake data, with an accuracy of nearly 100%. This has led to the adoption of the following loss function for the generator:
J(G) =−log(D(G(z))). (2.12) This function has the advantage of providing an higher gradient during the initial steps of training. Moreover its minimum point is definite, as opposed to Equation 2.11, in which it is indefinite (−∞).
Mode collapse
Mode collapse (also known as theHelveticascenario) occurs when the generator maps many different vectors from the latent space to the same output. In practice, a complete mode collapse is rare, but partial mode collapse (the generator produces only similar images) is common. The problem arises when the generator finds some weaknesses in the discriminator, and, since the loss is low for that output, it continues to produce the same outcome. It is not easy for the discriminator to notice this problem, as it does not conserve a history of data.
Even if the discriminator detects the issue and starts to reject the generated data, the generator simply searches for another mode, and starts to generate
2.3. Training 43 data that is close to the newly discovered vulnerability. Thus, the training becomes a cat-and-mouse game, without converging to the optimal point [38]. Mode collapse is still an open problem, and it is considered by many authors to be one of the most important issues of GANs [31].