• No se han encontrado resultados

II) LA NATURALEZA A TRAVÉS DE DOS TEXTOS ESENCIALES DE LA

6. Libro V: historia humana, crítica del progreso y melancolía

Free Newton Method

The observations made for di↵erent optimization techniques states that, close to nondegenerate critical points, what we want to do is to rescale the gradient in each direction e[i]by 1/| [i]|. This achieves the same rescaling as the Newton method,

while preserving the sign of the gradient. This means that if the gradient says that we should move away from ✓⇤, the rescaled step will still move away. Saddle points

are not attractors of the dynamics of this approach, as they are to the dynamics of the Newton method.

The idea of taking the absolute value of the eigenvalue was briefly suggested before, see for example in Nocedal and Wright (2006, Chapter 3.4) or in Murray (2010, Chapter 4.1). However, we are not aware of any proper justification of this algorithm or even a proper detailed exploration (empirical or otherwise) of this idea.

The problem is that one cannot simply replace H by |H|, where |H| is the matrix obtained by taking the absolute value of each eigenvalue of H, with out proper justification. For example, one obvious question is: are we still optimizing the same function? While we might be able to argue that we do the right thing close to critical points, can we do the same far away from these critical points? In what follows we will provide such a justification for replacing H with |H| by employing the generalized trust region framework. Namely we want to define k and d in the following equation, such that, when solving this constrained optimization using Lagrange multipliers, we get back ✓ = rL|H| 1 :

✓ = arg min

✓ Tk{L, ✓, ✓} with k 2 {1, 2}

s. t. d(✓, ✓ + ✓) r

(4.71)

We first note that k must be 1. If k is 2, then the step will be a function of H rather than |H|. Having k = 1 also makes sense because we know that a second order approximation is not reliable when we have negative curvature. Next, we need to design a distance measure d such that it will produce |H|.

want, similar to the Squared Newton method proposed in Section 4.5.1, to define the radius of the trust region according to the curvature of the function.

To achieve this, the Squared Newton method looked at the change in the gradi- ent of the loss L. The gradient is a ratio of the change in L divided by the change in the parameter, when ✓ ! 0. The nature of this ratio is di↵erent from that of the loss function. One can become aware of this by assigning units to the loss function1. If so, the constraints end up being expressed in di↵erent units compared

to the first order Taylor approximation of the loss. This is a sign that there is a rescaling term missing that is a function of ✓ between the constraint and the function we need to minimize. We end up ignoring this rescaling factor (by assum- ing that it is constant with respect to ✓) when we apply the Lagrange multipliers method.

In other words, the change in the gradients does not tell us how far from ✓ we can assume L to have the about the same first order approximation, but rather how fast the first order of L would change per ✏ change in the parameter, which is not a reliable measure of trustworthiness for the trust region.

The proper question to ask is how far from ✓ can we trust our first order approximation of L. One measure of this trustfulness is given by how much the second order term of the Taylor expansion ofL influences the value of the function at some point ✓ + ✓. That is we want the following constraint to hold:

d(✓, ✓ + ✓) = |T2{L, ✓, ✓} T1{L, ✓, ✓}| =|L(✓) + rL ✓ +1 2 ✓TH ✓ L(✓) rL ✓| = 1 2| ✓ TH ✓|  r (4.72)

where rL is the partial derivative of L with respect to ✓ and r 2 R is some small value that indicates how much discrepancy we are willing to accept between our first order approximation of L and the second order approximation of L.

Note that the distance measure d takes into account the curvature of the func- tion. It uses the curvature to decide how far from ✓ we have that its first order approximation is still reliable. If we have high curvature in some direction, we expect the corresponding radius of the trust region to be small and if have low curvature, the radius will be larger.

The proposed distance, however, does not easily allow to solve for ✓ in more than one dimension. If we take the square of the norm to remove the absolute value, we get a function that is quartic in ✓ (the term is raised to the power 4). To address this problem we rely on the following Lemma:

Lemma 5. Let A be a nonsingular square matrix inRn⇥ Rn, and x2 Rn be some

vector. Then it holds that |xTAx|  xT|A|x, where |A| is the matrix obtained by

taking the absolute value of each of the eigenvalues of A.

Proof. Let e[1], . . . e[n] be the di↵erent eigenvectors of A and [1], . . . [n] the corre-

sponding eigenvalues. We now re-write the identity by expressing the vector x in terms of these eigenvalues:

|xTAx| = X

i

(xTe[i])e[i]TAx

= X

i

(xTe[i]) [i](e[i]Tx)

= X

i

[i](xTe[i])2

We can now use the triangle inequality |Pixi| Pi|xi| and get that

|xTAx| X i |(xTe [i])2 [i]| =X i

(xTe[i])| [i]|(e[i]Tx)

= xT|A|x

Lemma 5 shows that

d(✓, ✓ + ✓) =| ✓TH ✓|  ✓T|H| ✓

so we enforce our constraint on this upper bound of the distance, instead of the distance directly, resulting in the following generalized trust region method:

✓ = arg min

✓ L(✓) + rL ✓

s. t. ✓T|H| ✓  r

(4.73)

Note that, as was discussed before when we introduced natural gradient, the inequality constraint can be turned into an equality one as the first order approxi- mation of L has a minimum at infinity, which means that the step will always be on the boundary of the trust region. We can use the Lagrange multipliers method, which gives us:

✓ = rL|H| 1 (4.74)

As before, we do not solve for the Lagrange coefficient in terms of r, but rather fold it into the learning rate for which we carry out a line search. The resulting algorithm has the desired behaviour around critical points, where it uses the right step size (as predicted by the Newton method) while also being able to escape saddle points. That is, if we go back to the approximation of the function near a critical point proposed in Equation (4.66), this method will move on each coordinate by

vi, which is the optimal speed according to the curvature of the function.

Far away from a critical point, the method also moves in the right direction because of its justification as a generalized trust region method. Namely, far away from the critical point, the method defines a neighbourhood in which the first order approximation of L is reliable and minimizes this approximation within this neighbourhood. This means that we always follow a descent direction of L. We call this algorithm Saddle-Free Newton method.

The description of the algorithm suggest that it should behave very well in practice. In fact, if we do not have negative curvature, the algorithm converts into the Newton method. This makes the algorithm ideal for compact models, where we can get close to fully compute the whole Hessian and do an eigen decomposition of this matrix. For example, recurrent networks discussed in the next chapters are compact models.

In general, however, the difficulty of this proposed approach is in scaling it up. Specifically, the standard pipeline employed by HF can not be directly applied because there is no efficient way of computing |H|x. The R and L operators can not yield this computation.

One approach of approximating this method is to rely on the Squared Newton method introduced previously. If we assume that all eigenvalues of the Hessian are clustered around the same value, then we can use the following identity:

|H|x ⇡ 1 | [max]|

HTHx (4.75)

The value of [max] can easily be approximating using the Power method. One

can also view this approach as using a specific per iteration scaling of the matrix HTH. Normally this scaling would fold back into the learning rate, but if we use

damping for this matrix by adding some other matrix after we rescaled HTH, then

the rescaling becomes important.

In particular, the Fisher Information Matrix is believed to approximate well the Hessian while being positive definite. By relying on the justification in Section4.3.8, we could use the Fisher Information Matrix to damp the squared Newton (where we rescale HTH). This could lead to minimizing even more the di↵erence between this

computed matrix and the matrix |H| corresponding to taking the absolute value of the eigenvalues of the Hessian. The additive term from the Squared Newton should help natural gradient descent when the FIM matrix becomes singular (due to negative curvature) in some direction while the Hessian is not. The advantage of such an approach is that it is efficient to compute in the framework introduced for the Hessian-Free Optimization.

More principled approaches might also be possible. We regard the problem of scaling up Saddle-Free Newton as a future research direction. At this point, this thesis will only introduce the algorithm from a theoretical point of view, and argue that this algorithm takes an optimal step near a critical point. In the next section we will also demonstrate the e↵ectiveness of the algorithm on a small scale experiment where we can a↵ord to compute the full Hessian.