• No se han encontrado resultados

The learning strategy proposed throughout this chapter, namely going from a learning problem in a binary setting to a stochastic optimization problem over a parametrized probability distribution, can be very effective in training feed- forward ANNs on real data. However, when dealing with K-label classification tasks, in order to define the log-likelihood it is necessary to give a proper definition to the underlying stochastic process, that determines the output of the network and its probability P (y| x, W ). Let us first consider the case in which the network outputs a vector τ ∈ {−1, 1}K, producing an independent binary classification in correspondence of all the possible labels. Consider a single pattern, whose correct label k⋆ ∈ {1, ..., K} is specified by the output vector yµ = δ

k,k⋆:

1. One could simply ask the network to give an output τ = 1 in correspon- dence of the correct class, determining a log-likelihood with the shape of

a cross-entropy loss: Lµ(m) = K X k=1 ykµlog P (τk= 1| xµ, W) = ykµ⋆log P (τk = 1| xµ, m) (7.81) 2. Alternatively, one could also try to obtain τ = −1 in all the wrong classes,

with the log-likelihood: Lµ(m) =

K

X

k=1

ykµlog P (τk = (2yk1) | xµ, m) (7.82) It is possible, instead, to define the stochastic process differently, so that the loss-function becomes more similar to the softmax function usually employed in Deep Learning. One can consider a stochastic model where the output of the network is accepted only if it is is an indicator on one of the K classes, otherwise the trajectory is rejected and the extraction of the synapses is repeated. The unnormalized probability of obtaining the desired output is still:

P (k⋆|xµ, m) = P (τk⋆ = 1|xµ, m)

Y

k̸=k⋆

P (τk = −1|xµ, m) (7.83) However, if we also take into account the normalization we obtain an expression for the likelihood of a single datapoint, wich can be simplified into the form:

Lµ(m) = ρk⋆R(k⋆|xµ, m)

PK

k=1ρkR(k|xµ, m)

(7.84) where R (k|xµ, m) = P (τ

k = 1|xµ, m) /P (τk = −1|xµ, m), and where we intro- duced the weights ρk, with ρk = 1 ∀k ̸= k⋆, playing the role of a robustness parameter, to be fixed at a value 0 < ρk⋆ ≤1. This can encourage a higher probability in correspondence of the correct label.

The drawback, with these definition for the stochastic process, is the fact that the space of possible outputs is exponential, with 2Kpossible labels. Therefore, the probability of actually obtaining an acceptable output τ = 2δk,k⋆ −1 is exponentially small, and for example a MC sampling of P (y| x, W ), starting from a random configuration of the parameters θ, would have a very high rejection rate → 1.

The other problem concerns the way the probability P (τk⋆ = 1|x, m) is computed in practice: taking care of the potential correlations between the inputs poses serious technical problems and, on the other hand, even a MC sampling would become unfeasible with growing numbers of hidden layers. Similar to what we did in section 3.4, we can completely neglect the correlations and simply work in a factorized Gaussian approximation (see also [82]), where the standard back-propagation algorithm can be applied. In the simulations, we chose to employ the natural gradient (with (1 − m2

i)∂mi instead of ∂mi), with a learning rate equal to 1; the loss-function was set to be that of equation 7.84, with ρk= 0.5. Moreover, we found to be very important, in practice, to

apply the Dropout heuristic, with η = 0.25 in the input layer and η = 0.3 in the intermediate layers. This effectiveness is probably due to two reasons: first, one of the properties of Dropout is that of uncorrelating the hidden units, in accordance with the naive assumption we made in the factorized approximation; secondly, the error function can suffer from the vanishing gradient problem close to saturation [90], in the late stages of the training procedures, and the Dropout can help enhancing the error signal received by the small magnetizations.

On the MNIST digit recognition benchmark [71], we obtained the following generalization performance:

• ∼ 1.3% with a fully connected architecture, with two hidden layers with 801 units each.

• ∼ 1.2% with three hidden layers of size 801.

These results are very promising, given that we are training a network with binary weights and no convolutional layers [7].

In this PhD thesis, we approached the problem of learning in Artificial Neural Networks with discrete synapses, both from a theoretical and an algorithmical point of view. The relevance of this subject is rapidly escalating in the Deep Learning community, as the impressive success of DNNs, in a variety of complex recognition tasks, is accompanied by growing memory and computational costs, calling for methods of obtaining more compact and robust representations of ANNs. This apparent simplification, going from continuous to discrete synaptic weights, might also be crucial for developing more realistic models of neural computation as well as hardware implementations of ANNs, but encompasses a series of technical and theoretical complications.

The initial theoretical objective of our work was that of tracing back the effectiveness of a few heuristic solvers to the static properties of the loss land- scape, and to resolve the clear discrepancy between the equilibrium analytical predictions and the dynamical properties of these learning processes, in the Binary Perceptron. The main mathematical tool we employed is the Replica Trick, borrowed from the Physics of Disordered Systems: the goal was that of obtaining a Large Deviation analysis able to enhance the statistical weight of configurations immersed in dense regions of solutions, since the solutions found by the algorithms exhibited this peculiar feature. The key idea was to introduce a local entropy potential, measuring the number of neighboring solutions, and using it as a modifier of the standard energy-based Boltzmann-Gibbs measure: the dominating effect of the isolated solution was thus canceled out, and a sub-dominant dense (“unfrozen”) cluster of solutions was discovered in the loss landscape. This novel structure was also found to break apart and disappear at a certain constraint density, very close to the measured algorithmic threshold [1].

Conceptually, the local-entropy-reweighting formalism can be seen as a generalization of the 1RSB formalism: compared to the ergodicity breaking scheme described by the Parisi Ansatz, in our scheme we keep an additional dependency on a distance parameter, that explicitly introduces a notion of locality in our model, and potentially allows the description of structures where the usual ultra-metric symmetry of the Gibbs states is broken [6].

We extended our analysis also to the case of the Generalized Perceptron, where a discrete set of possible values for the synaptic weights is allowed, and

the training set input and output statistics can be biased. The same qualitative picture holds also in this case, and we where able to show that the overall benefit of adding more synaptic states rapidly vanishes, highlighting the relevance of the problem of learning with discrete synapses [3].

Building on the theoretical understanding obtained through these Large Deviation analyses, we developed a series of algorithms that can target the sub-dominant dense regions of solutions explicitly:

• EdMC, a MCMC optimization scheme, was first introduced as a proof of concept: in this simple solver the objective function is the local entropy itself, estimated in the Bethe approximation through Belief Propagation. A simple Simulated Annealing procedure, both in the usual temperature

β and in γ, a parameter controlling the radius inside which the local

entropy is estimated, can focus the measure on smaller and denser regions, easily providing solutions of the Perceptron. The landscape explored by this solver is much smoother than the roughed energy landscape, and the process is able to avoid the exponentially numerous meta-stable states even in the greedy zero-temperature limit β → ∞. To prove the versatility of this strategy, we also applied it to the 4-SAT problem, obtaining good performance also in the hard region [2]. The main bottleneck in further generalizations remains the problem of computing the local entropy efficiently, as the validity of the cavity approximation has to be assessed for each problem at hand and BP may not be applicable.

• In order to avoid the two-level formulation, based on the employment of BP for the local entropy estimate, we proposed an alternative and more general strategy for obtaining solutions immersed in dense clusters: we defined the Robust Ensemble, where the original partition function is replicated and the replicas interact elastically with a central reference. Any optimization strategy (e.g., Simulated Annealing, Stochastic Gradient Descent, Belief Propagation) applied on this system, instead of the original one, is naturally attracted towards regions with high local entropy of low energy configurations. Our simple recipe can be easily adapted to any learning algorithm: one only needs to run a set of processes in parallel and couple them, to drive them towards high local entropy regions [4].

Similar to EdMC, this type of strategy was proven to be quite general, as it was shown to be very effective also in the K-SAT problem.

• Finally, we also showed that the introduction of a source of stochasticity at the level of the synapses can be exploited as a tool for directing the learning process into the dense cluster. The robustness required for learning in this noisy setting can in fact force the network to learn representations that are unaffected by small local perturbations, similarly to what would happen in a dense region of solutions. This stochastic framework naturally induces a Bayesian treatment of the neural network model, where the aim is that of learning a continuous parametrization of a probability distribution over the synaptic states: this allows one to employ a simple gradient descent procedure on these parameters, which would not be directly applicable in the discrete context. We were able to prove analytically that, in the Perceptron architecture, this learning procedure ends up in the same dense sub-dominant states found in the original Large Deviation analysis [5]. Moreover, this procedure can be easily generalized to deeper architectures [7] and different constraint satisfaction problems.

The idea of searching for high local entropy regins seems to be crucial in the generalization context: in the teacher-student scenario the solutions inside the cluster show remarkably smaller generalization errors with respect to the typical isolated solutions. Moreover, also in the numerical tests performed on real-world data (e.g., the handwritten-digit image-recognition benchmark MNIST), we observed that the well-performing learning algorithms invariably end up in dense regions of solutions, and walking away from their core harshly hampers the generalization performance [1]. The intuitive explanation of this property could be the following: the center of these wide, very robust regions can be interpreted as a Bayesian estimator for the whole extensive neighborhood. This is even more naturally understood when the stochastic synapses are considered, where the mode of the probability distribution, a configuration at the core of the dense cluster, is in fact the solution that carries the largest weight in the Bayesian integral.

It is becoming clear that a phenomenon quite similar to the one we first observed in simple discrete ANNs, is also manifesting in the context of complex

deep neural network models, currently employed in machine learning appli- cations. In [79], for example, an algorithm mainly developed for obtaining efficient parallelization of the training process, EASGD, also exhibited a nice generalization performance boost and its definition is actually equivalent to a simple SGD procedure in the Robust Ensemble. Moreover, it seems plausible that many effective heuristics, shaped and tuned in order to find solutions that generalize well, actually search for wide flat regions in the loss landscape, which are the transposition of high local entropy regions in the continuous setting.

Some progress towards designing more explicit and interpretable learning heuristics was presented in [74], where, building on our theoretical analysis and on the observation of a correlation between good generalization scores and the presence of wide valleys, the authors designed an algorithm akin to EdMC for deep continuous networks: Entropy-SGD achieves state-of-the-art performance by exploiting the geometric properties of the energy landscape, targeting regions with a high entropy, in this case estimated through a Langevin Dynamics. These findings are in countertrend with respect to the widespread belief that deep networks present multiple equivalent local minima with the same loss. Moreover, Parle, a hybrid algorithm inpired by the RE and EASGD, shows the potential of a parallel approach, where an explicit redirection towards the well-generalizing flat minima is accompanied also by a generous wall-clock time speedup, with infrequent communication requirements between the processes [92]. It might even be possible to exploit this parallel formulation for splitting the dataset instead of sharing it. All in all, these results seem to motivate a fundamental reconsideration of distributed machine learning in non-convex problems, as DNNs.

Another research direction is that of finding a role for the local entropy also in the unsupervised learning scenario, both in attractor neural networks and in generative models like the Restricted Boltzmann Machine. The enhanced robustness to noise might in fact be relevant for modeling and memorizing real data, that is often fuzzy and ambiguous. In this direction, in [8] we propose a new learning rule, Delayed Correlation Matching, that proves that the learning process can be built on highly noisy measurements and very small signals. However, the link with the reweighted measure is not yet formed, and it probably requires a more general rethinking of the true objective of inference processes.

Documento similar