2.5 DELITOS TIPIFICADOS EN EL CÓDIGO ORGÁNICO INTEGRAL PENAL
2.5.1 Artículo 229 Revelación ilegal de base de datos
When the size of the network is very large, Monte Carlo methods can become very computationally expensive. One simple alternative general method for finding minima of the energy is using Gradient Descent (GD) or one of its many variants. All these algorithms are generically called back-propagation algorithms in the neural networks (NN) context [61]. In particular, Stochastic GD (SGD) is the basis of most of the recently developed “deep learning” techniques employed in Machine Learning. In the following, we demonstrate that performing the gradient descent over the RE energy defined in equation (6.6), leads to a noticeable improvement in the performance of the algorithm;
moreover, the solutions found by the algorithm are indeed part of a dense regions, as expected.
Implementation
Gradient Descent is defined only for differentiable systems, and thus it needs some adaptations in order to be applied to the case of systems with discrete variables.
One possible work-around is a generalization to a mini-batch learning scenario of the “Clipped Perceptron” (CP) algorithm [78]: we can associate an auxiliary continuous variable W to each binary synaptic variable W , binding them through the relationship W = sign (W). The gradients will now be evaluated in correspondence of the real synapses W , but stored in the auxiliary
variables: Wk i t+1 = Wk i t − η 1 |m(t)| X µ∈m(t) ∂ ∂Wk i EµWt (6.17) Wikt+1= sign Wk i t+1 (6.18) where η is the learning rate and m (t) is a set of pattern indices (the so- called minibatch). The CP algorithm is recovered in the case of a sin- gle layer network without replication (K = 1, y = 1), of a fixed learn- ing rate, and in the fully-online regime (|m (t)| = 1). In that case, since
Eµ(W ) = R (−P
iWiξ µ
i ), with R (x) = 12(x + 1) Θ (x), the gradient becomes
∂WiE µ(W ) = −1 2ξ µ iΘ (− P iWiξ µ
i). The relation (6.18) is scale-invariant, so we can just set η = 4 and obtain
Wt+1 i = W t i −2ξ µ iΘ − X i Witξiµ ! (6.19) where the auxiliary quantities W can be restricted to discrete values as well, if they are initialized as integers. We note that the CP rule by itself does not achieve an extensive capacity in the large N limit; it is however possible to make it efficient, as in the CP+R heuristic algorithm (see section 3.2) or by adding the interaction term as in the RE.
In the two-layer case (K > 1) the energy associated to a wrong classification can be defined as the minimum number of spin flips needed to correct the output. The computation of the gradient becomes more involved, but of course gives a non-zero contribution only in case of error, and only for those units
k which contribute to the energy computation. Also in this case, since by
setting η = 4 the gradient is restricted to 3 possible integer values, we could use discretized variables for the W. It is interesting to point out that a slight variation of this update rule in which only the most easily-fixable unit is affected gives the extended CP+R rule, decribed in section 3.2, giving good results on a real-world learning task when the uniform reinforcement term was added. Note that, the difference between the two rules becomes irrelevant in the later stages of learning, when the overall energy is low.
Once we have the gradient of E (W ) separately for each system, we can add the interaction of the RE (with the traced-out center), and obtain the full
SGD update: (Wa i) t+1 = (Wa i) t − η 1 |m(t)| X µ∈m(t) ∂Eµ ∂Wi (W ) W =(Wa)t (6.20) + η′ tanh γ y X b=1 Wibt ! −(Wa i ) t ! where we used η′ = γ
βη as a control parameter, such that it remains finite in the limit β, γ → ∞; in this limit the tanh reduces to a sign.
The update equation (6.20) can be implemented in the following way: at each time step, we pick uniformly at random a replica a and compute the gradient with respect to a mini-batch of m (t) patterns, we partially update Wa and Wa, we compute the gradient with respect to the interaction term with the stored value of Py
a=1Wia, update it, and then complete the updates of Wa and
Wa. This scheme can be easily parallelized, since it alternates the standard learning periods in which each replica acts independently with brief interaction periods, similarly to what was done in [79]. In our tests, we kept fixed the learning rates η and η′ during the training process, and we implemented the usual scoping procedure.
Numerical results
In figure 6.4 we can see the results obtained in the case of the fully-connected committee machine: the introduction of the interaction term greatly improves the capacity of the network (from 0.3 to almost 0.6), and generally reduced the number if required presentations of the dataset (epochs); moreover, when the algorithm fails to solve the instance the reached configurations have a lower error rate than the non-interacting version. We also observed the same qualitative results in the Perceptron, where a capacity exceeding 0.7 can be reached, suggesting the fact that Replicated SGD is able to achieve near-optimal learning performance.
Relationship with EASGD
It is interesting to note that a very similar learning strategy—a replicated system in which each replica is attracted towards a reference configuration,
Fig. 6.4 Replicated Stochastic Gradient descent on a fully-connected committee machine with N = 1605 synapses and K = 5 units in the second layer, comparison between the non-interacting (i.e. standard SGD) and interacting versions, using y = 7 replicas and a minibatch size of 80 patterns. Each point shows averages and standard deviations on 10 samples with optimal choice of the parameters, as a function of the training set size. Top: minimum training error rate achieved after 104 epochs. Bottom: number of epochs required to find a solution. Only the cases with 100% success rate are shown (note that the interacting case at α = 0.6 has 50% success rate but an error rate of just 0.07%).
called Elastic Averaged SGD (EASGD)—was proposed in [79] (see also [80]).The context was that of deep convolutional networks with continuous variables, and EASGD was heuristically introduced to exploit parallel computing environments under communication constraints. In this work, the strategy of replicating the system and introducing the elastic interaction was concurrent with the employment of the usual deep learning techniques (e.g. momentum), so it is difficult to fully disentangle the effect of the various heuristics. However, their results clearly demonstrate a benefit from introducing the replicas in terms of training error, test error and convergence time.
It might be plausible that the general underlying reason for the effectiveness of the method is similar, related to the possibility of accessing robust low-energy states in the space of configurations, despite a conclusive assessment is difficult due to the great jump in complexity in the choice of the network architecture.