• No se han encontrado resultados

CAPITULO II: PRECEDENTES GENERALES Y ESPECÍFICOS

2.1 CONTEXTO URBANO

This section details the results of three sets of experiments where the robustness to the learning rate of IB is compared to that of EB.4 Since IB is equivalent to EB when the learning rate is small,

we expect little difference between the methods in the limit of small learning rates. However, we expect that IB will be more stable and have lower loss than EB for larger learning rates.

Classification, autoencoding and music prediction tasks. For the first set of experiments, we applied IB and EB to three different but common machine learning tasks. The first task was image classification on the MNIST dataset [LeCun, 1998] with an architecture consisting of two convolutional layers, an arctan layer and a relu layer. The second task also uses the MNIST dataset, but for an 8 layer relu autoencoding architecture. The third task involves music prediction on four music datasets, JSB Chorales, MuseData, Nottingham and Piano-midi.de [Boulanger-Lewandowski et al., 2012], for which a simple RNN architecture is used with an arctan activation function.

For each dataset-architecture pair we investigated the performance of EB and IB over a range of learning rates where EB performs well (see Appendix E.4.5 for more details on how these learning rates were chosen). Since IB and EB have similar asymptotic convergence rates, the difference between the methods will be most evident in the initial epochs of training. In our experiments we only focus on the performance after the first epoch of training for the MNIST datasets, and after the fifth epoch for the music datasets.5 Five random seeds are used for each configuration in order

to understand the variance of the performance of the methods.6

Figure 6.3 displays the results for the experiments. EB and IB have near-identical performance when the learning rate is small. However, as the learning rate increases, the performance of EB deteriorates far more quickly as compared to IB. Over the six datasets, the learning rate at which

4A more extensive description of the experimental setup and results are given in Appendices E.4 and E.5. 5The music datasets have fewer training examples and so more epochs are needed to see convergence.

6The same seeds are used for both EB and IB. Five seeds are used for all experiments, except MNIST-classification

CHAPTER 6. IMPLICIT BACKPROPAGATION 107 0.5 1.0 1.5 2.0 Learning rate 0.0 0.5 1.0 1.5 2.0 2.5 MNIST-classification explicit implicit exact ISGD 0 100 200 300 400 500 Learning rate 0.000 0.005 0.010 0.015 JSB Chorales-RNN explicit implicit 0 100 200 300 400 500 Learning rate 0.001 0.002 0.003 0.004 0.005 explicitMuseData-RNN implicit 0.000 0.005 0.010 0.015 0.020 Learning rate 2 4 6 MNIST-autoencoder explicit implicit 0 100 200 300 400 500 Learning rate 0.0000 0.0005 0.0010 0.0015 Nottingham-RNN explicit implicit 0 100 200 300 400 500 Learning rate 0.00025 0.00050 0.00075 0.00100 0.00125 Piano-midi.de-RNN explicit implicit

Figure 6.3: Training loss of EB and IB (lower is better). The plots display the mean performance and one standard deviation errors. The MNIST-classification plot also shows an “exact” version of ISGD with an inner gradient descent optimizer.

IB starts to diverge is at least 20% higher than that of EB.7 The benefit of IB over EB is most

noticeable for the music prediction problems where for high learning rates IB has much better mean performance and lower variance than EB. A potential explanation for this behaviour is that IB is better able to deal with exploding gradients, which are more prevalent in RNN training.

Exact ISGD. For MNIST-classification we also investigate the potential performance of exact

ISGD. Instead of using our IB approximation for the ISGD update, we directly optimize (6.1) using gradient descent. For each ISGD update we take a total of 100 gradient descent steps of (6.1) at a learning rate 10 times smaller than the “outer” learning rateηt. It is evident from Figure 6.3 that this method achieves the best performance and is remarkably robust to the learning rate. Since exact ISGD uses 100 extra gradient descent steps per iteration, it is 100 times slower than the other methods, and is thus impractically slow. However, its impressive performance indicates that ISGD-based methods have great potential for neural networks.

7The threshold for divergence that we use is when the mean loss exceeds the average of EB’s minimum and

Run times. According to the bounds derived in Section 6.4.5, the run time of IB should be no more than 12% longer per epoch than EB on any of our experiments. With our basic Pytorch implementation IB took between 16% and 216% longer in practice, depending on the architecture used (see Appendix E.5.1 for more details). With a more careful implementation of IB we expect these run times to decrease to at least the levels indicated by the bounds. Using activation functions with more efficient IB updates, like the smoothstep instead of arctan, would further reduce the run time.

UCI datasets. Our second set of experiments is on 121 classification datasets from the UCI database [Fern´andez-Delgado et al., 2014]. We consider a 4 layer feedforward neural network run for 10 epochs on each dataset. In contrast to the above experiments, we use the same coarse grid of 10 learning rates between 0.001 and 50 for all datasets. For each algorithm and dataset the best performing learning rate was found on the training set (measured by the performance on the training set). The neural network trained with this learning rate was then applied to the test set. Overall we found IB to have a 0.13% higher average accuracy on the test set. The similarity in performance of IB and EB is likely due to the small size of the datasets (some datasets have as few as 4 features) and relatively shallow architecture making the network relatively easy to train, as well as the coarseness of the learning rate grid.

Clipping. In our final set of experiments we investigated the effect of clipping on IB and EB. Both IB and clipping can be interpreted as approximate trust-region methods. Consequently, we expect IB to be less influenced by clipping than EB. This was indeed observed in our experiments. A total of 9 experiments were run with different clipping thresholds applied to RNNs on the music datasets (see Appendix E.4 for details). Clipping improved EB’s performance for higher learning rates in 7 out of the 9 experiments, whereas IB’s performance was only improved in 2. IB without clipping had an equal or lower loss than EB with clipping for all learning rates in all experiments except for one (Piano-midi.de with a clipping threshold of 0.1). This suggests that IB is a more effective method for training RNNs than EB with clipping.

The effect of clipping on IB and EB applied to MNIST-classification and MNIST-autoencoder is more complicated. In both cases clipping enabled IB and EB to have lower losses for higher learning rates. For MNIST-classification it is still the case that IB has uniformly superior performance to

CHAPTER 6. IMPLICIT BACKPROPAGATION 109 1 2 3 4 Learning rate 0.0 0.5 1.0 1.5 2.0

2.5 MNIST-classification, clip = 1.0explicit implicit exact ISGD 0.0 0.1 0.2 0.3 0.4 Learning rate 2 3 4 5 6 MNIST-autoencoder, clip = 1.0 explicit implicit 0 100 200 300 400 500 Learning rate 0.0005 0.0010 0.0015 Nottingham-RNN, clip = 8.0 explicit implicit

Figure 6.4: Training loss of EB with clipping and IB with clipping on three dataset-architecture pairs.

EB, but for MNIST-autoencoder this is reversed. It is not unsurprising that EB with clipping may outperform IB with clipping. If the clipping threshold is small enough then the clipping induced trust region will be smaller than that induced by IB. This makes EB with clipping and IB with clipping act the same for large gradients; however, below the clipping threshold EB’s unclipped steps may be able to make more progress than IB’s dampened steps. See Figure 6.4 for plots of EB and IB’s performance with clipping.

Summary. We ran a total of 17 experiments on the MNIST and music datasets8. IB outperformed

EB in 15 out of these. On the UCI datasets IB had slightly better performance than EB on average. The advantage of IB is most pronounced for RNNs, where for large learning rates IB has much lower losses and consistently outperforms EB with clipping. Although IB takes slightly longer to train, this is offset by its ability to get good performance with higher learning rates, which should enable it to get away with less hyperparameter tuning.

6.6

Conclusion

In this chapter we developed the first method for applying ISGD to neural networks. We showed that, through careful approximations, ISGD can be made to run nearly as quickly as standard backpropagation while still retaining the property of being more robust to high learning rates. The resulting method, which we call Implicit Backpropagation, consistently matches or outperforms

8MNIST-classification with and without clipping, MNIST-autoencoder with and without clipping, the four music

standard backpropagation on image recognition, autoencoding and music prediction tasks; and is particularly effective for robust RNN training.

The success of IB demonstrates the potential of ISGD methods to improve neural network training. It may be the case that there are better ways to approximate ISGD than IB, which could produce even better results. For example, the techniques behind Hessian-Free methods could be used to make a quadratic approximation of the higher layers in IB (opposed to the linear approximation currently used); or a second order approximation could be made directly to the ISGD formulation in (6.1). Developing and testing such methods is a ripe area for future research. The IB method, as presented in this chapter, also has potential for future research. Extending its updates to other activation functions and applying it to other architectures and datasets will lead to better understanding of IB’s properties and the conditions under which IB performs best.

111

Part III

Bibliography

Do lefties have an advantage in sports? It depends. https://www.nytimes.com/2017/11/21/

science/lefties-sports-advantage.html. Accessed: 2017-11-24.

UCI machine learning repository. University of California, Irvine, School of Information and Com- puter Sciences.

Tennis navigator. http://www.tennisnavigator.com/, 2004. Accessed: 2015-02-03.

About tennis Europe. http://www.tenniseurope.org/page.aspx?id=12173, 2015. Accessed: 2017-01-01.

The physical activity council’s annual study tracking sports, fitness, and recreation participation in the US. Report, The Physical Activity Council, 2016.

Daniel M Abrams and Mark J Panaggio. A model balancing cooperation and competition can explain our right-handed world and the dominance of left-handed athletes. Journal of the Royal Society Interface, 2012. ISSN 1742-5689.

David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for Boltzmann machines. Cognitive science, 9(1):147–169, 1985.

John P Aggleton and Charles J Wood. Is there a left-handed advantage in “ballistic” sports?

International Journal of Sport Psychology, 1990. ISSN 0047-0767.

AI Group at Arris Pharmaceutical Corporation. UCI machine learning repository, 1994. Aki Vehtari. MCMC diagnostics for Matlab.

BIBLIOGRAPHY 113

Selcuk Akpinar, Robert L Sainburg, Sadettin Kirazci, and Andrzej Przybyla. Motor asymmetry in elite fencers. Journal of motor behavior, 47(4):302–311, 2015.

Yoann Altmann, Steve McLaughlin, and Nicolas Dobigeon. Sampling from a multivariate Gaussian distribution truncated on a simplex: A review. InStatistical Signal Processing (SSP), 2014 IEEE Workshop on, pages 113–116. IEEE, 2014.

Jacob Andreas and Dan Klein. When and why are log-linear models self-normalizing? In HLT- NAACL, pages 244–249, 2015.

Boris Baˇci´c and Ali H Gazala. Left-handed representation in top 100 male professional tennis play- ers: Multi-disciplinary perspectives. http://tmg.aut.ac.nz/tmnz2016/papers/Boris2016. pdf. Accessed: 2017-06-15.

David Barber.Bayesian Reasoning and Machine Learning. Cambridge University Press, New York, NY, USA, 2012. ISBN 0521518148, 9780521518147.

Yoshua Bengio and Jean-S´ebastien Sen´ecal. Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Transactions on Neural Networks, 19(4):713–722, 2008.

Yoshua Bengio, Jean-S´ebastien Sen´ecal, et al. Quick training of probabilistic neural nets by impor- tance sampling. In AISTATS, 2003.

Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. Advances in optimizing recurrent networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE Interna- tional Conference on, pages 8624–8628. IEEE, 2013.

Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathe- matical programming, 129(2):163, 2011.

Dimitri P Bertsekas. Incremental aggregated proximal and augmented Lagrangian algorithms.

arXiv preprint arXiv:1509.09257, 2015.

Sylvain Billiard, Charlotte Faurie, and Michel Raymond. Maintenance of handedness polymorphism in humans: A frequency-dependent selection model.Journal of Theoretical Biology, 235(1):85–93, 2005. ISSN 0022-5193.

Patrizia S Bisiacchi, H Ripoll, J F Stein, P Simonet, and G Azemar. Left-handedness in fencers: An attentional advantage? Perceptual and motor skills, 61(2):507–513, 1985. ISSN 0031-5125. David M Blei. Build, compute, critique, repeat: Data analysis with latent variable models. Annual

Review of Statistics and Its Application, 1:203–232, 2014.

L´eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838, 2016.

Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal depen- dencies in high-dimensional sequences: Application to polyphonic music generation and tran- scription. arXiv preprint arXiv:1206.6392, 2012.

George EP Box. Science and statistics. Journal of the American Statistical Association, 71(356): 791–799, 1976.

Ralph Allan Bradley and Milton E. Terry. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 39(3/4):324–345, December 1952. ISSN 0006-3444. doi: 10.2307/2334029. URL http://www.jstor.org/stable/2334029.

Michael Braun and Andre Bonfrer. Scalable inference of customer similarities from interactions data using Dirichlet processes. Marketing Science, 30(3):513–531, 2011.

Kristijan Breznik. On the gender effects of handedness in professional tennis. Journal of sports science & medicine, 12(2):346, 2013.

Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 2015.

Supratik Chakraborty, Daniel J Fremont, Kuldeep S Meel, Sanjit A Seshia, and Moshe Y Vardi. Distribution-aware sampling and weighted model counting for SAT. arXiv preprint arXiv:1404.2984, 2014.

Venkat Chandrasekaran, Nathan Srebro, and Prahladh Harsha. Complexity of inference in graphical models. InProceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pages 70–78. AUAI Press, 2008.

BIBLIOGRAPHY 115

Mark Chavira and Adnan Darwiche. On probabilistic inference by weighted model counting. Ar- tificial Intelligence, 172(6):772–799, 2008.

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling.

arXiv:1312.3005, 2013.

Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, and Svn Vishwanathan. Implicit online learning with kernels. In Advances in neural information processing systems, pages 249–256, 2007.

Nicolas Chopin and James Ridgway. Leave Pima Indians alone: Binary regression as a benchmark for Bayesian computation. arXiv preprint arXiv:1506.08640, 2015.

Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.

Botond Cseke and Tom Heskes. Approximate marginals in latent Gaussian models. The Journal of Machine Learning Research, 12:417–454, 2011.

Yixiong Cui, Miguel- ´Angel G´omez, Bruno Gon¸calves, Hongyou Liu, and Jaime Sampaio. Effects of experience and relative quality in tennis match performance during four grand slams. Inter- national Journal of Performance Analysis in Sport, 17(5):783–801, 2017a.

Yixiong Cui, Miguel- ´Angel G´omez, Bruno Gon¸calves, and Jaime Sampaio. Identifying different tennis player types: An exploratory approach to interpret performance based on player features. In Complex Systems in Sport, International Congress Linking Theory and Practice, page 97, 2017b.

John P Cunningham, Philipp Hennig, and Simon Lacoste-Julien. Gaussian probabilities and ex- pectation propagation. arXiv preprint arXiv:1111.6832, 2011.

Anja Daems and Karl Verfaillie. Viewpoint-dependent priming effects in the perception of human actions and body postures. Visual cognition, 6(6):665–693, 1999. ISSN 1350-6285.

Hal Daume III, Nikos Karampatziakis, John Langford, and Paul Mineiro. Logarithmic time one- against-some. arXiv:1606.04988, 2016.

Herbert A. David and H. N. Nagaraja. Order Statistics. Wiley-Interscience, Hoboken, N.J, 3rd edition, August 2003. ISBN 9780471389262.

Alexandre de Br´ebisson and Pascal Vincent. An exploration of softmax alternatives belonging to the spherical loss family. arXiv:1511.05042, 2015.

Nando De Freitas, Pedro Højen-Sørensen, Michael I Jordan, and Stuart Russell. Variational MCMC. InProceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 120– 127. Morgan Kaufmann Publishers Inc., 2001.

Aaron Defazio. A simple practical accelerated method for finite sums. In Advances in Neural Information Processing Systems, pages 676–684, 2016.

Guillaume P Dehaene and Simon Barthelm´e. Bounding errors of expectation-propagation. In

Advances in Neural Information Processing Systems, pages 244–252, 2015.

Marc Deisenroth and Shakir Mohamed. Expectation propagation in Gaussian process dynamical systems. In Advances in Neural Information Processing Systems, pages 2609–2617, 2012. Julio Del Corral and Juan Prieto-Rodr´ıguez. Are differences in ranks good predictors for grand

slam tennis matches? International Journal of Forecasting, 26(3):551–563, 2010.

Ned A Dochtermann, CM Gienger, and Shane Zappettini. Born to win? Maybe, but perhaps only against inferior competition. Animal Behaviour, 96:e1–e3, 2014. ISSN 0003-3472.

Randal Douc, Christian P Robert, et al. A vanilla Rao–Blackwellization of Metropolis–Hastings algorithms. The Annals of Statistics, 39(1):261–277, 2011.

John Duchi and Feng Ruan. Stochastic methods for composite optimization problems. arXiv preprint arXiv:1703.08570, 2017.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.

BIBLIOGRAPHY 117

David Edwards and Havranek Toma. A fast procedure for model search in multidimensional con- tingency tables. Biometrika, 72(2):339–351, 1985.

A. E. Elo. The rating of chessplayers, past and present. Arco Pub., 1978.

Manuel Fern´andez-Delgado, Eva Cernadas, Sen´en Barro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res, 15(1):3133–3181, 2014.

Gerhard H Fischer and Ivo W Molenaar. Rasch models: Foundations, recent developments, and applications. Springer Science & Business Media, 2012.

Adrian E. Flatt. Is being left-handed a handicap? The short and useless answer is “yes and no.”.

Proceedings of Baylor University Medical Center, 21(3):304–307, 2008.

Sergey Foss, Dmitry Korshunov, Stan Zachary, et al. An introduction to heavy-tailed and subexpo- nential distributions, volume 6. Springer, 2011.

Andrew Gelman and Donald B Rubin. Inference from iterative simulation using multiple sequences.

Statistical science, pages 457–472, 1992.

Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis, volume 2. Taylor & Francis, 2014.

Norman Geschwind and Albert M Galaburda. Cerebral lateralization: Biological mechanisms, associations, and pathology: I. A hypothesis and a program for research. Archives of neurology, 42(5):428–459, 1985. ISSN 0003-9942.

Stefano Ghirlanda, Elisa Frasnelli, and Giorgio Vallortigara. Intraspecific competition and coor- dination in the evolution of lateralization. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 364(1519):861–866, 2009.

Mark E. Glickman. Parameter estimation in large dynamic paired comparison experiments.Applied Statistics, 48:377–394, 1999.

Stephen R Goldstein and Charlotte A Young. “Evolutionary” stable strategy of handedness in major league baseball. Journal of Comparative Psychology, 110(2):164, 1996. ISSN 1939-2087.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http:

//www.deeplearningbook.org.

Siddharth Gopal and Yiming Yang. Distributed training of large-scale logistic models. InInterna-

Documento similar