# On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

@article{Keskar2017OnLT, title={On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima}, author={Nitish Shirish Keskar and Dheevatsa Mudigere and Jorge Nocedal and Mikhail Smelyanskiy and Ping Tak Peter Tang}, journal={ArXiv}, year={2017}, volume={abs/1609.04836} }

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop… Expand

#### Supplemental Content

Github Repo

Via Papers with Code

Code to reproduce some of the figures in the paper "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima"

Presentation Slides

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

#### Figures, Tables, and Topics from this paper

#### Paper Mentions

#### 1,489 Citations

Extrapolation for Large-batch Training in Deep Learning

- Computer Science, Mathematics
- ICML
- 2020

This work proposes to use computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima, and proves the convergence of this novel scheme and rigorously evaluates its empirical performance on ResNet, LSTM, and Transformer. Expand

Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent

- Mathematics, Computer Science
- ArXiv
- 2018

It is proved that SGD tends to converge to flatter minima in the asymptotic regime (although may take exponential time to converge) regardless of the batch size, and thatSGD with a larger ratio of learning rate to batch size tends to convergence to a flat minimum faster, however, its generalization performance could be worse. Expand

Stochastic Gradient Descent with Large Learning Rate

- Mathematics, Computer Science
- ArXiv
- 2020

The main contributions of this work are to derive the stable distribution for discrete-time SGD in a quadratic loss function with and without momentum. Expand

An Empirical Study of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise

- Computer Science, Mathematics
- 2019

The empirical studies with standard deep learning model-architectures and datasets shows that the proposed add covariance noise to the gradients method not only improves generalization performance in large-batch training, but furthermore, does so in a way where the optimization performance remains desirable and the training duration is not elongated. Expand

The Impact of Local Geometry and Batch Size on the Convergence and Divergence of Stochastic Gradient Descent

- Mathematics
- 2017

Stochastic small-batch (SB) methods, such as mini-batch Stochastic Gradient Descent (SGD), have been extremely successful in training neural networks with strong generalization properties. In the… Expand

SmoothOut: Smoothing Out Sharp Minima for Generalization in Large-Batch Deep Learning

- Computer Science
- ArXiv
- 2018

It is proved that the Stochastic SmoothOut is an unbiased approximation of the original SmoothOut and can eliminate sharp minima in Deep Neural Networks (DNNs) and thereby close generalization gap. Expand

STOCHASTIC GRADIENT DESCENT WITH MODERATE LEARNING RATE

- 2021

Understanding the algorithmic bias of stochastic gradient descent (SGD) is one of the key challenges in modern machine learning and deep learning theory. Most of the existing works, however, focus on… Expand

Stochastic Normalized Gradient Descent with Momentum for Large Batch Training

- Computer Science, Mathematics
- ArXiv
- 2020

This paper theoretically prove that compared to momentum SGD (MSGD), SNGM can adopt a larger batch size to converge to the $\epsilon$-stationary point with the same computation complexity (total number of gradient computation). Expand

Gaussian Process Inference Using Mini-batch Stochastic Gradient Descent: Convergence Guarantees and Empirical Benefits

- Computer Science, Mathematics
- ArXiv
- 2021

Numerical studies on both simulated and real datasets demonstrate that minibatch SGD has better generalization over state-of-the-art GP methods while reducing the computational burden and opening a new, previously unexplored, data size regime for GPs. Expand

A closer look at batch size in mini-batch training of deep auto-encoders

- Computer Science
- 2017 3rd IEEE International Conference on Computer and Communications (ICCC)
- 2017

This paper tested the generalizability of deep auto-encoder trained with varying batch size and checked some well-known measures relating to model generalization, finding no obvious generalization gap in regression model such asauto-encoders. Expand

#### References

SHOWING 1-10 OF 61 REFERENCES

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

- Computer Science
- ICML
- 2015

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Expand

Optimization Methods for Large-Scale Machine Learning

- Computer Science, Mathematics
- SIAM Rev.
- 2018

A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning. Expand

Adam: A Method for Stochastic Optimization

- Computer Science, Mathematics
- ICLR
- 2015

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

- Computer Science, Mathematics
- ICLR
- 2017

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time. Expand

On the importance of initialization and momentum in deep learning

- Computer Science
- ICML
- 2013

It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization. Expand

adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs

- Computer Science, Mathematics
- ECML/PKDD
- 2016

adaQN is presented, a stochastic quasi-Newton algorithm for training RNNs that retains a low per-iteration cost while allowing for non-diagonal scaling through a Stochastic L-BFGS updating scheme and is judicious in storing and retaining L- BFGS curvature pairs. Expand

Train faster, generalize better: Stability of stochastic gradient descent

- Computer Science, Mathematics
- ICML
- 2016

We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically… Expand

No bad local minima: Data independent training error guarantees for multilayer neural networks

- Mathematics, Computer Science
- ArXiv
- 2016

It is proved that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization, and extended to the case of more than onehidden layer. Expand

Sample size selection in optimization methods for machine learning

- Computer Science, Mathematics
- Math. Program.
- 2012

A criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient, and establishes an O(1/\epsilon) complexity bound on the total cost of a gradient method. Expand

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

- Computer Science, Mathematics
- J. Mach. Learn. Res.
- 2011

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight. Expand