Adaptive Gradient Methods with Dynamic Bound of Learning Rate (AdaBound)
paperAuthors
Credibility Rating
Good quality. Reputable source with community review or editorial standards, but less rigorous than peer-reviewed venues.
Rating inherited from publication venue: arXiv
This is an ML optimization paper tangentially relevant to AI safety; it addresses training efficiency and generalization of deep learning models but does not directly engage with alignment or safety concerns.
Metadata
Abstract
Adaptive optimization methods such as AdaGrad, RMSprop and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared with SGD or even fail to converge due to unstable and extreme learning rates. Recent work has put forward some algorithms such as AMSGrad to tackle this issue but they failed to achieve considerable improvement over existing methods. In our paper, we demonstrate that extreme learning rates can lead to poor performance. We provide new variants of Adam and AMSGrad, called AdaBound and AMSBound respectively, which employ dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to SGD and give a theoretical proof of convergence. We further conduct experiments on various popular tasks and models, which is often insufficient in previous work. Experimental results show that new variants can eliminate the generalization gap between adaptive methods and SGD and maintain higher learning speed early in training at the same time. Moreover, they can bring significant improvement over their prototypes, especially on complex deep networks. The implementation of the algorithm can be found at https://github.com/Luolc/AdaBound .
Summary
This paper proposes AdaBound and AMSBound, optimization algorithms that apply dynamic bounds to learning rates to smoothly transition from adaptive methods (Adam, AMSGrad) to SGD during training. The approach combines fast early convergence of adaptive methods with SGD's superior generalization, supported by convergence proofs and empirical results across multiple deep learning tasks.
Key Points
- •Identifies extreme learning rates in adaptive optimizers (Adam, AdaGrad, RMSprop) as a root cause of poor generalization compared to SGD.
- •Proposes AdaBound and AMSBound, which clip learning rates with dynamic bounds that tighten over training, creating a smooth SGD transition.
- •Provides theoretical convergence guarantees for the proposed methods in a convex optimization framework.
- •Empirical results show AdaBound eliminates the generalization gap between adaptive methods and SGD while retaining faster early training speed.
- •Addresses a practical ML engineering concern relevant to training large models efficiently and reliably.
Cited by 1 page
| Page | Type | Quality |
|---|---|---|
| Mesa-Optimization Risk Analysis | Analysis | 61.0 |
Cached Content Preview
# Adaptive Gradient Methods with Dynamic Bound of Learning Rate
Liangchen Luo†, Yuanhao Xiong‡11footnotemark: 1,
Yan Liu§, Xu Sun†¶
†MOE Key Lab of Computational Linguistics, School of EECS, Peking University
‡College of Information Science and Electronic Engineering, Zhejiang University
§Department of Computer Science, University of Southern California
¶Center for Data Science, Beijing Institute of Big Data Research, Peking University
†{luolc,xusun}@pku.edu.cn‡xiongyh@zju.edu.cn§yanliu.cs@usc.eduEqual contribution. This work was done when the first and second authors were on an internship at DiDi AI Labs.
###### Abstract
Adaptive optimization methods such as AdaGrad, RMSprop and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates.
Though prevailing, they are observed to generalize poorly compared with Sgd or even fail to converge due to unstable and extreme learning rates.
Recent work has put forward some algorithms such as AMSGrad to tackle this issue but they failed to achieve considerable improvement over existing methods.
In our paper, we demonstrate that extreme learning rates can lead to poor performance.
We provide new variants of Adam and AMSGrad, called AdaBound and AMSBound respectively, which employ dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to Sgd and give a theoretical proof of convergence.
We further conduct experiments on various popular tasks and models, which is often insufficient in previous work.
Experimental results show that new variants can eliminate the generalization gap between adaptive methods and Sgd and maintain higher learning speed early in training at the same time.
Moreover, they can bring significant improvement over their prototypes, especially on complex deep networks.
The implementation of the algorithm can be found at [https://github.com/Luolc/AdaBound](https://github.com/Luolc/AdaBound "").
## 1 Introduction
There has been tremendous progress in first-order optimization algorithms for training deep neural networks.
One of the most dominant algorithms is stochastic gradient descent (Sgd) (Robbins & Monro, [1951](https://ar5iv.labs.arxiv.org/html/1902.09843#bib.bib16 "")), which performs well across many applications in spite of its simplicity.
However, there is a disadvantage of Sgd that it scales the gradient uniformly in all directions.
This may lead to poor performance as well as limited training speed when the training data are sparse.
To address this problem, recent work has proposed a variety of adaptive methods that scale the gradient by square roots of some form of the average of the squared values of past gradients.
Examples of such methods include Adam(Kingma & Lei Ba, [2015](https://ar5iv.labs.arxiv.org/html/1902.09843#bib.bib8 "")), AdaGrad(Duchi et al., [2011](https://ar5iv.labs.arxiv.org/html/1902.09843#bib.bib3 "")) and RMSprop(Tieleman & Hinton, [2012](https://ar5iv.l
... (truncated, 98 KB total)1e658bda9f72e89b | Stable ID: OGM4OGUxOD