David's raw ML reference notes

❯

01 Statistical (machine) learning (data science)

❯

00 Literature notes

❯

Ruder 2017

Feb 14, 20254 min read

articles

Metadata

Author: Sebastian Ruder
Full Title: An overview of gradient descent optimization algorithms
Category:articles
Summary: This text provides an overview of gradient descent optimization algorithms by Sebastian Ruder. The author discusses various techniques used in optimizing algorithms for machine learning and artificial intelligence. The content is available on arxiv.org for further reading.
URL: https://arxiv.org/pdf/1609.04747.pdf

Highlights

Gradient descent is a way to minimize an objective function J(θ) parameterized by a model’s parameters θ ∈ Rd by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ) w.r.t. to the parameters. The learning rate η determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.5 (View Highlight)
intractable (View Highlight)
- Note: Not actually ‘intractable,’ as in principle you can just do this once for each record (i.e., it is linear in the number of observations). The problem is that it requires an explicit loop, and—especially if we need to load the records for the entire dataset one at a time from disk—this will take so long to make one update that using it until convergence would be totally unreasonable. A better word might have been ‘infeasible.’
Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces. (View Highlight)
SGD performs frequent updates with a high variance that cause the objective function to ﬂuctuate heavily (View Highlight)
when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent (View Highlight)
we shufﬂe the training data at every epoch (View Highlight)
it a) reduces the variance of the parameter updates, which can lead to more stable conver- gence; and b) can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries (View Highlight)
difﬁculty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. (View Highlight)

New highlights added April 16, 2024 at 2:05 PM

SGD has trouble navigating ravines, (View Highlight)
adding a fraction γ of the update vector of the past time step to the current update vector (View Highlight)
The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. (View Highlight)
. Computing θ−γvt−1 thus gives us an approximation of the next position of the parameters (View Highlight)
Adagrad (View Highlight)
- Note: I didn’t follow this. I’ll decide at the end of the section if I need to.
It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. (View Highlight)
Pennington et al. [16] used Adagrad to train GloVe word embeddings, as infrequent words require much larger updates than frequent ones. (View Highlight)
Adagrad modiﬁes the general learning rate η at each time step t for every parameter θi based on the past gradients that have been computed for θi (View Highlight)
Adagrad’s main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become inﬁnitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. (View Highlight)
Adadelta restricts the window of accumulated past gradients to some ﬁxed size w. (View Highlight)
With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule. (View Highlight)
Adaptive Moment Estimation (Adam) [10] is another method that computes adaptive learning rates for each parameter. (View Highlight)
They show empiri- cally that Adam works well in practice and compares favorably to other adaptive learning-method algorithms. (View Highlight)
If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. (View Highlight)
Adam, ﬁnally, adds bias-correction and momentum to RMSprop. (View Highlight)
RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. (View Highlight)
Adam might be the best overall choice. (View Highlight)
SGD by itself is inherently sequential: Step-by-step, we progress further towards the minimum. (View Highlight)
parameter server, (View Highlight)
Curriculum Learning (View Highlight)

Graph View

Metadata
Highlights
New highlights added April 16, 2024 at 2:05 PM

Backlinks

Gradient descent
06 Adam optimizer

Created with Quartz v4.4.0 © 2025

Terms of Use
LinkedIn
Buy me a coffee