Summary: This text provides an overview of gradient descent optimization algorithms by Sebastian Ruder. The author discusses various techniques used in optimizing algorithms for machine learning and artificial intelligence. The content is available on arxiv.org for further reading.
Gradient descent is a way to minimize an objective function J(θ) parameterized by a model’s parameters θ ∈ Rd by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ) w.r.t. to the parameters. The learning rate η determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.5 (View Highlight)
Note: Not actually ‘intractable,’ as in principle you can just do this once for each record (i.e., it is linear in the number of observations). The problem is that it requires an explicit loop, and—especially if we need to load the records for the entire dataset one at a time from disk—this will take so long to make one update that using it until convergence would be totally unreasonable.
A better word might have been ‘infeasible.’
Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces. (View Highlight)
SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily (View Highlight)
when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent (View Highlight)
we shuffle the training data at every epoch (View Highlight)
it a) reduces the variance of the parameter updates, which can lead to more stable conver- gence; and b) can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries (View Highlight)
difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. (View Highlight)
adding a fraction γ of the update vector of the past time step to the current update vector (View Highlight)
The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. (View Highlight)
. Computing θ−γvt−1 thus gives us an approximation of the next position of the parameters (View Highlight)
Note: I didn’t follow this. I’ll decide at the end of the section if I need to.
It adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters. (View Highlight)
Pennington et al. [16] used Adagrad to train GloVe word embeddings, as infrequent words require much larger updates than frequent ones. (View Highlight)
Adagrad modifies the general learning rate η at each time step t for every parameter θi based on the past gradients that have been computed for θi (View Highlight)
Adagrad’s main weakness is its accumulation of the squared gradients in the denominator: Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. (View Highlight)
Adadelta restricts the window of accumulated past gradients to some fixed size w. (View Highlight)
With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule. (View Highlight)
Adaptive Moment Estimation (Adam) [10] is another method that computes adaptive learning rates for each parameter. (View Highlight)
They show empiri- cally that Adam works well in practice and compares favorably to other adaptive learning-method algorithms. (View Highlight)
If your input data is sparse, then you likely achieve the best results using one of the adaptive learning-rate methods. (View Highlight)
Adam, finally, adds bias-correction and momentum to RMSprop. (View Highlight)
RMSprop, Adadelta, and Adam are very similar algorithms that do well in similar circumstances. (View Highlight)