Learning rate scheduling

Recall that the learning rate is a hyperparameter of gradient-based optimization methods, particularly gradient descent and gradient boosting, determining the optimizer’s relative step size in the direction opposite the gradient. Large steps can overshoot a minimum but can easily escape local optima; small steps are thorough but slow.

When training a model that will require many iterations, such as a neural network, it can be helpful to start with large steps to get into the vicinity of a deep well in the loss manifold, then take ever smaller steps as the parameters approach the corresponding minimizer. This is the premise of learning rate scheduling.

By contrast, learning rate scheduling is not a good fit for gradient boosting, despite a shared theoretical foundation. By learning a fit to the residuals (i.e., the loss), each base learner represents a step that is highly adaptive to the slope of each derivative. As such, far fewer such steps are required than when performing gradient descent on a network. Instead, training is simply halted when the ensemble fails to improve further on the validation set.

There are infinitely many possible learning rate schedules, and a very large number have been proposed; 15 are implemented in PyTorch alone. However, we can roughly classify them into groups by their behavior over time:

Constant learning rate: this is the implicit default schedule in vanilla SGD. In PyTorch, this is called ConstantLR, but it does not need to be specified.
Monotonically decreasing learning rate: most schedules fall into this category. The most common are StepLR, MultiStepLR, and ExponentialLR.
Cyclical learning rate: for models with rough gradient manifolds, such as sparse models, it can be helpful to occasionally allow the model to jump to a different part of the parameter space in order to escape a local optimum after exploring it. See the main article for more.

David's raw ML reference notes

Explorer

Learning rate scheduling

Graph View

Backlinks