For models with rough loss landscape, such as models with sparse training data, it can be helpful to occasionally allow the model to jump to a different part of the parameter space in order to escape a local optimum after exploring it. However, if training ends during a high-LR period, the benefits of recent fine-tuning will be lost. Hence it is necessary to ensure that training ends during a low-LR period.
Typical options include:
- Cosine annealing: compose the cosine function with an exponential decay process, resulting in gradually declining learning rates. This is called CosineAnnealingLR in PyTorch.
- Ending at the intercept of a cosine cycle: a cosine optimizer will have some iterations in which the learning rate is very small. Designing the schedule to stop at one of these iterations will avoid the problem of destroying fine-tuning.
- One cycle: For moderately rough manifolds, a single learning rate cycle can be helpful to reduce the risk of getting stuck at a local minimum.
In addition, if the parameters are in a relatively wide minimizer well, early stopping can help to ensure that the model does not accidentally escape it. This is an option for all of the methods above.