Weight decay, which involves scaling all of the weights in a model by a constant factor , is equivalent to L2 regularization. To see why, let’s first explicitly write out the L2-regularized loss as a function of the parameters .
(We choose for convenience.) To optimize the loss function, we need the gradient of the regularized loss with respect to the weights. The gradient of the regularization term with respect to is
This means that when we apply L2 regularization, the gradient update for each weight is modified by an additional term . Thus, the total gradient for the regularized loss is