Weight decay, which involves scaling all of the weights in a model by a constant factor , is equivalent to L2 regularization. To see why, let’s first explicitly write out the L2-regularized loss as a function of the parameters .

(We choose for convenience.) To optimize the loss function, we need the gradient of the regularized loss with respect to the weights. The gradient of the regularization term with respect to is

This means that when we apply L2 regularization, the gradient update for each weight is modified by an additional term . Thus, the total gradient for the regularized loss is

Now, the plain-vanilla gradient descent update rule is

But now we’re adding this regularization term, so it becomes

Notice that when you distribute across the terms, you end up subtracting a constant multiple of the parameter weight :

Hence we see that L2 regularization results in a constant weight decay mechanism.