Stochastic gradient descent can have a regularizing effect on neural networks.

To see why, first consider why bootstrap aggregation (bagging) is a good strategy for high-variance data. Each base learner sees a different subset of the underlying data, and the ensemble averages across these learners to make a prediction. Hence any one observation is unlikely to have a large effect on the prediction as a whole.

When performing any form of gradient descent, we are approximating a hyperplane to the loss landscape local to a particular point. If we include all observations every time, then outliers may “tilt” this plane in a particular direction. Over time, these extra “tilts” can push the model parameters towards a minimum that is particular to the training data and its outliers.

If we instead fit a plane to a small number of examples, we are giving less influence to outliers, as each can only “tilt” the plane during a single mini-batch for each epoch. During this batch, it may “tilt” the plane quite a bit, but the many other batches will quickly compensate for them. As a result, the final fit will be less reflective of these outliers.

It should be noted that the same logic applies when using bootstrap sampling with gradient boosting, as is done by default in XGBoost. This method can be seen as a close cousin of gradient descent, differing primarily in that it builds a ‘memory’ of the training process through an adaptive gradient approximation strategy. Hence sampling has the same regularizing effect, provided that proportionally more base learners are used.