00 Most important forms of regularization for gradient boosting

While XGBoost and LightGBM expose a large number of hyperparameters. Given that hyperparameter tuning can incur a significant computational cost, it can be desirable to focus on the ones that are most impactful.

If you don’t have time to tune hyperparameters, you may be better off with a random forest; it often has better out-of-the-box performance.

Learning rate

Effect: Smaller learning rates help generalization but can require far more learning rounds.

Description: For gradient boosting, the term “learning rate” actually refers to the coefficient applied to each base learner. (Recall that the prediction is the sum of these scaled base learners.) Hence a lower learning rate implies a smaller contribution from each tree. Because of this, it is usually necessary to compensate for a reduction in learning rate by increasing the maximum number of base learners (and vice versa).

Typical range: 0.01 to 0.3 XGBoost: eta LightGBM: learning_rate

Maximum number of learning rounds

Effect: A fairly direct “knob” for the bias-variance tradeoff, complicated by its relationship with learning rate.

Description: Since each learning round fits the residuals of the previous predictions, each additional tree introduces finer details into the decision boundary. Hence, all other things being equal, more trees means higher variance, and fewer trees means higher bias.

But all other things are not equal: if the learning rate is low, more trees are necessary to achieve a sufficient response to variation in the input. Hence there is a hard-to-think-about interaction that ends up needing empirical tuning.

Note that early stopping is critical to avoid overfitting when using a relatively high maximum.

Typical range: 100-1000, but be sure to use early stopping. XGBoost: n_estimators LightGBM: num_iterations or num_boost_round

Maximum tree depth

Effect: Controls the number of nonlinearities that each tree can incorporate.

Description: Each split in the tree is a nonlinearity. More nonlinearity means more opportunity to learn variation in the training set, both signal and noise.

Hence, like number of learning rounds, tree depth is a “knob” for bias and variance. It is also not as directly coupled to learning rate. However, excessively shallow trees can greatly limit a model’s predictive power, especially for highly nonlinear data. As such, shallower trees require more trees, which in turn means indirect sensitivity to learning rate.

Typical range: 3-10. XGBoost: max_depth LightGBM: max_depth

Subsample fraction

Effect: Behaves somewhat like batch size in stochastic gradient descent.

Description: When subsampling is enabled (by changing the fraction from the default of 1.0), each base learner encounters a different subset of the data. Recall that using mini-batches in gradient descent reduces the influence of any particular example, but causes the parameters to “jump around” as the model responds to very different data. The same applies to subsample fraction, though typically the subsample is still a majority of the training data.

Like mini-batches, subsampling can result in divergence if coupled to a high learning rate. However, when coupled to a low learning rate, it can have a strong regularizing effect.

Note that, in most cases, you don’t need to tune both subsample fraction and feature fraction, as they often have similar effects. This is not a hard and fast rule, though.

Typical range: 0.5 - 0.9 XGBoost: subsample LightGBM: subsample

David's raw ML reference notes

Explorer

00 Most important forms of regularization for gradient boosting

Learning rate

Maximum number of learning rounds

Maximum tree depth

Subsample fraction

Graph View

Table of Contents

Backlinks