01 L1 regularization (and LASSO)

L1 regularization is a regularization technique applicable to many methods in statistical learning. It consists of applying a penalty to the loss function proportional to the sum of the absolute value of something. In parametric models, the “something” is parameters. It is an “L1” regularization because it uses the Manhattan distance (i.e., the -norm) as a measure of weight.

L1 regularization promotes model sparsity by pushing parameter weights towards zero. It can help with collinearity, but may do so by completely eliminating a feature that has meaningful information. It is most suitable when one suspects that a small set of features explain most of the variance in the response variable. L1 and L2 regularization can be used together; this is often a good approach in linear models.

In non-parametric models, L1 regularization has a case-by-case interpretation, and is not always applicable. For example, there is no straightforward way to use L1 regularization in a k-NN regression or a single decision tree. However, gradient boosting models do have a form of L1 regularization. The parameter, called in XGBoost, applies a penalty to absolute value of the weight of all leaf nodes in the base learners. This reduces the total amount of information that the meta-learner receives. Hence, it encourages sparsity in the latent features (“meta-features”) of the ensemble, but not in the inputs.

L1 regularization is often treated as synonymous with Least Absolute Shrinkage and Selection Operator (LASSO), though technically they are not exactly the same thing: LASSO refers to a particular implementation of L1 regularization, whereas “L1 regularization” refers to any regularizer based on the sum of the absolute value of something.

David's raw ML reference notes

Explorer

01 L1 regularization (and LASSO)

Graph View

Backlinks