Most things are BaseEstimators

Most things, even preprocessing steps like StandardScaler, inherit from BaseEstimator (api, code, guide). Exceptions include utilities for interfacing with external tools, such as DecisionBoundaryDisplay (which draws on a matplotlib “artist”).

More specialized behavior is achieved using mixins rather than a class hierarchy. This allows for greater flexibility in how components can be used, at the cost of a strict taxonomy of components.

BaseEstimator provides hooks for hyperparameter tuning, serialization/deserialization, and validation, via its get_params and set_params methods.

So operations are not functions

Let’s say you want to project something into its principal components. You might be surprised to learn that there is no PCA function; rather, sklearn.decomposition.PCA is an estimator that must be fit first:

from sklearn.decomposition import PCA
 
pca = PCA(n_component=3)
pca.fit(X)
X_projected = pca.transform(X)

Syntax notes

There are multiple syntaxes for the same thing

The following are all equivalent. That is, they all return the same projection, and the pca object ends up with new internal state as a side effect.

pca.fit(X)
X_projected = pca.transform(X)
X_projected = pca.fit(X).transform(X)
X_projected = pca.fit_transform(X)

Ending underscores mean estimated/derived

Some attributes end in an underscore. This indicates that an attribute is estimated from the data, rather than being directly or indirectly supplied. For example, when using sklearn.decomposition.PCA, you have the attributes components_ () and explained_variance_ ().

Important mixins

Although not explicitly required through an abstract method, the following mixins all expect a fit method, which is a stateful method that adjusts the estimator’s parameters as a side effect.

ClassifierMixin (ocurrences)

ClassifierMixin itself is fairly lightweight, mainly adding a score function that returns the accuracy of the prediction. The action mainly comes from classes expecting to interact with it. All instances of ClassifierMixin are assumed to have a predict method in addition to the fit method that all estimators are supposed to have.

Example instances:

  • sklearn.ensemble.GradientBoostingClassifier
  • sklearn.ensemble.StackingClassifier
  • sklearn.neighbors.NeighborsClassifier
  • sklearn.neural_network.MLPClassifier
  • sklearn.svc.LinearSVC (via LinearClassifierMixin)
  • sklearn.tree.DecisionTreeClassifier

RegressorMixin (ocurrences)

Like ClassifierMixin, the RegressorMixin adds a score function (this time for the coefficient of determination), but otherwise does little on its own. Again, instances are assumed to have a predict method.

Example instances:

  • sklearn.ensemble.ForestRegressor
  • sklearn.linear_model.LinearRegressor
  • sklearn.neighbors.KNeighborsRegressor
  • sklearn.neural_network.MLPRegressor
  • sklearn.svm.LinearSVR

ClusterMixin (occurrences)

Like ClassifierMixin and RegressorMixin, the ClusterMixin class has a fit_predict method. This interface is for consistency only! In practice, y is ignored, and predict is never called:

def fit_predict(self, X, y=None, **kwargs):
        self.fit(X, **kwargs)
        return self.labels_

The inclusion of this method allows ClusterMixin instances to be passed as part of a Pipeline, but it comes at the cost of clarity.

Important instances:

  • sklearn.cluster.AgglomerativeClustering
  • sklearn.cluster.KMeans (via _BaseKMeans)

TransformerMixin (occurrences)

Not to be confused with the Transformer architecture, “transformers” in SKL refer to data transformation operations such as those used for preprocessing.

In addition to fit, the TransformerMixin assumes that the transform method is implemented. The mixin’s main hook is a method called fit_transform, and it simply delegates to the two named methods.

Important instances:

  • sklearn.decomposition.PCA
  • sklearn.decomposition.LatentDirichletAllocation
  • sklearn.manifold.TSNE
  • sklearn.preprocessing.FunctionTransformer
  • sklearn.preprocessing.LabelEncoder

Important day-to-day packages

There is a LOT that’s built into SKL. The user guide is here and the API reference is here. The packages required for most daily tasks, particularly interview tasks, are as follows.

Utility

sklearn.datasets

Contains some canned datasets (like Iris) along with classes for generating ad-hoc benchmark datasets.

sklearn.pipeline

Mostly interesting for the Pipeline class, which lets you chain predictors.

Feature engineering

sklearn.decomposition

Mainly interesting to me for PCA and LDA (topic modeling) (The other LDA has its own package.)

sklearn.feature_extraction

Has feature extractors for text (TF-IDF, counts) or images (patch extraction).

sklearn.impute

This package has four useful classes related to missing values. They are all Estimator instances, meaning that you still use fit and then transform (though you can condense this to fit_transform(...)), which will return the updated version of your input.

SimpleImputer: Your usual strategies: mean, median, mode, or constant. Usually the right choice for your baseline.

KNNImputer: Impute features for each column based on a similarity metric (which obviously can only take numerical values into account).

MissingIndicator: not actually an imputer; it just adds flags for missing values.

IterativeImputer (beta): Iteratively impute features for each column based on the values from the others.

sklearn.preprocessing

Lots of data cleaning classes. Some of the most useful are:

  • OneHotEncoder and LabelEncoder (for categorical variables)
  • Column-wise normalization (StandardScaler) and row-wise normalization (Normalizer)

SKL provides a user guide on preprocessing data.

Models

sklearn.cluster

Your usual clustering algorithms (K-means, hierarchical, biclustering, etc.).

sklearn.ensemble

The usual suspects (bagging, RF, boosting, extra trees, etc.), plus stacking for sequential ensembles. For each of these, you have separate classes for regression and for classification.

sklearn.linear_model

More stuff than you’d expect. In addition to linear and logistic regression, you also have a linear SLP class (Perceptron), ridge regression, etc. They also have classes that combine regressors with regularization, resulting in a combinatoric explosion of classes.

sklearn.neighbors

Classes for nearest neigbhor search (e.g. KDTree) and nearest-neighbor regression and classification.

sklearn.svm

Support vector machines for regression and classification.

sklearn.tree

Base decision tree classes. Outside of an interview, you’d rarely use these unless you were building a fancy custom ensemble for some reason.

WARNING: this module has ExtraTreeClassifier and ExtraTreeRegressor classes. These are single trees. You almost certainly want ExtraTreesClassifier and ExtraTreesRegressor (note the plural!) from sklearn.ensemble.

Evaluation

sklearn.metrics

All the performance metrics you could ever want for classification, regression, ranking, clustering, distance and more.

sklearn.model_selection

This probably should have been split into a package called cross_validation and then this one, as there are dozens of members that deal with cross-validation of one kind or another, including CV for hyperparameter tuning.

The remaining ones are a hodgepodge. On the one hand, you have classes for things like tuning the threshold for binary classifiers. Then you have non-CV classes dealing with hyperparameter tuning. For some reason, the method for train-test splitting is here too (though you have to use it twice if you want a dev split). You even have a couple of classes that draw plots.

What a mess. The moral of the story is that if you can’t find it, there’s a good chance they stuck it in here. (Though there’s also a utils class. How they chose which shit to dump into each is beyond me.)