Most things are BaseEstimators
Most things, even preprocessing steps like StandardScaler, inherit from BaseEstimator (api, code, guide). Exceptions include utilities for interfacing with external tools, such as DecisionBoundaryDisplay (which draws on a matplotlib “artist”).
More specialized behavior is achieved using mixins rather than a class hierarchy. This allows for greater flexibility in how components can be used, at the cost of a strict taxonomy of components.
BaseEstimator provides hooks for hyperparameter tuning, serialization/deserialization, and validation, via its get_params and set_params methods.
So operations are not functions
Let’s say you want to project something into its principal components. You might be surprised to learn that there is no PCA function; rather, sklearn.decomposition.PCA is an estimator that must be fit first:
from sklearn.decomposition import PCA
pca = PCA(n_component=3)
pca.fit(X)
X_projected = pca.transform(X)Syntax notes
There are multiple syntaxes for the same thing
The following are all equivalent. That is, they all return the same projection, and the pca object ends up with new internal state as a side effect.
pca.fit(X)
X_projected = pca.transform(X)X_projected = pca.fit(X).transform(X)X_projected = pca.fit_transform(X)Ending underscores mean estimated/derived
Some attributes end in an underscore. This indicates that an attribute is estimated from the data, rather than being directly or indirectly supplied. For example, when using sklearn.decomposition.PCA, you have the attributes components_ (explained_variance_ (
Important mixins
Although not explicitly required through an abstract method, the following mixins all expect a fit method, which is a stateful method that adjusts the estimator’s parameters as a side effect.
ClassifierMixin (ocurrences)
ClassifierMixin itself is fairly lightweight, mainly adding a score function that returns the accuracy of the prediction. The action mainly comes from classes expecting to interact with it. All instances of ClassifierMixin are assumed to have a predict method in addition to the fit method that all estimators are supposed to have.
Example instances:
sklearn.ensemble.GradientBoostingClassifiersklearn.ensemble.StackingClassifiersklearn.neighbors.NeighborsClassifiersklearn.neural_network.MLPClassifiersklearn.svc.LinearSVC(viaLinearClassifierMixin)sklearn.tree.DecisionTreeClassifier
RegressorMixin (ocurrences)
Like ClassifierMixin, the RegressorMixin adds a score function (this time for the coefficient of determination), but otherwise does little on its own. Again, instances are assumed to have a predict method.
Example instances:
sklearn.ensemble.ForestRegressorsklearn.linear_model.LinearRegressorsklearn.neighbors.KNeighborsRegressorsklearn.neural_network.MLPRegressorsklearn.svm.LinearSVR
ClusterMixin (occurrences)
Like ClassifierMixin and RegressorMixin, the ClusterMixin class has a fit_predict method. This interface is for consistency only! In practice, y is ignored, and predict is never called:
def fit_predict(self, X, y=None, **kwargs):
self.fit(X, **kwargs)
return self.labels_The inclusion of this method allows ClusterMixin instances to be passed as part of a Pipeline, but it comes at the cost of clarity.
Important instances:
sklearn.cluster.AgglomerativeClusteringsklearn.cluster.KMeans(via_BaseKMeans)
TransformerMixin (occurrences)
Not to be confused with the Transformer architecture, “transformers” in SKL refer to data transformation operations such as those used for preprocessing.
In addition to fit, the TransformerMixin assumes that the transform method is implemented. The mixin’s main hook is a method called fit_transform, and it simply delegates to the two named methods.
Important instances:
sklearn.decomposition.PCAsklearn.decomposition.LatentDirichletAllocationsklearn.manifold.TSNEsklearn.preprocessing.FunctionTransformersklearn.preprocessing.LabelEncoder
Important day-to-day packages
There is a LOT that’s built into SKL. The user guide is here and the API reference is here. The packages required for most daily tasks, particularly interview tasks, are as follows.
Utility
sklearn.datasets
Contains some canned datasets (like Iris) along with classes for generating ad-hoc benchmark datasets.
sklearn.pipeline
Mostly interesting for the Pipeline class, which lets you chain predictors.
Feature engineering
sklearn.decomposition
Mainly interesting to me for PCA and LDA (topic modeling) (The other LDA has its own package.)
sklearn.feature_extraction
Has feature extractors for text (TF-IDF, counts) or images (patch extraction).
sklearn.impute
This package has four useful classes related to missing values. They are all Estimator instances, meaning that you still use fit and then transform (though you can condense this to fit_transform(...)), which will return the updated version of your input.
SimpleImputer: Your usual strategies: mean, median, mode, or constant. Usually the right choice for your baseline.
KNNImputer: Impute features for each column based on a similarity metric (which obviously can only take numerical values into account).
MissingIndicator: not actually an imputer; it just adds flags for missing values.
IterativeImputer (beta): Iteratively impute features for each column based on the values from the others.
sklearn.preprocessing
Lots of data cleaning classes. Some of the most useful are:
OneHotEncoderandLabelEncoder(for categorical variables)- Column-wise normalization (
StandardScaler) and row-wise normalization (Normalizer)
SKL provides a user guide on preprocessing data.
Models
sklearn.cluster
Your usual clustering algorithms (K-means, hierarchical, biclustering, etc.).
sklearn.ensemble
The usual suspects (bagging, RF, boosting, extra trees, etc.), plus stacking for sequential ensembles. For each of these, you have separate classes for regression and for classification.
sklearn.linear_model
More stuff than you’d expect. In addition to linear and logistic regression, you also have a linear SLP class (Perceptron), ridge regression, etc. They also have classes that combine regressors with regularization, resulting in a combinatoric explosion of classes.
sklearn.neighbors
Classes for nearest neigbhor search (e.g. KDTree) and nearest-neighbor regression and classification.
sklearn.svm
Support vector machines for regression and classification.
sklearn.tree
Base decision tree classes. Outside of an interview, you’d rarely use these unless you were building a fancy custom ensemble for some reason.
WARNING: this module has ExtraTreeClassifier and ExtraTreeRegressor classes. These are single trees. You almost certainly want ExtraTreesClassifier and ExtraTreesRegressor (note the plural!) from sklearn.ensemble.
Evaluation
sklearn.metrics
All the performance metrics you could ever want for classification, regression, ranking, clustering, distance and more.
sklearn.model_selection
This probably should have been split into a package called cross_validation and then this one, as there are dozens of members that deal with cross-validation of one kind or another, including CV for hyperparameter tuning.
The remaining ones are a hodgepodge. On the one hand, you have classes for things like tuning the threshold for binary classifiers. Then you have non-CV classes dealing with hyperparameter tuning. For some reason, the method for train-test splitting is here too (though you have to use it twice if you want a dev split). You even have a couple of classes that draw plots.
What a mess. The moral of the story is that if you can’t find it, there’s a good chance they stuck it in here. (Though there’s also a utils class. How they chose which shit to dump into each is beyond me.)