For the following, assume the following conventions:

General

is the number of classes
is the total number of observations ()
(number of actual instances in class )
(number of predicted instances in class )

Confusion Matrix

is the number of observations known to be in class but predicted to be in class (entries of the confusion matrix)
(row sums of the confusion matrix, actual class counts)
(column sums of the confusion matrix, predicted class counts)

Derived Metrics

(true positives for class )
(false positives for class )
(false negatives for class )
(total true positives)
(total false positives)
(total false negatives)
(total true negatives)

Agreement Metrics

(observed agreement)
(expected agreement by chance)

Functions and Parameters

is the indicator function (1 if the condition is true, 0 otherwise)
is the true label for observation
is the set of top predicted labels for observation
is the weight parameter in the score

Specific to multi-class metrics

Metric	Class imbalance	Use when	Avoid when
Confusion matrix	Can be somewhat misleading in presence of class imbalance	Classes are balanced and you want to understand confusion	Classes are imbalanced
Normalized Confusion Matrix	Highlights performance on each class, especially rare ones	When dealing with imbalanced datasets	Classes are balanced and you prefer counts
OvO ROC curve	As a pairwise metric, it is insensitive to imbalance	You want to scrutinize a particular confusion	You have many classes
OvR ROC curves	For large class , can conceal tendencies to replace smaller classes	You want to scrutinize confusion for a small class vs all others	You are early in model development
Cohen’s Kappa	Can be distorted by class imbalance	You want to know whether your classifier’s predictions are based on noise	You have reason to believe your classifier is based mostly on noise
Top-k accuracy	Deliberately tolerates misclassification of rare classes	There are several similar classes, and getting “close” is better than nothing. For example, in computer vision applications with many classes.	There is no such thing as a “similar class”
Matthews correlation coefficient	Robust to moderate class imbalance, though can still be distorted in the presence of extremely rare classes	Arguably the most comprehensive metric, but hard to understand	Classes are extremely imbalanced and/or you need something that’s easy to explain

Generalizations of binary metrics

The following can each be used with micro, macro, and weighted averaging.

Metric	Use when	Avoid when
Precision	You want to emphasize correct classifications (for macro, equally within each class)	Micro-precision and micro-recall are always identical for single-label multi-class classification.
Recall	You want to emphasize the model’s ability to detect all instances (for macro, equally within each class)	Micro-precision and micro-recall are always identical for single-label multi-class classification.
Accuracy	Observations are balanced and you want a highly interpretable metric	You have imbalanced data, or one type of error is worse than another
F1 score	Good metric for most situations	You are focused on one particular kind of measurement
F-beta score	You want to emphasize precision or recall, but still care about the other	You need a metric that is easy to interpret

Typical approach

The most common multi-class classification performance metric is the macro-averaged , because it is doubly helpful in the presence of class imbalance, which is often present in multi-class situations. As a macro-weighted metric, it gives equal weight to large and small classes. As an F-metric, it gives equal weight to precision and recall. (Recall is often worst for rare classes.)

In many situations, the multi-class MCC can actually be more informative of model performance. However, it can be difficult to reason about, so it is relatively niche.

David's raw ML reference notes

Explorer

Multi-class classification metrics