For the following, assume the following conventions:
General
is the number of classes is the total number of observations ( ) (number of actual instances in class ) (number of predicted instances in class )
Confusion Matrix
is the number of observations known to be in class but predicted to be in class (entries of the confusion matrix) (row sums of the confusion matrix, actual class counts) (column sums of the confusion matrix, predicted class counts)
Derived Metrics
(true positives for class ) (false positives for class ) (false negatives for class ) (total true positives) (total false positives) (total false negatives) (total true negatives)
Agreement Metrics
(observed agreement) (expected agreement by chance)
Functions and Parameters
is the indicator function (1 if the condition is true, 0 otherwise) is the true label for observation is the set of top predicted labels for observation is the weight parameter in the score
Specific to multi-class metrics
| Metric | Expression | Class imbalance | Use when | Avoid when |
|---|---|---|---|---|
| Confusion matrix | Can be somewhat misleading in presence of class imbalance | Classes are balanced and you want to understand confusion | Classes are imbalanced | |
| Normalized Confusion Matrix | Highlights performance on each class, especially rare ones | When dealing with imbalanced datasets | Classes are balanced and you prefer counts | |
| OvO ROC curve | As a pairwise metric, it is insensitive to imbalance | You want to scrutinize a particular confusion | You have many classes | |
| OvR ROC curves | For large class | You want to scrutinize confusion for a small class vs all others | You are early in model development | |
| Cohen’s Kappa | Can be distorted by class imbalance | You want to know whether your classifier’s predictions are based on noise | You have reason to believe your classifier is based mostly on noise | |
| Top-k accuracy | Deliberately tolerates misclassification of rare classes | There are several similar classes, and getting “close” is better than nothing. For example, in computer vision applications with many classes. | There is no such thing as a “similar class” | |
| Matthews correlation coefficient | Robust to moderate class imbalance, though can still be distorted in the presence of extremely rare classes | Arguably the most comprehensive metric, but hard to understand | Classes are extremely imbalanced and/or you need something that’s easy to explain |
Generalizations of binary metrics
The following can each be used with micro, macro, and weighted averaging.
| Metric | Micro | Macro | Use when | Avoid when |
|---|---|---|---|---|
| Precision | You want to emphasize correct classifications (for macro, equally within each class) | Micro-precision and micro-recall are always identical for single-label multi-class classification. | ||
| Recall | You want to emphasize the model’s ability to detect all instances (for macro, equally within each class) | Micro-precision and micro-recall are always identical for single-label multi-class classification. | ||
| Accuracy | Observations are balanced and you want a highly interpretable metric | You have imbalanced data, or one type of error is worse than another | ||
| F1 score | Good metric for most situations | You are focused on one particular kind of measurement | ||
| F-beta score | You want to emphasize precision or recall, but still care about the other | You need a metric that is easy to interpret |
Typical approach
The most common multi-class classification performance metric is the macro-averaged
In many situations, the multi-class MCC can actually be more informative of model performance. However, it can be difficult to reason about, so it is relatively niche.