For the following, assume the following conventions:

General

  • is the number of classes
  • is the total number of observations ()
  • (number of actual instances in class )
  • (number of predicted instances in class )

Confusion Matrix

  • is the number of observations known to be in class but predicted to be in class (entries of the confusion matrix)
  • (row sums of the confusion matrix, actual class counts)
  • (column sums of the confusion matrix, predicted class counts)

Derived Metrics

  • (true positives for class )
  • (false positives for class )
  • (false negatives for class )
  • (total true positives)
  • (total false positives)
  • (total false negatives)
  • (total true negatives)

Agreement Metrics

  • (observed agreement)
  • (expected agreement by chance)

Functions and Parameters

  • is the indicator function (1 if the condition is true, 0 otherwise)
  • is the true label for observation
  • is the set of top predicted labels for observation
  • is the weight parameter in the score

Specific to multi-class metrics

MetricExpressionClass imbalanceUse whenAvoid when
Confusion matrixCan be somewhat misleading in presence of class imbalanceClasses are balanced and you want to understand confusionClasses are imbalanced
Normalized Confusion MatrixHighlights performance on each class, especially rare onesWhen dealing with imbalanced datasetsClasses are balanced and you prefer counts
OvO ROC curveAs a pairwise metric, it is insensitive to imbalanceYou want to scrutinize a particular confusionYou have many classes
OvR ROC curvesFor large class , can conceal tendencies to replace smaller classesYou want to scrutinize confusion for a small class vs all othersYou are early in model development
Cohen’s KappaCan be distorted by class imbalanceYou want to know whether your classifier’s predictions are based on noiseYou have reason to believe your classifier is based mostly on noise
Top-k accuracyDeliberately tolerates misclassification of rare classesThere are several similar classes, and getting “close” is better than nothing. For example, in computer vision applications with many classes.There is no such thing as a “similar class”
Matthews correlation coefficientRobust to moderate class imbalance, though can still be distorted in the presence of extremely rare classesArguably the most comprehensive metric, but hard to understandClasses are extremely imbalanced and/or you need something that’s easy to explain

Generalizations of binary metrics

The following can each be used with micro, macro, and weighted averaging.

MetricMicroMacroUse whenAvoid when
PrecisionYou want to emphasize correct classifications (for macro, equally within each class)Micro-precision and micro-recall are always identical for single-label multi-class classification.
RecallYou want to emphasize the model’s ability to detect all instances (for macro, equally within each class)Micro-precision and micro-recall are always identical for single-label multi-class classification.
AccuracyObservations are balanced and you want a highly interpretable metricYou have imbalanced data, or one type of error is worse than another
F1 scoreGood metric for most situationsYou are focused on one particular kind of measurement
F-beta scoreYou want to emphasize precision or recall, but still care about the otherYou need a metric that is easy to interpret

Typical approach

The most common multi-class classification performance metric is the macro-averaged , because it is doubly helpful in the presence of class imbalance, which is often present in multi-class situations. As a macro-weighted metric, it gives equal weight to large and small classes. As an F-metric, it gives equal weight to precision and recall. (Recall is often worst for rare classes.)

In many situations, the multi-class MCC can actually be more informative of model performance. However, it can be difficult to reason about, so it is relatively niche.