At first glance, it can be hard to understand how the ideas of “recision” and “recall” can be meaningful in a multi-class classification situation, because a false positive in one class is false negative in another. So let’s work through an example. Consider a system with three classes. Our ideal confusion matrix across the three classes is

Clearly, precision and recall are definitionally both 1.

In the following analyses, we’ll use these conventions from the table of multi-class metrics.

  • (true positives for class )
  • (false positives for class )
  • (false negatives for class )

One error

What if we misclassify the rare class? Then we might get

Micro-averaging

Let’s first looks at micro-averaging. These are given as

By inspection, we see that

Hence micro-precision and micro-recall are the same:

Notice that the error didn’t make much difference. Clobbering a rare class doesn’t show up strongly in micro-averaging, because it averages over examples, of which rare classes (by definition) have very few.

Macro-averaging

Next, let’s look at macro-averaging. These are given as

We immediately see that these are not only different, but that we will run into a problem with precision. This is because and , so the first term is . It is customary, in this situation, to treat such a term as just . Proceeding thus, we see

The single misclassification now makes a very big difference. That’s because macro-averaging treats each class equally, and the misclassification caused us to get a whole class wrong. Hence recall is now : two of the three classes recalled 100% of their cases, and one recalled 0%.

We see that recall is a bit higher: it’s only concerned with false negatives, and only one class has any of those. However, this is arguably only due to the construction of the problem, so we’ll look at another example below.

Weighted averaging

Precision and recall will be different, but the smallest class will play a smaller role in both, bringing them both closer to 1.

Two errors

Now let’s look at the case where we misclassify an instance of the second-rarest class as well. Then we have

Micro-averaging

Again, we see that the false positives and false negatives are matched. For single-label multi-class classification, micro-averaging gives equal precision and recall.

Macro-averaging

Now we see that recall is a bit lower than precision. That’s because the single misclassification greatly reduces the recall for the second class, and that’s 1/3 of the average. If we were using weighted averaging, the impact would have been significantly less, as the third class is bigger than the first two combined.