At first glance, it can be hard to understand how the ideas of “recision” and “recall” can be meaningful in a multi-class classification situation, because a false positive in one class is false negative in another. So let’s work through an example. Consider a system with three classes. Our ideal confusion matrix across the three classes is
Clearly, precision and recall are definitionally both 1.
In the following analyses, we’ll use these conventions from the table of multi-class metrics.
(true positives for class ) (false positives for class ) (false negatives for class )
One error
What if we misclassify the rare class? Then we might get
Micro-averaging
Let’s first looks at micro-averaging. These are given as
By inspection, we see that
Hence micro-precision and micro-recall are the same:
Notice that the error didn’t make much difference. Clobbering a rare class doesn’t show up strongly in micro-averaging, because it averages over examples, of which rare classes (by definition) have very few.
Macro-averaging
Next, let’s look at macro-averaging. These are given as
We immediately see that these are not only different, but that we will run into a problem with precision. This is because
The single misclassification now makes a very big difference. That’s because macro-averaging treats each class equally, and the misclassification caused us to get a whole class wrong. Hence recall is now
We see that recall is a bit higher: it’s only concerned with false negatives, and only one class has any of those. However, this is arguably only due to the construction of the problem, so we’ll look at another example below.
Weighted averaging
Precision and recall will be different, but the smallest class will play a smaller role in both, bringing them both closer to 1.
Two errors
Now let’s look at the case where we misclassify an instance of the second-rarest class as well. Then we have
Micro-averaging
Again, we see that the false positives and false negatives are matched. For single-label multi-class classification, micro-averaging gives equal precision and recall.
Macro-averaging
Now we see that recall is a bit lower than precision. That’s because the single misclassification greatly reduces the recall for the second class, and that’s 1/3 of the average. If we were using weighted averaging, the impact would have been significantly less, as the third class is bigger than the first two combined.