See Precision and recall in multi-class classification for a worked example.

Micro-averaging

Start with the binary classification metric, but replace TP, TN, FP, FN with the corresponding sum across classes. Hence micro-precision is sum(TP) / (sum(TP) + sum(FP)). Use micro-averaging when you want to weight every example equally. For a large number of classes, micro-averaging could look artificially low, and can conceal issues with rare classes.

If you care about getting your most common cases right and rare classes don’t carry any particular importance, you want micro-averaging. A good example would be a product recommendation system.

Macro-averaging

Calculate the binary classification metric for each class, then average across them. Use this when you want to weight every class equally. This can make it easier to see issues with rare classes (since they become as important as common ones), but can make it harder to see how well you’re doing overall. If rare classes are of primary concern, you’ll want macro-averaging. Example: classifying medical images.

Weighted averaging

Calculate the binary classification metric for each class, then weight by the number of ground truth positives in each class. Balances importance of common and rare classes, providing a value between micro and macro averaging. Useful when you care about both rare classes and overall performance, such as in a multi-class text classification model.