1) General Idea
- In multi-class classification, we often want to combine per-class performance into one overall metric.
- Macro averaging means:
- Compute the metric independently for each class (treating that class as “positive” vs all others as “negative”).
- Average the values equally across classes.
Every class contributes equally, regardless of how many examples it has.
2) Example with Precision/Recall
Suppose you have 3 classes (A, B, C).
- Precision per class:
- A: 0.90
- B: 0.60
- C: 0.30
- Macro Precision = (0.90 + 0.60 + 0.30) / 3 = 0.60
Even if class A has 10,000 samples and class C has 50 samples, they count equally.
3) Macro Recall / Macro F1
- Same procedure: compute recall or F1 for each class separately.
- Then take the arithmetic mean.
This helps evaluate balance across classes, instead of letting majority classes dominate.
4) Macro AUROC
- For AUROC in multiclass:
- Compute one-vs-rest AUROC for each class.
- Average across classes.
This is called macro-averaged AUROC.
It answers: “If I treat each class equally important, how good is my model’s ranking ability overall?”
5) Contrast with Other Averages
- Micro averaging:
- Aggregate all true positives, false positives, false negatives across classes first, then compute the metric.
- Heavily influenced by large classes.
- Weighted averaging:
- Like macro, but each class’s metric is weighted by its sample count.
6) When to Use
- Macro average is best when:
- All classes are equally important, regardless of frequency.
- You want to check if the model is performing poorly on minority classes (macro will penalize imbalance).
- Micro average is best when:
- You care more about overall performance (dominated by majority classes).
Quick summary:
- Macro = equal weight per class
- Micro = equal weight per sample
- Weighted = in between
