1) General Idea

  • In multi-class classification, we often want to combine per-class performance into one overall metric.
  • Macro averaging means:
    1. Compute the metric independently for each class (treating that class as “positive” vs all others as “negative”).
    2. Average the values equally across classes.

Every class contributes equally, regardless of how many examples it has.


2) Example with Precision/Recall

Suppose you have 3 classes (A, B, C).

  • Precision per class:
    • A: 0.90
    • B: 0.60
    • C: 0.30
  • Macro Precision = (0.90 + 0.60 + 0.30) / 3 = 0.60

Even if class A has 10,000 samples and class C has 50 samples, they count equally.


3) Macro Recall / Macro F1

  • Same procedure: compute recall or F1 for each class separately.
  • Then take the arithmetic mean.

This helps evaluate balance across classes, instead of letting majority classes dominate.


4) Macro AUROC

  • For AUROC in multiclass:

This is called macro-averaged AUROC.
It answers: “If I treat each class equally important, how good is my model’s ranking ability overall?”


5) Contrast with Other Averages

  • Micro averaging:
    • Aggregate all true positives, false positives, false negatives across classes first, then compute the metric.
    • Heavily influenced by large classes.
  • Weighted averaging:
    • Like macro, but each class’s metric is weighted by its sample count.

6) When to Use

  • Macro average is best when:
    • All classes are equally important, regardless of frequency.
    • You want to check if the model is performing poorly on minority classes (macro will penalize imbalance).
  • Micro average is best when:
    • You care more about overall performance (dominated by majority classes).

Quick summary:

  • Macro = equal weight per class
  • Micro = equal weight per sample
  • Weighted = in between