Macro Averaging

1) General Idea

In multi-class classification, we often want to combine per-class performance into one overall metric.
Macro averaging means:
1. Compute the metric independently for each class (treating that class as “positive” vs all others as “negative”).
2. Average the values equally across classes.

Every class contributes equally, regardless of how many examples it has.

Suppose you have 3 classes (A, B, C).

Even if class A has 10,000 samples and class C has 50 samples, they count equally.

This helps evaluate balance across classes, instead of letting majority classes dominate.

For AUROC in multiclass:
- Compute one-vs-rest AUROC for each class.
- Average across classes.

This is called macro-averaged AUROC.
It answers: “If I treat each class equally important, how good is my model’s ranking ability overall?”

Micro averaging:
- Aggregate all true positives, false positives, false negatives across classes first, then compute the metric.
- Heavily influenced by large classes.
Weighted averaging:
- Like macro, but each class’s metric is weighted by its sample count.

Macro average is best when:
- All classes are equally important, regardless of frequency.
- You want to check if the model is performing poorly on minority classes (macro will penalize imbalance).
Micro average is best when:
- You care more about overall performance (dominated by majority classes).

Quick summary: