F1-score

Date: August 20, 2025Author: Ju Yeon Eum 0 Comments

1) What it is

F1-score is the harmonic mean of precision and recall.
It balances the two: high F1 only if both precision and recall are high.

Formula:

$F1 = 2 \cdot \frac{(\text{Precision} \cdot \text{Recall})}{(\text{Precision} + \text{Recall})}$

Where:

Precision = $\frac{TP}{TP + FP}$
Recall = $\frac{TP}{TP + FN}$

2) Why harmonic mean?

Arithmetic mean would allow one high value to hide a very low value.
Harmonic mean punishes imbalance more strongly.
- Example: Precision = 1.0, Recall = 0.0 → F1 = 0 (not 0.5).

This makes F1 a good metric when you want balance between precision and recall.

3) Interpretation

1.0 → perfect precision and recall.
0.0 → either precision or recall is zero.
Higher F1 means model is better at catching positives without too many false alarms.

4) Example

Suppose a spam filter:

Predicted spam: 70 emails
- True spam: 50 (TP)
- Not spam but flagged: 20 (FP)
Missed spam: 10 (FN)
Precision = 50 / (50+20) = 0.714
Recall = 50 / (50+10) = 0.833
F1 = 2 × (0.714×0.833)/(0.714+0.833) ≈ 0.769

5) Variants (for multi-class / multi-label)

Macro F1: compute F1 per class, average equally.
Micro F1: compute global TP, FP, FN first, then F1.
Weighted F1: per-class F1 weighted by class frequency.

Choice depends on whether you care about minority classes (macro), majority classes (micro), or balanced trade-off (weighted).

6) When to use F1

Good choice when:
- Classes are imbalanced.
- Both false positives and false negatives are costly.
Not always best when:
- Only one error type matters (then prefer precision or recall directly).
- Probabilistic calibration matters (then use log loss or Brier score).

Summary

F1-score = harmonic mean of precision & recall.
Rewards balance, punishes extreme imbalance.
Useful for imbalanced classification tasks where both error types matter.

Related

Leave a ReplyCancel reply