1) What it is
- F1-score is the harmonic mean of precision and recall.
- It balances the two: high F1 only if both precision and recall are high.
Formula:
$F1 = 2 \cdot \frac{(\text{Precision} \cdot \text{Recall})}{(\text{Precision} + \text{Recall})}$
Where:
- Precision = $\frac{TP}{TP + FP}$
- Recall = $\frac{TP}{TP + FN}$
2) Why harmonic mean?
- Arithmetic mean would allow one high value to hide a very low value.
- Harmonic mean punishes imbalance more strongly.
- Example: Precision = 1.0, Recall = 0.0 → F1 = 0 (not 0.5).
This makes F1 a good metric when you want balance between precision and recall.
3) Interpretation
- 1.0 → perfect precision and recall.
- 0.0 → either precision or recall is zero.
- Higher F1 means model is better at catching positives without too many false alarms.
4) Example
Suppose a spam filter:
- Predicted spam: 70 emails
- True spam: 50 (TP)
- Not spam but flagged: 20 (FP)
- Missed spam: 10 (FN)
- Precision = 50 / (50+20) = 0.714
- Recall = 50 / (50+10) = 0.833
- F1 = 2 × (0.714×0.833)/(0.714+0.833) ≈ 0.769
5) Variants (for multi-class / multi-label)
- Macro F1: compute F1 per class, average equally.
- Micro F1: compute global TP, FP, FN first, then F1.
- Weighted F1: per-class F1 weighted by class frequency.
Choice depends on whether you care about minority classes (macro), majority classes (micro), or balanced trade-off (weighted).
6) When to use F1
- Good choice when:
- Classes are imbalanced.
- Both false positives and false negatives are costly.
- Not always best when:
- Only one error type matters (then prefer precision or recall directly).
- Probabilistic calibration matters (then use log loss or Brier score).
Summary
- F1-score = harmonic mean of precision & recall.
- Rewards balance, punishes extreme imbalance.
- Useful for imbalanced classification tasks where both error types matter.
