1) What it is

  • F1-score is the harmonic mean of precision and recall.
  • It balances the two: high F1 only if both precision and recall are high.

Formula:

$F1 = 2 \cdot \frac{(\text{Precision} \cdot \text{Recall})}{(\text{Precision} + \text{Recall})}$

Where:

  • Precision = $\frac{TP}{TP + FP}$
  • Recall = $\frac{TP}{TP + FN}$

2) Why harmonic mean?

  • Arithmetic mean would allow one high value to hide a very low value.
  • Harmonic mean punishes imbalance more strongly.
    • Example: Precision = 1.0, Recall = 0.0 → F1 = 0 (not 0.5).

This makes F1 a good metric when you want balance between precision and recall.


3) Interpretation

  • 1.0 → perfect precision and recall.
  • 0.0 → either precision or recall is zero.
  • Higher F1 means model is better at catching positives without too many false alarms.

4) Example

Suppose a spam filter:

  • Predicted spam: 70 emails
    • True spam: 50 (TP)
    • Not spam but flagged: 20 (FP)
  • Missed spam: 10 (FN)
  • Precision = 50 / (50+20) = 0.714
  • Recall = 50 / (50+10) = 0.833
  • F1 = 2 × (0.714×0.833)/(0.714+0.833) ≈ 0.769

5) Variants (for multi-class / multi-label)

  • Macro F1: compute F1 per class, average equally.
  • Micro F1: compute global TP, FP, FN first, then F1.
  • Weighted F1: per-class F1 weighted by class frequency.

Choice depends on whether you care about minority classes (macro), majority classes (micro), or balanced trade-off (weighted).


6) When to use F1

  • Good choice when:
    • Classes are imbalanced.
    • Both false positives and false negatives are costly.
  • Not always best when:
    • Only one error type matters (then prefer precision or recall directly).
    • Probabilistic calibration matters (then use log loss or Brier score).

Summary

  • F1-score = harmonic mean of precision & recall.
  • Rewards balance, punishes extreme imbalance.
  • Useful for imbalanced classification tasks where both error types matter.