It decomposes a proper scoring rule (like the Brier score) into three components:

  1. Uncertainty
    • Measures the inherent difficulty of the prediction task (variance of the true outcomes).
    • High uncertainty = hard to predict (e.g., if events happen about 50% of the time).
  2. Resolution
    • Measures how well the forecast separates different outcome situations.
    • High resolution = model gives different probabilities for different situations (not always the same probability).
  3. Reliability (Calibration)
    • Measures how close predicted probabilities are to the true observed frequencies.
    • A perfectly calibrated model has high reliability.

Formula (for the Brier Score)

The Brier Score (BS) can be decomposed as:

$BS = \text{Uncertainty} – \text{Resolution} + \text{Reliability}$

  • Uncertainty = variance of actual outcomes
  • Resolution = how much forecasts differ across subsets of data
  • Reliability = calibration error

Example in ML

Suppose you train a binary classifier:

  • Model predicts 0.7 probability of “rain” on 100 days → it should rain ~70 days (calibration).
  • If model predicts always 0.5 → resolution is low.
  • If actual rain frequency is 50%, uncertainty is high (hard task).

Murphy’s decomposition lets you understand whether poor performance comes from:

  • the task being uncertain,
  • the model not distinguishing cases (low resolution),
  • or the model being miscalibrated (low reliability).

So in short:
Murphy’s decomposition = a way to break down forecast error into calibration (reliability), sharpness (resolution), and task difficulty (uncertainty).