It decomposes a proper scoring rule (like the Brier score) into three components:
- Uncertainty
- Measures the inherent difficulty of the prediction task (variance of the true outcomes).
- High uncertainty = hard to predict (e.g., if events happen about 50% of the time).
- Resolution
- Measures how well the forecast separates different outcome situations.
- High resolution = model gives different probabilities for different situations (not always the same probability).
- Reliability (Calibration)
- Measures how close predicted probabilities are to the true observed frequencies.
- A perfectly calibrated model has high reliability.
Formula (for the Brier Score)
The Brier Score (BS) can be decomposed as:
$BS = \text{Uncertainty} – \text{Resolution} + \text{Reliability}$
- Uncertainty = variance of actual outcomes
- Resolution = how much forecasts differ across subsets of data
- Reliability = calibration error
Example in ML
Suppose you train a binary classifier:
- Model predicts 0.7 probability of “rain” on 100 days → it should rain ~70 days (calibration).
- If model predicts always 0.5 → resolution is low.
- If actual rain frequency is 50%, uncertainty is high (hard task).
Murphy’s decomposition lets you understand whether poor performance comes from:
- the task being uncertain,
- the model not distinguishing cases (low resolution),
- or the model being miscalibrated (low reliability).
So in short:
Murphy’s decomposition = a way to break down forecast error into calibration (reliability), sharpness (resolution), and task difficulty (uncertainty).
