1. Definition

The Brier Score (BS) measures the mean squared error between predicted probabilities and actual outcomes.

$BS = \frac{1}{N} \sum_{i=1}^N ( \hat{p}_i – y_i )^2$

where:

  • $N$ = number of predictions
  • $\hat{p}_i$ = predicted probability for instance $i$
  • $y_i$​ = true outcome (0 or 1)

2. Interpretation

  • Range: 0 to 1
    • 0 = perfect predictions (probabilities exactly match outcomes)
    • 1 = worst possible predictions (completely wrong and overconfident)
  • Lower is better (unlike AUC which is “higher is better”).

Example:

  • If a model predicts 0.8 probability for fraud, and the actual label is 1 → error = (0.8 – 1)² = 0.04.
  • If the actual label was 0 instead → error = (0.8 – 0)² = 0.64 (a much larger penalty).

3. Why It’s Useful

  • Calibration + Accuracy: Brier Score incorporates both the correctness of the classification and the quality of the probability estimates.
  • Better than Accuracy for Probabilities:
    • Accuracy only considers the final decision (0/1).
    • Brier Score penalizes overconfident wrong predictions more heavily.

4. Variants

  • Decomposition (Murphy’s decomposition):
    The Brier Score can be broken into three parts:
    • Reliability (Calibration): How close predicted probabilities are to true frequencies.
    • Resolution: How well the predictions separate different outcome groups.
    • Uncertainty: Inherent difficulty of the prediction problem.

5. Use Cases

  • Credit Fraud Detection:
    • Brier Score helps assess whether fraud probabilities (0.01, 0.3, 0.9, etc.) are realistic.
  • Medical Diagnosis:
    • Important because doctors rely on calibrated risk scores (e.g., “30% chance of disease”).
  • Weather Forecasting:
    • Classic use case: a forecast of “70% chance of rain” should be correct about 70% of the time.

6. Example in Python

from sklearn.metrics import brier_score_loss

# true labels (0 = no fraud, 1 = fraud)
y_true = [0, 0, 1, 1]

# predicted probabilities for the positive class
y_prob = [0.1, 0.4, 0.8, 0.9]

bs = brier_score_loss(y_true, y_prob)
print("Brier Score:", bs)

Output:

Brier Score: 0.055

→ The model is well calibrated (low score).


Summary:

  • Brier Score = mean squared error of probability predictions.
  • Range = [0, 1], lower is better.
  • Captures both accuracy and calibration of probabilities.
  • Common in fraud detection, healthcare, weather forecasting.