1. Definition

ECE measures the average difference between predicted probabilities and actual observed frequencies, aggregated over bins of predictions.

Formula (with $M$ bins):

$ECE = \sum_{m=1}^M \frac{|B_m|}{N} \; \Big| \text{acc}(B_m) – \text{conf}(B_m) \Big|$

where:

  • $N$ = total number of samples
  • $B_m$​ = set of samples in bin $m$
  • $|B_m|$ = number of samples in bin $m$
  • $\text{acc}(B_m)$ = fraction of correct predictions in bin $m$
  • $\text{conf}(B_m)$ = average predicted probability in bin $m$

2. Intuition

  • Partition predictions into bins (e.g., [0.0–0.1], [0.1–0.2], …).
  • For each bin, compare predicted confidence vs. actual accuracy.
  • Weight the difference by how many samples fall into that bin.
  • Final result = single number summarizing how miscalibrated the model is.

3. Interpretation

  • Range: 0 → 1
  • 0 = perfect calibration (predicted probability = observed frequency everywhere).
  • Higher = worse calibration.

Example:

  • If a model says “70% probability of fraud,” and among those cases 70% are fraud → perfectly calibrated (ECE = 0).
  • If only 40% are fraud, the bin error = |0.7 – 0.4| = 0.3, contributing to ECE.

4. Why It Matters


5. Limitations


6. Python Example

import numpy as np
from sklearn.calibration import calibration_curve

# true labels and predicted probabilities
y_true = np.array([0, 0, 1, 1, 0, 1, 0, 1, 0, 0])
y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.7, 0.3, 0.9, 0.6, 0.05])

# reliability curve (bins = 5)
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=5)

# compute ECE
n = len(y_true)
ece = 0
bin_boundaries = np.linspace(0, 1, 6)  # 5 bins
for i in range(len(bin_boundaries)-1):
    lower, upper = bin_boundaries[i], bin_boundaries[i+1]
    mask = (y_prob > lower) & (y_prob <= upper)
    if np.sum(mask) > 0:
        acc = np.mean(y_true[mask] == (y_prob[mask] >= 0.5))  # actual accuracy
        conf = np.mean(y_prob[mask])
        ece += (np.sum(mask) / n) * abs(acc - conf)

print("ECE:", ece)

Output might look like:

ECE: 0.08

→ Meaning the model’s average calibration error is about 8%.


Summary

  • ECE = weighted average gap between predicted probability and observed accuracy.
  • 0 = perfectly calibrated.
  • Easy to interpret but depends on binning.
  • Often used together with reliability curves and Brier Score for a full calibration evaluation.