1. Definition
ECE measures the average difference between predicted probabilities and actual observed frequencies, aggregated over bins of predictions.
Formula (with $M$ bins):
$ECE = \sum_{m=1}^M \frac{|B_m|}{N} \; \Big| \text{acc}(B_m) – \text{conf}(B_m) \Big|$
where:
- $N$ = total number of samples
- $B_m$ = set of samples in bin $m$
- $|B_m|$ = number of samples in bin $m$
- $\text{acc}(B_m)$ = fraction of correct predictions in bin $m$
- $\text{conf}(B_m)$ = average predicted probability in bin $m$
2. Intuition
- Partition predictions into bins (e.g., [0.0–0.1], [0.1–0.2], …).
- For each bin, compare predicted confidence vs. actual accuracy.
- Weight the difference by how many samples fall into that bin.
- Final result = single number summarizing how miscalibrated the model is.
3. Interpretation
- Range: 0 → 1
- 0 = perfect calibration (predicted probability = observed frequency everywhere).
- Higher = worse calibration.
Example:
- If a model says “70% probability of fraud,” and among those cases 70% are fraud → perfectly calibrated (ECE = 0).
- If only 40% are fraud, the bin error = |0.7 – 0.4| = 0.3, contributing to ECE.
4. Why It Matters
- Simple and interpretable: one number summarizing calibration quality.
- Complements reliability curves (visual check).
- Often reported alongside Brier Score and Log Loss.
5. Limitations
- Bin-dependence: Choice of number of bins $M$ affects results.
- Averaging hides extremes: ECE averages differences; some bins may have large miscalibration but small weight.
- Variants exist: Maximum Calibration Error (MCE), Adaptive ECE (adaptive binning).
6. Python Example
import numpy as np
from sklearn.calibration import calibration_curve
# true labels and predicted probabilities
y_true = np.array([0, 0, 1, 1, 0, 1, 0, 1, 0, 0])
y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.7, 0.3, 0.9, 0.6, 0.05])
# reliability curve (bins = 5)
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=5)
# compute ECE
n = len(y_true)
ece = 0
bin_boundaries = np.linspace(0, 1, 6) # 5 bins
for i in range(len(bin_boundaries)-1):
lower, upper = bin_boundaries[i], bin_boundaries[i+1]
mask = (y_prob > lower) & (y_prob <= upper)
if np.sum(mask) > 0:
acc = np.mean(y_true[mask] == (y_prob[mask] >= 0.5)) # actual accuracy
conf = np.mean(y_prob[mask])
ece += (np.sum(mask) / n) * abs(acc - conf)
print("ECE:", ece)
Output might look like:
ECE: 0.08
→ Meaning the model’s average calibration error is about 8%.
Summary
- ECE = weighted average gap between predicted probability and observed accuracy.
- 0 = perfectly calibrated.
- Easy to interpret but depends on binning.
- Often used together with reliability curves and Brier Score for a full calibration evaluation.
