Expected Calibration Error (ECE)

1. Definition

ECE measures the average difference between predicted probabilities and actual observed frequencies, aggregated over bins of predictions.

Formula (with $M$ bins):

$ECE = \sum_{m=1}^M \frac{|B_m|}{N} \; \Big| \text{acc}(B_m) – \text{conf}(B_m) \Big|$

where:

$N$ = total number of samples
$B_m$ = set of samples in bin $m$
$|B_m|$ = number of samples in bin $m$
$\text{acc}(B_m)$ = fraction of correct predictions in bin $m$
$\text{conf}(B_m)$ = average predicted probability in bin $m$

2. Intuition

Partition predictions into bins (e.g., [0.0–0.1], [0.1–0.2], …).
For each bin, compare predicted confidence vs. actual accuracy.
Weight the difference by how many samples fall into that bin.
Final result = single number summarizing how miscalibrated the model is.

3. Interpretation

Range: 0 → 1
0 = perfect calibration (predicted probability = observed frequency everywhere).
Higher = worse calibration.

Example:

If a model says “70% probability of fraud,” and among those cases 70% are fraud → perfectly calibrated (ECE = 0).
If only 40% are fraud, the bin error = |0.7 – 0.4| = 0.3, contributing to ECE.

4. Why It Matters

Simple and interpretable: one number summarizing calibration quality.
Complements reliability curves (visual check).
Often reported alongside Brier Score and Log Loss.

5. Limitations

Bin-dependence: Choice of number of bins $M$ affects results.
Averaging hides extremes: ECE averages differences; some bins may have large miscalibration but small weight.
Variants exist: Maximum Calibration Error (MCE), Adaptive ECE (adaptive binning).

6. Python Example

import numpy as np
from sklearn.calibration import calibration_curve

# true labels and predicted probabilities
y_true = np.array([0, 0, 1, 1, 0, 1, 0, 1, 0, 0])
y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.7, 0.3, 0.9, 0.6, 0.05])

# reliability curve (bins = 5)
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=5)

# compute ECE
n = len(y_true)
ece = 0
bin_boundaries = np.linspace(0, 1, 6)  # 5 bins
for i in range(len(bin_boundaries)-1):
    lower, upper = bin_boundaries[i], bin_boundaries[i+1]
    mask = (y_prob > lower) & (y_prob <= upper)
    if np.sum(mask) > 0:
        acc = np.mean(y_true[mask] == (y_prob[mask] >= 0.5))  # actual accuracy
        conf = np.mean(y_prob[mask])
        ece += (np.sum(mask) / n) * abs(acc - conf)

print("ECE:", ece)

import numpy as np
from sklearn.calibration import calibration_curve

# true labels and predicted probabilities
y_true = np.array([0, 0, 1, 1, 0, 1, 0, 1, 0, 0])
y_prob = np.array([0.1, 0.4, 0.35, 0.8, 0.2, 0.7, 0.3, 0.9, 0.6, 0.05])

# reliability curve (bins = 5)
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=5)

# compute ECE
n = len(y_true)
ece = 0
bin_boundaries = np.linspace(0, 1, 6)  # 5 bins
for i in range(len(bin_boundaries)-1):
    lower, upper = bin_boundaries[i], bin_boundaries[i+1]
    mask = (y_prob > lower) & (y_prob <= upper)
    if np.sum(mask) > 0:
        acc = np.mean(y_true[mask] == (y_prob[mask] >= 0.5))  # actual accuracy
        conf = np.mean(y_prob[mask])
        ece += (np.sum(mask) / n) * abs(acc - conf)

print("ECE:", ece)

Output might look like:

ECE: 0.08

ECE: 0.08

→ Meaning the model’s average calibration error is about 8%.

Summary

ECE = weighted average gap between predicted probability and observed accuracy.
0 = perfectly calibrated.
Easy to interpret but depends on binning.
Often used together with reliability curves and Brier Score for a full calibration evaluation.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Expected Calibration Error (ECE)

1. Definition

2. Intuition

3. Interpretation

4. Why It Matters

5. Limitations

6. Python Example

Like this:

Related

Leave a ReplyCancel reply

1. Definition

2. Intuition

3. Interpretation

4. Why It Matters

5. Limitations

6. Python Example

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery