Definition
Maximum Calibration Error (MCE) is a metric used to evaluate probability calibration of a predictive model (often a classifier).
- It measures the worst-case deviation between the predicted probabilities and the actual observed frequencies.
- Unlike Expected Calibration Error (ECE), which averages errors across bins, MCE focuses on the largest single discrepancy.
Formal Setup
- Suppose your classifier outputs predicted probabilities $\hat{p}_i \in [0,1]$ for each instance.
- Partition predictions into bins $B_1, B_2, \ldots, B_M$ based on their predicted probability (e.g., [0–0.1], [0.1–0.2], …).
- For each bin $B_m$:
- Confidence (average predicted probability in bin):
- $\text{conf}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \hat{p}_i$
- Accuracy (fraction of true positives in bin):
- $\text{acc}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \mathbf{1}(\hat{y}_i = y_i)$
- Confidence (average predicted probability in bin):
- Then: $\text{MCE} = \max_{m=1,\dots,M} \; \big| \text{acc}(B_m) – \text{conf}(B_m) \big|$
Intuition
- ECE tells you on average how far off your probabilities are from reality.
- MCE tells you the worst case bin where your predicted probability was most misleading.
- Example: If a model predicts 90% confidence but the true frequency is only 60% → calibration error = 0.30. If this is the largest deviation among bins, then MCE = 0.30.
Properties
- Range: $0 \leq \text{MCE} \leq 1$.
- Perfect calibration: MCE = 0 (confidence always equals accuracy).
- Sensitive to binning: Choice of number of bins MMM matters. Too few bins → underestimation of error; too many → noisy estimate.
- Interpretation: A high MCE means at least one probability bucket is very misleading, which could be dangerous in high-stakes settings (medicine, finance, risk estimation).
Example
Suppose you divide predictions into 5 bins:
| Bin | Confidence | Accuracy | |Accuracy – Confidence| |
|---|---|---|---|
| [0.0–0.2] | 0.10 | 0.12 | 0.02 |
| [0.2–0.4] | 0.30 | 0.28 | 0.02 |
| [0.4–0.6] | 0.50 | 0.52 | 0.02 |
| [0.6–0.8] | 0.70 | 0.60 | 0.10 |
| [0.8–1.0] | 0.90 | 0.75 | 0.15 |
- ECE = weighted average of these errors.
- MCE = max = 0.15.
So the model’s worst calibration gap is 15%.
Use Cases
- Risk assessment (e.g., medical diagnosis, fraud detection).
- Any domain where confidence reliability matters.
- Often reported alongside ECE and Brier score.
Summary:
MCE = the largest absolute difference between predicted confidence and actual accuracy across probability bins. It highlights the worst-case calibration failure of your model, unlike ECE which averages across bins.
