Definition

Maximum Calibration Error (MCE) is a metric used to evaluate probability calibration of a predictive model (often a classifier).

  • It measures the worst-case deviation between the predicted probabilities and the actual observed frequencies.
  • Unlike Expected Calibration Error (ECE), which averages errors across bins, MCE focuses on the largest single discrepancy.

Formal Setup

  1. Suppose your classifier outputs predicted probabilities $\hat{p}_i \in [0,1]$ for each instance.
  2. Partition predictions into bins $B_1, B_2, \ldots, B_M$​ based on their predicted probability (e.g., [0–0.1], [0.1–0.2], …).
  3. For each bin $B_m$​:
    • Confidence (average predicted probability in bin):
      • $\text{conf}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \hat{p}_i$
    • Accuracy (fraction of true positives in bin):
      • $\text{acc}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \mathbf{1}(\hat{y}_i = y_i)$
  4. Then: $\text{MCE} = \max_{m=1,\dots,M} \; \big| \text{acc}(B_m) – \text{conf}(B_m) \big|$

Intuition

  • ECE tells you on average how far off your probabilities are from reality.
  • MCE tells you the worst case bin where your predicted probability was most misleading.
  • Example: If a model predicts 90% confidence but the true frequency is only 60% → calibration error = 0.30. If this is the largest deviation among bins, then MCE = 0.30.

Properties

  • Range: $0 \leq \text{MCE} \leq 1$.
  • Perfect calibration: MCE = 0 (confidence always equals accuracy).
  • Sensitive to binning: Choice of number of bins MMM matters. Too few bins → underestimation of error; too many → noisy estimate.
  • Interpretation: A high MCE means at least one probability bucket is very misleading, which could be dangerous in high-stakes settings (medicine, finance, risk estimation).

Example

Suppose you divide predictions into 5 bins:

BinConfidenceAccuracy|Accuracy – Confidence|
[0.0–0.2]0.100.120.02
[0.2–0.4]0.300.280.02
[0.4–0.6]0.500.520.02
[0.6–0.8]0.700.600.10
[0.8–1.0]0.900.750.15
  • ECE = weighted average of these errors.
  • MCE = max = 0.15.

So the model’s worst calibration gap is 15%.


Use Cases

  • Risk assessment (e.g., medical diagnosis, fraud detection).
  • Any domain where confidence reliability matters.
  • Often reported alongside ECE and Brier score.

Summary:
MCE = the largest absolute difference between predicted confidence and actual accuracy across probability bins. It highlights the worst-case calibration failure of your model, unlike ECE which averages across bins.