1. Definition
- Calibration = the alignment between predicted probabilities and actual observed frequencies.
- A well-calibrated model produces probabilities that can be directly interpreted as real-world likelihoods.
Example:
- If a fraud detection model outputs 70% probability, and among all such predictions about 70% are truly fraud, the model is well calibrated.
- If in reality only 40% are fraud, the model is overconfident.
2. Why It Matters
- Decision-making: Probabilities can be used directly in risk-based decisions (e.g., pricing insurance, setting fraud thresholds).
- Trust: Stakeholders can interpret probabilities as actionable confidence levels.
- Evaluation: Accuracy alone does not reflect calibration — a model may be accurate but poorly calibrated (too overconfident or underconfident).
3. How to Assess Calibration
- Reliability Diagram (Calibration Plot):
- x-axis = predicted probability bins, y-axis = observed frequency.
- Perfect calibration lies along the diagonal (y = x).
- Expected Calibration Error (ECE):
- Average difference between predicted probabilities and observed frequencies across bins.
- Brier Score:
- Mean squared error of predicted probabilities vs. outcomes.
- Lower values = better calibration.
4. How to Improve Calibration
- Platt Scaling: Logistic regression used to rescale probabilities (commonly for SVMs).
- Isotonic Regression: Non-parametric approach that learns a monotonic function to map scores to probabilities.
- Temperature Scaling: Adjusts softmax outputs in neural networks for better probability estimates.
5. Example Use Cases
- Credit Fraud Detection:
- Critical: calibrated probabilities directly guide risk thresholds and fraud investigation prioritization.
- Yelp Sentiment Analysis:
- Less critical: classification accuracy is more important than calibrated probabilities.
- Energy Forecasting:
Summary:
- Calibration ensures predicted probabilities reflect true outcome frequencies.
- Important for domains where probabilities drive decisions (fraud detection, medical diagnosis, insurance).
- Key methods: Platt Scaling, Isotonic Regression, Temperature Scaling.
