Definition

Temperature Scaling is a probability calibration technique that adjusts the “confidence” of a model’s predicted probabilities by dividing logits by a single parameter called the temperature ($T > 0$).

It is a post-processing step — applied after training, on logits, without changing the classifier’s accuracy.


How It Works

  1. Start with logits (raw outputs before softmax) from a classifier:
    • $z = (z_1, z_2, \dots, z_K)$
  2. Apply Temperature Scaling:
    • $\text{Softmax}_i(z, T) = \frac{\exp(z_i / T)}{\sum_{j=1}^K \exp(z_j / T)}$
      • If $T = 1$: standard softmax (no change).
      • If $T > 1$: probabilities become softer / less confident.
      • If $T < 1$: probabilities become sharper / more confident.
  3. Choose $T$:
    • Fit $T$ on a validation set by minimizing Negative Log-Likelihood (NLL).
    • Only one parameter ($T$) needs to be learned.

Intuition

  • Deep networks are often overconfident (e.g., predict 0.99 when correct frequency is 0.8).
  • By increasing temperature ($T > 1$), the distribution flattens → probabilities better calibrated.
  • It does not change the predicted class, only the probability confidence.

Example

Suppose a model outputs logits for 3 classes: $z = [2.0, 1.0, 0.1]$

  • With $T = 1$: $\text{Softmax} = [0.65, 0.24, 0.11]$
  • With $T = 2$: $\text{Softmax} = [0.48, 0.29, 0.23]$

Prediction still class 1, but probabilities are less extreme, closer to reality.


Applications

  • Deep Learning Calibration (CNNs, Transformers, etc.)
  • Safety-critical ML (medical, autonomous driving, finance) → reliable probabilities matter.
  • Uncertainty estimation in classification tasks.

Advantages

Simple (only one parameter).
Effective for modern neural networks.
Doesn’t hurt classification accuracy (decision boundaries unchanged).

Limitations

Only rescales probabilities, cannot fix more complex miscalibration.
Requires a good validation set.
Less flexible than isotonic regression.


Comparison to Other Methods

  • Platt Scaling → logistic regression on scores, assumes sigmoid relationship.
  • Isotonic Regression → non-parametric, more flexible but risk of overfitting.
  • Temperature Scaling → parametric, simple, works especially well in deep learning.

In short:
Temperature Scaling is a calibration method that adjusts softmax probabilities by dividing logits by a learned temperature parameter $T$. It softens or sharpens predicted probabilities, making them better aligned with reality without changing accuracy.