Definition

The Softmax function is a generalization of the sigmoid to multiple classes.
It converts a vector of raw scores (logits) into a probability distribution over classes.

  • Formula (for class $i$ out of $K$ total classes):

$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$

  • Input: a vector of scores $z = (z_1, z_2, …, z_K)$.
  • Output: a probability vector $p = (p_1, p_2, …, p_K)$, where $p_i \in (0,1)$ and $\sum_i p_i = 1$.

Key Properties

  1. Range: (0, 1), like probabilities.
  2. Normalization: Outputs sum to 1.
  3. Relative scaling: Exponentials make higher scores dominate.
  4. Differentiable: Works well with gradient descent.

Why It’s Useful

  • Provides probabilistic interpretation of raw model outputs.
  • Used in multi-class classification, where each output neuron corresponds to a class.
  • The predicted class = one with the highest softmax probability.

Example

Suppose a model outputs logits: $z = [2, 1, 0]$

Softmax:

$p_1 = \frac{e^2}{e^2+e^1+e^0}, \quad p_2 = \frac{e^1}{e^2+e^1+e^0}, \quad p_3 = \frac{e^0}{e^2+e^1+e^0}$

$= \Big[ \frac{7.39}{7.39+2.72+1}, \; \frac{2.72}{7.39+2.72+1}, \; \frac{1}{7.39+2.72+1} \Big]$

$= [0.66, \; 0.24, \; 0.09]$

So class 1 has the highest probability (66%).


Applications

  1. Neural Networks: Output layer in multi-class classification (e.g., image recognition).
  2. Language Models: Predict next word in a sentence.
  3. Reinforcement Learning: Converts action scores into action probabilities.

Limitations

  • Numerical instability: Large values of $z_i$ can cause overflow (fixed by subtracting max logit before exponentiation).
  • Overconfidence: Can produce high probabilities even when uncertain.

In short:
The Softmax function transforms raw model outputs into probabilities over multiple classes, making it the standard final layer in multi-class classification models.