Definition
The Softmax function is a generalization of the sigmoid to multiple classes.
It converts a vector of raw scores (logits) into a probability distribution over classes.
- Formula (for class $i$ out of $K$ total classes):
$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$
- Input: a vector of scores $z = (z_1, z_2, …, z_K)$.
- Output: a probability vector $p = (p_1, p_2, …, p_K)$, where $p_i \in (0,1)$ and $\sum_i p_i = 1$.
Key Properties
- Range: (0, 1), like probabilities.
- Normalization: Outputs sum to 1.
- Relative scaling: Exponentials make higher scores dominate.
- Differentiable: Works well with gradient descent.
Why It’s Useful
- Provides probabilistic interpretation of raw model outputs.
- Used in multi-class classification, where each output neuron corresponds to a class.
- The predicted class = one with the highest softmax probability.
Example
Suppose a model outputs logits: $z = [2, 1, 0]$
Softmax:
$p_1 = \frac{e^2}{e^2+e^1+e^0}, \quad p_2 = \frac{e^1}{e^2+e^1+e^0}, \quad p_3 = \frac{e^0}{e^2+e^1+e^0}$
$= \Big[ \frac{7.39}{7.39+2.72+1}, \; \frac{2.72}{7.39+2.72+1}, \; \frac{1}{7.39+2.72+1} \Big]$
$= [0.66, \; 0.24, \; 0.09]$
So class 1 has the highest probability (66%).
Applications
- Neural Networks: Output layer in multi-class classification (e.g., image recognition).
- Language Models: Predict next word in a sentence.
- Reinforcement Learning: Converts action scores into action probabilities.
Limitations
- Numerical instability: Large values of $z_i$ can cause overflow (fixed by subtracting max logit before exponentiation).
- Overconfidence: Can produce high probabilities even when uncertain.
In short:
The Softmax function transforms raw model outputs into probabilities over multiple classes, making it the standard final layer in multi-class classification models.
