1) Definition
- Kullback–Leibler (KL) divergence measures how one probability distribution differs from another.
- For two distributions $P$ (true) and $Q$ (approximation):
$D_{KL}(P \parallel Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$
(or integral in the continuous case).
It’s often called the “relative entropy.”
2) Intuition
- KL divergence tells you how many extra bits are needed if you encode samples from $P$ using a code optimized for $Q$, instead of the true distribution $P$.
- Not symmetric: $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P)$.
- $D_{KL}(P \parallel Q) = 0$ if $P = Q$.
Think of it as a directed measure of difference, not a true distance.
3) Examples
Discrete distributions
- $P = (0.5, 0.5)$, $Q = (0.9, 0.1)$.
$D_{KL}(P \parallel Q) = 0.5 \log \frac{0.5}{0.9} + 0.5 \log \frac{0.5}{0.1} \approx 0.51$
Interpretation: Using Q instead of P costs ~0.51 extra nats per sample.
4) Applications in ML
- Variational Inference (VI):
- minimize $D_{KL}(q(z) \parallel p(z|x))$ to approximate posterior distributions.
- VAEs (Variational Autoencoders): KL term regularizes latent variables.
- GANs: related divergences (JS divergence).
- Language models: evaluate how well one probability distribution (model) matches another (ground truth).
- Drift detection: KL divergence between feature distributions (training vs production).
- Reinforcement learning: trust region policy optimization (TRPO) uses KL constraint.
5) Limitations
Not symmetric → can’t be used as a true distance metric.
Can be infinite if $Q(x) = 0$ for any $x$ where $P(x) > 0$.
Sensitive to support mismatch.
6) Related Divergences
- Jensen–Shannon (JS) divergence: symmetric, bounded.
- Cross-entropy: $H(P, Q) = H(P) + D_{KL}(P \parallel Q)$.
- Total variation distance, Wasserstein distance: alternative measures.
Summary
- KL divergence = asymmetric measure of how one distribution $Q$ diverges from another $P$.
- $D_{KL}(P \parallel Q) = 0$ if identical.
- Core tool in information theory, variational inference, generative models, drift detection.
- Limitations: not symmetric, infinite if supports don’t overlap.
