1) Definition

  • Kullback–Leibler (KL) divergence measures how one probability distribution differs from another.
  • For two distributions $P$ (true) and $Q$ (approximation):

$D_{KL}(P \parallel Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$

(or integral in the continuous case).

It’s often called the “relative entropy.”


2) Intuition

  • KL divergence tells you how many extra bits are needed if you encode samples from $P$ using a code optimized for $Q$, instead of the true distribution $P$.
  • Not symmetric: $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P)$.
  • $D_{KL}(P \parallel Q) = 0$ if $P = Q$.

Think of it as a directed measure of difference, not a true distance.


3) Examples

Discrete distributions

  • $P = (0.5, 0.5)$, $Q = (0.9, 0.1)$.

$D_{KL}(P \parallel Q) = 0.5 \log \frac{0.5}{0.9} + 0.5 \log \frac{0.5}{0.1} \approx 0.51$

Interpretation: Using Q instead of P costs ~0.51 extra nats per sample.


4) Applications in ML

  • Variational Inference (VI):
    • minimize $D_{KL}(q(z) \parallel p(z|x))$ to approximate posterior distributions.
  • VAEs (Variational Autoencoders): KL term regularizes latent variables.
  • GANs: related divergences (JS divergence).
  • Language models: evaluate how well one probability distribution (model) matches another (ground truth).
  • Drift detection: KL divergence between feature distributions (training vs production).
  • Reinforcement learning: trust region policy optimization (TRPO) uses KL constraint.

5) Limitations

Not symmetric → can’t be used as a true distance metric.
Can be infinite if $Q(x) = 0$ for any $x$ where $P(x) > 0$.
Sensitive to support mismatch.


6) Related Divergences

  • Jensen–Shannon (JS) divergence: symmetric, bounded.
  • Cross-entropy: $H(P, Q) = H(P) + D_{KL}(P \parallel Q)$.
  • Total variation distance, Wasserstein distance: alternative measures.

Summary

  • KL divergence = asymmetric measure of how one distribution $Q$ diverges from another $P$.
  • $D_{KL}(P \parallel Q) = 0$ if identical.
  • Core tool in information theory, variational inference, generative models, drift detection.
  • Limitations: not symmetric, infinite if supports don’t overlap.