Kullback–Leibler (KL) Divergence

1) Definition

Kullback–Leibler (KL) divergence measures how one probability distribution differs from another.
For two distributions $P$ (true) and $Q$ (approximation):

$D_{KL}(P \parallel Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$

(or integral in the continuous case).

It’s often called the “relative entropy.”

2) Intuition

KL divergence tells you how many extra bits are needed if you encode samples from $P$ using a code optimized for $Q$, instead of the true distribution $P$.
Not symmetric: $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P)$.
$D_{KL}(P \parallel Q) = 0$ if $P = Q$.

Think of it as a directed measure of difference, not a true distance.

3) Examples

Discrete distributions

$P = (0.5, 0.5)$, $Q = (0.9, 0.1)$.

$D_{KL}(P \parallel Q) = 0.5 \log \frac{0.5}{0.9} + 0.5 \log \frac{0.5}{0.1} \approx 0.51$

Interpretation: Using Q instead of P costs ~0.51 extra nats per sample.

4) Applications in ML

Variational Inference (VI):
- minimize $D_{KL}(q(z) \parallel p(z|x))$ to approximate posterior distributions.
VAEs (Variational Autoencoders): KL term regularizes latent variables.
GANs: related divergences (JS divergence).
Language models: evaluate how well one probability distribution (model) matches another (ground truth).
Drift detection: KL divergence between feature distributions (training vs production).
Reinforcement learning: trust region policy optimization (TRPO) uses KL constraint.

5) Limitations

Not symmetric → can’t be used as a true distance metric.
Can be infinite if $Q(x) = 0$ for any $x$ where $P(x) > 0$.
Sensitive to support mismatch.

6) Related Divergences

Jensen–Shannon (JS) divergence: symmetric, bounded.
Cross-entropy: $H(P, Q) = H(P) + D_{KL}(P \parallel Q)$.
Total variation distance, Wasserstein distance: alternative measures.

Summary

KL divergence = asymmetric measure of how one distribution $Q$ diverges from another $P$.
$D_{KL}(P \parallel Q) = 0$ if identical.
Core tool in information theory, variational inference, generative models, drift detection.
Limitations: not symmetric, infinite if supports don’t overlap.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Kullback–Leibler (KL) Divergence

1) Definition

2) Intuition

3) Examples

Discrete distributions

4) Applications in ML

5) Limitations

6) Related Divergences

Summary

Like this:

Related

Leave a ReplyCancel reply

1) Definition

2) Intuition

3) Examples

Discrete distributions

4) Applications in ML

5) Limitations

6) Related Divergences

Summary

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery