1) Definition
- JS divergence is a symmetric and smoothed version of KL divergence.
- For two probability distributions $P$ and $Q$:
$JS(P \parallel Q) = \tfrac{1}{2} D_{KL}(P \parallel M) + \tfrac{1}{2} D_{KL}(Q \parallel M)$
where $M = \tfrac{1}{2}(P + Q)$
is the midpoint distribution.
It’s the average KL divergence of each distribution from the midpoint.
2) Properties
- Symmetric: $JS(P \parallel Q) = JS(Q \parallel P)$
- Bounded: $0 \leq JS(P \parallel Q) \leq \log(2)$ (0 if identical, max when distributions are disjoint).
- Always finite, unlike KL divergence.
- The square root of JS divergence is a true metric (obeys triangle inequality).
3) Intuition
- JS divergence measures the “similarity” between two distributions, with better mathematical behavior than KL.
- If $P$ and $Q$ are very different (disjoint support), JS divergence reaches its maximum.
- If $P = Q$, JS divergence = 0.
Think of it as a smoothed, symmetric measure of distance between distributions.
4) Example
Suppose
- $P = (0.5, 0.5)$,
- $Q = (0.9, 0.1)$.
Then
- Midpoint $M = (0.7, 0.3)$.
- Compute: $JS(P \parallel Q) = 0.5 \, D_{KL}(P \parallel M) + 0.5 \, D_{KL}(Q \parallel M)$
≈ 0.10 nats (much smaller than KL’s ~0.51).
5) Applications
- Generative Adversarial Networks (GANs): the original GAN objective minimizes JS divergence between real and generated distributions.
- Clustering distributions: better behaved than KL for comparing probability distributions.
- NLP & Information retrieval: compare topic distributions, language models.
- Drift detection: JS divergence for comparing training vs production data.
- Genomics, biology, ecology: measure diversity between distributions.
6) Relationship to Other Measures
- KL divergence: asymmetric, unbounded; JS fixes this.
- Cross-entropy: related via KL.
- Hellinger distance, Wasserstein distance: alternative symmetric distances.
Summary
- JS divergence = symmetric, bounded version of KL divergence.
- Measures similarity between two distributions with midpoint smoothing.
- Always finite, interpretable, and useful in ML (GANs, drift detection, NLP).
- The square root of JS is a true metric.
