Jensen–Shannon (JS) Divergence

Date: August 20, 2025Author: Ju Yeon Eum 0 Comments

1) Definition

JS divergence is a symmetric and smoothed version of KL divergence.
For two probability distributions $P$ and $Q$:

$JS(P \parallel Q) = \tfrac{1}{2} D_{KL}(P \parallel M) + \tfrac{1}{2} D_{KL}(Q \parallel M)$

where $M = \tfrac{1}{2}(P + Q)$

is the midpoint distribution.

It’s the average KL divergence of each distribution from the midpoint.

2) Properties

Symmetric: $JS(P \parallel Q) = JS(Q \parallel P)$
Bounded: $0 \leq JS(P \parallel Q) \leq \log(2)$ (0 if identical, max when distributions are disjoint).
Always finite, unlike KL divergence.
The square root of JS divergence is a true metric (obeys triangle inequality).

3) Intuition

JS divergence measures the “similarity” between two distributions, with better mathematical behavior than KL.
If $P$ and $Q$ are very different (disjoint support), JS divergence reaches its maximum.
If $P = Q$, JS divergence = 0.

Think of it as a smoothed, symmetric measure of distance between distributions.

4) Example

Suppose

$P = (0.5, 0.5)$,
$Q = (0.9, 0.1)$.

Then

Midpoint $M = (0.7, 0.3)$.
Compute: $JS(P \parallel Q) = 0.5 \, D_{KL}(P \parallel M) + 0.5 \, D_{KL}(Q \parallel M)$

≈ 0.10 nats (much smaller than KL’s ~0.51).

5) Applications

Generative Adversarial Networks (GANs): the original GAN objective minimizes JS divergence between real and generated distributions.
Clustering distributions: better behaved than KL for comparing probability distributions.
NLP & Information retrieval: compare topic distributions, language models.
Drift detection: JS divergence for comparing training vs production data.
Genomics, biology, ecology: measure diversity between distributions.

6) Relationship to Other Measures

KL divergence: asymmetric, unbounded; JS fixes this.
Cross-entropy: related via KL.
Hellinger distance, Wasserstein distance: alternative symmetric distances.

Summary

JS divergence = symmetric, bounded version of KL divergence.
Measures similarity between two distributions with midpoint smoothing.
Always finite, interpretable, and useful in ML (GANs, drift detection, NLP).
The square root of JS is a true metric.

Related

Leave a ReplyCancel reply