1) Definition

  • JS divergence is a symmetric and smoothed version of KL divergence.
  • For two probability distributions $P$ and $Q$:

$JS(P \parallel Q) = \tfrac{1}{2} D_{KL}(P \parallel M) + \tfrac{1}{2} D_{KL}(Q \parallel M)$

where $M = \tfrac{1}{2}(P + Q)$

is the midpoint distribution.

It’s the average KL divergence of each distribution from the midpoint.


2) Properties

  • Symmetric: $JS(P \parallel Q) = JS(Q \parallel P)$
  • Bounded: $0 \leq JS(P \parallel Q) \leq \log(2)$ (0 if identical, max when distributions are disjoint).
  • Always finite, unlike KL divergence.
  • The square root of JS divergence is a true metric (obeys triangle inequality).

3) Intuition

  • JS divergence measures the “similarity” between two distributions, with better mathematical behavior than KL.
  • If $P$ and $Q$ are very different (disjoint support), JS divergence reaches its maximum.
  • If $P = Q$, JS divergence = 0.

Think of it as a smoothed, symmetric measure of distance between distributions.


4) Example

Suppose

  • $P = (0.5, 0.5)$,
  • $Q = (0.9, 0.1)$.

Then

  • Midpoint $M = (0.7, 0.3)$.
  • Compute: $JS(P \parallel Q) = 0.5 \, D_{KL}(P \parallel M) + 0.5 \, D_{KL}(Q \parallel M)$

0.10 nats (much smaller than KL’s ~0.51).


5) Applications

  • Generative Adversarial Networks (GANs): the original GAN objective minimizes JS divergence between real and generated distributions.
  • Clustering distributions: better behaved than KL for comparing probability distributions.
  • NLP & Information retrieval: compare topic distributions, language models.
  • Drift detection: JS divergence for comparing training vs production data.
  • Genomics, biology, ecology: measure diversity between distributions.

6) Relationship to Other Measures

  • KL divergence: asymmetric, unbounded; JS fixes this.
  • Cross-entropy: related via KL.
  • Hellinger distance, Wasserstein distance: alternative symmetric distances.

Summary

  • JS divergence = symmetric, bounded version of KL divergence.
  • Measures similarity between two distributions with midpoint smoothing.
  • Always finite, interpretable, and useful in ML (GANs, drift detection, NLP).
  • The square root of JS is a true metric.