Definition
MMD is a statistical test to measure the difference between two probability distributions $P$ and $Q$ using samples.
- If MMD is 0, the two distributions are identical (under the chosen kernel).
- A larger MMD means the distributions are more different.
It’s widely used in machine learning to detect distribution shift (drift), compare training vs. test data, or evaluate generative models (GANs, VAEs, diffusion models).
Mathematical Idea
MMD compares the mean embeddings of two distributions in a Reproducing Kernel Hilbert Space (RKHS).
$\text{MMD}^2(P, Q) = \left\| \, \mu_P – \mu_Q \, \right\|_{\mathcal{H}}^2$
Where:
- $\mu_P = \mathbb{E}_{x \sim P}[ \phi(x) ]$ → mean embedding of $P$ in RKHS
- $\mu_Q = \mathbb{E}_{y \sim Q}[ \phi(y) ]$ → mean embedding of $Q$
- $\phi(\cdot)$ = feature mapping defined by the kernel $k(x,y)$
Empirical Estimation (with kernel trick)
Given samples $\{x_i\}_{i=1}^m \sim P$ and $\{y_j\}_{j=1}^n \sim Q$:
$\text{MMD}^2(P, Q) \approx \frac{1}{m^2} \sum_{i=1}^m \sum_{i’=1}^m k(x_i, x_{i’}) + \frac{1}{n^2} \sum_{j=1}^n \sum_{j’=1}^n k(y_j, y_{j’}) – \frac{2}{mn} \sum_{i=1}^m \sum_{j=1}^n k(x_i, y_j)$
- $k(x,y)$ is a kernel (commonly Gaussian RBF).
- This formula avoids explicitly working in infinite-dimensional space.
Interpretation
- Small MMD → two samples are from similar distributions.
- Large MMD → two samples are from different distributions.
Applications
- Drift Detection in ML
- Compare training vs. production data distributions.
- Detect covariate drift (feature shift).
- Generative Model Evaluation
- Compare generated data vs. real data.
- MMD is often used to evaluate GAN quality.
- Domain Adaptation
- Used in Deep Domain Adaptation (e.g., DANN, MMD-regularized networks) to align feature distributions between source and target domains.
Example
- Suppose you train a fraud detection model on last year’s data (distribution $P$).
- In 2025, transaction patterns shift (distribution $Q$).
- If MMD between $P$ and $Q$ is high, you have covariate drift → the model may need retraining.
Summary:
- MMD = kernel-based distance between two distributions.
- Works well for high-dimensional data (unlike simpler tests like KS test).
- Heavily used for drift detection and GAN evaluation.
