Definition
A Classifier Two-Sample Test is a method for checking whether two datasets come from the same distribution by training a classifier to distinguish between them.
- If the classifier cannot distinguish → the two distributions are likely the same.
- If the classifier can distinguish well → the two distributions differ (distribution shift).
This is a modern alternative to classical statistical tests like KS test, MMD, or Energy Distance, and is especially useful in high-dimensional data (e.g., images, text, embeddings).
How It Works
- Prepare Data
- Dataset A: label as 0 (e.g., training data).
- Dataset B: label as 1 (e.g., production data).
- Train a Classifier
- Use logistic regression, random forests, or neural nets.
- Task: predict whether a sample is from A or B.
- Evaluate Performance
- If classifier accuracy ≈ 50% (random guess) → distributions are similar.
- If classifier accuracy >> 50% → distributions differ.
- Can use AUC, precision-recall, etc. as test statistics.
Formal Hypothesis Test
- Null Hypothesis $H_0$: $P = Q$ (distributions are the same).
- Alternative Hypothesis $H_1$: $P \neq Q$.
- Test statistic: classifier accuracy (or AUC).
- Use permutation testing / bootstrapping to get p-values.
Advantages
- Works in very high dimensions (unlike KS or Chi-square).
- Easy to implement with standard ML classifiers.
- Naturally adapts to complex feature structures (images, text).
Disadvantages
- Requires training a model (computationally heavier).
- Sensitive to classifier choice (weak classifier may fail to detect shift).
- Needs enough data to avoid overfitting.
Applications in Data Science
- Data Drift Detection: Train vs. production data.
- GAN/Generative Model Evaluation: Generated vs. real data.
- Bias Detection: Compare subgroups (e.g., men vs. women distributions in features).
Example
- Suppose you have 10,000 training samples (2024 data) and 10,000 production samples (2025 data).
- Train a classifier (say XGBoost) to distinguish them.
- If classifier gets AUC = 0.90, it means the two datasets are easily distinguishable → strong distribution drift.
Summary:
Classifier Two-Sample Tests (C2STs) turn the distribution comparison problem into a supervised classification task. They are powerful for high-dimensional, real-world data (images, text, tabular), and often outperform traditional statistical distances in practice.
