Definition

A Classifier Two-Sample Test is a method for checking whether two datasets come from the same distribution by training a classifier to distinguish between them.

  • If the classifier cannot distinguish → the two distributions are likely the same.
  • If the classifier can distinguish well → the two distributions differ (distribution shift).

This is a modern alternative to classical statistical tests like KS test, MMD, or Energy Distance, and is especially useful in high-dimensional data (e.g., images, text, embeddings).


How It Works

  1. Prepare Data
    • Dataset A: label as 0 (e.g., training data).
    • Dataset B: label as 1 (e.g., production data).
  2. Train a Classifier
    • Use logistic regression, random forests, or neural nets.
    • Task: predict whether a sample is from A or B.
  3. Evaluate Performance
    • If classifier accuracy ≈ 50% (random guess) → distributions are similar.
    • If classifier accuracy >> 50% → distributions differ.
    • Can use AUC, precision-recall, etc. as test statistics.

Formal Hypothesis Test

  • Null Hypothesis $H_0$​: $P = Q$ (distributions are the same).
  • Alternative Hypothesis $H_1$​: $P \neq Q$.
  • Test statistic: classifier accuracy (or AUC).
  • Use permutation testing / bootstrapping to get p-values.

Advantages

  • Works in very high dimensions (unlike KS or Chi-square).
  • Easy to implement with standard ML classifiers.
  • Naturally adapts to complex feature structures (images, text).

Disadvantages

  • Requires training a model (computationally heavier).
  • Sensitive to classifier choice (weak classifier may fail to detect shift).
  • Needs enough data to avoid overfitting.

Applications in Data Science

  • Data Drift Detection: Train vs. production data.
  • GAN/Generative Model Evaluation: Generated vs. real data.
  • Bias Detection: Compare subgroups (e.g., men vs. women distributions in features).

Example

  • Suppose you have 10,000 training samples (2024 data) and 10,000 production samples (2025 data).
  • Train a classifier (say XGBoost) to distinguish them.
  • If classifier gets AUC = 0.90, it means the two datasets are easily distinguishable → strong distribution drift.

Summary:
Classifier Two-Sample Tests (C2STs) turn the distribution comparison problem into a supervised classification task. They are powerful for high-dimensional, real-world data (images, text, tabular), and often outperform traditional statistical distances in practice.