1) What it is
- A non-parametric test that compares two cumulative distribution functions (CDFs).
- Measures the maximum difference between them.
- Used to test whether two samples come from the same distribution.
2) Test Statistic
For two empirical CDFs, $F_n(x)$ and $G_m(x)$:
$D_{n,m} = \sup_x | F_n(x) – G_m(x) |$
- $\sup$ = supremum (maximal difference).
- Intuition: it looks at the largest vertical gap between the two CDFs.
3) Types of KS Tests
- One-sample KS test
- Compare a sample against a known reference distribution (e.g., normal).
- Example: “Does this data follow a normal distribution?”
- Two-sample KS test
- Compare two samples to check if they come from the same distribution.
- Example: “Do training and production data have the same distribution?”
4) Properties
Non-parametric → no assumptions about the distribution (works for any continuous distribution).
Sensitive to both shifts in location (mean) and changes in shape (variance, skewness).
Less powerful with small sample sizes.
Works best for continuous data; ties in discrete data complicate things.
5) Example
Suppose we test if production data matches training distribution.
- Training feature values (n=1000).
- Production feature values (m=1000).
- KS test yields statistic $D = 0.15$, p-value = 0.01.
Interpretation:
- Since p < 0.05, reject null hypothesis → distributions differ significantly → possible data drift.
6) Applications
- Data drift detection (training vs production features).
- Model validation: check if residuals follow assumed distribution.
- Goodness-of-fit testing: test if data fits normal, exponential, etc.
- Finance: compare empirical distributions of returns.
- Medicine / biology: compare patient/control distributions.
7) Python Example
from scipy.stats import ks_2samp, kstest
import numpy as np
# Two-sample KS test
x1 = np.random.normal(0, 1, 1000) # training data
x2 = np.random.normal(0.5, 1, 1000) # production data
stat, p = ks_2samp(x1, x2)
print("D-statistic:", stat, "p-value:", p)
# One-sample KS test (against normal)
stat, p = kstest(x1, 'norm')
print("D-statistic:", stat, "p-value:", p)
Summary
- KS test = compares distributions via max distance between their CDFs.
- Types: one-sample (vs reference distribution), two-sample (vs another dataset).
- Applications: drift detection, goodness-of-fit, validation.
- Pros: non-parametric, flexible. Cons: less powerful for small samples, tricky with discrete data.
