Kolmogorov–Smirnov (KS) Test

1) What it is

A non-parametric test that compares two cumulative distribution functions (CDFs).
Measures the maximum difference between them.
Used to test whether two samples come from the same distribution.

2) Test Statistic

For two empirical CDFs, $F_n(x)$ and $G_m(x)$:

$D_{n,m} = \sup_x | F_n(x) – G_m(x) |$

$\sup$ = supremum (maximal difference).
Intuition: it looks at the largest vertical gap between the two CDFs.

3) Types of KS Tests

One-sample KS test
- Compare a sample against a known reference distribution (e.g., normal).
- Example: “Does this data follow a normal distribution?”
Two-sample KS test
- Compare two samples to check if they come from the same distribution.
- Example: “Do training and production data have the same distribution?”

4) Properties

Non-parametric → no assumptions about the distribution (works for any continuous distribution).
Sensitive to both shifts in location (mean) and changes in shape (variance, skewness).
Less powerful with small sample sizes.
Works best for continuous data; ties in discrete data complicate things.

5) Example

Suppose we test if production data matches training distribution.

Training feature values (n=1000).
Production feature values (m=1000).
KS test yields statistic $D = 0.15$, p-value = 0.01.

Interpretation:

Since p < 0.05, reject null hypothesis → distributions differ significantly → possible data drift.

6) Applications

Data drift detection (training vs production features).
Model validation: check if residuals follow assumed distribution.
Goodness-of-fit testing: test if data fits normal, exponential, etc.
Finance: compare empirical distributions of returns.
Medicine / biology: compare patient/control distributions.

7) Python Example

from scipy.stats import ks_2samp, kstest
import numpy as np

# Two-sample KS test
x1 = np.random.normal(0, 1, 1000)   # training data
x2 = np.random.normal(0.5, 1, 1000) # production data
stat, p = ks_2samp(x1, x2)
print("D-statistic:", stat, "p-value:", p)

# One-sample KS test (against normal)
stat, p = kstest(x1, 'norm')
print("D-statistic:", stat, "p-value:", p)

from scipy.stats import ks_2samp, kstest
import numpy as np

# Two-sample KS test
x1 = np.random.normal(0, 1, 1000)   # training data
x2 = np.random.normal(0.5, 1, 1000) # production data
stat, p = ks_2samp(x1, x2)
print("D-statistic:", stat, "p-value:", p)

# One-sample KS test (against normal)
stat, p = kstest(x1, 'norm')
print("D-statistic:", stat, "p-value:", p)

Summary

KS test = compares distributions via max distance between their CDFs.
Types: one-sample (vs reference distribution), two-sample (vs another dataset).
Applications: drift detection, goodness-of-fit, validation.
Pros: non-parametric, flexible. Cons: less powerful for small samples, tricky with discrete data.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Kolmogorov–Smirnov (KS) Test

1) What it is

2) Test Statistic

3) Types of KS Tests

4) Properties

5) Example

6) Applications

7) Python Example

Summary

Like this:

Related

Leave a ReplyCancel reply

1) What it is

2) Test Statistic

3) Types of KS Tests

4) Properties

5) Example

6) Applications

7) Python Example

Summary

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery