Dataset Shift

1) Definition

Dataset shift = when the training data distribution differs from the test/production data distribution.
Formally: $P_{train}(X, Y) \neq P_{test}(X, Y)$
Since models learn patterns from training data, a shift makes them perform worse in the real world.

Covariate Shift
- $P(X)$ changes, but $P(Y|X)$ (the labeling function) stays the same.
- Example: Spam detector trained on 2010 emails, tested on 2025 emails (different vocabulary, but spam rules are the same).
Prior Probability Shift (Label Shift)
- $P(Y)$ (class prevalence) changes, but $P(X|Y)$ stays the same.
- Example: Fraud detection — fraud rate is 1% in training data, but 5% in deployment.
Concept Shift (Concept Drift)
- $P(Y|X)$ itself changes (the meaning of labels evolves).
- Example: “spam” definition changes because spammers adapt to filters.
- Hardest to detect and handle.

Statistical tests: KS-test, Chi-square test, PSI (Population Stability Index), KL-divergence.
Embedding comparison: train a classifier to distinguish train vs test data (if it works well → distributions differ).
Monitoring metrics: monitor accuracy, AUC, calibration drift in production.

Reweighting
- Estimate importance weights: $w(x) = \frac{P_{test}(x)}{P_{train}(x)}$
- Train model with weighted samples (for covariate shift).
Resampling / Rebalancing
- Adjust dataset to match real-world label prevalence.
Domain Adaptation
- Train models explicitly to generalize across source and target domains.
Continual Learning / Online Updates
- Update model periodically with new data.
Robust models
- Use regularization, ensemble methods, or invariant feature learning.

Train a medical diagnosis model on hospital A’s data (patients mostly older, European).
Deploy it in hospital B (younger, more diverse population).
Symptoms distributions differ → model accuracy drops = dataset shift.