1) Definition
- Dataset shift = when the training data distribution differs from the test/production data distribution.
- Formally: $P_{train}(X, Y) \neq P_{test}(X, Y)$
- Since models learn patterns from training data, a shift makes them perform worse in the real world.
2) Types of Dataset Shift
- Covariate Shift
- $P(X)$ changes, but $P(Y|X)$ (the labeling function) stays the same.
- Example: Spam detector trained on 2010 emails, tested on 2025 emails (different vocabulary, but spam rules are the same).
- Prior Probability Shift (Label Shift)
- $P(Y)$ (class prevalence) changes, but $P(X|Y)$ stays the same.
- Example: Fraud detection — fraud rate is 1% in training data, but 5% in deployment.
- Concept Shift (Concept Drift)
- $P(Y|X)$ itself changes (the meaning of labels evolves).
- Example: “spam” definition changes because spammers adapt to filters.
- Hardest to detect and handle.
3) Why it matters
- Causes performance degradation after deployment.
- Leads to miscalibration of probabilities.
- Can cause fairness issues (e.g., demographic shift in population).
4) Detecting Dataset Shift
- Statistical tests: KS-test, Chi-square test, PSI (Population Stability Index), KL-divergence.
- Embedding comparison: train a classifier to distinguish train vs test data (if it works well → distributions differ).
- Monitoring metrics: monitor accuracy, AUC, calibration drift in production.
5) Coping Strategies
- Reweighting
- Estimate importance weights: $w(x) = \frac{P_{test}(x)}{P_{train}(x)}$
- Train model with weighted samples (for covariate shift).
- Resampling / Rebalancing
- Adjust dataset to match real-world label prevalence.
- Domain Adaptation
- Train models explicitly to generalize across source and target domains.
- Continual Learning / Online Updates
- Update model periodically with new data.
- Robust models
- Use regularization, ensemble methods, or invariant feature learning.
6) Example
- Train a medical diagnosis model on hospital A’s data (patients mostly older, European).
- Deploy it in hospital B (younger, more diverse population).
- Symptoms distributions differ → model accuracy drops = dataset shift.
Summary
- Dataset shift = mismatch between training and test/production data.
- Types: covariate shift, prior/label shift, concept shift.
- Effects: degraded performance, miscalibration, fairness issues.
- Fixes: reweighting, domain adaptation, continual retraining, robust modeling.
