1) Definition

  • Dataset shift = when the training data distribution differs from the test/production data distribution.
  • Formally: $P_{train}(X, Y) \neq P_{test}(X, Y)$
  • Since models learn patterns from training data, a shift makes them perform worse in the real world.

2) Types of Dataset Shift

  1. Covariate Shift
    • $P(X)$ changes, but $P(Y|X)$ (the labeling function) stays the same.
    • Example: Spam detector trained on 2010 emails, tested on 2025 emails (different vocabulary, but spam rules are the same).
  2. Prior Probability Shift (Label Shift)
    • $P(Y)$ (class prevalence) changes, but $P(X|Y)$ stays the same.
    • Example: Fraud detection — fraud rate is 1% in training data, but 5% in deployment.
  3. Concept Shift (Concept Drift)
    • $P(Y|X)$ itself changes (the meaning of labels evolves).
    • Example: “spam” definition changes because spammers adapt to filters.
    • Hardest to detect and handle.

3) Why it matters

  • Causes performance degradation after deployment.
  • Leads to miscalibration of probabilities.
  • Can cause fairness issues (e.g., demographic shift in population).

4) Detecting Dataset Shift

  • Statistical tests: KS-test, Chi-square test, PSI (Population Stability Index), KL-divergence.
  • Embedding comparison: train a classifier to distinguish train vs test data (if it works well → distributions differ).
  • Monitoring metrics: monitor accuracy, AUC, calibration drift in production.

5) Coping Strategies

  1. Reweighting
    • Estimate importance weights: $w(x) = \frac{P_{test}(x)}{P_{train}(x)}$
    • Train model with weighted samples (for covariate shift).
  2. Resampling / Rebalancing
    • Adjust dataset to match real-world label prevalence.
  3. Domain Adaptation
    • Train models explicitly to generalize across source and target domains.
  4. Continual Learning / Online Updates
    • Update model periodically with new data.
  5. Robust models
    • Use regularization, ensemble methods, or invariant feature learning.

6) Example

  • Train a medical diagnosis model on hospital A’s data (patients mostly older, European).
  • Deploy it in hospital B (younger, more diverse population).
  • Symptoms distributions differ → model accuracy drops = dataset shift.

Summary

  • Dataset shift = mismatch between training and test/production data.
  • Types: covariate shift, prior/label shift, concept shift.
  • Effects: degraded performance, miscalibration, fairness issues.
  • Fixes: reweighting, domain adaptation, continual retraining, robust modeling.