1) Definition
- Data drift = a change in the statistical properties of input data (features or labels) between training and production.
- If the production data distribution is different from training, the model’s assumptions may no longer hold → performance degrades.
It’s like the model is answering a question it was never trained for.
2) Types of Drift
- Covariate Drift (Feature Drift)
- Input feature distribution $P(X)$ changes.
- Example: A loan model trained on income range 30–80k sees new applicants with 150k income → outside training range.
- Label Drift (Prior Probability Shift)
- Distribution of target variable $P(Y)$ changes.
- Example: Fraud rate rises from 1% in training data to 4% in production.
- Concept Drift
- Relationship between inputs and outputs $P(Y|X)$ changes.
- Example: Spam detection — phrases like “free gift” used to always mean spam, but become common in legitimate marketing emails.
3) Causes of Data Drift
- Seasonality (e.g., holiday shopping changes spending patterns).
- Market/economic shifts (e.g., COVID changed consumer behavior).
- User behavior changes (new trends, apps, slang in text).
- Data pipeline issues (feature encoding bugs, missing values).
4) How to Detect Drift
- Statistical tests:
- KL divergence, Jensen–Shannon divergence.
- KS test (Kolmogorov–Smirnov).
- Chi-square test for categorical features.
- PSI (Population Stability Index):
- PSI < 0.1 → stable
- 0.1–0.2 → moderate drift
- 0.2 → significant drift
- Performance monitoring: drop in AUC, accuracy, calibration.
- Embedding similarity (for NLP, vision).
5) Guardrails & Mitigation
- Drift Guardrails (monitor thresholds):
- Example: “Alert if PSI > 0.2 for income feature.”
- Model retraining: refresh model with new data distribution.
- Adaptive models: online learning, domain adaptation.
- Feature engineering: normalize or encode to reduce sensitivity.
6) Example
- Fraud detection model:
- Training data: fraud rate = 1%.
- Production last month: fraud rate = 3%.
- Concept drift: Fraud patterns change (attackers use new methods).
- Guardrail triggers retraining + review.
Summary
- Data drift = mismatch between training and production data.
- Types: covariate drift, label drift, concept drift.
- Causes: seasonality, market shifts, pipeline errors, evolving behaviors.
- Detection: PSI, KS test, KL divergence, performance drop.
- Mitigation: monitoring, retraining, guardrails.
