Data Drift

Date: August 20, 2025Author: Ju Yeon Eum 0 Comments

1) Definition

Data drift = a change in the statistical properties of input data (features or labels) between training and production.
If the production data distribution is different from training, the model’s assumptions may no longer hold → performance degrades.

It’s like the model is answering a question it was never trained for.

2) Types of Drift

Covariate Drift (Feature Drift)
- Input feature distribution $P(X)$ changes.
- Example: A loan model trained on income range 30–80k sees new applicants with 150k income → outside training range.
Label Drift (Prior Probability Shift)
- Distribution of target variable $P(Y)$ changes.
- Example: Fraud rate rises from 1% in training data to 4% in production.
Concept Drift
- Relationship between inputs and outputs $P(Y|X)$ changes.
- Example: Spam detection — phrases like “free gift” used to always mean spam, but become common in legitimate marketing emails.

3) Causes of Data Drift

Seasonality (e.g., holiday shopping changes spending patterns).
Market/economic shifts (e.g., COVID changed consumer behavior).
User behavior changes (new trends, apps, slang in text).
Data pipeline issues (feature encoding bugs, missing values).

4) How to Detect Drift

Statistical tests:
- KL divergence, Jensen–Shannon divergence.
- KS test (Kolmogorov–Smirnov).
- Chi-square test for categorical features.
PSI (Population Stability Index):
- PSI < 0.1 → stable
- 0.1–0.2 → moderate drift
- 0.2 → significant drift
Performance monitoring: drop in AUC, accuracy, calibration.
Embedding similarity (for NLP, vision).

5) Guardrails & Mitigation

Drift Guardrails (monitor thresholds):
- Example: “Alert if PSI > 0.2 for income feature.”
Model retraining: refresh model with new data distribution.
Adaptive models: online learning, domain adaptation.
Feature engineering: normalize or encode to reduce sensitivity.

6) Example

Fraud detection model:
- Training data: fraud rate = 1%.
- Production last month: fraud rate = 3%.
- Concept drift: Fraud patterns change (attackers use new methods).
- Guardrail triggers retraining + review.

Summary

Data drift = mismatch between training and production data.
Types: covariate drift, label drift, concept drift.
Causes: seasonality, market shifts, pipeline errors, evolving behaviors.
Detection: PSI, KS test, KL divergence, performance drop.
Mitigation: monitoring, retraining, guardrails.

Related

Leave a ReplyCancel reply