1) Definition

  • Data drift = a change in the statistical properties of input data (features or labels) between training and production.
  • If the production data distribution is different from training, the model’s assumptions may no longer hold → performance degrades.

It’s like the model is answering a question it was never trained for.


2) Types of Drift

  1. Covariate Drift (Feature Drift)
    • Input feature distribution $P(X)$ changes.
    • Example: A loan model trained on income range 30–80k sees new applicants with 150k income → outside training range.
  2. Label Drift (Prior Probability Shift)
    • Distribution of target variable $P(Y)$ changes.
    • Example: Fraud rate rises from 1% in training data to 4% in production.
  3. Concept Drift
    • Relationship between inputs and outputs $P(Y|X)$ changes.
    • Example: Spam detection — phrases like “free gift” used to always mean spam, but become common in legitimate marketing emails.

3) Causes of Data Drift

  • Seasonality (e.g., holiday shopping changes spending patterns).
  • Market/economic shifts (e.g., COVID changed consumer behavior).
  • User behavior changes (new trends, apps, slang in text).
  • Data pipeline issues (feature encoding bugs, missing values).

4) How to Detect Drift


5) Guardrails & Mitigation

  • Drift Guardrails (monitor thresholds):
    • Example: “Alert if PSI > 0.2 for income feature.”
  • Model retraining: refresh model with new data distribution.
  • Adaptive models: online learning, domain adaptation.
  • Feature engineering: normalize or encode to reduce sensitivity.

6) Example

  • Fraud detection model:
    • Training data: fraud rate = 1%.
    • Production last month: fraud rate = 3%.
    • Concept drift: Fraud patterns change (attackers use new methods).
    • Guardrail triggers retraining + review.

Summary

  • Data drift = mismatch between training and production data.
  • Types: covariate drift, label drift, concept drift.
  • Causes: seasonality, market shifts, pipeline errors, evolving behaviors.
  • Detection: PSI, KS test, KL divergence, performance drop.
  • Mitigation: monitoring, retraining, guardrails.