1) Definition
- Label drift occurs when the distribution of the target variable (labels) changes between training and production (or across time).
- Formally:
- $p_{\text{train}}(y) \;\neq\; p_{\text{prod}}(y)$
- Importantly, the relationship $p(y \mid x)$ may still be the same; it’s the marginal distribution of $y$ that has shifted.
In contrast:
- Covariate drift: input features $x$ shift.
- Concept drift: the relationship $p(y \mid x)$ itself changes.
2) Why it matters
- Model evaluation bias: Offline test metrics (on training distribution) may not reflect real-world deployment.
- Thresholds / calibration: Class priors change → probability calibration becomes wrong.
- Resource planning: In ops, drifted labels mean reality is changing (e.g., more fraud, higher churn).
3) Examples
Binary classification
- Fraud detection: In training, 2% of transactions are fraud. In production, suddenly 5% are fraud.
- If your threshold was tuned for a 2% base rate, false positives/negatives will spike.
Multi-class classification
- Customer support tickets: Distribution shifts from 70% “billing issues” to 40% billing + 60% “technical issues”.
- A model optimized on past priors underperforms because it “expects” more billing cases.
Regression
- Energy demand forecasting: Average demand rises by 20% due to a cold winter.
- Even if the model’s conditional mapping works, it will systematically underpredict.
4) How to detect label drift
- Histogram / distribution tests: Compare class frequencies (Chi-square, KL divergence, Jensen–Shannon).
- PSI (Population Stability Index): Apply to the label itself.
- Time-series monitoring: Rolling label distribution over time windows.
- Calibration checks: Brier score, reliability plots → reveal mismatch between predicted probabilities and actual frequencies.
- Performance metrics drift: Sudden drop in precision/recall/F1 may indicate label shift.
In real-time production, you often don’t know labels immediately → label drift monitoring usually lags (delayed ground truth).
5) Mitigation strategies
- Recalibration: Adjust predicted probabilities to match new priors (e.g., Platt scaling, Bayesian correction).
- Reweighting: Importance weights $w(y) = \frac{p_{\text{prod}}(y)}{p_{\text{train}}(y)}$ to adapt loss function.
- Continuous retraining: Periodically retrain with fresh labels.
- Monitoring pipelines: Alert when class frequency moves beyond thresholds.
- Active learning: If label drift is suspected, prioritize annotation of new data.
6) Example Calculation
Training distribution (churn model):
- Churn = 20%
- Non-churn = 80%
Production distribution (last month):
- Churn = 30%
- Non-churn = 70%
Label drift (absolute change in priors):
$|0.30 – 0.20| = 0.10 \quad \Rightarrow \quad 10\%\text{ shift}$
That shift alone can cause model miscalibration and deteriorating metrics.
Summary
- Label drift = shift in target variable distribution.
- Impacts calibration, metrics, business decisions.
- Must be detected via delayed labels and mitigated via recalibration, reweighting, retraining.
