Definition:
Categorical drift refers to a change over time in the distribution of categories (values of categorical variables) between training data and the data seen in production. It is a type of data drift that specifically affects categorical features rather than continuous ones.
Key Points:
- Distribution Shift:
- The frequency of categories (e.g., “Male/Female,” “Product A/Product B,” “Region 1/Region 2”) may change over time.
- Example: If 80% of customers were from Region A during training, but only 40% are from Region A in current production data, drift has occurred.
- Impact on Models:
- Models trained on old category distributions may make biased or inaccurate predictions.
- Rare categories may suddenly become common, and unseen categories (new labels) can appear in production.
- Detection Methods:
- Chi-square test: Compares observed vs. expected category frequencies.
- Cramér’s V: Measures association/strength of drift between distributions.
- Population Stability Index (PSI): Quantifies how much categorical distributions shift.
- Examples:
- E-commerce: A product recommendation model trained on last year’s sales data may fail if new products (new categories) dominate.
- Healthcare: A diagnosis model may degrade if the frequency of certain disease codes changes in incoming hospital records.
- Finance: Fraud detection models may underperform if the types of transactions (online, POS, crypto, etc.) shift significantly.
Summary:
Categorical drift is when the relative frequency or presence of categories in a categorical variable changes between training and deployment. Monitoring and addressing it is critical to maintain model reliability, especially in dynamic environments.
