Definition:
Categorical drift refers to a change over time in the distribution of categories (values of categorical variables) between training data and the data seen in production. It is a type of data drift that specifically affects categorical features rather than continuous ones.


Key Points:

  1. Distribution Shift:
    • The frequency of categories (e.g., “Male/Female,” “Product A/Product B,” “Region 1/Region 2”) may change over time.
    • Example: If 80% of customers were from Region A during training, but only 40% are from Region A in current production data, drift has occurred.
  2. Impact on Models:
    • Models trained on old category distributions may make biased or inaccurate predictions.
    • Rare categories may suddenly become common, and unseen categories (new labels) can appear in production.
  3. Detection Methods:
    • Chi-square test: Compares observed vs. expected category frequencies.
    • Cramér’s V: Measures association/strength of drift between distributions.
    • Population Stability Index (PSI): Quantifies how much categorical distributions shift.
  4. Examples:
    • E-commerce: A product recommendation model trained on last year’s sales data may fail if new products (new categories) dominate.
    • Healthcare: A diagnosis model may degrade if the frequency of certain disease codes changes in incoming hospital records.
    • Finance: Fraud detection models may underperform if the types of transactions (online, POS, crypto, etc.) shift significantly.

Summary:
Categorical drift is when the relative frequency or presence of categories in a categorical variable changes between training and deployment. Monitoring and addressing it is critical to maintain model reliability, especially in dynamic environments.