Cardinality in Categorical Data

Definition:

Cardinality means the number of unique categories (distinct values) in a categorical feature.
Example:
- Gender → 2 categories (low cardinality).
- Zip Code → thousands of categories (high cardinality).

Cramér’s V uses the smaller dimension between rows and columns ($k = \min(\text{\#rows}, \text{\#columns})$) in the denominator.
This means the cardinality of your categorical variables directly affects the result.

Cases:

Low Cardinality Variables (e.g., Gender, Yes/No):
- Easy to compute, interpretation is clear.
- Example: Gender vs. Product Preference → V = 0.3 (moderate association).
High Cardinality Variables (e.g., Zip Codes, Product IDs):
- Cramér’s V becomes less reliable because many categories may appear with very small frequencies.
- The Chi-square statistic can become inflated, giving misleadingly high values.
- Interpretation: association might look strong, but it could be driven by sparsity.

For high-cardinality categorical data:
1. Group categories (e.g., merge rare categories into “Other”).
2. Use target encoding or frequency encoding instead of raw categorical comparison.
3. If using Cramér’s V, ensure sample size is large enough so that expected frequencies are not too small.

Example:

If you compare City (100 categories) vs. Purchase (Yes/No), Cramér’s V might come out as 0.6 (suggesting strong association).
But in reality, this might just be because each city has too few samples, not because city strongly determines purchase.

Summary:

Cardinality = number of unique categories.
Cramér’s V works well with low or moderate cardinality.
With high cardinality, results can be misleading → need grouping or alternative encodings.

Your Gateway to Data Mastery