Definition:
- Cardinality means the number of unique categories (distinct values) in a categorical feature.
- Example:
Gender→ 2 categories (low cardinality).Zip Code→ thousands of categories (high cardinality).
Cramér’s V with Cardinality
- Cramér’s V uses the smaller dimension between rows and columns ($k = \min(\text{\#rows}, \text{\#columns})$) in the denominator.
- This means the cardinality of your categorical variables directly affects the result.
Cases:
- Low Cardinality Variables (e.g., Gender, Yes/No):
- Easy to compute, interpretation is clear.
- Example: Gender vs. Product Preference → V = 0.3 (moderate association).
- High Cardinality Variables (e.g., Zip Codes, Product IDs):
- Cramér’s V becomes less reliable because many categories may appear with very small frequencies.
- The Chi-square statistic can become inflated, giving misleadingly high values.
- Interpretation: association might look strong, but it could be driven by sparsity.
Practical Tips:
- For high-cardinality categorical data:
- Group categories (e.g., merge rare categories into “Other”).
- Use target encoding or frequency encoding instead of raw categorical comparison.
- If using Cramér’s V, ensure sample size is large enough so that expected frequencies are not too small.
Example:
- If you compare
City(100 categories) vs.Purchase(Yes/No), Cramér’s V might come out as 0.6 (suggesting strong association). - But in reality, this might just be because each city has too few samples, not because city strongly determines purchase.
Summary:
- Cardinality = number of unique categories.
- Cramér’s V works well with low or moderate cardinality.
- With high cardinality, results can be misleading → need grouping or alternative encodings.
