Chi-square (χ²) Test

1) What it is

A statistical hypothesis test used to determine whether there’s a significant difference between observed frequencies and expected frequencies in categorical data.
Based on the χ² distribution.

In plain terms: “Do the counts we see differ from what we’d expect by chance?”

2) Formula

$\chi^2 = \sum \frac{(O_i – E_i)^2}{E_i}$

$O_i$: observed frequency
$E_i$: expected frequency under the null hypothesis
Large χ² value → observed data doesn’t match expected distribution → evidence against H₀.

3) Types of Chi-square Tests

Goodness-of-fit test
- Tests if sample distribution matches a theoretical distribution.
- Example: “Do dice rolls follow a uniform distribution?”
Test of independence
- Tests if two categorical variables are independent.
- Example: “Are gender and loan approval independent?”
Homogeneity test
- Tests if different populations have the same distribution.
- Example: “Is customer preference for product color the same across regions?”

4) Assumptions

Data are counts/frequencies (not percentages or continuous).
Categories are mutually exclusive.
Expected frequency ≥ 5 in most cells (for validity).
Observations are independent.

5) Example (Independence Test)

	Loan Approved	Loan Denied	Total
Male	40	60	100
Female	30	70	100
Total	70	130	200

Null hypothesis $H_0$: loan approval independent of gender.
Compute expected frequencies, χ² statistic, p-value.
If p < 0.05 → reject $H_0$ → approval depends on gender (potential fairness violation).

6) Applications in ML

Feature selection: test dependence between categorical feature & target (e.g., chi² test in scikit-learn).
Fairness checks: test if outcomes differ by protected group.
Drift detection: compare categorical feature distributions (training vs production).
Survey analysis / A/B testing: categorical response comparisons.

7) Python Example

import numpy as np
from scipy.stats import chi2_contingency

# Contingency table
data = np.array([[40, 60],
                 [30, 70]])  # rows: gender, cols: loan outcome

chi2, p, dof, expected = chi2_contingency(data)
print("Chi-square:", chi2)
print("p-value:", p)
print("Expected counts:\n", expected)

import numpy as np
from scipy.stats import chi2_contingency

# Contingency table
data = np.array([[40, 60],
                 [30, 70]])  # rows: gender, cols: loan outcome

chi2, p, dof, expected = chi2_contingency(data)
print("Chi-square:", chi2)
print("p-value:", p)
print("Expected counts:\n", expected)

Summary

Chi-square test = checks if observed categorical frequencies differ from expected.
Types: goodness-of-fit, independence, homogeneity.
Widely used in feature selection, fairness testing, drift monitoring.
Pros: simple, interpretable. Cons: only works with categorical count data, needs large samples.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Chi-square (χ²) Test

1) What it is

2) Formula

3) Types of Chi-square Tests

4) Assumptions

5) Example (Independence Test)

6) Applications in ML

7) Python Example

Summary

Like this:

Related

Leave a ReplyCancel reply

1) What it is

2) Formula

3) Types of Chi-square Tests

4) Assumptions

5) Example (Independence Test)

6) Applications in ML

7) Python Example

Summary

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery