Definition
Stratified Cross-Validation is a variant of cross-validation where the data is split into folds such that each fold preserves the original distribution of the target variable (labels).
Key difference from ordinary k-fold CV: instead of random splits, it ensures that each fold has similar class proportions as the full dataset.
Why It’s Important
- In imbalanced classification problems, naive CV may create folds with very few or even zero examples of the minority class → unreliable metrics.
- Stratification ensures that every fold has a representative mix of labels.
- Leads to more stable and realistic performance estimates.
Example
Suppose you have a binary classification dataset with 1,000 samples:
- Class 0 → 900 samples (90%)
- Class 1 → 100 samples (10%)
- Naive 5-fold CV: Some folds might contain almost no Class 1s (e.g., only 2–3).
- Stratified 5-fold CV: Each fold will have ≈ 180 Class 0 and ≈ 20 Class 1, maintaining the 90/10 ratio.
Formula / Process
For k-fold stratified CV:
- Partition the dataset into k subsets (folds).
- In each fold, keep class proportions similar to the overall dataset.
- For each iteration:
- Use k-1 folds as training data.
- Use the remaining fold as validation.
- Average performance metrics across k iterations.
Extensions
- Multiclass stratified CV: Preserves proportions across all classes.
- Stratified Shuffle Split: Random train/test splits while maintaining label distribution.
- Stratified Group K-Fold: Ensures stratification and keeps grouped samples together (e.g., multiple records per user).
When to Use
Highly recommended when:
- Data is imbalanced.
- Labels are categorical (classification tasks).
Less useful for:
- Regression tasks (continuous labels, no class proportions).
- Very large datasets (random splits approximate stratification naturally).
In Python (scikit-learn)
from sklearn.model_selection import StratifiedKFold
X, y = ... # features and labels
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in skf.split(X, y):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# fit model on (X_train, y_train), evaluate on (X_test, y_test)
Summary:
- Stratified CV = CV that preserves label proportions across folds.
- Essential for imbalanced classification → fairer, more stable estimates.
