Definition

Stratified Cross-Validation is a variant of cross-validation where the data is split into folds such that each fold preserves the original distribution of the target variable (labels).

Key difference from ordinary k-fold CV: instead of random splits, it ensures that each fold has similar class proportions as the full dataset.


Why It’s Important

  • In imbalanced classification problems, naive CV may create folds with very few or even zero examples of the minority class → unreliable metrics.
  • Stratification ensures that every fold has a representative mix of labels.
  • Leads to more stable and realistic performance estimates.

Example

Suppose you have a binary classification dataset with 1,000 samples:

  • Class 0 → 900 samples (90%)
  • Class 1 → 100 samples (10%)
  • Naive 5-fold CV: Some folds might contain almost no Class 1s (e.g., only 2–3).
  • Stratified 5-fold CV: Each fold will have ≈ 180 Class 0 and ≈ 20 Class 1, maintaining the 90/10 ratio.

Formula / Process

For k-fold stratified CV:

  1. Partition the dataset into k subsets (folds).
  2. In each fold, keep class proportions similar to the overall dataset.
  3. For each iteration:
    • Use k-1 folds as training data.
    • Use the remaining fold as validation.
  4. Average performance metrics across k iterations.

Extensions


When to Use

Highly recommended when:

  • Data is imbalanced.
  • Labels are categorical (classification tasks).

Less useful for:

  • Regression tasks (continuous labels, no class proportions).
  • Very large datasets (random splits approximate stratification naturally).

In Python (scikit-learn)

from sklearn.model_selection import StratifiedKFold

X, y = ...  # features and labels
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_idx, test_idx in skf.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    # fit model on (X_train, y_train), evaluate on (X_test, y_test)

Summary:

  • Stratified CV = CV that preserves label proportions across folds.
  • Essential for imbalanced classification → fairer, more stable estimates.