Definition

Stratified Cross-Validation is a variant of cross-validation where the data is split into folds such that each fold preserves the original distribution of the target variable (labels).

Key difference from ordinary k-fold CV: instead of random splits, it ensures that each fold has similar class proportions as the full dataset.

Why It’s Important

In imbalanced classification problems, naive CV may create folds with very few or even zero examples of the minority class → unreliable metrics.
Stratification ensures that every fold has a representative mix of labels.
Leads to more stable and realistic performance estimates.

Example

Suppose you have a binary classification dataset with 1,000 samples:

Class 0 → 900 samples (90%)
Class 1 → 100 samples (10%)
Naive 5-fold CV: Some folds might contain almost no Class 1s (e.g., only 2–3).
Stratified 5-fold CV: Each fold will have ≈ 180 Class 0 and ≈ 20 Class 1, maintaining the 90/10 ratio.

Formula / Process

For k-fold stratified CV:

Partition the dataset into k subsets (folds).
In each fold, keep class proportions similar to the overall dataset.
For each iteration:
- Use k-1 folds as training data.
- Use the remaining fold as validation.
Average performance metrics across k iterations.

Extensions

Multiclass stratified CV: Preserves proportions across all classes.
Stratified Shuffle Split: Random train/test splits while maintaining label distribution.
Stratified Group K-Fold: Ensures stratification and keeps grouped samples together (e.g., multiple records per user).

When to Use

Highly recommended when:

Data is imbalanced.
Labels are categorical (classification tasks).

Less useful for:

Regression tasks (continuous labels, no class proportions).
Very large datasets (random splits approximate stratification naturally).

In Python (scikit-learn)

from sklearn.model_selection import StratifiedKFold

X, y = ...  # features and labels
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_idx, test_idx in skf.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    # fit model on (X_train, y_train), evaluate on (X_test, y_test)

from sklearn.model_selection import StratifiedKFold

X, y = ...  # features and labels
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for train_idx, test_idx in skf.split(X, y):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    # fit model on (X_train, y_train), evaluate on (X_test, y_test)

Summary:

Stratified CV = CV that preserves label proportions across folds.
Essential for imbalanced classification → fairer, more stable estimates.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

k-fold Stratified Cross-Validation (Stratified CV)

Definition

Why It’s Important

Example

Formula / Process

Extensions

When to Use

In Python (scikit-learn)

Like this:

Related

Leave a ReplyCancel reply

Definition

Why It’s Important

Example

Formula / Process

Extensions

When to Use

In Python (scikit-learn)

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery