Definition

Stratified Shuffle Split is a cross-validation method that repeatedly splits a dataset into train/test sets while preserving the class distribution (stratification).

  • Unlike k-fold, it doesn’t split into fixed folds; instead, it randomly shuffles and splits multiple times.
  • Ensures each train/test split has the same proportion of classes as the original dataset.

How It Works

  1. Choose parameters:
    • n_splits → number of re-shuffles.
    • train_size / test_size → proportion of data in each split.
  2. For each split:
    • Randomly shuffle data.
    • Partition into train/test sets, keeping class ratios the same.
  3. Repeat for all splits → evaluate model on each → average results.

Example

  • Dataset: 1,000 samples (80% Class A, 20% Class B).
  • StratifiedShuffleSplit(n_splits=5, test_size=0.2).
  • Each split → Train = 800 (A=640, B=160), Test = 200 (A=160, B=40).
  • Class proportions (80/20) preserved in every split.

When to Use

  • Imbalanced datasets → prevents minority classes from disappearing in test sets.
  • When you want randomized train/test splits (not fixed folds).
  • Useful for model validation when dataset is small or class imbalance exists.

Difference from Other CV

MethodKey Feature
k-Fold CVSplits into k equal folds; each fold used once as test.
Stratified k-Fold CVLike k-fold, but maintains class proportions.
Shuffle SplitRandom splits multiple times, but no class preservation.
Stratified Shuffle SplitRandom splits multiple times with class preservation.

Code Example (scikit-learn)

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

sss = StratifiedShuffleSplit(n_splits=3, test_size=0.3, random_state=42)

for train_idx, test_idx in sss.split(X, y):
    print("Train:", train_idx.shape, "Test:", test_idx.shape)

Summary
Stratified Shuffle Split = randomized cross-validation method that generates multiple train/test splits while preserving class distribution.

  • Useful for imbalanced datasets.
  • Different from k-fold → more flexible, not constrained by folds.