Definition
Stratified Shuffle Split is a cross-validation method that repeatedly splits a dataset into train/test sets while preserving the class distribution (stratification).
- Unlike k-fold, it doesn’t split into fixed folds; instead, it randomly shuffles and splits multiple times.
- Ensures each train/test split has the same proportion of classes as the original dataset.
How It Works
- Choose parameters:
- n_splits → number of re-shuffles.
- train_size / test_size → proportion of data in each split.
- For each split:
- Randomly shuffle data.
- Partition into train/test sets, keeping class ratios the same.
- Repeat for all splits → evaluate model on each → average results.
Example
- Dataset: 1,000 samples (80% Class A, 20% Class B).
- StratifiedShuffleSplit(n_splits=5, test_size=0.2).
- Each split → Train = 800 (A=640, B=160), Test = 200 (A=160, B=40).
- Class proportions (80/20) preserved in every split.
When to Use
- Imbalanced datasets → prevents minority classes from disappearing in test sets.
- When you want randomized train/test splits (not fixed folds).
- Useful for model validation when dataset is small or class imbalance exists.
Difference from Other CV
| Method | Key Feature |
|---|---|
| k-Fold CV | Splits into k equal folds; each fold used once as test. |
| Stratified k-Fold CV | Like k-fold, but maintains class proportions. |
| Shuffle Split | Random splits multiple times, but no class preservation. |
| Stratified Shuffle Split | Random splits multiple times with class preservation. |
Code Example (scikit-learn)
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
sss = StratifiedShuffleSplit(n_splits=3, test_size=0.3, random_state=42)
for train_idx, test_idx in sss.split(X, y):
print("Train:", train_idx.shape, "Test:", test_idx.shape)
Summary
Stratified Shuffle Split = randomized cross-validation method that generates multiple train/test splits while preserving class distribution.
- Useful for imbalanced datasets.
- Different from k-fold → more flexible, not constrained by folds.
