Definition

Stratified Group K-Fold is a cross-validation method that combines three ideas:

  1. k-Fold CV → split dataset into k folds.
  2. Stratification → preserve the class distribution in each fold.
  3. Grouping → ensure that samples from the same group (e.g., same patient, same user, same session) never appear in both train and validation sets.

This is especially useful for classification tasks with grouped and imbalanced data.


Why It’s Needed

  • Normal stratified k-fold → ensures class balance, but the same group (e.g., same patient’s data) might appear in both train and validation. This can leak information and inflate performance.
  • Group k-fold → prevents group overlap but may lose class balance.
  • Stratified Group k-fold → solves both: keeps class proportions and respects group boundaries.

How It Works

  1. Identify groups in your dataset (e.g., patient_id).
  2. Ensure each fold has:
    • Similar class distribution (stratification).
    • Entire groups assigned to either train or validation (grouping).
  3. Rotate across k folds, as in k-fold CV.

Example

  • Dataset = 1,000 samples from 100 patients (groups).
  • Target labels = disease present (20%) vs. not present (80%).
  • Using Stratified Group k-fold (k=5):
    • Each fold contains ~20% patients.
    • Within each fold, class distribution ~20/80 preserved.
    • No patient appears in both train and validation sets.

When to Use

  • Medical data (multiple samples per patient).
  • User behavior data (multiple events per user).
  • Any dataset where samples within a group are not independent.
  • Imbalanced datasets where class distribution must be preserved.

Comparison

MethodStratified?Group-aware?Use Case
k-FoldNoNoSimple datasets
Stratified k-FoldYesNoImbalanced classification
Group k-FoldNoYesGrouped data (but not imbalanced)
Stratified Group k-FoldYesYesGrouped + imbalanced data

Code (scikit-learn)

from sklearn.model_selection import StratifiedGroupKFold

X = ...   # features
y = ...   # labels
groups = ...  # e.g., patient IDs

cv = StratifiedGroupKFold(n_splits=5)

for train_idx, test_idx in cv.split(X, y, groups):
    print("Train:", train_idx, "Test:", test_idx)

Summary
Stratified Group K-Fold = cross-validation method that ensures class balance and prevents group leakage across folds.

  • Ideal for datasets with groups + imbalanced classes.
  • More reliable evaluation than standard CV.