1. Definition

  • Cluster-based undersampling = an advanced undersampling method for handling class imbalance.
  • Instead of randomly dropping samples from the majority class, it first clusters the majority samples (e.g., with K-means), then selects representative samples from each cluster.

objective:

  • Keep diversity in the majority class (avoid losing too much information).
  • Prevent the randomness problem of random undersampling.

2. How It Works

  1. Separate data into majority and minority classes.
  2. Apply a clustering algorithm (commonly K-means) on the majority class samples.
  3. For each cluster:
    • Pick representative samples (e.g., closest to the cluster centroid).
    • Or sample a proportion of points from the cluster.
  4. Combine these with the minority class → new balanced dataset.

3. Advantages

  • Keeps representative structure of the majority class.
  • Reduces risk of throwing away informative samples (problem of random undersampling).
  • Often improves classification performance compared to random undersampling.

4. Disadvantages

  • More computationally expensive (clustering step).
  • Quality depends on clustering algorithm & chosen number of clusters (K).
  • Still discards data → possible underfitting if majority class is reduced too much.

5. Example (K-means undersampling)

Suppose:

  • Majority class = 10,000 samples
  • Minority class = 1,000 samples

Steps:

  1. Cluster majority class into 1,000 clusters (same as minority size).
  2. Select 1 representative sample per cluster (e.g., nearest to centroid).
  3. Result: balanced dataset → 1,000 majority + 1,000 minority.

6. Implementation (Python, imbalanced-learn)

from imblearn.under_sampling import ClusterCentroids
from collections import Counter

X, y = ...  # features and labels
print("Original distribution:", Counter(y))

cc = ClusterCentroids(random_state=42)  # K-means based undersampling
X_res, y_res = cc.fit_resample(X, y)

print("Resampled distribution:", Counter(y_res))

7. Comparison with Random Undersampling

MethodHow it worksProsCons
Random UndersamplingRandomly drops majority samplesSimple, fastMay lose important information
Cluster-based UndersamplingClusters majority, keeps representative pointsPreserves diversity, less info lossMore complex, slower

Summary

  • Cluster-based undersampling = undersampling majority class by clustering and selecting representatives.
  • Helps maintain majority class diversity → better than random undersampling.
  • Trade-off: higher computational cost.