1. Definition
- Cluster-based undersampling = an advanced undersampling method for handling class imbalance.
- Instead of randomly dropping samples from the majority class, it first clusters the majority samples (e.g., with K-means), then selects representative samples from each cluster.
objective:
- Keep diversity in the majority class (avoid losing too much information).
- Prevent the randomness problem of random undersampling.
2. How It Works
- Separate data into majority and minority classes.
- Apply a clustering algorithm (commonly K-means) on the majority class samples.
- For each cluster:
- Pick representative samples (e.g., closest to the cluster centroid).
- Or sample a proportion of points from the cluster.
- Combine these with the minority class → new balanced dataset.
3. Advantages
- Keeps representative structure of the majority class.
- Reduces risk of throwing away informative samples (problem of random undersampling).
- Often improves classification performance compared to random undersampling.
4. Disadvantages
- More computationally expensive (clustering step).
- Quality depends on clustering algorithm & chosen number of clusters (K).
- Still discards data → possible underfitting if majority class is reduced too much.
5. Example (K-means undersampling)
Suppose:
- Majority class = 10,000 samples
- Minority class = 1,000 samples
Steps:
- Cluster majority class into 1,000 clusters (same as minority size).
- Select 1 representative sample per cluster (e.g., nearest to centroid).
- Result: balanced dataset → 1,000 majority + 1,000 minority.
6. Implementation (Python, imbalanced-learn)
from imblearn.under_sampling import ClusterCentroids
from collections import Counter
X, y = ... # features and labels
print("Original distribution:", Counter(y))
cc = ClusterCentroids(random_state=42) # K-means based undersampling
X_res, y_res = cc.fit_resample(X, y)
print("Resampled distribution:", Counter(y_res))
7. Comparison with Random Undersampling
| Method | How it works | Pros | Cons |
|---|---|---|---|
| Random Undersampling | Randomly drops majority samples | Simple, fast | May lose important information |
| Cluster-based Undersampling | Clusters majority, keeps representative points | Preserves diversity, less info loss | More complex, slower |
Summary
- Cluster-based undersampling = undersampling majority class by clustering and selecting representatives.
- Helps maintain majority class diversity → better than random undersampling.
- Trade-off: higher computational cost.
