Cluster-based undersampling

1. Definition

Cluster-based undersampling = an advanced undersampling method for handling class imbalance.
Instead of randomly dropping samples from the majority class, it first clusters the majority samples (e.g., with K-means), then selects representative samples from each cluster.

objective:

Keep diversity in the majority class (avoid losing too much information).
Prevent the randomness problem of random undersampling.

2. How It Works

Separate data into majority and minority classes.
Apply a clustering algorithm (commonly K-means) on the majority class samples.
For each cluster:
- Pick representative samples (e.g., closest to the cluster centroid).
- Or sample a proportion of points from the cluster.
Combine these with the minority class → new balanced dataset.

3. Advantages

Keeps representative structure of the majority class.
Reduces risk of throwing away informative samples (problem of random undersampling).
Often improves classification performance compared to random undersampling.

4. Disadvantages

More computationally expensive (clustering step).
Quality depends on clustering algorithm & chosen number of clusters (K).
Still discards data → possible underfitting if majority class is reduced too much.

5. Example (K-means undersampling)

Suppose:

Majority class = 10,000 samples
Minority class = 1,000 samples

Steps:

Cluster majority class into 1,000 clusters (same as minority size).
Select 1 representative sample per cluster (e.g., nearest to centroid).
Result: balanced dataset → 1,000 majority + 1,000 minority.

6. Implementation (Python, imbalanced-learn)

from imblearn.under_sampling import ClusterCentroids
from collections import Counter

X, y = ...  # features and labels
print("Original distribution:", Counter(y))

cc = ClusterCentroids(random_state=42)  # K-means based undersampling
X_res, y_res = cc.fit_resample(X, y)

print("Resampled distribution:", Counter(y_res))

from imblearn.under_sampling import ClusterCentroids
from collections import Counter

X, y = ...  # features and labels
print("Original distribution:", Counter(y))

cc = ClusterCentroids(random_state=42)  # K-means based undersampling
X_res, y_res = cc.fit_resample(X, y)

print("Resampled distribution:", Counter(y_res))

7. Comparison with Random Undersampling

Method	How it works	Pros	Cons
Random Undersampling	Randomly drops majority samples	Simple, fast	May lose important information
Cluster-based Undersampling	Clusters majority, keeps representative points	Preserves diversity, less info loss	More complex, slower

Summary

Cluster-based undersampling = undersampling majority class by clustering and selecting representatives.
Helps maintain majority class diversity → better than random undersampling.
Trade-off: higher computational cost.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Cluster-based undersampling

1. Definition

2. How It Works

3. Advantages

4. Disadvantages

5. Example (K-means undersampling)

6. Implementation (Python, imbalanced-learn)

7. Comparison with Random Undersampling

Like this:

Related

Leave a ReplyCancel reply

1. Definition

2. How It Works

3. Advantages

4. Disadvantages

5. Example (K-means undersampling)

6. Implementation (Python, imbalanced-learn)

7. Comparison with Random Undersampling

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery