1. Definition

  • Random Undersampling = a resampling technique for handling class imbalance in classification tasks.
  • Idea: reduce the number of samples in the majority class by randomly removing them until class distribution is more balanced.

2. Why It’s Used

  • In imbalanced datasets (e.g., fraud detection, churn prediction), the majority class overwhelms the minority class.
  • Classifiers may become biased toward the majority.
  • Undersampling helps by making classes more balanced.

3. How It Works

Example: Binary classification

  • Majority class (non-fraud): 10,000 samples
  • Minority class (fraud): 1,000 samples

Random undersampling:

  • Randomly select 1,000 samples from the majority class
  • New dataset = 1,000 majority + 1,000 minority

4. Advantages

  • Simple and fast
  • Balances dataset → models don’t ignore the minority class
  • Reduces training time (fewer samples)

5. Disadvantages

  • Information loss: many potentially useful majority samples are discarded
  • Risk of underfitting, since the model trains on fewer data points
  • Works poorly if minority class is extremely small (too much data discarded)

6. Alternatives / Variants

  • Random Oversampling: duplicate minority samples
  • SMOTE (Synthetic Minority Over-sampling Technique): generate synthetic minority samples
  • Tomek Links / Edited Nearest Neighbors: smarter undersampling (removes borderline/overlapping majority samples instead of random)
  • Ensemble methods: combine undersampling with bagging/boosting (e.g., Balanced Random Forest).

7. Example (Python, imbalanced-learn)

from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

X, y = ...  # features and labels
print("Original distribution:", Counter(y))

rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)

print("Resampled distribution:", Counter(y_res))

Summary

  • Random Undersampling = randomly drop majority class samples to balance dataset.
  • Pros: simple, faster training, balances data.
  • Cons: loses information, risk of underfitting.
  • Often combined with oversampling or ensemble methods for better results.