1. Definition
- Random Undersampling = a resampling technique for handling class imbalance in classification tasks.
- Idea: reduce the number of samples in the majority class by randomly removing them until class distribution is more balanced.
2. Why It’s Used
- In imbalanced datasets (e.g., fraud detection, churn prediction), the majority class overwhelms the minority class.
- Classifiers may become biased toward the majority.
- Undersampling helps by making classes more balanced.
3. How It Works
Example: Binary classification
- Majority class (non-fraud): 10,000 samples
- Minority class (fraud): 1,000 samples
Random undersampling:
- Randomly select 1,000 samples from the majority class
- New dataset = 1,000 majority + 1,000 minority
4. Advantages
- Simple and fast
- Balances dataset → models don’t ignore the minority class
- Reduces training time (fewer samples)
5. Disadvantages
- Information loss: many potentially useful majority samples are discarded
- Risk of underfitting, since the model trains on fewer data points
- Works poorly if minority class is extremely small (too much data discarded)
6. Alternatives / Variants
- Random Oversampling: duplicate minority samples
- SMOTE (Synthetic Minority Over-sampling Technique): generate synthetic minority samples
- Tomek Links / Edited Nearest Neighbors: smarter undersampling (removes borderline/overlapping majority samples instead of random)
- Ensemble methods: combine undersampling with bagging/boosting (e.g., Balanced Random Forest).
7. Example (Python, imbalanced-learn)
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
X, y = ... # features and labels
print("Original distribution:", Counter(y))
rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)
print("Resampled distribution:", Counter(y_res))
Summary
- Random Undersampling = randomly drop majority class samples to balance dataset.
- Pros: simple, faster training, balances data.
- Cons: loses information, risk of underfitting.
- Often combined with oversampling or ensemble methods for better results.
