1. Definition
- Oversampling = a resampling technique to handle class imbalance.
- It increases the number of samples in the minority class by duplicating or generating new samples, so that the dataset becomes more balanced.
2. Why It’s Used
- In imbalanced datasets (e.g., fraud detection, medical diagnosis), the minority class has very few examples.
- A classifier trained on imbalanced data tends to predict mostly the majority class.
- Oversampling helps ensure the model pays attention to the minority class.
3. How It Works
- Random Oversampling: randomly duplicate existing minority samples until class counts are balanced.
- Synthetic Oversampling (e.g., SMOTE, ADASYN): generate new, synthetic minority samples by interpolation or adaptive methods.
4. Advantages
- Prevents model from ignoring the minority class.
- Simple to implement (random oversampling).
- Can improve recall and F1 for minority class.
5. Disadvantages
- Random oversampling may cause overfitting (exact duplicates of minority samples).
- Increases dataset size → longer training time.
- Synthetic methods (like SMOTE) may generate noisy or less realistic samples.
6. Example
Dataset:
- Majority (non-fraud) = 10,000
- Minority (fraud) = 1,000
Random oversampling: duplicate fraud samples until there are 10,000 → now balanced.
SMOTE: generate synthetic fraud samples until there are 10,000.
7. Python Example (imbalanced-learn)
from imblearn.over_sampling import RandomOverSampler, SMOTE
from collections import Counter
X, y = ... # features and labels
print("Original distribution:", Counter(y))
# Random oversampling
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)
print("Random Oversampling:", Counter(y_res))
# SMOTE oversampling
smote = SMOTE(random_state=42)
X_res_smote, y_res_smote = smote.fit_resample(X, y)
print("SMOTE Oversampling:", Counter(y_res_smote))
Summary
- Oversampling = increase minority samples to balance dataset.
- Methods: Random Oversampling (duplicate) and Synthetic Oversampling (e.g., SMOTE, ADASYN).
- Pros: improves minority detection.
- Cons: may cause overfitting or unrealistic synthetic samples.
