1. Definition

  • Oversampling = a resampling technique to handle class imbalance.
  • It increases the number of samples in the minority class by duplicating or generating new samples, so that the dataset becomes more balanced.

2. Why It’s Used

  • In imbalanced datasets (e.g., fraud detection, medical diagnosis), the minority class has very few examples.
  • A classifier trained on imbalanced data tends to predict mostly the majority class.
  • Oversampling helps ensure the model pays attention to the minority class.

3. How It Works

  • Random Oversampling: randomly duplicate existing minority samples until class counts are balanced.
  • Synthetic Oversampling (e.g., SMOTE, ADASYN): generate new, synthetic minority samples by interpolation or adaptive methods.

4. Advantages

  • Prevents model from ignoring the minority class.
  • Simple to implement (random oversampling).
  • Can improve recall and F1 for minority class.

5. Disadvantages

  • Random oversampling may cause overfitting (exact duplicates of minority samples).
  • Increases dataset size → longer training time.
  • Synthetic methods (like SMOTE) may generate noisy or less realistic samples.

6. Example

Dataset:

  • Majority (non-fraud) = 10,000
  • Minority (fraud) = 1,000

Random oversampling: duplicate fraud samples until there are 10,000 → now balanced.
SMOTE: generate synthetic fraud samples until there are 10,000.


7. Python Example (imbalanced-learn)

from imblearn.over_sampling import RandomOverSampler, SMOTE
from collections import Counter

X, y = ...  # features and labels
print("Original distribution:", Counter(y))

# Random oversampling
ros = RandomOverSampler(random_state=42)
X_res, y_res = ros.fit_resample(X, y)
print("Random Oversampling:", Counter(y_res))

# SMOTE oversampling
smote = SMOTE(random_state=42)
X_res_smote, y_res_smote = smote.fit_resample(X, y)
print("SMOTE Oversampling:", Counter(y_res_smote))

Summary

  • Oversampling = increase minority samples to balance dataset.
  • Methods: Random Oversampling (duplicate) and Synthetic Oversampling (e.g., SMOTE, ADASYN).
  • Pros: improves minority detection.
  • Cons: may cause overfitting or unrealistic synthetic samples.