1. Definition

  • SMOTE = Synthetic Minority Over-sampling Technique.
  • A method to handle class imbalance by generating synthetic samples of the minority class, instead of simply duplicating existing ones.
  • Proposed by Chawla et al., 2002.

2. Why SMOTE?

  • Random oversampling duplicates minority samples → risk of overfitting.
  • SMOTE creates new, synthetic data points along the line segments between existing minority samples.
  • This makes the decision boundary more general and less biased.

3. How It Works (Algorithm)

  1. For each minority class sample $x$, find its k nearest neighbors (other minority samples).
  2. Randomly select one of these neighbors $x_{nn}$​.
  3. Generate a synthetic sample:

$x_{new} = x + \delta \cdot (x_{nn} – x), \quad \delta \sim U(0,1)$

  • This means $x_{new}$​ lies somewhere on the line segment between $x$ and $x_{nn}$​.

4. Variants

  • Borderline-SMOTE: generates synthetic points near the decision boundary (borderline minority samples).
  • SMOTE Tomek / SMOTE ENN: combine SMOTE with cleaning methods to remove overlapping/noisy points.
  • ADASYN (Adaptive Synthetic Sampling): focus more on minority samples that are harder to learn (closer to majority).

5. Advantages

  • Reduces overfitting compared to random oversampling.
  • Improves classifier performance on imbalanced datasets.
  • Creates smoother, more general decision boundaries.

6. Disadvantages

  • May generate unrealistic samples if minority distribution is complex.
  • Risk of introducing class overlap (synthetic points may invade majority region).
  • More computationally expensive than random oversampling.

7. Example

Suppose:

  • Minority class = 100 samples
  • Majority class = 1,000 samples
  • Goal: balance dataset

SMOTE generates 900 synthetic minority samples → total = 1,000 vs 1,000.


8. Python Example (imbalanced-learn)

from imblearn.over_sampling import SMOTE
from collections import Counter

X, y = ...  # features and labels
print("Original distribution:", Counter(y))

smote = SMOTE(random_state=42, k_neighbors=5)
X_res, y_res = smote.fit_resample(X, y)

print("After SMOTE:", Counter(y_res))

Summary

  • SMOTE = synthetic oversampling technique for minority class.
  • Works by interpolating between minority samples and their neighbors.
  • Pros: reduces overfitting, better boundaries.
  • Cons: may create unrealistic or overlapping samples.
  • Widely used in fraud detection, medical diagnosis, rare event prediction.