1. Definition
- SMOTE = Synthetic Minority Over-sampling Technique.
- A method to handle class imbalance by generating synthetic samples of the minority class, instead of simply duplicating existing ones.
- Proposed by Chawla et al., 2002.
2. Why SMOTE?
- Random oversampling duplicates minority samples → risk of overfitting.
- SMOTE creates new, synthetic data points along the line segments between existing minority samples.
- This makes the decision boundary more general and less biased.
3. How It Works (Algorithm)
- For each minority class sample $x$, find its k nearest neighbors (other minority samples).
- Randomly select one of these neighbors $x_{nn}$.
- Generate a synthetic sample:
$x_{new} = x + \delta \cdot (x_{nn} – x), \quad \delta \sim U(0,1)$
- This means $x_{new}$ lies somewhere on the line segment between $x$ and $x_{nn}$.
4. Variants
- Borderline-SMOTE: generates synthetic points near the decision boundary (borderline minority samples).
- SMOTE Tomek / SMOTE ENN: combine SMOTE with cleaning methods to remove overlapping/noisy points.
- ADASYN (Adaptive Synthetic Sampling): focus more on minority samples that are harder to learn (closer to majority).
5. Advantages
- Reduces overfitting compared to random oversampling.
- Improves classifier performance on imbalanced datasets.
- Creates smoother, more general decision boundaries.
6. Disadvantages
- May generate unrealistic samples if minority distribution is complex.
- Risk of introducing class overlap (synthetic points may invade majority region).
- More computationally expensive than random oversampling.
7. Example
Suppose:
- Minority class = 100 samples
- Majority class = 1,000 samples
- Goal: balance dataset
SMOTE generates 900 synthetic minority samples → total = 1,000 vs 1,000.
8. Python Example (imbalanced-learn)
from imblearn.over_sampling import SMOTE
from collections import Counter
X, y = ... # features and labels
print("Original distribution:", Counter(y))
smote = SMOTE(random_state=42, k_neighbors=5)
X_res, y_res = smote.fit_resample(X, y)
print("After SMOTE:", Counter(y_res))
Summary
- SMOTE = synthetic oversampling technique for minority class.
- Works by interpolating between minority samples and their neighbors.
- Pros: reduces overfitting, better boundaries.
- Cons: may create unrealistic or overlapping samples.
- Widely used in fraud detection, medical diagnosis, rare event prediction.
