SMOTE (Synthetic Minority Over-sampling Technique)

1. Definition

SMOTE = Synthetic Minority Over-sampling Technique.
A method to handle class imbalance by generating synthetic samples of the minority class, instead of simply duplicating existing ones.
Proposed by Chawla et al., 2002.

2. Why SMOTE?

Random oversampling duplicates minority samples → risk of overfitting.
SMOTE creates new, synthetic data points along the line segments between existing minority samples.
This makes the decision boundary more general and less biased.

3. How It Works (Algorithm)

For each minority class sample $x$, find its k nearest neighbors (other minority samples).
Randomly select one of these neighbors $x_{nn}$.
Generate a synthetic sample:

$x_{new} = x + \delta \cdot (x_{nn} – x), \quad \delta \sim U(0,1)$

This means $x_{new}$ lies somewhere on the line segment between $x$ and $x_{nn}$.

4. Variants

Borderline-SMOTE: generates synthetic points near the decision boundary (borderline minority samples).
SMOTE Tomek / SMOTE ENN: combine SMOTE with cleaning methods to remove overlapping/noisy points.
ADASYN (Adaptive Synthetic Sampling): focus more on minority samples that are harder to learn (closer to majority).

5. Advantages

Reduces overfitting compared to random oversampling.
Improves classifier performance on imbalanced datasets.
Creates smoother, more general decision boundaries.

6. Disadvantages

May generate unrealistic samples if minority distribution is complex.
Risk of introducing class overlap (synthetic points may invade majority region).
More computationally expensive than random oversampling.

7. Example

Suppose:

Minority class = 100 samples
Majority class = 1,000 samples
Goal: balance dataset

SMOTE generates 900 synthetic minority samples → total = 1,000 vs 1,000.

8. Python Example (imbalanced-learn)

from imblearn.over_sampling import SMOTE
from collections import Counter

X, y = ...  # features and labels
print("Original distribution:", Counter(y))

smote = SMOTE(random_state=42, k_neighbors=5)
X_res, y_res = smote.fit_resample(X, y)

print("After SMOTE:", Counter(y_res))

from imblearn.over_sampling import SMOTE
from collections import Counter

X, y = ...  # features and labels
print("Original distribution:", Counter(y))

smote = SMOTE(random_state=42, k_neighbors=5)
X_res, y_res = smote.fit_resample(X, y)

print("After SMOTE:", Counter(y_res))

Summary

SMOTE = synthetic oversampling technique for minority class.
Works by interpolating between minority samples and their neighbors.
Pros: reduces overfitting, better boundaries.
Cons: may create unrealistic or overlapping samples.
Widely used in fraud detection, medical diagnosis, rare event prediction.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

SMOTE (Synthetic Minority Over-sampling Technique)

1. Definition

2. Why SMOTE?

3. How It Works (Algorithm)

4. Variants

5. Advantages

6. Disadvantages

7. Example

8. Python Example (imbalanced-learn)

Like this:

Related

Leave a ReplyCancel reply

1. Definition

2. Why SMOTE?

3. How It Works (Algorithm)

4. Variants

5. Advantages

6. Disadvantages

7. Example

8. Python Example (imbalanced-learn)

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery