NearMiss (Distance-based Undersampling)

1. Definition

NearMiss = a family of undersampling methods for imbalanced classification.
Unlike random undersampling, which drops majority samples randomly, NearMiss selects majority samples based on their distance to minority samples.
Goal: keep informative majority examples that are hardest to classify, while discarding “easy” ones.

2. How It Works

NearMiss has different versions (commonly NearMiss-1, NearMiss-2, NearMiss-3).

NearMiss-1

For each majority class sample, compute its average distance to the k closest minority samples.
Select the majority samples with the smallest average distance.
Keeps majority samples that are close to minority samples → forces decision boundary to be more precise.

NearMiss-2

For each majority class sample, compute its average distance to the k farthest minority samples.
Select majority samples with smallest average distance to those farthest minority samples.
Ensures minority class is well surrounded by majority samples.

NearMiss-3

For each minority sample, select a fixed number of its nearest majority samples.
Ensures every minority sample has nearby majority samples included.

3. Advantages

Keeps informative boundary points rather than discarding data randomly.
Reduces bias toward the majority class.
Better generalization than naive undersampling in many cases.

4. Disadvantages

Computationally expensive (requires distance calculations).
May remove “easy” samples too aggressively, causing overfitting to difficult regions.
Choice of k (neighbors) strongly affects results.

5. Example (Python, imbalanced-learn)

from imblearn.under_sampling import NearMiss
from collections import Counter

X, y = ...  # features and labels
print("Original distribution:", Counter(y))

nm = NearMiss(version=1)  # can choose version=1,2,3
X_res, y_res = nm.fit_resample(X, y)

print("Resampled distribution:", Counter(y_res))

from imblearn.under_sampling import NearMiss
from collections import Counter

X, y = ...  # features and labels
print("Original distribution:", Counter(y))

nm = NearMiss(version=1)  # can choose version=1,2,3
X_res, y_res = nm.fit_resample(X, y)

print("Resampled distribution:", Counter(y_res))

6. Comparison

Method	Selection Criterion	Effect
Random Undersampling	Randomly remove majority samples	Fast but may lose important info
Cluster-based Undersampling	Use clustering to keep representatives	Preserves diversity
NearMiss	Keep majority samples closest/farthest to minority (distance-based)	Focuses on class boundaries

Summary

NearMiss = distance-based undersampling strategy for imbalanced datasets.
Versions differ in how they pick majority samples (closest, farthest, or per minority).
Helps refine decision boundary but is computationally heavier than random methods.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

NearMiss (Distance-based Undersampling)

1. Definition

2. How It Works

NearMiss-1

NearMiss-2

NearMiss-3

3. Advantages

4. Disadvantages

5. Example (Python, imbalanced-learn)

6. Comparison

Like this:

Related

Leave a ReplyCancel reply

1. Definition

2. How It Works

NearMiss-1

NearMiss-2

NearMiss-3

3. Advantages

4. Disadvantages

5. Example (Python, imbalanced-learn)

6. Comparison

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery