1. Definition

  • NearMiss = a family of undersampling methods for imbalanced classification.
  • Unlike random undersampling, which drops majority samples randomly, NearMiss selects majority samples based on their distance to minority samples.
  • Goal: keep informative majority examples that are hardest to classify, while discarding “easy” ones.

2. How It Works

NearMiss has different versions (commonly NearMiss-1, NearMiss-2, NearMiss-3).

NearMiss-1

  • For each majority class sample, compute its average distance to the k closest minority samples.
  • Select the majority samples with the smallest average distance.
  • Keeps majority samples that are close to minority samples → forces decision boundary to be more precise.

NearMiss-2

  • For each majority class sample, compute its average distance to the k farthest minority samples.
  • Select majority samples with smallest average distance to those farthest minority samples.
  • Ensures minority class is well surrounded by majority samples.

NearMiss-3

  • For each minority sample, select a fixed number of its nearest majority samples.
  • Ensures every minority sample has nearby majority samples included.

3. Advantages

  • Keeps informative boundary points rather than discarding data randomly.
  • Reduces bias toward the majority class.
  • Better generalization than naive undersampling in many cases.

4. Disadvantages

  • Computationally expensive (requires distance calculations).
  • May remove “easy” samples too aggressively, causing overfitting to difficult regions.
  • Choice of k (neighbors) strongly affects results.

5. Example (Python, imbalanced-learn)

from imblearn.under_sampling import NearMiss
from collections import Counter

X, y = ...  # features and labels
print("Original distribution:", Counter(y))

nm = NearMiss(version=1)  # can choose version=1,2,3
X_res, y_res = nm.fit_resample(X, y)

print("Resampled distribution:", Counter(y_res))

6. Comparison

MethodSelection CriterionEffect
Random UndersamplingRandomly remove majority samplesFast but may lose important info
Cluster-based UndersamplingUse clustering to keep representativesPreserves diversity
NearMissKeep majority samples closest/farthest to minority (distance-based)Focuses on class boundaries

Summary

  • NearMiss = distance-based undersampling strategy for imbalanced datasets.
  • Versions differ in how they pick majority samples (closest, farthest, or per minority).
  • Helps refine decision boundary but is computationally heavier than random methods.