1. Definition
- NearMiss = a family of undersampling methods for imbalanced classification.
- Unlike random undersampling, which drops majority samples randomly, NearMiss selects majority samples based on their distance to minority samples.
- Goal: keep informative majority examples that are hardest to classify, while discarding “easy” ones.
2. How It Works
NearMiss has different versions (commonly NearMiss-1, NearMiss-2, NearMiss-3).
NearMiss-1
- For each majority class sample, compute its average distance to the k closest minority samples.
- Select the majority samples with the smallest average distance.
- Keeps majority samples that are close to minority samples → forces decision boundary to be more precise.
NearMiss-2
- For each majority class sample, compute its average distance to the k farthest minority samples.
- Select majority samples with smallest average distance to those farthest minority samples.
- Ensures minority class is well surrounded by majority samples.
NearMiss-3
- For each minority sample, select a fixed number of its nearest majority samples.
- Ensures every minority sample has nearby majority samples included.
3. Advantages
- Keeps informative boundary points rather than discarding data randomly.
- Reduces bias toward the majority class.
- Better generalization than naive undersampling in many cases.
4. Disadvantages
- Computationally expensive (requires distance calculations).
- May remove “easy” samples too aggressively, causing overfitting to difficult regions.
- Choice of k (neighbors) strongly affects results.
5. Example (Python, imbalanced-learn)
from imblearn.under_sampling import NearMiss
from collections import Counter
X, y = ... # features and labels
print("Original distribution:", Counter(y))
nm = NearMiss(version=1) # can choose version=1,2,3
X_res, y_res = nm.fit_resample(X, y)
print("Resampled distribution:", Counter(y_res))
6. Comparison
| Method | Selection Criterion | Effect |
|---|---|---|
| Random Undersampling | Randomly remove majority samples | Fast but may lose important info |
| Cluster-based Undersampling | Use clustering to keep representatives | Preserves diversity |
| NearMiss | Keep majority samples closest/farthest to minority (distance-based) | Focuses on class boundaries |
Summary
- NearMiss = distance-based undersampling strategy for imbalanced datasets.
- Versions differ in how they pick majority samples (closest, farthest, or per minority).
- Helps refine decision boundary but is computationally heavier than random methods.
