1) General Meaning
- Downsampling means reducing the number of observations in a dataset.
- The idea is to make the dataset smaller or more balanced, depending on context.
It appears in two main areas:
- Imbalanced Classification (resampling strategy)
- Time Series or Signal Processing (reducing frequency)
2) Downsampling in Imbalanced Classification
- In classification, one class (usually “negative”) can dominate the dataset.
- Example: Fraud detection dataset → 99% non-fraud, 1% fraud.
- If we train directly, the model may ignore minority class.
Downsampling strategy:
- Randomly remove samples from the majority class until the dataset is more balanced.
- Example: 10,000 negatives + 100 positives → downsample negatives to 200 → now 200 vs 100.
Benefit:
- Forces model to “see” minority class more clearly.
Risks:
- Throwing away data → loss of information.
- If dataset is small, this can hurt performance.
Variants:
- Random undersampling: choose majority samples randomly.
- Cluster-based undersampling: keep representative samples from clusters.
- NearMiss (distance-based): keep majority samples that are closest to minority examples.
3) Downsampling in Time Series / Signals
- Means reducing the sampling rate.
- Example: you have sensor readings every 1 ms → downsample to every 10 ms.
Why?
- To reduce storage or computational cost.
- To remove noise / smooth data.
- To match another dataset’s frequency (e.g., align weather data hourly with energy usage hourly).
How?
- Pick every k-th sample (simple subsampling).
- Aggregate within bins (e.g., average temperature per hour).
- Often requires low-pass filtering first (anti-aliasing) to avoid distortions.
4) Pros and Cons
Pros:
- In classification: improves balance → models trained fairly.
- In time series: reduces noise, smaller files, faster processing.
Cons:
- Classification: may discard important majority information.
- Time series: may lose details or introduce aliasing.
5) Related Concepts
- Oversampling: add more minority samples (duplicate or synthetic, e.g., SMOTE).
- Class weighting: instead of changing data, adjust loss function.
- Subsampling vs Downsampling: often used interchangeably, but subsampling usually means random selection.
Summary:
- Downsampling (classification) = cut down majority class samples to fix imbalance.
- Downsampling (time series) = reduce data frequency to save resources or smooth noise.
- Always a trade-off: balance vs. information loss.
