1. General Definition

  • Subsampling = selecting a subset of the original dataset (or signal) for analysis or training.
  • It can be done for efficiency, class balancing, or signal processing (reducing sample rate).

2. In Machine Learning / Data Science

  • Often used when datasets are too large or imbalanced.
  • Random subsampling: randomly pick a subset of data (like bootstrapping but without replacement).
  • Undersampling (subsampling majority class): reduce the size of the majority class to balance with the minority class.
  • Cross-validation subsampling: select subsets of data in each fold for model validation.

Example:

  • Dataset: 1,000,000 samples
  • You take a subsample of 100,000 to train faster.

3. In Signal Processing / Time Series

  • Subsampling = reducing the sampling rate of a signal (a form of downsampling).
  • Example:
    • Original: audio sampled at 44.1 kHz
    • Subsampled: reduce to 22.05 kHz
  • Must apply a low-pass filter first to avoid aliasing (distortion caused by high frequencies folding into lower ones).

4. Advantages

  • Faster training and inference (less data).
  • Reduces storage and computation cost.
  • In imbalanced datasets, helps balance class proportions (if applied to the majority class).

5. Disadvantages

  • Information loss: discards data, which may reduce accuracy.
  • If subsampling isn’t stratified, it may change class distribution unintentionally.
  • In signals, careless subsampling without filtering introduces aliasing noise.

6. Examples

In ML (Python, scikit-learn):

from sklearn.utils import resample

# Subsample dataset
X_sub, y_sub = resample(X, y, n_samples=10000, random_state=42)

In Signal Processing (Python, scipy):

import scipy.signal as sps

# Downsample signal by factor of 2
signal_sub = sps.resample(signal, len(signal)//2)

7. Comparison

TermContextMeaning
UndersamplingImbalanced classificationReduce majority class samples
OversamplingImbalanced classificationIncrease minority class samples
SubsamplingGeneral MLTake subset of data for efficiency or balance
Subsampling (DSP)SignalsReduce sampling rate (downsampling)

Summary

  • Subsampling = selecting a smaller subset of data or reducing signal sampling rate.
  • In ML: improves efficiency or balances classes.
  • In DSP: reduces sample rate → must use low-pass filtering to avoid aliasing.
  • Pros: faster, cheaper. Cons: possible loss of information.