1) General Idea
- The bootstrap is a resampling method used to estimate uncertainty (variance, confidence intervals) of a statistic when the underlying distribution is unknown.
- Idea: instead of relying on analytical formulas, use the data itself as a proxy for the population.
You “simulate” drawing new datasets by sampling with replacement from the observed data.
2) How it works (step by step)
Suppose you have a dataset of size $n$:
$\{x_1, x_2, …, x_n\}$.
- Draw a bootstrap sample: pick nnn points with replacement from the original data.
- Some observations appear multiple times, some not at all.
- Compute the statistic of interest (e.g., mean, median, regression coefficient) on this resampled dataset.
- Repeat steps 1–2 many times (e.g., 1,000 or 10,000).
- Look at the distribution of bootstrap statistics.
This distribution approximates the sampling distribution of the statistic.
3) Example
Dataset = [5, 6, 7, 8, 9]
- Original sample mean = 7.0
- Generate bootstrap samples (size 5, with replacement):
- [6, 9, 7, 7, 8] → mean = 7.4
- [5, 5, 6, 7, 9] → mean = 6.4
- [7, 8, 8, 9, 9] → mean = 8.2
- Repeat 1,000 times → get distribution of means.
From that, you can compute:
- Bootstrap SE (standard error) = SD of bootstrap means.
- 95% CI = e.g., 2.5th percentile to 97.5th percentile of bootstrap means.
4) Why it’s powerful
- No need for strong parametric assumptions (like normality).
- Works for complicated statistics (median, correlation, regression coefficients, AUC, etc.).
- Easy to implement with modern computing.
5) Types of Bootstrap
- Nonparametric bootstrap: sample directly from the data (most common).
- Parametric bootstrap: assume a model (e.g., normal distribution), generate new samples from that model, and repeat.
- Block bootstrap: used in time series (sample contiguous blocks instead of independent points, to preserve autocorrelation).
- Bayesian bootstrap: resample by assigning random weights to observations instead of replicating them.
6) Common Applications
- Confidence intervals (CIs) for statistics.
- Standard errors of estimators.
- Bias correction (compare mean of bootstrap statistics vs original statistic).
- Model evaluation (e.g., bootstrap resampling instead of cross-validation).
- ROC/PR-AUC confidence intervals in classification tasks.
7) Limitations
- Computationally heavy (thousands of resamples).
- Assumes your sample is a good approximation of the population.
- Doesn’t fully solve issues with small sample sizes or biased sampling.
Summary
- Bootstrap = resampling with replacement to approximate the sampling distribution.
- Gives estimates of variance, confidence intervals, and bias for almost any statistic.
- Especially useful when parametric formulas are hard or unreliable.
