1. Setup: Nonparametric Regression with Many Basis Functions
Consider a nonparametric regression model:
where
- are prespecified basis functions (for example, B-splines, Gaussian RBFs),
- are the basis coefficients,
- is the residual variance.
In practice, not all basis functions are needed. Some can be dropped without losing accuracy. The problem is to decide:
- which basis functions to include, and
- how strongly to regularize their coefficients.
Two main strategies are:
- Variable selection (some exactly zero),
- Shrinkage (many near zero but not forced to be exactly zero).
2. Bayesian Variable Selection via a Mixture Prior
Introduce a model indicator vector
with
- : basis function is included,
- : basis function is excluded ().
The model space contains all possible inclusion patterns.
A convenient way to encode this is to work in the full model and give each coefficient a spike–slab prior:
Interpretation:
- With probability , (the spike at zero).
- With probability , is drawn from a normal distribution with variance (the slab).
This induces:
- , independently for .
- For a given model , the nonzero coefficients have a multivariate normal prior βwhere is the number of included basis functions and
.
This is a variable selection mixture prior: each basis function is either off () or on with a Gaussian prior.
2.1 Hyperprior on inclusion probability and Cauchy slab
If there is no strong prior reason to treat basis functions differently, set
- for all ,
- and put a prior .
Given , the full conditional posterior for is
This induces an automatic multiplicity adjustment: as more unnecessary basis functions are added, posterior mass shifts toward smaller (more exclusion).
To give nonzero coefficients heavy-tailed priors (to avoid overshrinkage of real signals), place
for each . This choice makes the slab effectively Cauchy-like (using a normal–inverse-gamma scale mixture representation of the Student-t).
- An improper prior on nonzero should not be used; it tends to put very high posterior probability on the null model (all ).
- An improper prior on is acceptable, because is shared across all models. For example, set in the Inv-Gamma.
3. Posterior over Models and Analytic Marginal Likelihood
Under fixed and for simplicity, the posterior model probability has the form
where is the marginal likelihood under model :
Here is the reduced design row.
This is exactly the marginal likelihood for a normal linear regression with a conjugate normal–inverse-gamma prior, so a closed-form expression is available.
The posterior for conditional on is multivariate normal–inverse-gamma, again with standard formulas.
3.1 Computational difficulty when H is large
The main obstacle is the size of model space:
- There are models.
- For , this is about models.
Exact summation over all is impossible unless is very small.
Therefore, approximations are needed:
- Stochastic search over models, using MCMC to find high posterior probability models and averaging over them.
- Gibbs sampling for $\gamma$, updating one at a time.
4. Gibbs Sampling for Model Indicators
For Gibbs sampling, update each from its Bernoulli full conditional:
where denotes all indicators except .
One Gibbs cycle:
- iterate over ,
- update using the above probability.
After warm-up, successive draws of approximate samples from the posterior over .
4.1 Model selection vs model averaging
From the posterior sample of , one can:
- compute marginal inclusion probabilities for each basis function,
- identify a maximum posterior model (the single with largest posterior probability),
- or average predictions and curve estimates over the entire posterior over .
In high-dimensional spaces, many models can have similar posterior probabilities, so relying on a single “best model” can be misleading. Model averaging better reflects uncertainty.
If a single model must be chosen, a good choice is the median probability model:
- include every basis function with marginal inclusion probability .
For orthogonal basis functions, this model has optimality properties as the best single-model approximation to full Bayesian model averaging.
5. Example: Chloride Concentration with Basis Selection
In the chloride concentration example:
- There are 54 observations,
- and 21 spline basis functions.
If all 21 basis functions are included with a weakly informative Gaussian prior, such as
- or ,
the posterior mean curve fits poorly:
- the fit drifts downward toward zero,
- because the prior is centered at 0 and not sufficiently regularized relative to the number of parameters.
Instead, treat the 21 splines as a full candidate basis set and perform Bayesian variable selection:
- prior inclusion probability for each basis: ,
- nonzero coefficients: ,
- residual variance: .
A Gibbs sampler is run:
- at each iteration, some subset of is active and updated; others are set to 0 via .
- convergence is fast, and computation is quick.
Results:
- Posterior mean number of active basis functions: 12.0
- 95% posterior interval for model size: [8.0, 16.0]
- Posterior mean of residual standard deviation:
- 95% interval for : [0.23, 0.33]
This indicates:
- a moderately complex but smooth curve (about 8–16 basis functions),
- relatively low measurement error.
A limitation: results can be sensitive to the initial choice of basis family and number of basis functions. For example, choosing 21 smooth cubic splines implicitly assumes that the true curve is smooth and does not have extremely sharp peaks. If sharp spikes are plausible, a different basis (for example wavelets) might be more appropriate, or a mix of basis families can be considered and selected via variable selection.
6. Shrinkage Priors as a Continuous Alternative to Hard Selection
Variable selection with exact zeros is conceptually attractive but computationally demanding:
- Model space is enormous ().
- MCMC can explore only a tiny fraction of models.
- One-at-a-time updates of can mix slowly.
- Nonconjugate priors make computation even harder.
An alternative is to avoid exact zeros and instead use shrinkage priors on that:
- concentrate probability mass near zero, so many coefficients are effectively negligible,
- have heavy tails to avoid over-shrinking large, important coefficients.
6.1 Scale mixtures of normals
Most useful shrinkage priors can be written as:
where is a mixing distribution over variances.
Examples:
- t-distribution with degrees of freedom arises if
- As , one gets something like a normal-Jeffreys prior, which can produce posterior modes at exactly zero but leads to improper posteriors and no valid Bayesian uncertainty.
- A common practical choice is , giving a Cauchy prior:
- very heavy tails,
- strong shrinkage near zero,
- good empirical behavior in many regression and nonparametric settings.
The Laplace (double-exponential) prior is also a normal scale mixture and underlies the lasso:
- It creates exact zeros in the posterior mode,
- but posterior draws for are not exactly zero,
- and tails are not as heavy as Cauchy, so large signals can be overshrunk.
7. Generalized Double Pareto (GDP) Shrinkage Prior
The generalized double Pareto prior offers:
- strong peak at zero (like Laplace),
- arbitrarily heavy tails (like Cauchy or heavier).
Its density is
where
- : scale parameter,
- : shape parameter.
This prior can be represented as a scale mixture of normals:
with .
For nonparametric regression with basis functions, apply this prior independently to all :
equivalently:
and put a prior on the error standard deviation.
A typical default choice:
- , → Cauchy-like tails.
8. Gibbs Sampler for GDP Shrinkage
Define:
- : design matrix with rows ,
- .
The conditional posterior distributions are:
- Coefficients:
- Error variance:
- Hyperparameters for each coefficient:
- Update :
- Update via an inverse-Gaussian:
This yields a block Gibbs sampler:
- update all at once,
- then ,
- then the local scale parameters and .
Block updating of often leads to good mixing.
After convergence, the posterior draws of induce a posterior over the regression function:
In high-dimensional situations with many candidate basis functions:
- many are shrunk close to zero,
- important basis functions retain non-negligible coefficients,
- specific active basis functions may change across iterations (especially with nonorthogonal bases), but the overall function estimate is stable.
