1. Setup: Nonparametric Regression with Many Basis Functions

Consider a nonparametric regression model:yiN(wiβ,σ2),wi=(b1(xi),,bH(xi)),y_i \sim N(w_i \beta,\, \sigma^2), \quad w_i = (b_1(x_i), \dots, b_H(x_i)),

where

  • b1,,bHb_1, \dots, b_H​ are prespecified basis functions (for example, B-splines, Gaussian RBFs),
  • β=(β1,,βH)\beta = (\beta_1, \dots, \beta_H) are the basis coefficients,
  • σ2\sigma^2 is the residual variance.

In practice, not all basis functions are needed. Some can be dropped without losing accuracy. The problem is to decide:

  • which basis functions to include, and
  • how strongly to regularize their coefficients.

Two main strategies are:

  1. Variable selection (some βh\beta_h​ exactly zero),
  2. Shrinkage (many βh\beta_h​ near zero but not forced to be exactly zero).

2. Bayesian Variable Selection via a Mixture Prior

Introduce a model indicator vectorγ=(γ1,,γH){0,1}H,\gamma = (\gamma_1, \dots, \gamma_H) \in \{0,1\}^H,

with

  • γh=1\gamma_h = 1: basis function bhb_h​ is included,
  • γh=0\gamma_h = 0: basis function bhb_h is excluded (βh=0\beta_h = 0).

The model space Γ\Gamma contains all 2H2^H possible inclusion patterns.

A convenient way to encode this is to work in the full model and give each coefficient βh\beta_h​ a spike–slab prior:βhπhδ0  +  (1πh)N ⁣(0,κh1σ2),σ2Inv-Gamma(a,b).(20.3)\beta_h \sim \pi_h \, \delta_0 \;+\; (1 – \pi_h)\, N\!\big(0,\, \kappa_h^{-1}\sigma^2\big), \quad \sigma^2 \sim \text{Inv-Gamma}(a,b). \tag{20.3}

Interpretation:

  • With probability πh\pi_h​, βh=0\beta_h = 0 (the spike at zero).
  • With probability 1πh1 – \pi_h​, βh\beta_h​ is drawn from a normal distribution with variance κh1σ2\kappa_h^{-1}\sigma^2 (the slab).

This induces:

  • γhBernoulli(1πh)\gamma_h \sim \text{Bernoulli}(1 – \pi_h), independently for h=1,,Hh = 1,\dots,H.
  • For a given model γ\gamma, the nonzero coefficients βγ={βh:γh=1}\beta_\gamma = \{\beta_h : \gamma_h = 1\} have a multivariate normal prior βγNpγ(0,Vγσ2),\beta_\gamma \sim N_{p_\gamma}(0, V_\gamma \sigma^2),βwhere pγ=hγhp_\gamma = \sum_h \gamma_h​ is the number of included basis functions and
    Vγ=diag(κh:γh=1)V_\gamma = \text{diag}(\kappa_h : \gamma_h = 1).

This is a variable selection mixture prior: each basis function is either off (βh=0\beta_h = 0) or on with a Gaussian prior.

2.1 Hyperprior on inclusion probability and Cauchy slab

If there is no strong prior reason to treat basis functions differently, set

  • πh=π\pi_h = \pi for all hh,
  • and put a prior πBeta(aπ,bπ)\pi \sim \text{Beta}(a_\pi, b_\pi).

Given γ\gamma, the full conditional posterior for π\pi isπBeta(aπ+h(1γh),  bπ+hγh).\pi \mid – \sim \text{Beta}\left(a_\pi + \sum_h (1 – \gamma_h),\; b_\pi + \sum_h \gamma_h\right).

This induces an automatic multiplicity adjustment: as more unnecessary basis functions are added, posterior mass shifts toward smaller π\pi (more exclusion).

To give nonzero coefficients heavy-tailed priors (to avoid overshrinkage of real signals), placeκhGamma(0.5,0.5)\kappa_h \sim \text{Gamma}(0.5, 0.5)

for each hh. This choice makes the slab effectively Cauchy-like (using a normal–inverse-gamma scale mixture representation of the Student-t).

  • An improper prior on nonzero βh\beta_h​ should not be used; it tends to put very high posterior probability on the null model (all βh=0\beta_h = 0).
  • An improper prior on σ\sigma is acceptable, because σ\sigma is shared across all models. For example, set a=b=0a = b = 0 in the Inv-Gamma.

3. Posterior over Models and Analytic Marginal Likelihood

Under fixed π\pi and κh=κ\kappa_h = \kappa for simplicity, the posterior model probability has the formPr(γy,X)=πHpγ(1π)pγp(yX,γ)γΓπHpγ(1π)pγp(yX,γ),(20.4)\Pr(\gamma \mid y,X) = \frac{ \pi^{H – p_\gamma} (1 – \pi)^{p_\gamma} \, p(y \mid X, \gamma) }{ \sum_{\gamma^* \in \Gamma} \pi^{H – p_{\gamma^*}} (1 – \pi)^{p_{\gamma^*}} \, p(y \mid X, \gamma^*) }, \tag{20.4}

where p(yX,γ)p(y \mid X, \gamma) is the marginal likelihood under model γ\gamma:p(yX,γ)=[i=1nN(yiwi,γβγ,σ2)]N(βγ0,Vγσ2)Inv-Gamma(σ2a,b)dβγdσ2.p(y \mid X, \gamma) = \int \int \left[ \prod_{i=1}^n N(y_i \mid w_{i,\gamma} \beta_\gamma, \sigma^2) \right] N(\beta_\gamma \mid 0, V_\gamma \sigma^2) \, \text{Inv-Gamma}(\sigma^2 \mid a,b) \, d\beta_\gamma d\sigma^2.

Here wi,γ=(wih:γh=1)w_{i,\gamma} = (w_{ih}: \gamma_h = 1) is the reduced design row.

This is exactly the marginal likelihood for a normal linear regression with a conjugate normal–inverse-gamma prior, so a closed-form expression is available.

The posterior for (βγ,σ2)(\beta_\gamma, \sigma^2) conditional on γ\gamma is multivariate normal–inverse-gamma, again with standard formulas.

3.1 Computational difficulty when H is large

The main obstacle is the size of model space:

  • There are 2H2^H models.
  • For H=50H = 50, this is about 1.1×10151.1 \times 10^{15} models.

Exact summation over all γ\gamma is impossible unless HH is very small.

Therefore, approximations are needed:

  1. Stochastic search over models, using MCMC to find high posterior probability models and averaging over them.
  2. Gibbs sampling for $\gamma$, updating one γh\gamma_h​ at a time.

4. Gibbs Sampling for Model Indicators

For Gibbs sampling, update each γh\gamma_h​ from its Bernoulli full conditional:Pr(γh=1γ(h),π,y,X)=[1+π1πp(yX,γh=0,γ(h))p(yX,γh=1,γ(h))]1,\Pr(\gamma_h = 1 \mid \gamma_{(-h)}, \pi, y, X) = \left[ 1 + \frac{ \pi }{ 1 – \pi } \frac{ p(y \mid X, \gamma_h = 0, \gamma_{(-h)}) }{ p(y \mid X, \gamma_h = 1, \gamma_{(-h)}) } \right]^{-1},

where γ(h)\gamma_{(-h)}​ denotes all indicators except γh\gamma_h​.

One Gibbs cycle:

  • iterate over h=1,,Hh = 1,\dots,H,
  • update γh\gamma_h​ using the above probability.

After warm-up, successive draws of γ\gamma approximate samples from the posterior over Γ\Gamma.

4.1 Model selection vs model averaging

From the posterior sample of γ\gamma, one can:

  • compute marginal inclusion probabilities Pr(γh=1data)\Pr(\gamma_h = 1 \mid \text{data}) for each basis function,
  • identify a maximum posterior model (the single γ\gamma with largest posterior probability),
  • or average predictions and curve estimates over the entire posterior over γ\gamma.

In high-dimensional spaces, many models can have similar posterior probabilities, so relying on a single “best model” can be misleading. Model averaging better reflects uncertainty.

If a single model must be chosen, a good choice is the median probability model:

  • include every basis function with marginal inclusion probability Pr(γh=1data)>0.5\Pr(\gamma_h = 1 \mid \text{data}) > 0.5.

For orthogonal basis functions, this model has optimality properties as the best single-model approximation to full Bayesian model averaging.


5. Example: Chloride Concentration with Basis Selection

In the chloride concentration example:

  • There are 54 observations,
  • and 21 spline basis functions.

If all 21 basis functions are included with a weakly informative Gaussian prior, such as

  • βN(0,I)\beta \sim N(0, I) or βN(0,22I)\beta \sim N(0, 2^2 I),

the posterior mean curve fits poorly:

  • the fit drifts downward toward zero,
  • because the prior is centered at 0 and not sufficiently regularized relative to the number of parameters.

Instead, treat the 21 splines as a full candidate basis set and perform Bayesian variable selection:

  • prior inclusion probability for each basis: 0.50.5,
  • nonzero coefficients: βhN(0,22)\beta_h \sim N(0, 2^2),
  • residual variance: σ2Inv-Gamma(1,1)\sigma^2 \sim \text{Inv-Gamma}(1,1).

A Gibbs sampler is run:

  • at each iteration, some subset of βh\beta_h​ is active and updated; others are set to 0 via γ\gamma.
  • convergence is fast, and computation is quick.

Results:

  • Posterior mean number of active basis functions: 12.0
  • 95% posterior interval for model size: [8.0, 16.0]
  • Posterior mean of residual standard deviation: σ^=0.27\hat{\sigma} = 0.27
  • 95% interval for σ\sigma: [0.23, 0.33]

This indicates:

  • a moderately complex but smooth curve (about 8–16 basis functions),
  • relatively low measurement error.

A limitation: results can be sensitive to the initial choice of basis family and number of basis functions. For example, choosing 21 smooth cubic splines implicitly assumes that the true curve is smooth and does not have extremely sharp peaks. If sharp spikes are plausible, a different basis (for example wavelets) might be more appropriate, or a mix of basis families can be considered and selected via variable selection.


6. Shrinkage Priors as a Continuous Alternative to Hard Selection

Variable selection with exact zeros is conceptually attractive but computationally demanding:

  • Model space is enormous (2H2^H).
  • MCMC can explore only a tiny fraction of models.
  • One-at-a-time updates of γh\gamma_h​ can mix slowly.
  • Nonconjugate priors make computation even harder.

An alternative is to avoid exact zeros and instead use shrinkage priors on βh\beta_h​ that:

  • concentrate probability mass near zero, so many coefficients are effectively negligible,
  • have heavy tails to avoid over-shrinking large, important coefficients.

6.1 Scale mixtures of normals

Most useful shrinkage priors can be written as:βhN(0,σh2),σh2G,\beta_h \sim N(0, \sigma_h^2), \quad \sigma_h^2 \sim G,

where GG is a mixing distribution over variances.

Examples:

  • t-distribution with ν\nu degrees of freedom arises if σh2Inv-Gamma(ν2,ν2).\sigma_h^2 \sim \text{Inv-Gamma}\left(\frac{\nu}{2}, \frac{\nu}{2}\right).
  • As ν0\nu \to 0, one gets something like a normal-Jeffreys prior, which can produce posterior modes at exactly zero but leads to improper posteriors and no valid Bayesian uncertainty.
  • A common practical choice is ν=1\nu = 1, giving a Cauchy prior:
    • very heavy tails,
    • strong shrinkage near zero,
    • good empirical behavior in many regression and nonparametric settings.

The Laplace (double-exponential) prior is also a normal scale mixture and underlies the lasso:

  • It creates exact zeros in the posterior mode,
  • but posterior draws for βh\beta_h​ are not exactly zero,
  • and tails are not as heavy as Cauchy, so large signals can be overshrunk.

7. Generalized Double Pareto (GDP) Shrinkage Prior

The generalized double Pareto prior offers:

  • strong peak at zero (like Laplace),
  • arbitrarily heavy tails (like Cauchy or heavier).

Its density isgdP(βξ,α)=12ξ(1+βαξ)(α+1),g_{\text{dP}}(\beta \mid \xi, \alpha) = \frac{1}{2\xi} \left(1 + \frac{|\beta|}{\alpha \xi}\right)^{-(\alpha + 1)},

where

  • ξ>0\xi > 0: scale parameter,
  • α>0\alpha > 0: shape parameter.

This prior can be represented as a scale mixture of normals:βN(0,σ2),σ2Exponential(λ2/2),λGamma(α,η),\beta \sim N(0, \sigma^2), \quad \sigma^2 \sim \text{Exponential}(\lambda^2 / 2), \quad \lambda \sim \text{Gamma}(\alpha, \eta),

with ξ=η/α\xi = \eta / \alpha.

For nonparametric regression with basis functions, apply this prior independently to all βh\beta_h​:p(βσ)=h=1Hα2ση(1+βhση)(α+1),p(\beta \mid \sigma) = \prod_{h=1}^H \frac{\alpha}{2 \sigma \eta} \left(1 + \frac{|\beta_h|}{\sigma \eta}\right)^{-(\alpha + 1)},

equivalently:βhN(0,σ2τh),τhExponential(λh2/2),λhGamma(α,η),\beta_h \sim N(0, \sigma^2 \tau_h), \quad \tau_h \sim \text{Exponential}(\lambda_h^2 / 2), \quad \lambda_h \sim \text{Gamma}(\alpha, \eta),

and put a prior p(σ)1/σp(\sigma) \propto 1/\sigma on the error standard deviation.

A typical default choice:

  • α=1\alpha = 1, η=1\eta = 1 → Cauchy-like tails.

8. Gibbs Sampler for GDP Shrinkage

Define:

  • WW: design matrix with rows wiw_i​,
  • T=Diag(τ1,,τH)T = \text{Diag}(\tau_1, \dots, \tau_H).

The conditional posterior distributions are:

  1. Coefficients: βN((WW+T1)1Wy,  σ2(WW+T1)1).\beta \mid – \sim N\Big( (W^\top W + T^{-1})^{-1} W^\top y,\; \sigma^2 (W^\top W + T^{-1})^{-1} \Big).
  2. Error variance: σ2Inv-Gamma(n+H2,  (yWβ)(yWβ)2+βT1β2).\sigma^2 \mid – \sim \text{Inv-Gamma}\left( \frac{n + H}{2},\; \frac{(y – W\beta)^\top(y – W\beta)}{2} + \frac{\beta^\top T^{-1} \beta}{2} \right).
  3. Hyperparameters for each coefficient:
    • Update λh\lambda_h​: λhGamma(α+1,  βhσ+η).\lambda_h \mid – \sim \text{Gamma}\left(\alpha + 1,\; \frac{|\beta_h|}{\sigma} + \eta \right).
    • Update τh1\tau_h^{-1}​ via an inverse-Gaussian: τh1Inv-Gaussian(μ=λhσβh,  ρ=λh2).\tau_h^{-1} \mid – \sim \text{Inv-Gaussian}\left( \mu = \left|\frac{\lambda_h \sigma}{\beta_h}\right|,\; \rho = \lambda_h^2 \right).

This yields a block Gibbs sampler:

  • update all β\beta at once,
  • then σ2\sigma^2,
  • then the local scale parameters λh\lambda_h and τh\tau_h​.

Block updating of β\beta often leads to good mixing.

After convergence, the posterior draws of β\beta induce a posterior over the regression function:μ(x)=h=1Hβhbh(x).\mu(x) = \sum_{h=1}^H \beta_h b_h(x).

In high-dimensional situations with many candidate basis functions:

  • many βh\beta_h are shrunk close to zero,
  • important basis functions retain non-negligible coefficients,
  • specific active basis functions may change across iterations (especially with nonorthogonal bases), but the overall function estimate μ(x)\mu(x) is stable.