Basis selection and shrinkage of coeﬃcients

1. Setup: Nonparametric Regression with Many Basis Functions

Consider a nonparametric regression model: $y_i \sim N(w_i \beta,\, \sigma^2), \quad w_i = (b_1(x_i), \dots, b_H(x_i)),$

where

$b_1, \dots, b_H$ are prespecified basis functions (for example, B-splines, Gaussian RBFs),
$\beta = (\beta_1, \dots, \beta_H)$ are the basis coefficients,
$\sigma^2$ is the residual variance.

In practice, not all basis functions are needed. Some can be dropped without losing accuracy. The problem is to decide:

which basis functions to include, and
how strongly to regularize their coefficients.

Two main strategies are:

Variable selection (some $\beta_h$ exactly zero),
Shrinkage (many $\beta_h$ near zero but not forced to be exactly zero).

2. Bayesian Variable Selection via a Mixture Prior

Introduce a model indicator vector $\gamma = (\gamma_1, \dots, \gamma_H) \in \{0,1\}^H,$

with

$\gamma_h = 1$ : basis function $b_h$ is included,
$\gamma_h = 0$ : basis function $b_h$ is excluded ( $\beta_h = 0$ ).

The model space $\Gamma$ contains all $2^H$ possible inclusion patterns.

A convenient way to encode this is to work in the full model and give each coefficient $\beta_h$ a spike–slab prior: $\beta_h \sim \pi_h \, \delta_0 \;+\; (1 – \pi_h)\, N\!\big(0,\, \kappa_h^{-1}\sigma^2\big), \quad \sigma^2 \sim \text{Inv-Gamma}(a,b). \tag{20.3}$

Interpretation:

With probability $\pi_h$ , $\beta_h = 0$ (the spike at zero).
With probability $1 – \pi_h$ , $\beta_h$ is drawn from a normal distribution with variance $\kappa_h^{-1}\sigma^2$ (the slab).

This induces:

$\gamma_h \sim \text{Bernoulli}(1 – \pi_h)$ , independently for $h = 1,\dots,H$ .
For a given model $\gamma$ , the nonzero coefficients $\beta_\gamma = \{\beta_h : \gamma_h = 1\}$ have a multivariate normal prior $\beta_\gamma \sim N_{p_\gamma}(0, V_\gamma \sigma^2),$ βwhere $p_\gamma = \sum_h \gamma_h$ is the number of included basis functions and
$V_\gamma = \text{diag}(\kappa_h : \gamma_h = 1)$ .

This is a variable selection mixture prior: each basis function is either off ( $\beta_h = 0$ ) or on with a Gaussian prior.

2.1 Hyperprior on inclusion probability and Cauchy slab

If there is no strong prior reason to treat basis functions differently, set

$\pi_h = \pi$ for all $h$ ,
and put a prior $\pi \sim \text{Beta}(a_\pi, b_\pi)$ .

Given $\gamma$ , the full conditional posterior for $\pi$ is $\pi \mid – \sim \text{Beta}\left(a_\pi + \sum_h (1 – \gamma_h),\; b_\pi + \sum_h \gamma_h\right).$

This induces an automatic multiplicity adjustment: as more unnecessary basis functions are added, posterior mass shifts toward smaller $\pi$ (more exclusion).

To give nonzero coefficients heavy-tailed priors (to avoid overshrinkage of real signals), place $\kappa_h \sim \text{Gamma}(0.5, 0.5)$

for each $h$ . This choice makes the slab effectively Cauchy-like (using a normal–inverse-gamma scale mixture representation of the Student-t).

An improper prior on nonzero $\beta_h$ should not be used; it tends to put very high posterior probability on the null model (all $\beta_h = 0$ ).
An improper prior on $\sigma$ is acceptable, because $\sigma$ is shared across all models. For example, set $a = b = 0$ in the Inv-Gamma.

3. Posterior over Models and Analytic Marginal Likelihood

Under fixed $\pi$ and $\kappa_h = \kappa$ for simplicity, the posterior model probability has the form $\Pr(\gamma \mid y,X) = \frac{ \pi^{H – p_\gamma} (1 – \pi)^{p_\gamma} \, p(y \mid X, \gamma) }{ \sum_{\gamma^* \in \Gamma} \pi^{H – p_{\gamma^*}} (1 – \pi)^{p_{\gamma^*}} \, p(y \mid X, \gamma^*) }, \tag{20.4}$

where $p(y \mid X, \gamma)$ is the marginal likelihood under model $\gamma$ : $p(y \mid X, \gamma) = \int \int \left[ \prod_{i=1}^n N(y_i \mid w_{i,\gamma} \beta_\gamma, \sigma^2) \right] N(\beta_\gamma \mid 0, V_\gamma \sigma^2) \, \text{Inv-Gamma}(\sigma^2 \mid a,b) \, d\beta_\gamma d\sigma^2.$

Here $w_{i,\gamma} = (w_{ih}: \gamma_h = 1)$ is the reduced design row.

This is exactly the marginal likelihood for a normal linear regression with a conjugate normal–inverse-gamma prior, so a closed-form expression is available.

The posterior for $(\beta_\gamma, \sigma^2)$ conditional on $\gamma$ is multivariate normal–inverse-gamma, again with standard formulas.

3.1 Computational difficulty when H is large

The main obstacle is the size of model space:

There are $2^H$ models.
For $H = 50$ , this is about $1.1 \times 10^{15}$ models.

Exact summation over all $\gamma$ is impossible unless $H$ is very small.

Therefore, approximations are needed:

Stochastic search over models, using MCMC to find high posterior probability models and averaging over them.
Gibbs sampling for $\gamma$, updating one $\gamma_h$ at a time.

4. Gibbs Sampling for Model Indicators

For Gibbs sampling, update each $\gamma_h$ from its Bernoulli full conditional: $\Pr(\gamma_h = 1 \mid \gamma_{(-h)}, \pi, y, X) = \left[ 1 + \frac{ \pi }{ 1 – \pi } \frac{ p(y \mid X, \gamma_h = 0, \gamma_{(-h)}) }{ p(y \mid X, \gamma_h = 1, \gamma_{(-h)}) } \right]^{-1},$

where $\gamma_{(-h)}$ denotes all indicators except $\gamma_h$ .

One Gibbs cycle:

iterate over $h = 1,\dots,H$ ,
update $\gamma_h$ using the above probability.

After warm-up, successive draws of $\gamma$ approximate samples from the posterior over $\Gamma$ .

4.1 Model selection vs model averaging

From the posterior sample of $\gamma$ , one can:

compute marginal inclusion probabilities $\Pr(\gamma_h = 1 \mid \text{data})$ for each basis function,
identify a maximum posterior model (the single $\gamma$ with largest posterior probability),
or average predictions and curve estimates over the entire posterior over $\gamma$ .

In high-dimensional spaces, many models can have similar posterior probabilities, so relying on a single “best model” can be misleading. Model averaging better reflects uncertainty.

If a single model must be chosen, a good choice is the median probability model:

include every basis function with marginal inclusion probability $\Pr(\gamma_h = 1 \mid \text{data}) > 0.5$ .

For orthogonal basis functions, this model has optimality properties as the best single-model approximation to full Bayesian model averaging.

5. Example: Chloride Concentration with Basis Selection

In the chloride concentration example:

There are 54 observations,
and 21 spline basis functions.

If all 21 basis functions are included with a weakly informative Gaussian prior, such as

$\beta \sim N(0, I)$ or $\beta \sim N(0, 2^2 I)$ ,

the posterior mean curve fits poorly:

the fit drifts downward toward zero,
because the prior is centered at 0 and not sufficiently regularized relative to the number of parameters.

Instead, treat the 21 splines as a full candidate basis set and perform Bayesian variable selection:

prior inclusion probability for each basis: $0.5$ ,
nonzero coefficients: $\beta_h \sim N(0, 2^2)$ ,
residual variance: $\sigma^2 \sim \text{Inv-Gamma}(1,1)$ .

A Gibbs sampler is run:

at each iteration, some subset of $\beta_h$ is active and updated; others are set to 0 via $\gamma$ .
convergence is fast, and computation is quick.

Results:

Posterior mean number of active basis functions: 12.0
95% posterior interval for model size: [8.0, 16.0]
Posterior mean of residual standard deviation: $\hat{\sigma} = 0.27$
95% interval for $\sigma$ : [0.23, 0.33]

This indicates:

a moderately complex but smooth curve (about 8–16 basis functions),
relatively low measurement error.

A limitation: results can be sensitive to the initial choice of basis family and number of basis functions. For example, choosing 21 smooth cubic splines implicitly assumes that the true curve is smooth and does not have extremely sharp peaks. If sharp spikes are plausible, a different basis (for example wavelets) might be more appropriate, or a mix of basis families can be considered and selected via variable selection.

6. Shrinkage Priors as a Continuous Alternative to Hard Selection

Variable selection with exact zeros is conceptually attractive but computationally demanding:

Model space is enormous ( $2^H$ ).
MCMC can explore only a tiny fraction of models.
One-at-a-time updates of $\gamma_h$ can mix slowly.
Nonconjugate priors make computation even harder.

An alternative is to avoid exact zeros and instead use shrinkage priors on $\beta_h$ that:

concentrate probability mass near zero, so many coefficients are effectively negligible,
have heavy tails to avoid over-shrinking large, important coefficients.

6.1 Scale mixtures of normals

Most useful shrinkage priors can be written as: $\beta_h \sim N(0, \sigma_h^2), \quad \sigma_h^2 \sim G,$

where $G$ is a mixing distribution over variances.

Examples:

t-distribution with $\nu$ degrees of freedom arises if $\sigma_h^2 \sim \text{Inv-Gamma}\left(\frac{\nu}{2}, \frac{\nu}{2}\right).$
As $\nu \to 0$ , one gets something like a normal-Jeffreys prior, which can produce posterior modes at exactly zero but leads to improper posteriors and no valid Bayesian uncertainty.
A common practical choice is ν=1\nu = 1, giving a Cauchy prior:
- very heavy tails,
- strong shrinkage near zero,
- good empirical behavior in many regression and nonparametric settings.

The Laplace (double-exponential) prior is also a normal scale mixture and underlies the lasso:

It creates exact zeros in the posterior mode,
but posterior draws for $\beta_h$ are not exactly zero,
and tails are not as heavy as Cauchy, so large signals can be overshrunk.

7. Generalized Double Pareto (GDP) Shrinkage Prior

The generalized double Pareto prior offers:

strong peak at zero (like Laplace),
arbitrarily heavy tails (like Cauchy or heavier).

Its density is $g_{\text{dP}}(\beta \mid \xi, \alpha) = \frac{1}{2\xi} \left(1 + \frac{|\beta|}{\alpha \xi}\right)^{-(\alpha + 1)},$

where

$\xi > 0$ : scale parameter,
$\alpha > 0$ : shape parameter.

This prior can be represented as a scale mixture of normals: $\beta \sim N(0, \sigma^2), \quad \sigma^2 \sim \text{Exponential}(\lambda^2 / 2), \quad \lambda \sim \text{Gamma}(\alpha, \eta),$

with $\xi = \eta / \alpha$ .

For nonparametric regression with basis functions, apply this prior independently to all $\beta_h$ : $p(\beta \mid \sigma) = \prod_{h=1}^H \frac{\alpha}{2 \sigma \eta} \left(1 + \frac{|\beta_h|}{\sigma \eta}\right)^{-(\alpha + 1)},$

equivalently: $\beta_h \sim N(0, \sigma^2 \tau_h), \quad \tau_h \sim \text{Exponential}(\lambda_h^2 / 2), \quad \lambda_h \sim \text{Gamma}(\alpha, \eta),$

and put a prior $p(\sigma) \propto 1/\sigma$ on the error standard deviation.

A typical default choice:

$\alpha = 1$ , $\eta = 1$ → Cauchy-like tails.

8. Gibbs Sampler for GDP Shrinkage

Define:

$W$ : design matrix with rows $w_i$ ,
$T = \text{Diag}(\tau_1, \dots, \tau_H)$ .

The conditional posterior distributions are:

Coefficients: $\beta \mid – \sim N\Big( (W^\top W + T^{-1})^{-1} W^\top y,\; \sigma^2 (W^\top W + T^{-1})^{-1} \Big).$
Error variance: $\sigma^2 \mid – \sim \text{Inv-Gamma}\left( \frac{n + H}{2},\; \frac{(y – W\beta)^\top(y – W\beta)}{2} + \frac{\beta^\top T^{-1} \beta}{2} \right).$
Hyperparameters for each coefficient:
- Update $\lambda_h$ : $\lambda_h \mid – \sim \text{Gamma}\left(\alpha + 1,\; \frac{|\beta_h|}{\sigma} + \eta \right).$
- Update $\tau_h^{-1}$ via an inverse-Gaussian: $\tau_h^{-1} \mid – \sim \text{Inv-Gaussian}\left( \mu = \left|\frac{\lambda_h \sigma}{\beta_h}\right|,\; \rho = \lambda_h^2 \right).$

This yields a block Gibbs sampler:

update all $\beta$ at once,
then $\sigma^2$ ,
then the local scale parameters $\lambda_h$ and $\tau_h$ .

Block updating of $\beta$ often leads to good mixing.

After convergence, the posterior draws of $\beta$ induce a posterior over the regression function: $\mu(x) = \sum_{h=1}^H \beta_h b_h(x).$

In high-dimensional situations with many candidate basis functions:

many $\beta_h$ are shrunk close to zero,
important basis functions retain non-negligible coefficients,
specific active basis functions may change across iterations (especially with nonorthogonal bases), but the overall function estimate $\mu(x)$ is stable.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Basis selection and shrinkage of coeﬃcients

1. Setup: Nonparametric Regression with Many Basis Functions

2. Bayesian Variable Selection via a Mixture Prior

2.1 Hyperprior on inclusion probability and Cauchy slab

3. Posterior over Models and Analytic Marginal Likelihood

3.1 Computational difficulty when H is large

4. Gibbs Sampling for Model Indicators

4.1 Model selection vs model averaging

5. Example: Chloride Concentration with Basis Selection

6. Shrinkage Priors as a Continuous Alternative to Hard Selection

6.1 Scale mixtures of normals

7. Generalized Double Pareto (GDP) Shrinkage Prior

8. Gibbs Sampler for GDP Shrinkage

Like this:

Related

Leave a ReplyCancel reply

1. Setup: Nonparametric Regression with Many Basis Functions

2. Bayesian Variable Selection via a Mixture Prior

2.1 Hyperprior on inclusion probability and Cauchy slab

3. Posterior over Models and Analytic Marginal Likelihood

3.1 Computational difficulty when H is large

4. Gibbs Sampling for Model Indicators

4.1 Model selection vs model averaging

5. Example: Chloride Concentration with Basis Selection

6. Shrinkage Priors as a Continuous Alternative to Hard Selection

6.1 Scale mixtures of normals

7. Generalized Double Pareto (GDP) Shrinkage Prior

8. Gibbs Sampler for GDP Shrinkage

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery