1. Goal: Make the Mean Function Nonlinear

Standard regression uses a linear predictor:XiβX_i \beta

To allow a nonlinear relationship between predictors and outcome, replace this with a general mean function:μ(Xi)\mu(X_i)

The task is to model μ()\mu(\cdot) flexibly. One approach is to represent μ(x)\mu(x) as a weighted sum of basis functions.


2. Basis Function Representation

Consider one-dimensional regression with predictor xx. The mean function is written asμ(x)=h=1Hβhbh(x),\mu(x) = \sum_{h=1}^{H} \beta_h\, b_h(x),

where

  • bh(x)b_h(x) is the h-th basis function,
  • βh\beta_h​ is its weight (coefficient),
  • HH is the number of basis functions.

This includes familiar representations:

  • Taylor series: basis functions = polynomials 1,x,x2,x3,1, x, x^2, x^3, \dots

However, high-order polynomials:

  • may require many terms for good fit over a wide domain
  • often behave poorly near boundaries (large oscillations).

For practical modeling, local basis functions are usually better.


3. Local Basis Functions

Local basis functions are centered at different locations xhx_h​ and decay away from those centers. The function value is significant only near xhx_h​ and small far away.

3.1 Gaussian radial basis functions

A common choice is the Gaussian radial basis function:bh(x)=exp ⁣(xxh2l2),b_h(x) = \exp\!\left(-\frac{|x – x_h|^2}{l^2}\right),

  • xhx_h​: center of the h-th basis function
  • ll: width (length-scale) parameter.

Properties:

  • The number of basis functions HH and the width ll control how rapidly μ(x)\mu(x) can change.
  • Smaller ll → more local, wiggly functions; larger ll → smoother, more global behavior.

3.2 B-splines

Another widely used family is B-splines, especially cubic B-splines.

Assume uniform knots with spacing δ\delta:xh+k=xh+δk.x_{h+k} = x_h + \delta k.

The cubic B-spline basis function bh(x)b_h(x)bh​(x) is a piecewise cubic polynomial, defined over four intervals:bh(x)={16u3,x(xh,xh+1),16(1+3u+3u23u3),x(xh+1,xh+2),16(46u2+3u3),x(xh+2,xh+3),16(13u+3u2u3),x(xh+3,xh+4),0,otherwise,b_h(x) = \begin{cases} \frac{1}{6}u^3, & x \in (x_h, x_{h+1}), \\[4pt] \frac{1}{6}(1 + 3u + 3u^2 – 3u^3), & x \in (x_{h+1}, x_{h+2}), \\[4pt] \frac{1}{6}(4 – 6u^2 + 3u^3), & x \in (x_{h+2}, x_{h+3}), \\[4pt] \frac{1}{6}(1 – 3u + 3u^2 – u^3), & x \in (x_{h+3}, x_{h+4}), \\[4pt] 0, & \text{otherwise}, \end{cases}

with

  • u=(xxh)/δu = (x – x_h)/\delta, or similar shifts on later intervals.

Key points:

  • The width of each B-spline is determined by the knot spacing δ\delta.
  • The flexibility of the model depends on the number of knots (and thus the number of basis functions).
  • Compact support: each B-spline is nonzero only on a small interval → the design matrix is sparse, which is computationally efficient.

4. Comparison: Gaussian vs B-spline Basis

Both Gaussian and cubic B-spline basis functions:

  • have smooth, bell-shaped profiles
  • can be combined in weighted sums to approximate smooth functions.

Differences:

  • Gaussian basis functions are infinitely differentiable, leading to very smooth approximations.
  • Cubic B-splines are three-times differentiable, so smooth but with less smoothness than Gaussians.
  • B-splines have compact support; Gaussians decay but never reach exactly zero.

If the true mean function μ(x)\mu(x) has a very sharp spike narrower than the basis function width, the model with fixed-width basis functions will oversmooth and fail to capture that spike.


5. Linear Model in Basis Coefficients

Once the basis functions are chosen, the model is linear in the parameters β\beta.

For data (xi,yi)(x_i, y_i), writeyi=μ(xi)+εi=h=1Hβhbh(xi)+εi.y_i = \mu(x_i) + \varepsilon_i = \sum_{h=1}^{H} \beta_h b_h(x_i) + \varepsilon_i.

Definewi=(b1(xi),,bH(xi)),w_i = (b_1(x_i), \dots, b_H(x_i)),

soyi=wiβ+εi.y_i = w_i \beta + \varepsilon_i.

This is just a standard linear regression in the transformed predictors wiw_i​.

Consequences:

  • All familiar linear regression tools apply.
  • With a conjugate prior (multivariate normal–inverse–χ2\chi^2 or normal–inverse-gamma) for (β,σ2)(\beta, \sigma^2), the posterior remains in the same family.
  • Although the model is linear in β\beta, the function μ(x)\mu(x) can be highly nonlinear.

6. Centering the Model on a Linear Trend

Often it is useful to decompose μ(x)\mu(x) into:

  • a global linear trend, plus
  • a flexible deviation captured by basis functions.

Writeμ(x)=β1+β2x+h=3H+2βhbh(x).\mu(x) = \beta_1 + \beta_2 x + \sum_{h=3}^{H+2} \beta_h b_h(x).

Here:

  • β1+β2x\beta_1 + \beta_2 x: linear component
  • h=3H+2βhbh(x)\sum_{h=3}^{H+2} \beta_h b_h(x): nonlinear correction.

This structure encourages the spline to capture local deviations while the linear part captures the overall trend.


7. Example: Chloride Concentration Data

A small dataset with 54 measurements of chloride concentration over time illustrates the method.

  • A simple linear regression fits the data roughly.
  • Visual inspection shows small but clear local deviations from linearity.

A B-spline model is used:

  • Number of coefficients: 21 (β1,,β21\beta_1, \dots, \beta_{21}​).
  • Sample size: 54.
  • Potential issue: too many parameters relative to data → risk of overfitting without regularization.

To address this, introduce a prior on β\beta.

7.1 Prior centered on a linear function

Letβσ2N(β0,σ2λ1IH),σ2Inv-Gamma(a0,b0).\beta \mid \sigma^2 \sim N(\beta_0, \sigma^2 \lambda^{-1} I_H), \quad \sigma^2 \sim \text{Inv-Gamma}(a_0, b_0).

This implies that the prior mean function isμ0(x)=E[μ(x)]=h=1Hβ0hbh(x).\mu_0(x) = E[\mu(x)] = \sum_{h=1}^{H} \beta_{0h} b_h(x).

Choose β0\beta_0 so that μ0(x)\mu_0(x) is approximately linear, sayμ0(x)α+ψx.\mu_0(x) \approx \alpha + \psi x.

Procedure:

  1. Fit a simple linear regression to get least squares estimates α^,ψ^\hat{\alpha}, \hat{\psi}​.
  2. Find β0\beta_0​ such that the spline combination μ0(x)\mu_0(x) matches the linear function α^+ψ^x\hat{\alpha} + \hat{\psi} x as closely as possible (using least squares over the input range).
  3. Use that β0\beta_0​ as the prior mean.

For H=21H = 21 in this example, the spline approximation to the linear function is essentially indistinguishable from an exact straight line.

7.2 Posterior mean

The posterior mean of μ(x)\mu(x) has a closed form:μ^(x)=E[μ(x)(xi,yi)i=1n]=(WW+λIH)1(Wy+λμ^0(x)),\hat{\mu}(x) = E[\mu(x) \mid (x_i, y_i)_{i=1}^n] = \Big(W^\top W + \lambda I_H\Big)^{-1} \Big(W^\top y + \lambda \hat{\mu}_0(x)\Big),

where

  • WW: matrix with row wiw_i
  • yy: vector of responses
  • μ^0(x)\hat{\mu}_0(x): the linear regression function used as prior mean.

Interpretation:

  • μ^(x)\hat{\mu}(x) is shrunk toward the linear regression fit,
  • while still allowing smooth nonlinear deviations controlled by the basis functions and λ\lambda.

To extend the model, hyperpriors can be put on α\alpha, ψ\psi, or one can impose smoothing priors that encourage neighboring coefficients βh\beta_h​ to be similar (for example, an AR(1) prior). Such models are often called Bayesian penalized splines (P-splines).


8. Choosing Knots and Handling Uncertainty

Two important design choices:

  1. Number of knots / basis functions
  2. Locations of knots

A common practical strategy:

  • Use “enough” knots (for example, H=21H = 21 in the example)
  • Control flexibility through priors on β\beta so that overfitting is avoided.

Several Bayesian strategies handle uncertainty in the basis specification:

8.1 Free-knot models with reversible jump MCMC

  • Put a prior on both the number of knots and their locations.
  • Use reversible jump MCMC to move between models with different knot numbers and positions.
  • Posterior for (μ,σ)(\mu, \sigma) averages over knot configurations.

This approach is conceptually attractive but often computationally challenging due to the difficulty of designing efficient reversible jump proposals.

8.2 Shrinkage priors instead of exact selection

An alternative is to keep a rich set of basis functions but use shrinkage priors for βh\beta_h​:

  • Priors with strong mass near 0, but heavy tails
  • Many coefficients are shrunk close to zero
  • Large coefficients are allowed for important basis functions.

This acts as a continuous analogue of variable selection:

  • No coefficient is forced exactly to zero
  • Basis functions with negligible effect are heavily shrunk
  • Important basis functions remain relatively unshrunk.

Shrinkage priors simplify computation and avoid discrete model jumps.