1. Goal: Make the Mean Function Nonlinear
Standard regression uses a linear predictor:
To allow a nonlinear relationship between predictors and outcome, replace this with a general mean function:
The task is to model flexibly. One approach is to represent as a weighted sum of basis functions.
2. Basis Function Representation
Consider one-dimensional regression with predictor . The mean function is written as
where
- is the h-th basis function,
- is its weight (coefficient),
- is the number of basis functions.
This includes familiar representations:
- Taylor series: basis functions = polynomials
However, high-order polynomials:
- may require many terms for good fit over a wide domain
- often behave poorly near boundaries (large oscillations).
For practical modeling, local basis functions are usually better.
3. Local Basis Functions
Local basis functions are centered at different locations and decay away from those centers. The function value is significant only near and small far away.
3.1 Gaussian radial basis functions
A common choice is the Gaussian radial basis function:
- : center of the h-th basis function
- : width (length-scale) parameter.
Properties:
- The number of basis functions and the width control how rapidly can change.
- Smaller → more local, wiggly functions; larger → smoother, more global behavior.
3.2 B-splines
Another widely used family is B-splines, especially cubic B-splines.
Assume uniform knots with spacing :
The cubic B-spline basis function bh(x) is a piecewise cubic polynomial, defined over four intervals:
with
- , or similar shifts on later intervals.
Key points:
- The width of each B-spline is determined by the knot spacing .
- The flexibility of the model depends on the number of knots (and thus the number of basis functions).
- Compact support: each B-spline is nonzero only on a small interval → the design matrix is sparse, which is computationally efficient.
4. Comparison: Gaussian vs B-spline Basis
Both Gaussian and cubic B-spline basis functions:
- have smooth, bell-shaped profiles
- can be combined in weighted sums to approximate smooth functions.
Differences:
- Gaussian basis functions are infinitely differentiable, leading to very smooth approximations.
- Cubic B-splines are three-times differentiable, so smooth but with less smoothness than Gaussians.
- B-splines have compact support; Gaussians decay but never reach exactly zero.
If the true mean function has a very sharp spike narrower than the basis function width, the model with fixed-width basis functions will oversmooth and fail to capture that spike.
5. Linear Model in Basis Coefficients
Once the basis functions are chosen, the model is linear in the parameters .
For data , write
Define
so
This is just a standard linear regression in the transformed predictors .
Consequences:
- All familiar linear regression tools apply.
- With a conjugate prior (multivariate normal–inverse– or normal–inverse-gamma) for , the posterior remains in the same family.
- Although the model is linear in , the function can be highly nonlinear.
6. Centering the Model on a Linear Trend
Often it is useful to decompose into:
- a global linear trend, plus
- a flexible deviation captured by basis functions.
Write
Here:
- : linear component
- : nonlinear correction.
This structure encourages the spline to capture local deviations while the linear part captures the overall trend.
7. Example: Chloride Concentration Data
A small dataset with 54 measurements of chloride concentration over time illustrates the method.
- A simple linear regression fits the data roughly.
- Visual inspection shows small but clear local deviations from linearity.
A B-spline model is used:
- Number of coefficients: 21 ().
- Sample size: 54.
- Potential issue: too many parameters relative to data → risk of overfitting without regularization.
To address this, introduce a prior on .
7.1 Prior centered on a linear function
Let
This implies that the prior mean function is
Choose so that is approximately linear, say
Procedure:
- Fit a simple linear regression to get least squares estimates .
- Find such that the spline combination matches the linear function as closely as possible (using least squares over the input range).
- Use that as the prior mean.
For in this example, the spline approximation to the linear function is essentially indistinguishable from an exact straight line.
7.2 Posterior mean
The posterior mean of has a closed form:
where
- : matrix with row
- : vector of responses
- : the linear regression function used as prior mean.
Interpretation:
- is shrunk toward the linear regression fit,
- while still allowing smooth nonlinear deviations controlled by the basis functions and .
To extend the model, hyperpriors can be put on , , or one can impose smoothing priors that encourage neighboring coefficients to be similar (for example, an AR(1) prior). Such models are often called Bayesian penalized splines (P-splines).
8. Choosing Knots and Handling Uncertainty
Two important design choices:
- Number of knots / basis functions
- Locations of knots
A common practical strategy:
- Use “enough” knots (for example, in the example)
- Control flexibility through priors on so that overfitting is avoided.
Several Bayesian strategies handle uncertainty in the basis specification:
8.1 Free-knot models with reversible jump MCMC
- Put a prior on both the number of knots and their locations.
- Use reversible jump MCMC to move between models with different knot numbers and positions.
- Posterior for averages over knot configurations.
This approach is conceptually attractive but often computationally challenging due to the difficulty of designing efficient reversible jump proposals.
8.2 Shrinkage priors instead of exact selection
An alternative is to keep a rich set of basis functions but use shrinkage priors for :
- Priors with strong mass near 0, but heavy tails
- Many coefficients are shrunk close to zero
- Large coefficients are allowed for important basis functions.
This acts as a continuous analogue of variable selection:
- No coefficient is forced exactly to zero
- Basis functions with negligible effect are heavily shrunk
- Important basis functions remain relatively unshrunk.
Shrinkage priors simplify computation and avoid discrete model jumps.
