Splines and weighted sums of basis functions

1. Goal: Make the Mean Function Nonlinear

Standard regression uses a linear predictor: $X_i \beta$

To allow a nonlinear relationship between predictors and outcome, replace this with a general mean function: $\mu(X_i)$

The task is to model $\mu(\cdot)$ flexibly. One approach is to represent $\mu(x)$ as a weighted sum of basis functions.

2. Basis Function Representation

Consider one-dimensional regression with predictor $x$ . The mean function is written as $\mu(x) = \sum_{h=1}^{H} \beta_h\, b_h(x),$

where

$b_h(x)$ is the h-th basis function,
$\beta_h$ is its weight (coefficient),
$H$ is the number of basis functions.

This includes familiar representations:

Taylor series: basis functions = polynomials $1, x, x^2, x^3, \dots$

However, high-order polynomials:

may require many terms for good fit over a wide domain
often behave poorly near boundaries (large oscillations).

For practical modeling, local basis functions are usually better.

3. Local Basis Functions

Local basis functions are centered at different locations $x_h$ and decay away from those centers. The function value is significant only near $x_h$ and small far away.

3.1 Gaussian radial basis functions

A common choice is the Gaussian radial basis function: $b_h(x) = \exp\!\left(-\frac{|x – x_h|^2}{l^2}\right),$

$x_h$ : center of the h-th basis function
$l$ : width (length-scale) parameter.

Properties:

The number of basis functions $H$ and the width $l$ control how rapidly $\mu(x)$ can change.
Smaller $l$ → more local, wiggly functions; larger $l$ → smoother, more global behavior.

3.2 B-splines

Another widely used family is B-splines, especially cubic B-splines.

Assume uniform knots with spacing $\delta$ : $x_{h+k} = x_h + \delta k.$

The cubic B-spline basis function $b_h(x)$ bh(x) is a piecewise cubic polynomial, defined over four intervals: $b_h(x) = \begin{cases} \frac{1}{6}u^3, & x \in (x_h, x_{h+1}), \\[4pt] \frac{1}{6}(1 + 3u + 3u^2 – 3u^3), & x \in (x_{h+1}, x_{h+2}), \\[4pt] \frac{1}{6}(4 – 6u^2 + 3u^3), & x \in (x_{h+2}, x_{h+3}), \\[4pt] \frac{1}{6}(1 – 3u + 3u^2 – u^3), & x \in (x_{h+3}, x_{h+4}), \\[4pt] 0, & \text{otherwise}, \end{cases}$

with

$u = (x – x_h)/\delta$ , or similar shifts on later intervals.

Key points:

The width of each B-spline is determined by the knot spacing $\delta$ .
The flexibility of the model depends on the number of knots (and thus the number of basis functions).
Compact support: each B-spline is nonzero only on a small interval → the design matrix is sparse, which is computationally efficient.

4. Comparison: Gaussian vs B-spline Basis

Both Gaussian and cubic B-spline basis functions:

have smooth, bell-shaped profiles
can be combined in weighted sums to approximate smooth functions.

Differences:

Gaussian basis functions are infinitely differentiable, leading to very smooth approximations.
Cubic B-splines are three-times differentiable, so smooth but with less smoothness than Gaussians.
B-splines have compact support; Gaussians decay but never reach exactly zero.

If the true mean function $\mu(x)$ has a very sharp spike narrower than the basis function width, the model with fixed-width basis functions will oversmooth and fail to capture that spike.

5. Linear Model in Basis Coefficients

Once the basis functions are chosen, the model is linear in the parameters $\beta$ .

For data $(x_i, y_i)$ , write $y_i = \mu(x_i) + \varepsilon_i = \sum_{h=1}^{H} \beta_h b_h(x_i) + \varepsilon_i.$

Define $w_i = (b_1(x_i), \dots, b_H(x_i)),$

so $y_i = w_i \beta + \varepsilon_i.$

This is just a standard linear regression in the transformed predictors $w_i$ .

Consequences:

All familiar linear regression tools apply.
With a conjugate prior (multivariate normal–inverse– $\chi^2$ or normal–inverse-gamma) for $(\beta, \sigma^2)$ , the posterior remains in the same family.
Although the model is linear in $\beta$ , the function $\mu(x)$ can be highly nonlinear.

6. Centering the Model on a Linear Trend

Often it is useful to decompose $\mu(x)$ into:

a global linear trend, plus
a flexible deviation captured by basis functions.

Write $\mu(x) = \beta_1 + \beta_2 x + \sum_{h=3}^{H+2} \beta_h b_h(x).$

Here:

$\beta_1 + \beta_2 x$ : linear component
$\sum_{h=3}^{H+2} \beta_h b_h(x)$ : nonlinear correction.

This structure encourages the spline to capture local deviations while the linear part captures the overall trend.

7. Example: Chloride Concentration Data

A small dataset with 54 measurements of chloride concentration over time illustrates the method.

A simple linear regression fits the data roughly.
Visual inspection shows small but clear local deviations from linearity.

A B-spline model is used:

Number of coefficients: 21 ( $\beta_1, \dots, \beta_{21}$ ).
Sample size: 54.
Potential issue: too many parameters relative to data → risk of overfitting without regularization.

To address this, introduce a prior on $\beta$ .

7.1 Prior centered on a linear function

Let $\beta \mid \sigma^2 \sim N(\beta_0, \sigma^2 \lambda^{-1} I_H), \quad \sigma^2 \sim \text{Inv-Gamma}(a_0, b_0).$

This implies that the prior mean function is $\mu_0(x) = E[\mu(x)] = \sum_{h=1}^{H} \beta_{0h} b_h(x).$

Choose $\beta_0$ so that $\mu_0(x)$ is approximately linear, say $\mu_0(x) \approx \alpha + \psi x.$

Procedure:

Fit a simple linear regression to get least squares estimates $\hat{\alpha}, \hat{\psi}$ .
Find $\beta_0$ such that the spline combination $\mu_0(x)$ matches the linear function $\hat{\alpha} + \hat{\psi} x$ as closely as possible (using least squares over the input range).
Use that $\beta_0$ as the prior mean.

For $H = 21$ in this example, the spline approximation to the linear function is essentially indistinguishable from an exact straight line.

7.2 Posterior mean

The posterior mean of $\mu(x)$ has a closed form: $\hat{\mu}(x) = E[\mu(x) \mid (x_i, y_i)_{i=1}^n] = \Big(W^\top W + \lambda I_H\Big)^{-1} \Big(W^\top y + \lambda \hat{\mu}_0(x)\Big),$

where

$W$ : matrix with row $w_i$
$y$ : vector of responses
$\hat{\mu}_0(x)$ : the linear regression function used as prior mean.

Interpretation:

$\hat{\mu}(x)$ is shrunk toward the linear regression fit,
while still allowing smooth nonlinear deviations controlled by the basis functions and $\lambda$ .

To extend the model, hyperpriors can be put on $\alpha$ , $\psi$ , or one can impose smoothing priors that encourage neighboring coefficients $\beta_h$ to be similar (for example, an AR(1) prior). Such models are often called Bayesian penalized splines (P-splines).

8. Choosing Knots and Handling Uncertainty

Two important design choices:

Number of knots / basis functions
Locations of knots

A common practical strategy:

Use “enough” knots (for example, $H = 21$ in the example)
Control flexibility through priors on $\beta$ so that overfitting is avoided.

Several Bayesian strategies handle uncertainty in the basis specification:

8.1 Free-knot models with reversible jump MCMC

Put a prior on both the number of knots and their locations.
Use reversible jump MCMC to move between models with different knot numbers and positions.
Posterior for $(\mu, \sigma)$ averages over knot configurations.

This approach is conceptually attractive but often computationally challenging due to the difficulty of designing efficient reversible jump proposals.

8.2 Shrinkage priors instead of exact selection

An alternative is to keep a rich set of basis functions but use shrinkage priors for $\beta_h$ :

Priors with strong mass near 0, but heavy tails
Many coefficients are shrunk close to zero
Large coefficients are allowed for important basis functions.

This acts as a continuous analogue of variable selection:

No coefficient is forced exactly to zero
Basis functions with negligible effect are heavily shrunk
Important basis functions remain relatively unshrunk.

Shrinkage priors simplify computation and avoid discrete model jumps.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Splines and weighted sums of basis functions

1. Goal: Make the Mean Function Nonlinear

2. Basis Function Representation

3. Local Basis Functions

3.1 Gaussian radial basis functions

3.2 B-splines

4. Comparison: Gaussian vs B-spline Basis

5. Linear Model in Basis Coefficients

6. Centering the Model on a Linear Trend

7. Example: Chloride Concentration Data

7.1 Prior centered on a linear function

7.2 Posterior mean

8. Choosing Knots and Handling Uncertainty

8.1 Free-knot models with reversible jump MCMC

8.2 Shrinkage priors instead of exact selection

Like this:

Related

Leave a ReplyCancel reply

1. Goal: Make the Mean Function Nonlinear

2. Basis Function Representation

3. Local Basis Functions

3.1 Gaussian radial basis functions

3.2 B-splines

4. Comparison: Gaussian vs B-spline Basis

5. Linear Model in Basis Coefficients

6. Centering the Model on a Linear Trend

7. Example: Chloride Concentration Data

7.1 Prior centered on a linear function

7.2 Posterior mean

8. Choosing Knots and Handling Uncertainty

8.1 Free-knot models with reversible jump MCMC

8.2 Shrinkage priors instead of exact selection

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery