1. Gaussian process as a prior over functions

A Gaussian process (GP) is a probability distribution over functions. It is a natural prior for an unknown regression function μ(x), especially when:

  • predictors are multivariate, and
  • interactions and nonlinearities are important,
  • but explicit basis expansions are inconvenient.

Write the prior asμGP(m, k),\mu \sim \text{GP}(m,\ k),

where:

  • m(x)m(x) is the mean function,
  • k(x,x)k(x, x’) is the covariance function.

For any finite collection of inputs x1,,xnx_1, \dots, x_n​, the random vector of function values(μ(x1),,μ(xn))(\mu(x_1), \dots, \mu(x_n))

is multivariate normal with mean vector (m(x1),,m(xn))(m(x_1), \dots, m(x_n)) and covariance matrix KK, where the (p, q) entry isKpq=k(xp,xq).K_{pq} = k(x_p, x_q).

The GP is nonparametric: the function μ(x)\mu(x) evaluated over all possible x is characterized by infinitely many degrees of freedom. The mean function can encode a parametric “base model,” often a linear regression:m(x)=xTβ,m(x) = x^T \beta,

with a prior on β (for example, a Gaussian prior). This centers the GP around a linear model but allows smooth deviations from linearity. In regions where the data are sparse, the GP naturally shrinks back toward the mean function, which helps control the curse of dimensionality.

When needed, the collection of GP hyperparameters (for example, τ, l, σ) can be viewed collectively as a parameter vector θ.


2. Covariance function and the squared exponential kernel

The covariance function k(x,x)k(x, x’) determines:

  • how smooth the sample paths of μ(x) are,
  • how strongly values at different inputs are correlated,
  • how strongly the function is shrunk towards the mean.

A common choice is the squared exponential (Gaussian / exponentiated quadratic) kernel:k(x,x)=τ2exp ⁣(xx22l2),k(x, x’) = \tau^2 \exp\!\left(- \frac{\|x – x’\|^2}{2 l^2}\right),

where:

  • τ>0\tau > 0 controls the overall vertical scale (amplitude) of the function,
  • l>0l > 0 is a length-scale parameter controlling smoothness,
  • xx2\|x – x’\|^2 is the squared Euclidean distance between x and x′.

Large l produces very smooth, slowly varying functions. Small l allows rapid changes. Larger τ produces functions with larger variability around the mean.

Random draws from a GP with a squared exponential kernel produce smooth, “wiggly” curves whose roughness is governed by τ and l.


3. Relationship to basis expansions

Consider a basis expansion model:μ(x)=h=1Hβhbh(x),\mu(x) = \sum_{h=1}^H \beta_h\, b_h(x),

with coefficient vectorβ=(β1,,βH)TN(β0, Σβ).\beta = (\beta_1, \dots, \beta_H)^T \sim N(\beta_0,\ \Sigma_\beta).

Defineb(x)=(b1(x),,bH(x))T.b(x) = (b_1(x), \dots, b_H(x))^T.

Then the induced distribution for (μ(x1),,μ(xn))(\mu(x_1), \dots, \mu(x_n)) is multivariate normal, with

  • mean function m(x)=b(x)Tβ0,m(x) = b(x)^T \beta_0,
  • covariance function k(x,x)=b(x)TΣβb(x).k(x, x’) = b(x)^T \Sigma_\beta b(x’).

Thus any linear basis-function model with a Gaussian prior on β defines a GP prior on μ.

Conversely, many standard GP covariance functions (including the squared exponential) admit an equivalent representation as an infinite basis expansion with Gaussian weights. Truncating these expansions yields approximate finite-dimensional models that can speed up computation.


4. Isotropic vs anisotropic covariance

With multivariate predictors x=(x1,,xp)x = (x_1, \dots, x_p), using a single length scale l for all directions implies isotropy: the function is assumed to vary at the same rate in every coordinate. This is often unrealistic and inefficient if:

  • some predictors have stronger influence than others,
  • the function changes quickly in some directions and slowly in others,
  • some predictors are essentially irrelevant.

An anisotropic squared exponential kernel allows separate length-scales ljl_j for each predictor:k(x,x)=τ2exp(j=1p(xjxj)22lj2).k(x, x’) = \tau^2 \exp\left( – \sum_{j=1}^p \frac{(x_j – x’_j)^2}{2 l_j^2} \right).

Each ljl_j​ controls smoothness in the j-th coordinate. Very large ljl_j​ effectively suppresses variation along predictor j, which can be interpreted as a form of nonparametric variable selection. Priors on the vector (l1,,lp)(l_1, \dots, l_p) allow the data to drive the degree of relevance of each predictor.


5. Posterior for μ in Gaussian noise

Assume a regression model with Gaussian observation noise:yiN(μ(xi), σ2),i=1,,n.y_i \sim N(\mu(x_i),\ \sigma^2), \quad i = 1, \dots, n.

Stack the data into vectors:

  • y=(y1,,yn)Ty = (y_1,\dots,y_n)^T,
  • x=(x1,,xn)x = (x_1,\dots,x_n).

Given a GP prior μGP(0,k)\mu \sim \text{GP}(0, k) and fixed hyperparameters θ (for example, τ, l, σ), the joint distribution of y and the function at prediction points x~\tilde{x} is multivariate normal:(yμ~)N ⁣((00),(K(x,x)+σ2IK(x,x~)K(x~,x)K(x~,x~))),\begin{pmatrix} y \\ \tilde{\mu} \end{pmatrix} \sim N\!\left( \begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} K(x, x) + \sigma^2 I & K(x, \tilde{x}) \\ K(\tilde{x}, x) & K(\tilde{x}, \tilde{x}) \end{pmatrix} \right),

where:

  • K(x,x)K(x, x) is the n×n covariance matrix with entries k(xi,xj)k(x_i, x_j),
  • K(x~,x)K(\tilde{x}, x) is the cross-covariance between prediction points and training points,
  • K(x~,x~)K(\tilde{x}, \tilde{x}) is the covariance among prediction points.

Conditioning on the observed data y yields the posterior distribution of μ~=μ(x~)\tilde{\mu} = \mu(\tilde{x}):μ~x,y,θN(E(μ~), cov(μ~)),\tilde{\mu} \mid x, y, \theta \sim N\big( E(\tilde{\mu}),\ \text{cov}(\tilde{\mu}) \big),

withE(μ~)=K(x~,x)[K(x,x)+σ2I]1y,E(\tilde{\mu}) = K(\tilde{x}, x)\, [K(x,x) + \sigma^2 I]^{-1} y,cov(μ~)=K(x~,x~)K(x~,x)[K(x,x)+σ2I]1K(x,x~).\text{cov}(\tilde{\mu}) = K(\tilde{x}, \tilde{x}) – K(\tilde{x}, x)\, [K(x,x) + \sigma^2 I]^{-1} K(x,\tilde{x}).

The posterior mean is a linear smoother of the observed data y. The weight each $y_i$ receives depends on the covariance structure: points $x_i$ near x~\tilde{x} in input space carry more weight.


6. Computational cost

The main bottleneck in GP regression is the inversion (or factorization) of the n×n matrix K(x,x)+σ2IK(x,x) + \sigma^2 I, which costs O(n3)O(n^3) operations. This must be repeated whenever the covariance hyperparameters θ change (for example, within MCMC). As a result:

  • GP regression is straightforward for small or moderate n (up to a few thousand),
  • but becomes computationally challenging for large n or many hyperparameters.

These limitations motivate approximate methods.


7. Covariance approximations and scalable GPs

Several strategies reduce the computational cost:

  1. Markov random field representations
    Some GPs can be represented as Gaussian Markov random fields (GMRFs) with sparse precision matrices. In low dimensions (e.g., time series or low-dimensional spatial problems), this leads to algorithms that scale much better than O(n3)O(n^3). For example:
    • in 1D, state-space methods can achieve O(n)O(n) complexity,
    • in 2D spatial problems, specialized algorithms can achieve around O(n3/2)O(n^{3/2}).
  2. Basis-function approximations
    When the latent function is smooth and the dimension is modest, μ(x) can be approximated by a lower-dimensional basis expansion with m basis functions, where mnm \ll n. In that case, matrix operations involve m instead of n, greatly reducing cost. This is closely related to truncated Karhunen–Loève expansions or other low-rank approximations.
  3. Compact-support covariance functions
    If the covariance k(x,x)k(x, x’) is chosen to have compact support and the length scale is short, distant points become exactly uncorrelated, making the covariance matrix sparse. Sparse linear algebra can then be used to speed computation, even with many predictors.
  4. Reduced-rank approximations
    When the function is very smooth (long length scales), low-rank approximations to the covariance matrix can be effective. The covariance matrix is approximated as K(x,x)QQT,K(x,x) \approx Q Q^T,, where Q is n×m with mnm \ll n. This reduces matrix operations to O(mn2)O(m n^2). Careful construction is needed to keep approximation error small.
  5. Additive GP structures
    In high dimensions, additive decompositions of μ(x) into low-dimensional components (for example, univariate GPs and low-order interactions) can combine the above approximations effectively. Each low-dimensional component is cheaper to represent and compute with.

Different approximations can be combined, for example, one approximation for short-range structure and another for long-range structure in an additive model.


8. Marginal likelihood and posterior for hyperparameters θ

For Gaussian observation noise, integration over μ yields an analytic marginal likelihood for y given θ:logp(yθ)=n2log(2π)12logK(x,x)+σ2I12yT[K(x,x)+σ2I]1y.\log p(y \mid \theta) = -\frac{n}{2}\log(2\pi) – \frac{1}{2}\log \left|K(x,x) + \sigma^2 I\right| – \frac{1}{2} y^T [K(x,x) + \sigma^2 I]^{-1} y.

Here θ typically includes τ, the length-scale parameters (l or ljl_j), and σ. Combining this marginal likelihood with a prior p(θ)p(\theta) gives the posterior p(θy)p(\theta \mid y). Inference for θ can proceed via:

  • optimization (type-II maximum likelihood / empirical Bayes),
  • or full Bayesian sampling (e.g., MCMC methods such as slice sampling, Metropolis–Hastings, or Hamiltonian Monte Carlo).

Posterior samples of θ can be combined with the conditional GP posterior formulas to obtain posterior draws of μ(x) and predictive distributions at new points.