Latent Gaussian process (GP) models extend standard GP regression to cases where the data likelihood is non-Gaussian. Instead of modeling $y$ directly with a Gaussian likelihood, the model introduces a latent function $f$ that has a GP prior. A link function (as in generalized linear models) connects $f$ to the likelihood.


1. Structure of Latent GP Models

Latent Function

  • A GP prior is placed on the latent function:
    fGP(0,k)f \sim \text{GP}(0, k)
  • The likelihood depends on $f$ through a link function:
    p(yf,ϕ)p(y \mid f, \phi)
    where $\phi$ is a likelihood shape parameter.

Optional: Separate GPs for Location and Shape

  • In more flexible models, one can use:
    • GP for the mean (location) $f$
    • GP for the shape parameter $\phi$
  • This allows scale or shape to depend on predictors.

2. Posterior Structure

The conditional posterior of latent $f$ is:p(fx,y,θ,ϕ)p(yf,ϕ)p(fx,θ)p(f \mid x, y, \theta, \phi) \propto p(y \mid f, \phi)\, p(f \mid x, \theta)

Sampling from this posterior is challenging due to:

  • High correlation between $f$, $\theta$, and $\phi$
  • High-dimensional GP priors

Thus mixing in MCMC is slow, especially for large datasets.


3. GP-Specific MCMC Samplers

To achieve more efficient sampling, special samplers exploit the multivariate Gaussian prior structure:

  • Elliptical Slice Sampling
  • Scaled Metropolis–Hastings
  • Scaled HMC or NUTS

These samplers alternate between:

  1. Sampling latent values $f$
  2. Sampling covariance (kernel) parameters $\theta$ and likelihood parameters $\phi$

However, because of strong coupling between parameters, mixing can remain slow.


4. Gaussian Posterior Approximation (Laplace Approximation)

Since the latent prior is Gaussian and the posterior is often close to Gaussian, a normal approximation is used:p(fx,y,θ,ϕ)N(f^,Σ)p(f \mid x, y, \theta, \phi) \approx N(\hat f, \Sigma)

where:

  • $\hat f$ is the posterior mode
  • The covariance satisfies:
    Σ1=K(x,x)1+W\Sigma^{-1} = K(x,x)^{-1} + W

Here:

  • $K(x,x)$ = GP prior covariance matrix
  • $W$ = diagonal matrix with elements:
    Wii=d2dfi2logp(yifi,ϕ)fi=f^iW_{ii} = -\frac{d^2}{df_i^2} \log p(y_i \mid f_i, \phi)\Big|_{f_i = \hat f_i}
    (negative second derivative of log-likelihood)

Predictive Distribution

Approximate predictive distribution for a new point $\tilde y_i$ is:p(y~if~i,ϕ)p(y~if~i,ϕ)N(f~ix,y,θ,ϕ)df~ip(\tilde y_i \mid \tilde f_i, \phi) \approx \int p(\tilde y_i \mid \tilde f_i, \phi)\, N(\tilde f_i \mid x, y, \theta, \phi)\, d\tilde f_i

This integral is usually evaluated by quadrature.


5. Approximate Marginal Likelihood via Laplace’s Method

To estimate hyperparameters $(\theta, \phi)$, the model approximates:logp(yx,θ,ϕ)logp(yf^,ϕ)12f^K1f^12logB\log p(y \mid x, \theta, \phi) \approx \log p(y \mid \hat f, \phi) – \frac{1}{2}\hat f^\top K^{-1}\hat f – \frac{1}{2}\log |B|

where:B=I+W1/2KW1/2B = I + W^{1/2} K W^{1/2}

This allows efficient hyperparameter optimization using gradients.


6. Expectation Propagation (EP) and Variational Inference

When the likelihood is highly skewed (e.g., logistic likelihood), Laplace approximation can be inaccurate.
Alternatives:

Expectation Propagation (EP)

  • More accurate for skewed likelihoods.
  • Often slower than Laplace.

Variational Approximation

  • Sometimes used
  • Often slower and less accurate than EP for GP models.

Once the latent posterior is approximated, one can compute an approximate marginal posterior:q(θ,ϕx,y)q(\theta, \phi \mid x, y)

This can be used for:

  • Hyperparameter mode finding
  • Grid-based integration
  • MCMC over hyperparameters
  • CCD (Central Composite Design) integration

7. Example: Leukemia Survival Times (AML dataset)

GPs are used to model nonlinear predictor effects and interactions in a real survival dataset from 1043 adult AML cases (1982–1998, UK).
Approximately 16% are right-censored.

Predictors

  • Age
  • Sex
  • White Blood Cell Count (WBC) at diagnosis
    • Highly skewed and positive → log-transformed
  • Townsend deprivation index (measure of socioeconomic deprivation)

All continuous predictors and log-survival times are normalized.


7.1 Likelihood: Log-Logistic Model

The log-logistic model performed best among tested survival models (proportional hazards, Weibull, log-Gaussian).

Likelihood for each observation:p(yi)=yiryi1exp(f(Xi))1zi[1+yirexp(f(Xi))]zi2p(y_i \mid – ) = y_i^{r y_i – 1} \, \exp(f(X_i))^{1-z_i} \left[1 + y_i^{r}\exp(f(X_i))\right]^{-z_i – 2}

  • $r$ = shape parameter
  • $z_i$ = censoring indicator (0 = observed, 1 = censored)

This model is robust compared to log-Gaussian, especially with skewed data.


7.2 Latent Function Model

The latent function combines linear effects and a GP:fi=α+Xiβ+μ(Xi)f_i = \alpha + X_i\beta + \mu(X_i)

where:

  • $\mu \sim \text{GP}(0, k)$
  • Kernel: squared exponential k(x,x)=σg2exp(jxjxj22lj2)k(x, x’) = \sigma_g^2 \exp\left( -\sum_j \frac{|x_j – x_j’|^2}{2l_j^2} \right)

7.3 Priors

  • $\log r$ ~ uniform
  • $\alpha \sim N(0, \sigma_\alpha^2)$
  • $\beta \sim N(0, \sigma_\beta^2)$
  • $\sigma_\alpha^2, \sigma_\beta^2, \sigma_g^2 \sim \text{Inv-}\chi^2(1,1)$
  • Length-scales $l_j \sim \text{half-Cauchy}(0,1)$

7.4 Inference Approach

  • Laplace approximation for latent values $f$
  • CCD integration for hyperparameters
  • EP gave similar results but was much slower

Cross-Validation Performance

Leave-one-out predictive accuracy:

  • Linear-only model:
    $ \text{lppd}_{\text{loo-cv}} = -1662 $
  • Model with nonlinear GP components:
    $ \text{lppd}_{\text{loo-cv}} = -1629 $

Lower negative value indicates better predictive accuracy, showing clear improvement from nonlinear structure.


7.5 Discovered Patterns

Figure 21.6 showed:

  • Strong nonlinear effects for predictors
  • An interaction between WBC and deprivation index (Townsend)
  • Previous studies did not detect these patterns because:
    • They did not log-transform WBC
    • They only used additive models (no interactions)
    • Spatial component was not helpful once GP nonlinearities were included

The GP model automatically discovered:

  • Complex nonlinear shapes
  • Interactions without explicitly specifying them

This demonstrates the flexibility and modeling power of latent GP models.


Summary

Latent Gaussian process models extend GP regression to non-Gaussian likelihoods by introducing a latent function with a GP prior. Because sampling from the posterior is difficult due to high dimensionality and strong correlations, specialized GP samplers or Gaussian approximations (Laplace, EP, variational) are used. These approximations allow efficient hyperparameter inference and prediction.

The leukemia survival example illustrates the power of GPs in uncovering nonlinear predictors and interactions without pre-specification. Nonlinear effects of WBC and its interaction with socioeconomic deprivation were discovered automatically, improving predictive performance over simpler additive models.