Density regression

Motivation and Setting

Previous sections focused on finite collections of random probability measures, typically corresponding to discrete groups that are either exchangeable or follow a simple ordering. In many real-world applications, however, predictors vary continuously, and it is more natural to consider an uncountable collection of random probability measures indexed by predictors: $\mathcal{P}_X = \{ P_x : x \in \mathcal{X} \}, \qquad x \in \mathbb{R}^p,$

where $x = (x_1, \ldots, x_p)$ is a vector of predictors, $P_x$ is the conditional distribution of the response given $x$ , and $\mathcal{X}$ denotes the predictor space.

The goal of density regression is to model the entire conditional density $p(y \mid x)$ flexibly, allowing its shape, skewness, and tail behavior to vary smoothly with predictors, rather than modeling only conditional means or quantiles.

Mixture-of-Experts Models

A classical approach to density regression is the hierarchical mixture-of-experts (MoE) model: $p(y \mid x) = \sum_{h=1}^{H} \pi_h(x)\, \mathcal{N}(y \mid x \beta_h, \tau_h^{-1}), \tag{23.21}$

where:

each mixture component corresponds to a linear regression,
the regression parameters $\beta_h$ and precisions $\tau_h$ are component-specific,
the mixture weights $\pi_h(x)$ vary with predictors.

In machine learning, the predictor-dependent weights $\pi_h(x)$ are often modeled using tree-based structures or logistic regressions. While flexible, these approaches require specifying a finite number of components $H$ and may struggle to adapt to complex conditional density shapes.

Nonparametric Bayesian Density Regression

A fully nonparametric Bayesian alternative models the conditional density as a predictor-dependent mixture: $p(y \mid x) = \int \mathcal{N}(y \mid x \beta, \tau^{-1}) \, dP_x(\beta, \tau),$

where $\{P_x : x \in \mathcal{X}\}$ is a collection of random mixing measures. The prior on this collection, denoted $\Pi_X$ , must induce dependence across predictor values while allowing sufficient flexibility.

A desirable prior should:

be centered on a reasonable parametric model,
allow local deviations from that model,
use only a small number of mixture components where appropriate,
adapt automatically as more data are observed.

Predictor-Dependent Stick-Breaking Processes

A natural construction uses predictor-dependent stick-breaking representations: $P_x = \sum_{h=1}^{\infty} \pi_h(x)\, \delta_{\theta_h}, \qquad \pi_h(x) = V_h(x) \prod_{l<h} \{1 – V_l(x)\},$

where:

$\theta_h = (\beta_h, \tau_h)$ are global mixture atoms,
$V_h(x) \in (0,1)$ are predictor-dependent stick-breaking weights.

For computational and conceptual simplicity, the atoms are often assumed predictor-independent, with only the weights depending on $x$ . This yields the infinite mixture regression model $p(y \mid x) = \sum_{h=1}^{\infty} \pi_h(x)\, \mathcal{N}(y \mid x \beta_h, \tau_h^{-1}),$

which generalizes finite mixture-of-experts models to an infinite setting.

Kernel Stick-Breaking Processes

One effective construction is the kernel stick-breaking process (KSBP): $V_h(x) = K_{\psi_h}(x, \Gamma_h)\, V_h, \qquad V_h \sim \mathrm{Beta}(1,\alpha),$

where:

$\Gamma_h$ is a random location in predictor space,
$\psi_h$ is a bandwidth parameter,
$K_{\psi_h}(x, \Gamma_h) \in [0,1]$ is a kernel function attaining its maximum at $x = \Gamma_h$ .

This construction implies:

mixture component $h$ has the largest influence near $\Gamma_h$ ,
weights decay smoothly as $x$ moves away from $\Gamma_h$ ,
the number of effectively active components is small in any local region of predictor space.

When the kernels are flat, the model reduces to a standard Dirichlet process mixture as a limiting case.

Probit Stick-Breaking Processes (PSBP)

An alternative with strong computational advantages is the probit stick-breaking process: $\pi_h(x) = V_h(x) \prod_{l<h} \{1 – V_l(x)\}, \qquad V_h(x) = \Phi\!\left(\alpha_h + \mu_h(x)\right),$

where:

$\Phi(\cdot)$ is the standard normal cumulative distribution function,
$\alpha_h \sim \mathcal{N}(\mu, 1)$ ,
$\mu_h(x)$ is a regression function over predictors.

This approach has several key properties:

when predictors are absent and $\mu = 0$ , the model reduces to a Dirichlet process with precision 1,
the parameter $\mu$ controls how quickly weights decay with $h$ , analogous to the DP concentration parameter,
predictor dependence is introduced naturally through $\mu_h(x)$ ,
Gaussian latent variable representations enable efficient Gibbs sampling.

Logistic stick-breaking processes are similar but use a logistic link instead of a probit link.

Example: Glucose Tolerance Prediction

The probit stick-breaking process was applied to an epidemiological study of glucose tolerance in $n = 868$ patients. The response variable was 2-hour plasma glucose level, with predictors including insulin sensitivity, age, waist-to-hip ratio, body mass index, and blood pressure measures.

Key findings include:

the glucose distribution is strongly right-skewed,
the shape of the distribution changes markedly with insulin sensitivity,
mean or median regression models fail to capture these changes.

Bayesian density regression revealed that:

insulin sensitivity and age have posterior inclusion probabilities close to 1,
other predictors have very low inclusion probabilities and can be discarded,
the right tail of the glucose distribution disappears as insulin sensitivity increases,
aging amplifies the heavy right tail, especially for individuals with low insulin sensitivity.

These results demonstrate the ability of density regression to uncover predictor-dependent changes in distributional shape, not merely changes in central tendency.

Summary

Density regression using predictor-dependent stick-breaking processes provides:

flexible modeling of full conditional densities,
automatic adaptation of the number of mixture components,
local modeling behavior in predictor space,
principled Bayesian uncertainty quantification.

Compared with classical regression and finite mixture models, these approaches are especially valuable when distributional features such as skewness, multimodality, or tail behavior change with predictors, as commonly observed in biomedical and epidemiological data.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Density regression

Motivation and Setting

Mixture-of-Experts Models

Nonparametric Bayesian Density Regression

Predictor-Dependent Stick-Breaking Processes

Kernel Stick-Breaking Processes

Probit Stick-Breaking Processes (PSBP)

Example: Glucose Tolerance Prediction

Summary

Like this:

Related

Leave a ReplyCancel reply

Motivation and Setting

Mixture-of-Experts Models

Nonparametric Bayesian Density Regression

Predictor-Dependent Stick-Breaking Processes

Kernel Stick-Breaking Processes

Probit Stick-Breaking Processes (PSBP)

Example: Glucose Tolerance Prediction

Summary

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery