Motivation

While Dirichlet process (DP) and Dirichlet process mixture (DPM) models are often introduced through density estimation, their primary value lies in relaxing parametric assumptions within hierarchical models. These models allow probability distributions themselves to be treated as random objects, enabling flexible borrowing of information across subjects, groups, and time while maintaining Bayesian coherence.


Nonparametric Error Distributions in Regression

Consider a linear regression modelyi=Xiβ+εi,εif,y_i = X_i \beta + \varepsilon_i, \qquad \varepsilon_i \sim f,

where the error distribution ff is unknown. Classical approaches assume a Gaussian form, while robust alternatives replace this with a Student-t distribution via a scale mixture of normals.

A more flexible approach models the error distribution nonparametrically using a Dirichlet process mixture (DPM). For example, a scale mixture formulation isεiϕiN(0,ϕi1),ϕiP,PDP(αP0),\varepsilon_i \mid \phi_i \sim \mathcal{N}(0, \phi_i^{-1}), \qquad \phi_i \sim P, \qquad P \sim \mathrm{DP}(\alpha P_0),

where P0P_0​ is chosen to center the prior on a Student-t distribution. This preserves robustness while allowing deviations from the parametric form. However, the resulting distribution remains unimodal and symmetric.

Greater flexibility is obtained by using a location mixture,εiμi,τN(μi,τ1),μiP,PDP(αP0),\varepsilon_i \mid \mu_i, \tau \sim \mathcal{N}(\mu_i, \tau^{-1}), \qquad \mu_i \sim P, \qquad P \sim \mathrm{DP}(\alpha P_0),

with P0P_0​ typically Gaussian. This removes unimodality and symmetry constraints entirely.


Nonparametric Distributions for Group-Varying Parameters

In hierarchical models with subject-specific parameters, uncertainty about the distribution of those parameters can be handled nonparametrically. For example, in a one-way ANOVA model,yij=μi+εij,y_{ij} = \mu_i + \varepsilon_{ij},

placing a Dirichlet process prior on the distribution of μi\mu_i,μiP,PDP(αP0),\mu_i \sim P, \qquad P \sim \mathrm{DP}(\alpha P_0),

induces a latent clustering structure:μi=μSi,Pr(Si=c)=πc,\mu_i = \mu^{*}_{S_i}, \qquad \Pr(S_i = c) = \pi_c,

with cluster-specific parameters μcP0\mu^{*}_c \sim P_0. Subjects are probabilistically grouped into an unknown number of clusters, allowing the data to determine how many distinct latent parameter values are needed.

This approach raises identifiability issues when the number of observations per subject is small, since variability in μi\mu_i and residual variability may be confounded. These issues can be mitigated by also modeling the residual distribution nonparametrically.


Functional Data Analysis via Dirichlet Processes

Functional observations are modeled as noisy realizations of smooth subject-specific trajectories:yijN(fi(tij),σ2).y_{ij} \sim \mathcal{N}(f_i(t_{ij}), \sigma^2).

Each function is represented using a basis expansion,fi(t)=h=1Hθihbh(t),f_i(t) = \sum_{h=1}^{H} \theta_{ih} b_h(t),

with subject-specific coefficient vectors θi\theta_i​. Placing a Dirichlet process prior on the distribution of coefficients,θiP,PDP(αP0),\theta_i \sim P, \qquad P \sim \mathrm{DP}(\alpha P_0),

induces functional clustering:fi(t)=fSi(t),fc(t)=b(t)θc,θcP0.f_i(t) = f^{*}_{S_i}(t), \qquad f^{*}_c(t) = b(t) \theta^{*}_c, \qquad \theta^{*}_c \sim P_0.

All subjects assigned to cluster cc share the same underlying function. By choosing P0P_0​ appropriately, cluster-specific basis selection is enabled.

Two common choices for P0P_0​ are:

  1. Spike-and-slab priors, which allow exact basis selection via point masses at zero.
  2. Heavy-tailed shrinkage priors (e.g., normal-gamma or Cauchy-type), which encourage small coefficients without forcing exact zeros and allow efficient block updates in MCMC.

Hierarchical Dependence Across Random Probability Measures

Nested Dirichlet Process (NDP)

In the nested Dirichlet process, group-specific distributions PjP_j​ are themselves clustered:PjDP(αP00),P00DP(βP0).P_j \sim \mathrm{DP}(\alpha P_{00}), \qquad P_{00} \sim \mathrm{DP}(\beta P_0).

This induces clustering of entire distributions, so that for distinct groups jjj \neq j’,Pr(Pj=Pj)=11+α.\Pr(P_j = P_{j’}) = \frac{1}{1+\alpha}.

Groups assigned to different clusters have completely independent atoms.


Hierarchical Dirichlet Process (HDP)

In contrast, the HDP uses a shared set of global atoms but group-specific weights. As a result,Pr(Pj=Pj)=0,\Pr(P_j = P_{j’}) = 0,

even though distributions are related through shared support. The HDP is therefore appropriate when groups should share mixture components but not identical distributions.


Convex Mixtures of Random Probability Measures

An alternative to DP-based dependence is to use convex combinations of random probability measures. For example,Pc=πG0+(1π)Gc,GcDP(αG0),πBeta(a,b).P_c = \pi G_0 + (1-\pi) G_c, \qquad G_c \sim \mathrm{DP}(\alpha G_0), \qquad \pi \sim \mathrm{Beta}(a,b).

This formulation decomposes group-level variability into a global component and a group-specific deviation, analogous to random-effects models but defined in the space of probability measures. Marginally, PcP_c​ does not follow a DP, so this construction falls outside the dependent Dirichlet process (DDP) class.


Dynamic Models for Random Probability Measures

Temporal dependence can be introduced via measure-valued autoregressive models:Pt=(1π)Pt1+πGt,GtDP(αP0).P_t = (1-\pi) P_{t-1} + \pi G_t, \qquad G_t \sim \mathrm{DP}(\alpha P_0).

This represents a random walk in the space of probability measures. A limitation is that atoms introduced early persist indefinitely, though with decreasing weights. This can be addressed by placing an HDP prior on P0P_0​, allowing atoms to reappear and disappear over time.


Summary

  • Dirichlet processes enable nonparametric modeling of distributions within hierarchical models.
  • DPMs generalize finite mixtures by allowing the number of components to grow with the data.
  • NDPs cluster entire distributions; HDPs share atoms across distributions.
  • Convex mixtures provide a flexible alternative that does not enforce DP marginals.
  • Functional data and regression models benefit substantially from DP-based priors.
  • Dynamic extensions allow dependence across time while preserving Bayesian coherence.