Why a Dirichlet process mixture is needed

A Dirichlet process prior $P \sim \mathrm{DP}(\alpha P_0)$ is not ideal as a direct prior on a continuous data distribution because draws from $P$ are almost surely discrete/atomic, so the induced “density” is not smooth. That limitation does not make the DP useless; instead, the DP is better used as a prior on an unknown mixing distribution inside a kernel mixture model, which can produce smooth densities even when the mixing measure is discrete.

Kernel mixture specification

A general kernel mixture model is

$f(y \mid P) = \int K(y \mid \theta), dP(\theta), \quad$

where $K(\cdot \mid \theta)$ is a kernel (often with location and possibly scale parameters inside $\theta$), and $P$ is a mixing measure over $\theta$.

  • If $P$ has finitely many atoms, the model becomes a finite mixture.
  • If $P$ is given a DP prior, the model becomes a Dirichlet process mixture (DPM).

With $P \sim \mathrm{DP}(\alpha P_0)$ and the stick-breaking representation of $P$, the density can be written as an infinite mixture

$f(y) = \sum_{h=1}^{\infty} \pi_h K(y \mid \theta_h^*), \quad$

where the weights $\pi = (\pi_h)$ come from the stick-breaking process (shorthand: $\pi \sim \mathrm{stick}(\alpha)$) and the atoms satisfy $\theta_h^* \sim P_0$ independently.

The phrase “infinite mixture” means the model allows unboundedly many potential components, but in any finite dataset only a finite number of components will typically be occupied.

Hierarchical formulation equivalent to the infinite mixture

A common hierarchical representation is

$y_i \sim K(\cdot \mid \theta_i), \quad \theta_i \sim P, \quad P \sim \mathrm{DP}(\alpha P_0).$

This is equivalent to sampling $y_i$ from the infinite mixture in $f(y)=\sum_{h=1}^{\infty}\pi_h K(y\mid \theta_h^*)$.

The computational challenge

Even though $P$ can be expressed via stick-breaking, $P$ involves infinitely many parameters, and the posterior is no longer jointly conjugate in a simple closed form. A key strategy is to integrate out $P$ to obtain a predictive distribution directly for the subject-specific parameters $\theta_1,\dots,\theta_n$.


Polya urn predictive rule and Chinese restaurant interpretation

Polya urn predictive rule

Marginalizing out $P$ yields the Polya urn (Blackwell–MacQueen) predictive distribution. For $i \ge 2$,

$p(\theta_i \mid \theta_1,\dots,\theta_{i-1}) = \frac{\alpha}{\alpha+i-1} P_0(\theta_i) + \sum_{j=1}^{i-1} \frac{1}{\alpha+i-1},\delta_{\theta_j}(\theta_i). \quad$

Interpretation:

  • With probability $\frac{\alpha}{\alpha+i-1}$, draw a new value from the base measure $P_0$.
  • Otherwise, copy one of the previously seen values $\theta_j$ (creating ties among $\theta_i$’s).

This implies clustering: multiple subjects share the same parameter value.

Chinese restaurant process (CRP) metaphor

An equivalent description uses an infinite restaurant with infinitely many tables.

  • Customer 1 sits at a new table and receives parameter $\theta_1$.
  • Customer $i$ sits at an existing table $t$ with probability proportional to the number of customers already there, or starts a new table with probability proportional to $\alpha$.

If table $t$ has count $c_t$, then

$\Pr(\text{sit at table } t) = \frac{c_t}{\alpha+i-1}, \qquad \Pr(\text{new table}) = \frac{\alpha}{\alpha+i-1}.$

As $n$ grows, the number of occupied clusters increases slowly, approximately on the order of $\alpha \log n$ (in an asymptotic sense). This behavior matches the practical idea of adding mixture components only as needed.


Full conditional prior for a single subject parameter

Exchangeability implies a conditional prior for $\theta_i$ given all other parameters $\theta_{-i}=(\theta_j, j\ne i)$:

$\theta_i \mid \theta_{-i} \sim \frac{\alpha}{\alpha+n-1}P_0(\theta_i) + \sum_{h=1}^{k_{-i}} \frac{n_{h,-i}}{\alpha+n-1},\delta_{\theta_h^*}(\theta_i). \quad$

where

  • $\theta_1^{}, \dots, \theta_{k_{-i}}^{}$ are the unique values among $\theta_{-i}$,
  • $n_{h,-i}$ counts how many $j\ne i$ satisfy $\theta_j=\theta_h^*$,
  • $k_{-i}$ is the number of distinct clusters excluding subject $i$.

If $P_0$ is conjugate to the kernel $K$, then updating this conditional prior with the likelihood from $y_i$ yields a tractable conditional posterior for $\theta_i$. However, directly updating each $\theta_i$ in this way often mixes poorly.


Marginal (collapsed) Gibbs sampler using cluster allocations

A more effective approach updates cluster assignments and cluster parameters separately.

Let $\theta^=(\theta_1^,\dots,\theta_k^)$ be the unique cluster parameters and define an allocation variable $S_i$ such that $S_i=c$ if $\theta_i=\theta_c^{*}$.

Step 1: update allocations

For subject $i$, the conditional probabilities are

$\Pr(S_i=c\mid -)\propto n_{c,-i},K(y_i\mid \theta_c^*), \quad c=1,\dots,k_{-i},$

and for creating a new singleton cluster,

$\Pr(S_i=k_{-i}+1\mid -)\propto \alpha \int K(y_i\mid \theta),dP_0(\theta).$

If $S_i=k_{-i}+1$, subject $i$ starts a new cluster.

Step 2: update cluster-specific parameters

For each occupied cluster $c$,

$p(\theta_c^\mid -)\propto P_0(\theta_c^)\prod_{i:S_i=c} K(y_i\mid \theta_c^*).$

Under conjugacy, this has the same parametric form as $P_0$ but with updated hyperparameters.

Roles of $\alpha$ and $P_0$ in clustering

  • $\alpha$ controls the tendency to create new clusters. As $\alpha \to 0$, the model collapses toward a single-cluster parametric model $y_i \sim K(\cdot\mid \theta)$ with common $\theta$.
  • $P_0$ strongly affects cluster formation via the marginal likelihood term $\int K(y_i\mid \theta),dP_0(\theta)$. A naive “very diffuse” $P_0$ can unintentionally discourage new clusters because it makes the marginal likelihood small; in the limit of extremely diffuse $P_0$, behavior can mimic a single-cluster model. Practical guidance is to choose $P_0$ concentrated near plausible values, often aided by standardizing the data.

The key difference from finite mixtures is that the number of occupied clusters $k$ varies across MCMC iterations, so posterior uncertainty in $k$ is obtained automatically without reversible-jump MCMC.

A posterior predictive density estimate for a new observation can be expressed as

$p(y)=\sum_{c=1}^{k}\frac{n_c}{n+\alpha}K(y\mid \theta_c^*)+\frac{\alpha}{n+\alpha}\int K(y\mid \theta),dP_0(\theta),$

and then averaged over posterior draws.


Finite mixture approximation to a DP mixture

A finite mixture with weights

$\pi \sim \mathrm{Dirichlet}!\left(\frac{\alpha}{k},\dots,\frac{\alpha}{k}\right)$

and component parameters i.i.d. from $P_0$ approximates a DPM. The approximation improves as $k$ increases, and practical settings like $k=25$ or $k=50$ can serve as conservative upper bounds on the number of occupied clusters.


Blocked Gibbs sampler via truncated stick-breaking

Motivation for blocked sampling

If $P$ is marginalized out, inference on $P$ itself is not directly available. Also, marginalization is not always possible for generalizations. An alternative keeps the stick-breaking representation but truncates it.

Truncation idea

Because stick-breaking weights decrease stochastically with index, for a sufficiently large truncation level $N$ the remaining tail probability $\sum_{h=N+1}^{\infty}\pi_h$ is near zero. An approximation sets $V_N=1$, forcing $\pi_h=0$ for $h>N$. Typical defaults are $N=25$ or $N=50$.

Blocked Gibbs steps (truncated DP mixture)

  1. Allocation update:
    $\Pr(S_i=c\mid -)=\frac{\pi_c K(y_i\mid \theta_c^)}{\sum_{l=1}^{N}\pi_l K(y_i\mid \theta_l^)}, \quad c=1,\dots,N.$
  2. Stick-breaking updates for $c=1,\dots,N-1$:
    $V_c \sim \mathrm{Beta}!\left(1+n_c,\ \alpha+\sum_{l=c+1}^{N} n_l\right).$
  3. Cluster-parameter updates:
    $\theta_c^* \mid – \propto P_0(\theta_c^)\prod_{i:S_i=c}K(y_i\mid \theta_c^),$
    and if $n_c=0$, draw $\theta_c^*\sim P_0$.

For density estimation, monitor

$f(y)=\sum_{c=1}^{N}\pi_c K(y\mid \theta_c^*)$

over a dense grid of $y$ values across iterations. A useful diagnostic is $S_{\max}=\max(S_1,\dots,S_n)$; if $S_{\max}$ often approaches $N$, increase $N$.

Convergence assessment should focus on $f(y)$ rather than component-specific parameters because label ambiguity can cause poor mixing in $\theta_c^*$ even when the induced density mixes well.


Hyperprior for $\alpha$ and Bayesian learning of cluster complexity

Fixing vs learning $\alpha$

A common default is $\alpha=1$, which implies that two randomly selected individuals have roughly a 50–50 chance of sharing a cluster under the prior (in the DP clustering sense). Alternatively, place a gamma hyperprior

$\alpha \sim \mathrm{Gamma}(a_\alpha,b_\alpha)$

and update $\alpha$ within MCMC. Under the blocked sampler, the conditional posterior is

$\alpha\mid – \sim \mathrm{Gamma}!\left(a_\alpha+N-1,\ b_\alpha-\sum_{h=1}^{N-1}\log(1-V_h)\right).$

Even though adding clusters increases the maximized likelihood in frequentist estimation, the Bayesian approach uses marginal likelihood, integrating over parameters. Allocating observations to many clusters increases dimensionality and integration volume, inducing an intrinsic complexity penalty that discourages overfitting.

Subtlety: choice of $P_0$

The base measure $P_0$ governs the prior for cluster locations and scales. If $P_0$ is too diffuse relative to the data scale, introducing new clusters is heavily penalized through the marginal likelihood, pushing toward overly few clusters. Standardizing data and choosing $P_0$ centered near $0$ with variance near $1$ (on the standardized scale) is a common strategy to make $P_0$ meaningful.


Toxicology example: counts, DP vs DPM, and kernel choice

DP prior directly on a discrete distribution (no MCMC)

For count data, using a DP directly on the data distribution can be computationally easy because of conjugacy:

$P\mid y_1,\dots,y_n \sim \mathrm{DP}!\left(\alpha P_0+\sum_{i=1}^{n}\delta_{y_i}\right).$

The posterior mean probability at $y=j$ is

$\Pr(y=j\mid y_1,\dots,y_n)=\frac{\alpha}{\alpha+n}P_0(j)+\frac{1}{\alpha+n}\sum_{i=1}^{n}1_{y_i=j}.$

This produces a spike-at-observed-values behavior and limited smoothing, especially at small $n$.

Poisson DPM for counts and a key limitation

A natural DPM for counts is

$y_i \sim \mathrm{Poisson}(\theta_i),\quad \theta_i \sim P,\quad P\sim \mathrm{DP}(\alpha P_0),\quad P_0=\mathrm{Gamma}(a,b). \quad$

Blocked Gibbs steps:

  • $S_i$ update uses $\Pr(S_i=c\mid -)\propto \pi_c,\mathrm{Poisson}(y_i\mid \theta_c^*)$.
  • $V_c$ update uses $V_c \sim \mathrm{Beta}(1+n_c,\ \alpha+\sum_{l>c}n_l)$.
  • $\theta_c^{*}$ update uses $\theta_c^ \sim \mathrm{Gamma}!\left(a+\sum_{i:S_i=c}y_i,\ b+n_c\right)$.

However, Poisson kernels force $\mathrm{Var}(y)=E(y)$. A mixture of Poissons can only increase variance (overdispersion) and cannot model underdispersed counts well. In the toxicology implantation data, variance is smaller than the mean (underdispersion), leading the Poisson DPM to overestimate variance and fit poorly.

Rounded Gaussian DPM as a remedy

A more flexible approach introduces latent continuous values and rounds:

$y_i = h(y_i^),\quad y_i^ \sim N(\mu_i,\tau_i^{-1}),\quad (\mu_i,\tau_i)\sim P,\quad P\sim \mathrm{DP}(\alpha P_0), \quad$

with rounding rule $h(y^)=j$ if $y^\in(a_j,a_{j+1}]$, where $a_0=-\infty$ and $a_j=j-1$ for $j\ge 1$. Choose

$P_0(\mu,\tau)=N(\mu\mid \mu_0,\kappa \tau^{-1}),\mathrm{Gamma}(\tau\mid a_\tau,b_\tau).$

Blocked Gibbs sampler includes:

  1. Allocation update using the rounded likelihood
    $p(y_i\mid \mu_c^,\tau_c^)=\Phi(a_{y_i+1}\mid \mu_c^,\tau_c^)-\Phi(a_{y_i}\mid \mu_c^,\tau_c^),$
    and
    $\Pr(S_i=c\mid -)\propto \pi_c,p(y_i\mid \mu_c^,\tau_c^).$
  2. Stick-breaking update
    $V_c \sim \mathrm{Beta}(1+n_c,\ \alpha+\sum_{l>c}n_l).$
  3. Latent augmentation:
    $u_i \sim \mathrm{Uniform}!\left(\Phi(a_{y_i}\mid \mu_{S_i}^,\tau_{S_i}^),\ \Phi(a_{y_i+1}\mid \mu_{S_i}^,\tau_{S_i}^)\right),$
    $y_i^=\Phi^{-1}(u_i\mid \mu_{S_i}^,\tau_{S_i}^*).$
  4. Update $(\mu_c^,\tau_c^)$ from the conjugate normal-gamma posterior
    $N(\mu_c^\mid \hat\mu_c,\hat\kappa_c \tau_c^{-1}),\mathrm{Gamma}(\tau_c^\mid \hat a_{\tau c},\hat b_{\tau c})$
    with
    $\hat a_{\tau c}=a_\tau+\frac{n_c}{2},\quad \hat\kappa_c=(\kappa^{-1}+n_c)^{-1},\quad \hat\mu_c=\hat\kappa_c(\kappa^{-1}\mu_0+n_c\bar y_c^),$
    and
    $\hat b_{\tau c}=b_\tau+\frac{1}{2}\left(\sum_{i:S_i=c}\left(y_i-\bar y_c^{}\right)^2+\frac{n_c}{1+\kappa n_c}\left(\bar y_c^{}-\mu_0\right)^2\right)$

This rounded Gaussian DPM can fit underdispersed count distributions better and yielded excellent fit in the toxicology example, enabling posterior inference on dose-related distributional shifts via changes in percentiles across dose groups.