1. Motivation and basic idea

The Bayesian histogram model with a Dirichlet prior is attractive because of its simplicity and conjugacy. However, that approach requires an explicit choice of bins. The need to predefine bins becomes problematic in multivariate settings, where the number of bins grows exponentially and sensitivity to bin choice becomes severe.

The Dirichlet process (DP) addresses this limitation by defining a prior directly on the space of probability measures, eliminating the need to specify bins explicitly while retaining many of the intuitive properties of the Dirichlet distribution.

2. Probability measures and partitions

Let the sample space be $\Omega$, equipped with a sigma-algebra $\mathcal{B}$. Let $P$ denote an unknown probability measure on $(\Omega,\mathcal{B})$. For a measurable partition $B_1,\dots,B_k$ of $\Omega$, the probabilities assigned by $P$ are

$P(B_1),\dots,P(B_k) = \int_{B_1} f(y),dy,\dots,\int_{B_k} f(y),dy.$

If $P$ is treated as a random probability measure, then these quantities become random variables.

A natural conjugate prior for probabilities over a finite partition is the Dirichlet distribution. This motivates specifying

$(P(B_1),\dots,P(B_k)) \sim \text{Dirichlet}(\alpha P_0(B_1),\dots,\alpha P_0(B_k)),$

where $P_0$ is a fixed base probability measure representing a prior guess for $P$, and $\alpha>0$ is a concentration parameter controlling the degree of shrinkage toward $P_0$.

3. From finite partitions to the Dirichlet process

The specification above resembles a Bayesian histogram, but it only determines the total probability assigned to each bin $B_h$, not how mass is distributed inside each bin. Moreover, the specification depends on the chosen partition.

The key idea of the Dirichlet process is to require that this Dirichlet specification hold for every possible finite measurable partition of $\Omega$. If a random probability measure $P$ satisfies

$(P(B_1),\dots,P(B_k)) \sim \text{Dirichlet}(\alpha P_0(B_1),\dots,\alpha P_0(B_k))$

for all finite partitions $B_1,\dots,B_k$, then $P$ is said to follow a Dirichlet process prior, written as

$P \sim \text{DP}(\alpha P_0).$

Here, $P_0$ is the baseline (or base) distribution, and $\alpha$ is a precision parameter.

4. Marginal properties of the Dirichlet process

From properties of the Dirichlet distribution, for any measurable set $B \in \mathcal{B}$,

$P(B) \sim \text{Beta}(\alpha P_0(B), \alpha(1-P_0(B))).$

This leads to simple expressions for the prior mean and variance:

$E[P(B)] = P_0(B),$

$\text{Var}[P(B)] = \dfrac{P_0(B)(1-P_0(B))}{1+\alpha}.$

Thus, the Dirichlet process prior is centered at $P_0$, and $\alpha$ controls the variability around this baseline. Larger values of $\alpha$ imply stronger concentration around $P_0$, while smaller values allow more deviation.

The support of the DP includes all probability measures whose support lies within the support of $P_0$.

5. Conjugacy and posterior updating

Assume observations $y_1,\dots,y_n$ are independent draws from $P$, written as $y_i \sim P$, and the prior is $P \sim \text{DP}(\alpha P_0)$. For any finite partition $B_1,\dots,B_k$, the posterior distribution satisfies

$(P(B_1),\dots,P(B_k)\mid y_1,\dots,y_n) \sim \text{Dirichlet}!\left(\alpha P_0(B_1)+\sum_{i=1}^n 1_{y_i\in B_1},\dots,\alpha P_0(B_k)+\sum_{i=1}^n 1_{y_i\in B_k}\right).$

This implies a simple posterior update for the random measure itself:

$P \mid y_1,\dots,y_n \sim \text{DP}!\left(\alpha P_0 + \sum_{i=1}^n \delta_{y_i}\right),$

where $\delta_{y_i}$ denotes a point mass at $y_i$. The updated precision parameter is $\alpha+n$, which gives $\alpha$ the interpretation of a prior sample size.

The posterior mean is

$E[P(B)\mid y_1,\dots,y_n] = \frac{\alpha}{\alpha+n}P_0(B) + \frac{1}{\alpha+n}\sum_{i=1}^n 1_{y_i\in B}.$

This expression shows that the Bayes estimator of $P$ under squared error loss is a convex combination of the base measure and the empirical distribution. As $n$ grows, the estimator converges to the empirical distribution.

6. Bayesian bootstrap as a limiting case

In the limit $\alpha \to 0$, the posterior becomes

$P \mid y_1,\dots,y_n \sim \text{DP}!\left(\sum_{i=1}^n \delta_{y_i}\right).$

This limiting case is known as the Bayesian bootstrap. Samples from this posterior correspond to discrete distributions supported on the observed data points with Dirichlet-distributed weights, providing a Bayesian analogue of the classical bootstrap with smoother weight behavior.

7. Limitations of the Dirichlet process

Despite its appealing properties, the Dirichlet process has notable drawbacks for density estimation:

  • The posterior mean lacks smoothness, as it is a weighted average of $P_0$ and point masses at the observations.
  • The DP induces negative dependence between probabilities assigned to disjoint sets, regardless of their proximity.
  • Realizations from a DP are almost surely discrete, meaning $P$ is atomic and does not admit a continuous density on $\mathbb{R}$.

These properties make the DP unsuitable as a direct prior for continuous densities without modification.

8. Stick-breaking representation

A constructive characterization of the Dirichlet process is provided by the stick-breaking representation. A draw $P \sim \text{DP}(\alpha P_0)$ can be written as

$P(\cdot) = \sum_{h=1}^\infty \pi_h \delta_{\theta_h}(\cdot),$

where the atoms $\theta_h \sim P_0$ independently, and the weights are defined by

$\pi_h = V_h \prod_{l<h}(1-V_l), \quad V_h \sim \text{Beta}(1,\alpha).$

This construction ensures that $\sum_{h=1}^\infty \pi_h = 1$. The interpretation is sequential: a unit-length stick is broken repeatedly, with each break allocating probability mass to a new atom. The expected value $E[V_h] = \frac{1}{1+\alpha}$ shows that small $\alpha$ leads to most mass being concentrated on the first few atoms.

9. Implications for modeling

The stick-breaking representation clarifies that the DP generates discrete distributions. For continuous data, a large value of $\alpha$ is required to avoid excessive ties, but as $\alpha \to \infty$, the model degenerates to $y_i \sim P_0$, recovering the parametric base distribution.

This insight motivates Dirichlet process mixture models, where the DP is placed on latent parameters rather than directly on the data distribution, allowing smooth densities while retaining the flexibility of nonparametric Bayesian inference.