1. Nonparametric residual distributions in regression
Density estimation has primarily served as a pedagogical entry point. The main strength of Dirichlet process mixture (DPM) models lies in relaxing parametric assumptions inside hierarchical models, especially for error distributions.
Consider the linear regression model
$y_i = X_i\beta + \varepsilon_i$, with $\varepsilon_i \sim f$.
Standard regression assumes a parametric form for $f$, often Gaussian.
A common robust alternative is the $t$ distribution, represented as a scale mixture:
$\varepsilon_i \sim N(0, \phi_i^{-1}\sigma^2)$ with $\phi_i \sim \text{Gamma}(\nu/2, \nu/2)$.
Although this handles heavy tails, the shape remains restrictive.
A more flexible approach models the error distribution nonparametrically using a Dirichlet process scale mixture:
$\varepsilon_i \sim N(0, \phi_i^{-1})$, $\phi_i \sim P$, $P \sim DP(\alpha P_0)$,
with $P_0$ chosen as $\text{Gamma}(\nu/2, \nu/2)$ so the prior centers on a $t$ distribution.
This approach preserves symmetry and unimodality.
To remove both symmetry and unimodality, a location mixture of Gaussians is used:
$\varepsilon_i \sim N(\mu_i, \tau^{-1})$, $\mu_i \sim P$, $P \sim DP(\alpha P_0)$, $\tau \sim \text{Gamma}(a_\tau, b_\tau)$,
with $P_0 = N(0, \tau^{-1})$.
Regression coefficients $\beta$ are updated separately, using residuals $y_i – X_i\beta$.
2. Nonparametric distributions for group-level parameters
Hierarchical models often assume normally distributed random effects. For example, in a one-factor ANOVA:
$y_{ij} = \mu_i + \varepsilon_{ij}$, with $\mu_i \sim f$ and $\varepsilon_{ij} \sim g$.
Instead of assuming $f$ is Gaussian, one can place a Dirichlet process prior:
$\mu_i \sim P$, $P \sim DP(\alpha P_0)$.
This induces a latent clustering structure:
$\mu_i = \mu^*_{S_i}$ with $\Pr(S_i = h) = \pi_h$,
where ${\pi_h}$ follow a stick-breaking construction.
Subjects sharing the same cluster index $S_i$ share the same underlying parameter $\mu^*_h$.
Posterior means remain subject-specific due to probabilistic clustering.
Identifiability depends on the number of observations per subject.
If $n_i = 1$, subject-level and residual variability cannot be separated.
If $n_i$ is large, the distribution $P$ primarily reflects between-subject heterogeneity.
To reduce confounding, the residual distribution $g$ can also be modeled nonparametrically.
In such fully nonparametric settings, identifiability of the mean requires post-processing.
3. Functional data analysis with Dirichlet processes
Functional data analysis models observations as noisy realizations of subject-specific functions.
Let $y_{ij} \sim N(f_i(t_{ij}), \sigma^2)$.
Each function is expressed via basis expansion:
$f_i(t) = \sum_{h=1}^H \theta_{ih} b_h(t)$, with coefficient vector $\theta_i$.
Instead of assuming $\theta_i$ follows a multivariate normal distribution, a Dirichlet process prior is placed:
$\theta_i \sim P$, $P \sim DP(\alpha P_0)$.
This induces functional clustering:
$f_i(t) = f^*_{S_i}(t)$, where $f_c^{*}(t) = b(t)\theta_c^{*}$ and $\theta_c^{*} \sim P_0$.
All subjects within a cluster share the same functional form.
Flexibility arises through the choice of the base measure $P_0$.
4. Basis selection through the base measure
Two effective strategies exist for handling basis selection:
(a) Spike-and-slab prior
Each coefficient follows
$P_{0h}(\cdot) = \pi_{0h}\delta_0(\cdot) + (1-\pi_{0h})N(0, \psi_h^{-1})$,
with $\pi_{0h} \sim \text{Beta}(a,b)$ and possibly $\psi_h \sim \text{Gamma}(\nu/2,\nu/2)$.
This yields exact zeros in $\theta^*_{ch}$, enabling cluster-specific basis selection.
(b) Heavy-tailed shrinkage prior
$\theta^*{ch} \sim N(0, \psi{ch}^{-1})$, $\psi_{ch} \sim \text{Gamma}(\nu/2,\nu/2)$,
with small $\nu$ (e.g., $\nu=1$), producing Cauchy-like marginal priors.
This approach avoids exact zeros but strongly shrinks irrelevant coefficients toward zero.
Block updates of $\theta^*_c$ improve computational efficiency and mixing.
5. Key takeaways
- Dirichlet process mixtures provide a principled way to relax parametric assumptions.
- Nonparametric modeling applies to residuals, random effects, and functional coefficients.
- Clustering emerges naturally, with uncertainty in the number of clusters handled automatically.
- The base measure $P_0$ plays a critical role in controlling flexibility and computational behavior.
- Heavy-tailed shrinkage priors often provide a practical balance between flexibility and efficiency.
