1. Finite Mixtures: Basic Setup

We want to model the distribution of:

  • $y = (y_1, \dots, y_n)$, or
  • the conditional distribution $y \mid x$

as a finite mixture of $H$ components.

1.1 Mixture structure

For each component $h = 1, \dots, H$:

  • Component distribution: $f_h(y_i \mid \theta_h)$, with parameter vector $\theta_h$
  • Mixture weight: $\lambda_h$, with $\sum_{h=1}^H \lambda_h = 1$

Typically all components share the same parametric family (e.g., normal) but with different parameter values.

The sampling distribution for a single observation is:p(yiθ,λ)=λ1f(yiθ1)+λ2f(yiθ2)++λHf(yiθH).(22.1)p(y_i \mid \theta, \lambda) = \lambda_1 f(y_i \mid \theta_1) + \lambda_2 f(y_i \mid \theta_2) + \dots + \lambda_H f(y_i \mid \theta_H). \tag{22.1}

Here, $\theta = (\theta_1, \dots, \theta_H)$.


1.2 Mixture as a hierarchical model

The mixture form in (22.1) might look like a discrete prior on parameter values $\theta_h$, but it is better interpreted as:

  • $\lambda$ describes variation in $\theta$ across a population (a mixing distribution),
  • So the model is closer to a hierarchical model.

To make the hierarchy explicit, we introduce latent indicator variables:

  • For each observation $i$ and component $h$, zih={1if observation i comes from component h0otherwisez_{ih} = \begin{cases} 1 & \text{if observation } i \text{ comes from component } h \\ 0 & \text{otherwise} \end{cases}
  • Each vector $z_i = (z_{i1}, \dots, z_{iH})$ is: ziλMultinomial(1;λ1,,λH).z_i \mid \lambda \sim \text{Multinomial}(1; \lambda_1, \dots, \lambda_H).

Given $(\theta, \lambda)$, the joint distribution of data and latent indicators is:p(y,zθ,λ)=p(zλ)p(yz,θ)=i=1nh=1H[λhf(yiθh)]zih.(22.2)p(y, z \mid \theta, \lambda) = p(z \mid \lambda) \, p(y \mid z, \theta) = \prod_{i=1}^n \prod_{h=1}^H \bigl[\lambda_h f(y_i \mid \theta_h)\bigr]^{z_{ih}}. \tag{22.2}

Exactly one $z_{ih}$ equals 1 for each $i$.

If for some observations the component label is known (e.g., male/female height where sex is observed), then their $z_i$ are fixed rather than latent, and the likelihood is modified accordingly.

For now, the number of components $H$ is assumed known and fixed; later, we consider model checking and how to handle uncertainty in $H$.


2. Continuous Mixtures

Finite mixtures are a special case of a more general continuous mixture:p(yi)=p(yiθ)λ(θ)dθ.p(y_i) = \int p(y_i \mid \theta) \, \lambda(\theta) \, d\theta.

Here:

  • Each $y_i$ depends on its own parameter $\theta_i$,
  • The distribution of $\theta_i$ in the population is given by a mixing distribution $\lambda(\theta)$.

This is exactly how hierarchical models can be interpreted.

Examples:

  • t distribution as a continuous mixture on the scale of a normal:
    • Model: $y_i \mid \mu, \sigma^2, z_i \sim N(\mu, \sigma^2 z_i)$
    • Latent scale: $z_i \sim \text{scaled Inv-}\chi^2(\nu, 1)$
    • Marginally: $y_i \sim t_\nu(\mu, \sigma^2)$
    • Here, $\nu$ (degrees of freedom) describes the mixing distribution, analogous to $\lambda$ in finite mixtures.
  • Negative binomial and beta-binomial distributions:
    • Both arise as continuous mixtures of Poisson or binomial models.

The posterior of the latent $z_i$ in such continuous mixtures can be used to detect possible outliers (e.g., observations associated with extreme scales).

Computationally, continuous mixtures are handled similarly to finite mixtures; changes are mainly notational and minor.


3. Identifiability of Mixture Likelihood

A model is not identifiable if different parameter values give the same likelihood.

All finite mixtures are non-identifiable in at least one sense:

  • The distribution is unchanged if we permute component labels (label switching).
  • For example, in a 2-component mixture, it is arbitrary which is component 1 vs component 2.

To help with identifiability:

  • Impose ordering constraints on parameters, such as:
    • Component means ordered non-decreasingly,
    • Mixture weights ordered, e.g., $\lambda_1 \le \dots \le \lambda_H$.
  • Use informative priors that link components to specific subpopulations.

These strategies break label symmetry and make interpretation more stable.


4. Prior Distributions

We have mixture parameters $(\theta, \lambda)$.

Typically, the prior factorizes:p(θ,λ)=p(θ)p(λ).p(\theta, \lambda) = p(\theta)\, p(\lambda).

4.1 Prior for mixture weights $\lambda$

Given that $z_i \mid \lambda \sim \text{Multinomial}(1; \lambda)$, the conjugate prior is:λDirichlet(α1,,αH).\lambda \sim \text{Dirichlet}(\alpha_1, \dots, \alpha_H).

Interpretation:

  • Relative sizes of $\alpha_h$ → prior mean of $\lambda_h$
  • Sum $\sum_h \alpha_h$ → prior sample size, i.e., strength of prior

4.2 Prior for component parameters $\theta$

  • Let $\theta = (\theta_1, \dots, \theta_H)$.
  • Some parameters may be shared among components; others may be component-specific.
    • Example: Mixture of normals where each component has its own mean but shares a common variance.

At this stage, $p(\theta)$ is left general.

In continuous mixtures, additional parameters defining the mixing distribution (e.g., $\nu$ in a t-model) need hyperpriors.


5. Ensuring a Proper Posterior

Improper priors can easily lead to improper posteriors in mixture models.

Two common issues:

  1. Improper prior on $\lambda$
    • For instance, Dirichlet with $\alpha_h = 0$ corresponds to an “uninformative” prior.
    • If the data do not support all $H$ components, this can cause problems.
  2. Improper priors on component parameters
    • Example: mixture of two normals with unknown variances $\sigma_1^2, \sigma_2^2$.
    • If we use a joint uniform prior on $(\log \sigma_1, \log \sigma_2)$, the posterior can develop degenerate modes, where:
      • One component shrinks to a single data point with near-zero variance,
      • Producing an extremely high likelihood (“spike” components).

A workaround:

  • Fix the ratio of the variances or assign a proper prior to it.
  • In general, ensure that priors on variance parameters are proper.

Conclusion: you must check that the combination of model + improper priors yields a well-defined (proper) posterior before trusting results.


6. Number of Components $H$

In practice, $H$ is often unknown.

Strategy 1: Start small and check adequacy

  • Begin with a small $H$ for scientific and computational reasons.
  • Fit the model and assess whether it can reproduce key data features.
  • Use posterior predictive checks with test statistics that:
    • Are not sufficient statistics,
    • Are sensitive to aspects like multimodality, heavy tails, etc.

If the current $H$ fails to capture important patterns, increase $H$.

Strategy 2: Treat $H$ as a random parameter

  • Let $H \in {1, 2, 3, \dots}$ and assign a prior on $H$.
  • Average inferences over the posterior on $H$.
  • This aligns with trans-dimensional approaches (e.g., reversible jump MCMC) and connects to nonparametric approaches discussed later.

7. General Latent Class Formulation

Finite mixtures can be described by latent class membership:

  • Each item $i = 1, \dots, n$ belongs to one of $H$ latent subpopulations (latent classes).
  • Let $z_i \in {1, \dots, H}$ denote the class of observation $i$.

Conditionally on $z_i$:yizif(θzi,ϕ),(22.3)y_i \mid z_i \sim f(\theta_{z_i}, \phi), \tag{22.3}

where:

  • $f(\cdot)$ is a parametric family,
  • $\phi$ are parameters common to all classes,
  • $\theta_h$ are class-specific parameters for class $h$.

Assume:Pr(zi=h)=πh.\Pr(z_i = h) = \pi_h.

Then, marginalizing $z_i$:g(yπ,θ,ϕ)=h=1Hπhf(yθh,ϕ),g(y \mid \pi, \theta, \phi) = \sum_{h=1}^H \pi_h f(y \mid \theta_h, \phi),

which is exactly a finite mixture with weights $\pi_h$.

This structure allows $g$ to approximate a very broad variety of distributions, even if $f$ is simple.

Example: Location mixture of normals

Let

  • $f(y \mid \theta, \phi) = N(y \mid \theta, \phi^2)$, with common variance $\phi^2$.
  • Then $y_i \mid z_i = h \sim N(\theta_h, \phi^2)$.

Marginally:g(yπ,θ,ϕ)=h=1HπhN(yθh,ϕ2).(22.4)g(y \mid \pi, \theta, \phi) = \sum_{h=1}^H \pi_h N(y \mid \theta_h, \phi^2). \tag{22.4}

With enough components, this mixture can approximate any density.

If variances are also component-specific:

  • $N(y \mid \theta_h, \phi_h^2)$

then we obtain location–scale mixtures, which:

  • Can approximate complicated distributions with fewer components,
  • Allow different variances in different subpopulations, often more realistic in practice.

8. Interpreting Mixture Models: Two Viewpoints

There are two major philosophical views on finite mixture models.

8.1 Mixtures as “true” latent subpopulations

  • Components correspond to real, underlying groups (e.g., disease subtypes).
  • We might want:
    • Inference on subpopulation-specific parameters,
    • Clustering of individuals into latent classes.

This leads to:

  • Latent class models,
  • Model-based clustering literature.

However:

  • Cluster structure depends heavily on the within-component model $f$.
  • Changing $f$ (e.g., from normal to t) can dramatically change clusters.
  • In multivariate mixtures, using diagonal covariance in components can artificially increase the number of clusters (elliptical truth approximated by many spherical subclusters).

Therefore, cluster results are fragile when the parametric form is uncertain.

8.2 Mixtures as flexible approximating distributions

Here, we do not insist subpopulations “really exist.”

Instead, mixtures are used as:

  • Flexible tools to model complex distributions,
  • Building blocks for hierarchical models that:
    • Allow more realistic random-effect distributions,
    • Capture uncertainty about parametric assumptions,
    • Serve in density estimation, classification, nonparametric regression, etc.

In this second view, the emphasis is on good predictive modeling and robustness, not on literal interpretation of components.


9. Computation for Mixture Models

Mixture models are handled computationally like hierarchical models:

  • Latent indicators $z$ act as missing data or nuisance parameters.
  • Inference averages over $z$.

We use:

  1. Crude initial estimates
  2. EM algorithm and variational Bayes
  3. Gibbs sampling

9.1 Crude estimates

Initial parameter and proportion estimates can be obtained by:

  • Graphical methods (histograms, density plots),
  • Clustering (e.g., k-means).

Once observations are tentatively assigned to components, estimating each component’s parameters is straightforward.

However:

  • This ignores uncertainty in $z$,
  • Tends to overestimate differences between components.

Crude $z$ estimates can be useful as starting values, but not for final inference.


9.2 EM Algorithm and Variational Bayes

It is not meaningful to find the joint mode of $(\theta, \lambda, z)$ in mixtures because of label switching and multimodality.

Instead, we use EM to approximate posterior modes / MLE of $(\theta, \lambda)$ while averaging over $z$.

EM structure:

  • E-step:
    Compute expected sufficient statistics for the complete-data model $(y, z)$, using the log of (22.2) and current parameter guesses.
    Often this boils down to computing expected indicators: E[zihy,θ,λ].\mathbb{E}[z_{ih} \mid y, \theta, \lambda].
  • M-step:
    Update $(\theta, \lambda)$ given these expectations.

This also applies to continuous mixtures, where latent continuous mixture variables take the place of $z$.

In practice:

  • There can be multiple modes, so we must find all major modes and compare them.
  • Use many starting points (e.g., 50–100), obtained by:
    • Adding randomness to crude estimates,
    • Simplifying the model (e.g., remove random effects) and then upscaling.

Variational Bayes:

  • Uses similar structure but optimizes a lower bound on the marginal likelihood.
  • Produces approximate marginal posteriors for parameters, again integrating out $z$.

9.3 Gibbs Sampling

For full Bayesian inference, we can use a Gibbs sampler.

Starting values can be taken from:

  • Importance resampling from an approximation to the posterior, e.g., a mixture of $t_4$ distributions placed at EM modes.

Gibbs steps:

  1. Sample indicators $z$ given $(\theta, \lambda, y)$
    • In finite mixtures, this is straightforward: p(zi=hyi,θ,λ)λhf(yiθh).p(z_i = h \mid y_i, \theta, \lambda) \propto \lambda_h f(y_i \mid \theta_h).
    • These are multinomial or categorical draws.
  2. Sample parameters $(\theta, \lambda)$ given $z$ and $y$
    • Given $z$, the model becomes a collection of standard parametric models (possibly hierarchical).
    • Conjugate priors make sampling easy (e.g., normal–inverse-gamma for normal components, Dirichlet for $\lambda$).

Issues that can surface during Gibbs simulation:

  • Improper prior problems:
    • Chains may collapse into zero-variance modes and never escape.
  • Identifiability (label-switching):
    • Chains might not appear to converge because they wander among label permutations.

These problems often reveal modeling issues that must be addressed.


10. Posterior Inference and Model Checking

Once the Gibbs sampler has approximately converged:

  • Posterior inferences on $(\theta, \lambda)$ are obtained by marginalizing over $z$ (i.e., ignoring the sampled $z$ and summarizing $\theta, \lambda$).
  • Posterior on $z$ gives:
    • Probabilities for component membership of each observation,
    • Useful for soft clustering or subpopulation assignment probabilities.

Model fit:

  • Assessed using posterior predictive checks:
    • Simulate replicated data from the fitted model,
    • Compare summary statistics or discrepancy measures to observed data.

Robustness:

  • If robustness is a concern, evaluate sensitivity to the assumed parametric family $f$ using tools.