Unspecified number of mixture components

1. Goal: allow an unknown number of mixture components

So far, the finite mixture model assumed a fixed number of components $H$. This section discusses how to allow the number of components to be unknown and inferred from the data.

There are two main strategies:

A fully Bayesian approach that puts a prior on $H$ itself and uses reversible jump MCMC.
A more practical approach that sets $H$ as a large upper bound, and lets the data effectively decide how many components are actually used.

The text emphasizes that the second strategy is often simpler and works well in practice.

2. “Full” approach: prior on $H$ and reversible jump MCMC

One theoretical option:

Put a prior distribution on the number of components $H$, for example a truncated Poisson prior.
Use reversible jump MCMC (RJMCMC) to move between models with different $H$.

However:

RJMCMC is computationally intensive and technically challenging.
In practice, people rarely use it for standard mixture density estimation.

Instead, practitioners more often:

Fit models for several fixed values of $H$,
And then choose a “best” $H$ using some goodness-of-fit criterion penalized for model complexity (e.g., marginal likelihood, WAIC).

3. Model selection over fixed $H$: EM, marginal likelihood, and predictive criteria

A common computational workflow is:

For each candidate $H$ (e.g., $H = 1,2,\dots,H_{\max}$):
- Use EM (or ECM) to compute a maximum likelihood estimate or posterior mode of the parameters.
For each fitted model, compute:
- An estimate of marginal posterior probability for that model, or
- A predictive performance criterion (e.g., WAIC).

Then:

Select the $H$ with the best marginal posterior probability or best predictive performance.

But each criterion has issues:

Marginal posterior probability can be sensitive to prior choices.
DIC is not theoretically well-justified for mixture models.
WAIC is theoretically justified for mixture models, but:
- If you pick a single $H$ and proceed as if $H$ were known, you are ignoring uncertainty in $H$ in the final inferences.

So even WAIC-based model selection does not propagate uncertainty about $H$ into later inference.

4. Practical alternative: fix a large $H$, use a Dirichlet prior designed for sparsity

A simpler and more elegant approach keeps $H$ technically fixed but allows the effective number of occupied components to vary:

Choose a fixed large $H$ as an upper bound on the number of mixture components.
Use the standard Dirichlet prior over weights: $(\pi_1,\dots,\pi_H) \sim \text{Dirichlet}(a,\dots,a),$ where $a$ is a hyperparameter.

In this setup:

$H$ is the maximum number of components.
Not all components will necessarily be used.
Some components can be empty, i.e., no observations assigned to them.

Define:

$H_n = \sum_{h=1}^H 1_{n_h > 0}$,

where $n_h$ is the number of observations currently allocated to component $h$ (in the Gibbs sampler). Thus, $H_n$ is the number of occupied components in a given iteration.

The Gibbs sampler described earlier for fixed $H$ automatically produces posterior samples of $H_n$ after convergence. So you get a posterior distribution over the effective number of clusters (occupied components) without trans-dimensional MCMC.

5. How the Dirichlet hyperparameter $a$ affects the number of occupied components

The behavior of $H_n$ depends crucially on the Dirichlet hyperparameter $a$.

If you set $a = 1$, then:
- $(\pi_1,\dots,\pi_H) \sim \text{Dirichlet}(1,\dots,1)$ (uniform over the simplex).
- As $H$ becomes large, the prior tends to spread mass over many components.
- The Gibbs sampler will tend to allocate items across many components.
- In the limit, you could see a tendency for each observation to fall in its own component, so $H_n$ could become close to $n$.

This is not desirable if you want a finite number of meaningful clusters.

To avoid this, the authors propose instead:

Set $a = \frac{n_0}{H},$ where $n_0$ is a fixed prior “sample size”, independent of $H$ (e.g., $n_0 = 1$).

Then:

The total concentration parameter $\sum_{h=1}^H a = H \cdot \frac{n_0}{H} = n_0$ stays constant as $H$ changes.

Interpretation:

$n_0$ controls the overall prior strength in the Dirichlet.
Choosing $n_0 = 1$ gives a minimally informative default that tends to favor a small number of dominant components.

Equivalent representation:

You can view the Dirichlet prior as arising from independent Gamma variables:
- $\lambda_h \sim \text{Gamma}!\left(\frac{n_0}{H}, 1\right)$ independently,
- Then set $\pi_h = \frac{\lambda_h}{\sum_{l=1}^H \lambda_l}.$

If $H$ is large and $n_0$ is small (e.g. $n_0 = 1$):

Most $\lambda_h$ will be very close to 0,
A small number of $\lambda_h$ will be in the right tail,
After normalization, those few right-tail $\lambda_h$ values produce a few high-probability components, and the rest have probability near 0.

This prior structure naturally encourages:

A sparse mixture with a few active components and many nearly empty ones, even though $H$ is large.

6. Using $H_n$ for component-specific inference

If you want to make inferences about mixture component-specific parameters $(\pi_h,\theta_h)$, for $h=1,\dots,H$:

You should condition on an estimate of the effective number of clusters $H_n$.

A practical workflow:

Initial run:
- Set a large upper bound $H$ (e.g., $H = 20$),
- Run the Gibbs sampler with the Dirichlet prior $a = n_0/H$ (e.g., $n_0=1$),
- From the posterior draws, compute $H_n$ in each iteration.
Estimate $H_n$:
- Obtain the posterior distribution of $H_n$,
- Choose an estimate of $H_n$ such as the posterior mode.
Second run conditioned on $H_n$:
- Fix the number of components to this estimated value $H_n$ (i.e., treat this as the effective $H$),
- Run a new MCMC or Gibbs sampler with $H = H_n$,
- Apply postprocessing for label switching (as discussed in the label switching section) to get interpretable component-specific summaries.

This two-stage approach:

Uses the first run to infer how many clusters are actually needed by the data,
Then conditions on that effective number in a second, more detailed analysis where you care about component parameters.

7. Examples: galaxy data, acidity data, and iris data

The methodology is illustrated on three classic datasets:

Galaxy data: 82 measurements of galaxy speeds (1D).
Acidity data: 155 lake acidity measurements (1D).
Iris data: 150 observations from three iris species, each with 4 continuous variables (4D).

For each dataset, they fit a finite Gaussian location–scale mixture with an upper bound on the number of clusters $H$, and a Dirichlet prior with $a = 1/H$ (so $n_0 = 1$).

7.1 Prior setup

To simplify prior specification and avoid unrealistic component locations:

They standardize the data:
- Subtract the mean,
- Divide by the standard deviation.

Then, for all examples:

Dirichlet prior on weights: $(\pi_1,\dots,\pi_H) \sim \text{Dirichlet}\left(\frac{1}{H},\dots,\frac{1}{H}\right).$
Component means: $\mu_h \sim \text{Normal}(0, \kappa \tau_h^2), \quad \text{with } \mu_0 = 0,\ \kappa = 1.$
Component precisions (or variances):
- $\tau_h^2$ (or $\sigma_h^2$) have an inverse-gamma prior: $\tau_h^2 \sim \text{Inv-Gamma}(3, 1).$
- This is a weakly informative choice given the normalization; it avoids extremely concentrated or extremely diffuse components.

For the multivariate iris data:

They use a diagonal covariance matrix for each Gaussian component (independent dimensions within each component),
Full covariance could be used but would require many more parameters.

7.2 Postprocessing and label switching

For all these applications, they perform postprocessing of the Gibbs draws to correct for label switching.
They use a procedure based on minimizing a Kullback–Leibler loss to align labels across MCMC draws.

They observe:

As expected, mixing in the component-specific parameters is poor due to identifiability issues,
But the density estimates and cluster structure are still useful.

7.3 Galaxy data (1D)

They set an upper bound $H = 5$.
The top row of shows:
- The histogram of galaxy velocities,
- A kernel density estimate,
- And the fitted finite mixture density.

Qualitative results:

The posterior suggests three non-overlapping clusters.
One large cluster near the origin dominates; the other two are smaller.
While the largest cluster might itself be interpreted as composed of two highly overlapping components, it is well approximated by a single normal.

Table gives:

Posterior means and standard deviations for:
- Weights $\pi_h$,
- Locations $\mu_h$,
- Scales $\sigma_h$ (i.e., standard deviations),
And shows that most posterior mass is concentrated in a single component centered near zero.

7.4 Acidity data (1D)

Upper bound again $H = 5$.
The bottom row of Figure shows:
- Histogram of acidity,
- Kernel density estimate,
- Fitted mixture density and 95% pointwise credible intervals.

Visual conclusions:

The data suggest two main clusters,
One cluster appears skewed, which a single normal cannot capture well.
Thus, three Gaussian components may be needed to approximate the skewed cluster.

Key point:

If the component shape (Gaussian) does not match the true cluster shape (e.g., skewed), then:
- The number of mixture components does not necessarily equal the number of “real” clusters”.
- For example, a single true cluster may require multiple Gaussian components.

Table shows:

Posterior summaries indicating evidence for two or three components contributing meaningfully.
The finite mixture density estimate, along with credible intervals, provides a good fit when compared to the empirical histogram.

7.5 Iris data (4D)

Each observation has 4 attributes.
They use an upper bound $H$ larger than 5 (they mention up to 6 components).
The data contain three biological species, so one might expect three clusters.

However:

The fitted finite mixture sometimes uses more than three components.
Table presents posterior means and standard deviations for:
- Weights,
- 4-dimensional locations (means for each feature),
- Scales.

Interpretation:

The extra components indicate that single multivariate normals may not fully capture the within-species variation.
Multiple components may be used to approximate the distribution of a single species.

Advantage of the Bayesian approach:

The posterior automatically encodes uncertainty in the number of clusters.
You can summarize this uncertainty by:
- Ordering components by their weights $\pi_h$,
- Examining how many components carry nontrivial probability mass.

Even though there are three true species, the posterior still finds that normal components need to be split to approximate the distribution of at least one species well.

8. Summary of the approach

Instead of running reversible jump MCMC with an explicit prior on $H$, you can:
- Fix a large upper bound $H$,
- Choose a Dirichlet prior with $a = n_0/H$ (e.g., $n_0=1$),
- Run the Gibbs sampler,
- Let the posterior decide which components are effectively used (through $H_n$).
This yields:
- A posterior distribution for the effective number of occupied components $H_n$,
- A flexible Bayesian density estimator with good performance,
- And, if desired, a way to condition on $H_n$ and postprocess labels to conduct component-wise inference.

Your Gateway to Data Mastery