1. Overview: why use mixture models for classification and regression?
The section explains how finite mixture models can be used not only for density estimation, but also for:
- Classification (predicting a categorical label $y$ from predictors $x$), including:
- Fully supervised settings (all $y_i$ observed),
- Semi-supervised settings (some $y_i$ missing).
- Regression (predicting a continuous $y$ from predictors $x$),
- Allowing flexible, nonlinear relationships and heteroskedasticity.
The key idea is:
- Model either:
- the joint distribution of $(y, x)$ with a mixture, or
- the conditional distribution of $x$ given $y$ (for discriminant analysis),
- Then derive $\Pr(y \mid x)$ from Bayes’ rule.
Mixtures make the conditional mean, variance, and shape very flexible.
2. Mixture models for classification
2.1 Basic setup
We have:
- A categorical response $y \in {1,\dots,C}$.
- A real-valued predictor vector $x = (x_1,\dots,x_p)$.
Data:
- Fully supervised: for $i = 1,\dots,n$, both $(y_i, x_i)$ observed.
- Semi-supervised: $x_i$ observed for all $i$, but $y_i$ observed only for a subset of $n_0 < n$ items.
Goal:
- Predict the response $y$ from $x$ by modeling $\Pr(y = c \mid x)$.
2.2 Modeling via Bayes’ rule
Instead of modeling $\Pr(y_i = c \mid x_i)$ directly, we model:
- The marginal class probabilities:
- The class-conditional predictor density:
Then, by Bayes’ rule:
If we make $f_c$ flexible enough (e.g., via mixtures), this representation can approximate very complex classification boundaries (nonlinearities, interactions, etc.). Algebraically it looks similar to the allocation formula in a $C$–component finite mixture.
2.3 Prior for class probabilities
Let $\psi = (\psi_1,\dots,\psi_C)$ be the vector of class probabilities.
A simple conjugate prior:
where:
- $\psi_0 = (\psi_{01},\dots,\psi_{0C})$ is the prior mean,
- $a$ is a prior “sample size”, controlling how concentrated the prior is.
Given labeled data $(y_i, x_i)$ for $i = 1,\dots,n$ in the fully supervised case, the posterior is:
So the prior counts $a\psi_{0c}$ get updated by the observed counts of each class.
2.4 Modeling the class-conditional density $f_c(x)$
We need a flexible model for $f_c(x_i)$ for each class $c$.
2.4.1 Simple parametric option: Gaussian per class
One option:
where:
- All classes share a common covariance $\Sigma$,
- Means $\mu_c$ differ across classes.
This is a standard discriminant analysis setup (like LDA), but may be:
- Too restrictive,
- Sensitive to non-normality,
- Especially problematic in semi-supervised settings or when class differences are subtle and not captured well by a simple mean shift.
2.4.2 Flexible option: finite mixture per class
More flexibly, for each class $c$:
with:
- Class-specific mixture weights $\pi_c = (\pi_{c1},\dots,\pi_{cH})$,
- Component-specific parameters $(\mu^{}_{ch}, \Sigma^{}_{ch})$,
- A prior $(\mu^{}_{ch}, \Sigma^{}_{ch}) \sim P_0$ (often chosen as normal–inverse-Wishart).
This is very flexible but high-dimensional (many parameters).
To simplify, the authors recommend sharing the mixture weights across classes:
- Let $\pi_c = \pi = (\pi_1,\dots,\pi_H)$ for all $c$,
- Put a prior $\pi \sim \text{Dirichlet}\big(\tfrac{1}{H},\dots,\tfrac{1}{H}\big)$.
Then:
- $f_c(x)$ differs across classes via $(\mu^{}_{ch}, \Sigma^{}_{ch})$,
- The mixture structure and component weights are shared, which stabilizes estimation.
Each $(\mu^{}_{ch}, \Sigma^{}_{ch})$ can have a conjugate normal–inverse-Wishart prior $P_0$.
2.5 Latent component representation and Gibbs sampling
Introduce component indicators:
- $z_i \in {1,\dots,H}$ is the mixture component index for observation $i$.
- If $y_i = c$ and $z_i = h$, then:
The data model:
- For item $i$ with known class $y_i = c$:
- Choose component $z_i$ with probability $\Pr(z_i = h) = \pi_h$,
- Then draw $x_i$ from the corresponding Gaussian for that class-component pair.
Gibbs sampling steps (fully supervised case):
- Update component indices $z_i$: For an item with $y_i = c$, the full conditional for $z_i$ is:
- Update class probability vector $\psi$: Using the Dirichlet conditional posterior (as above) based on the labeled $y_i$’s.
- Update mixture weights $\pi$: Given the current allocation counts $n_h = \sum_i 1_{z_i = h}$, the conditional posterior is:
- Update $(\mu^{}_{ch}, \Sigma^{}_{ch})$: For each class $c$ and component $h$:
- Collect all $x_i$ such that $y_i = c$ and $z_i = h$,
- Update $(\mu^{}_{ch}, \Sigma^{}_{ch})$ from their multivariate normal–inverse-Wishart full conditional.
This step can be adapted if you use a different prior than normal–inverse-Wishart.
Semi-supervised classification
When some $y_i$ are missing:
- Add a step to impute missing labels $y_i$ from: which can be computed using the current mixture parameter estimates.
- Then proceed with the same steps as in the fully supervised case.
Missing predictors
If some entries of $x_i$ are missing:
- Use the conditional normal distribution to impute missing components given:
- The observed components of $x_i$,
- The current class/cluster parameters $(\mu^{}_{ch}, \Sigma^{}_{ch})$,
- And the current $y_i, z_i$.
This is done in an additional Gibbs step.
2.6 Prediction for a new observation
For a new item $(x_{n+1}, y_{n+1})$ with unknown $y_{n+1}$:
- For each saved Gibbs iteration $s$:
- Compute $\Pr(y_{n+1} = c \mid x_{n+1}, y, X)$ under the current parameters.
- Average over iterations: where $\Pr^{(s)}$ is the predictive probability at draw $s$.
For 0–1 loss (classification loss = 1 when prediction is wrong, 0 when correct), the Bayes optimal classifier is:
The full vector of predictive probabilities can also be reported to show uncertainty.
3. Product-kernel mixture models for mixed-type predictors and joint modeling
The previous discriminant analysis approach mainly considered continuous predictors. The text then generalizes to mixed-scale predictors and a joint model for both $y$ and $x$.
3.1 Product kernel for mixed-type $x$
Suppose $x_i$ has $p$ components of possibly different types (continuous, binary, count, etc.). We want a flexible model for the joint density of $x_i$ alone:
- For each observation, where:
- $K_j$ is a kernel appropriate for the $j$th predictor:
- Gaussian for continuous,
- Bernoulli for binary,
- Poisson for counts,
- etc.
- $\theta_i = (\theta_{i1},\dots,\theta_{ip})$ is a parameter vector.
- $K_j$ is a kernel appropriate for the $j$th predictor:
This assumes conditional independence of components given local parameter $\theta_i$.
To induce dependence across components, we put a mixture prior on $\theta_i$:
- Let where:
- $\Theta_h = (\Theta_{h1},\dots,\Theta_{hp})$ is a component-specific parameter vector,
- $\pi_h$ are mixture weights,
- $\delta_{\Theta_h}$ is a point mass at $\Theta_h$.
Thus, each observation belongs to a latent class $z_i = h$ with probability $\pi_h$, and then
Each component of $\Theta_h$ has its own prior:
- $\Theta_{hj} \sim P_{0j}$, with $P_{0j}$ chosen appropriate for kernel $K_j$ (e.g., conjugate prior).
This is a latent class model:
- Given $z_i$, the coordinates $x_{ij}$ are independent,
- But marginally (after integrating out $z_i$), the coordinates of $x_i$ become dependent.
3.2 Joint modeling of $(y, x)$
We can use the same mixture structure to model both $y$ and $x$ jointly.
First, consider $x_i$ continuous with $x_i \in \mathbb{R}^p$. Conditionally on the latent class $z_i = h$:
- For the response: so each class $h$ has its own categorical distribution over outcomes $c$.
- For the predictors: f(xi∣zi=h)=Np(xi∣μh,Σh).
So within each latent class $h$:
- $y_i$ follows a multinomial distribution with parameters $\psi_{h\cdot}$,
- $x_i$ follows a multivariate normal with parameters $(\mu_h,\Sigma_h)$,
- And given $z_i$, $y_i$ and $x_i$ are conditionally independent.
Bayesian computation is similar to the earlier mixture setting:
- Update class labels $z_i$ using a multinomial full conditional involving both $y_i$ and $x_i$,
- Update $\psi_{hc}$, $\mu_h$, $\Sigma_h$, and $\pi_h$ based on their full conditionals.
Marginalizing out $z_i$ yields a very flexible conditional distribution:
The same idea can be extended to mixed-type predictors by choosing suitable kernels $K_j$ for each $x_{ij}$ and then modeling $y_i$ and $x_i$ together via a product kernel and mixture over latent classes.
4. Mixture models for regression
Now we move to regression: continuous outcome, continuous predictors.
4.1 Joint mixture model for $(y, x)$
Suppose:
- $y_i \in \mathbb{R}$,
- $x_i \in \mathbb{R}^p$,
- Observed data: $(y_i, x_i)$ for $i = 1,\dots,n$.
Goal: predict $y_{n+1}$ from $x_{n+1}$.
One approach:
- Define $w_i = (y_i, x_i)$ (a $(p+1)$-dimensional vector),
- Model the joint density $f(w_i)$ via a finite Gaussian mixture:
Here:
- $\mu_h$ is the $(p+1)$–dimensional mean,
- $\Sigma_h$ is the $(p+1) \times (p+1)$ covariance matrix.
This induces a flexible conditional density for $y_i$ given $x_i$.
4.2 Induced mixture-of-regressions form
From the joint Gaussian structure, the conditional distribution $f(y_i \mid x_i)$ becomes a mixture of linear regressions:
where:
- Each component $h$ has its own linear regression:
- Intercept $\beta_{0h}$,
- Slope vector $\beta_{1h}$,
- Residual variance $\sigma_h^2$.
- The mixture weights depend on $x_i$: where:
- $(\mu_h^{(x)}, \Sigma_h^{(x)})$ are the $x$–marginal mean and covariance from the joint $(y,x)$ Gaussian for component $h$.
Interpretation:
- At a given predictor value $x_i = x$, the conditional density $f(y \mid x)$ is a univariate Gaussian mixture in $y$.
- As $x$ changes, both:
- The means of the component regressions, and
- The weights $\pi_h(x)$
change smoothly.
Thus, the model allows:
- Nonlinear relationship between the conditional mean $E(y \mid x)$ and $x$,
- Heteroskedasticity: $\text{Var}(y \mid x)$ can change with $x$,
- Non-Gaussian residual shapes (skewness, heavy tails) through mixing.
4.3 Fitting via Gibbs sampling
A major advantage of this construction:
- We only need to fit the simple joint model in (22.13) using standard mixture-of-Gaussians Gibbs sampling,
- Then the conditional distribution in (22.14) is automatically derived from the joint draws.
So the workflow:
- Fit the mixture:
- Use a Gibbs sampler for the finite Gaussian mixture on $w_i = (y_i, x_i)$.
- At each iteration, get draws for ${\pi_h, \mu_h, \Sigma_h}_{h=1}^H$ and latent $z_i$.
- For any $x$ you care about:
- Compute the induced mixture-of-regressions form in (22.14) using the current draws,
- Average over iterations to get the posterior for $f(y \mid x)$, $E(y \mid x)$, predictive intervals, etc.
So you get a highly flexible regression model without explicitly programming a separate conditional model.
4.4 Limitations of joint modeling of $(y, x)$
Despite its elegance, joint modeling has some drawbacks:
- Predictors fixed by design:
- In some experiments, $x$ is controlled by the experimenter,
- Treating $x$ as random in a full joint model may feel unnatural or conceptually awkward, even if done primarily as a modeling device.
- Mixed-type or complex predictors:
- If predictors are not all continuous (categorical, counts, etc.), defining a joint multivariate mixture for $(y, x)$ is more complicated.
- You need product kernels and more complex priors.
- High-dimensional predictors:
- For moderate or high $p$, fitting a mixture for the marginal density $f(x)$ is computationally heavy,
- This is hard to justify when interest is only in $f(y \mid x)$, not $f(x)$ itself.
- Overly complex joint models when conditional is simple:
- Sometimes the conditional $y \mid x$ is relatively simple (e.g., a single regression component suffices),
- But $f(x)$ may be highly complex and multi-modal,
- Joint modeling forces a complex mixture for $(y, x)$, which may hurt efficiency and predictive performance for $y \mid x$.
These considerations motivate models that directly target the conditional distribution $f(y \mid x)$ using mixture-based ideas, without fully modeling $f(x)$. (Later sections of the book develop such conditional mixture ideas directly.)
