Mixture models for classification and regression

1. Overview: why use mixture models for classification and regression?

The section explains how finite mixture models can be used not only for density estimation, but also for:

Classification (predicting a categorical label $y$ from predictors $x$), including:
- Fully supervised settings (all $y_i$ observed),
- Semi-supervised settings (some $y_i$ missing).
Regression (predicting a continuous $y$ from predictors $x$),
- Allowing flexible, nonlinear relationships and heteroskedasticity.

The key idea is:

Model either:
- the joint distribution of $(y, x)$ with a mixture, or
- the conditional distribution of $x$ given $y$ (for discriminant analysis),
Then derive $\Pr(y \mid x)$ from Bayes’ rule.

Mixtures make the conditional mean, variance, and shape very flexible.

2. Mixture models for classification

2.1 Basic setup

We have:

A categorical response $y \in {1,\dots,C}$.
A real-valued predictor vector $x = (x_1,\dots,x_p)$.

Data:

Fully supervised: for $i = 1,\dots,n$, both $(y_i, x_i)$ observed.
Semi-supervised: $x_i$ observed for all $i$, but $y_i$ observed only for a subset of $n_0 < n$ items.

Goal:

Predict the response $y$ from $x$ by modeling $\Pr(y = c \mid x)$.

2.2 Modeling via Bayes’ rule

Instead of modeling $\Pr(y_i = c \mid x_i)$ directly, we model:

The marginal class probabilities: $\Pr(y_i = c) = \psi_c$
The class-conditional predictor density: $f_c(x_i) = f(x_i \mid y_i = c)$

Then, by Bayes’ rule: $\Pr(y_i = c \mid x_i) = \frac{\Pr(y_i = c)\, f(x_i \mid y_i = c)} {\sum_{c’=1}^C \Pr(y_i = c’)\, f(x_i \mid y_i = c’)} = \frac{\psi_c\, f_c(x_i)}{\sum_{c’=1}^C \psi_{c’} f_{c’}(x_i)}.$

If we make $f_c$ flexible enough (e.g., via mixtures), this representation can approximate very complex classification boundaries (nonlinearities, interactions, etc.). Algebraically it looks similar to the allocation formula in a $C$–component finite mixture.

2.3 Prior for class probabilities

Let $\psi = (\psi_1,\dots,\psi_C)$ be the vector of class probabilities.

A simple conjugate prior: $\psi \sim \text{Dirichlet}(a \psi_{01}, a \psi_{02}, \dots, a \psi_{0C}),$

where:

$\psi_0 = (\psi_{01},\dots,\psi_{0C})$ is the prior mean,
$a$ is a prior “sample size”, controlling how concentrated the prior is.

Given labeled data $(y_i, x_i)$ for $i = 1,\dots,n$ in the fully supervised case, the posterior is: $\psi \mid y, X \sim \text{Dirichlet}\Big( a \psi_{01} + \sum_{i=1}^n 1_{y_i = 1},\dots, a \psi_{0C} + \sum_{i=1}^n 1_{y_i = C} \Big).$

So the prior counts $a\psi_{0c}$ get updated by the observed counts of each class.

2.4 Modeling the class-conditional density $f_c(x)$

We need a flexible model for $f_c(x_i)$ for each class $c$.

2.4.1 Simple parametric option: Gaussian per class

One option: $f_c(x_i) = N_p(x_i \mid \mu_c, \Sigma),$

where:

All classes share a common covariance $\Sigma$,
Means $\mu_c$ differ across classes.

This is a standard discriminant analysis setup (like LDA), but may be:

Too restrictive,
Sensitive to non-normality,
Especially problematic in semi-supervised settings or when class differences are subtle and not captured well by a simple mean shift.

2.4.2 Flexible option: finite mixture per class

More flexibly, for each class $c$: $f_c(x_i) = \sum_{h=1}^H \pi_{ch} \, N_p(x_i \mid \mu^{*}_{ch}, \Sigma^{*}_{ch}),$

with:

Class-specific mixture weights $\pi_c = (\pi_{c1},\dots,\pi_{cH})$,
Component-specific parameters $(\mu^{}_{ch}, \Sigma^{}_{ch})$,
A prior $(\mu^{}_{ch}, \Sigma^{}_{ch}) \sim P_0$ (often chosen as normal–inverse-Wishart).

This is very flexible but high-dimensional (many parameters).

To simplify, the authors recommend sharing the mixture weights across classes:

Let $\pi_c = \pi = (\pi_1,\dots,\pi_H)$ for all $c$,
Put a prior $\pi \sim \text{Dirichlet}\big(\tfrac{1}{H},\dots,\tfrac{1}{H}\big)$.

Then:

$f_c(x)$ differs across classes via $(\mu^{}_{ch}, \Sigma^{}_{ch})$,
The mixture structure and component weights are shared, which stabilizes estimation.

Each $(\mu^{}_{ch}, \Sigma^{}_{ch})$ can have a conjugate normal–inverse-Wishart prior $P_0$.

2.5 Latent component representation and Gibbs sampling

Introduce component indicators:

$z_i \in {1,\dots,H}$ is the mixture component index for observation $i$.
If $y_i = c$ and $z_i = h$, then: $x_i \sim N_p(x_i \mid \mu^{*}_{ch}, \Sigma^{*}_{ch}).$

The data model:

For item $i$ with known class $y_i = c$:
- Choose component $z_i$ with probability $\Pr(z_i = h) = \pi_h$,
- Then draw $x_i$ from the corresponding Gaussian for that class-component pair.

Gibbs sampling steps (fully supervised case):

Update component indices $z_i$: For an item with $y_i = c$, the full conditional for $z_i$ is: $\Pr(z_i = h \mid y_i = c, -) = \frac{\pi_h \, N_p(x_i \mid \mu^{*}_{ch}, \Sigma^{*}_{ch})} {\sum_{l=1}^H \pi_l \, N_p(x_i \mid \mu^{*}_{cl}, \Sigma^{*}_{cl})}, \quad h = 1,\dots,H.$
Update class probability vector $\psi$: Using the Dirichlet conditional posterior (as above) based on the labeled $y_i$’s.
Update mixture weights $\pi$: Given the current allocation counts $n_h = \sum_i 1_{z_i = h}$, the conditional posterior is: $\pi \mid z \sim \text{Dirichlet}\left( \frac{1}{H} + n_1, \dots, \frac{1}{H} + n_H \right).$
Update $(\mu^{}_{ch}, \Sigma^{}_{ch})$: For each class $c$ and component $h$:
- Collect all $x_i$ such that $y_i = c$ and $z_i = h$,
- Update $(\mu^{}_{ch}, \Sigma^{}_{ch})$ from their multivariate normal–inverse-Wishart full conditional.

This step can be adapted if you use a different prior than normal–inverse-Wishart.

Semi-supervised classification

When some $y_i$ are missing:

Add a step to impute missing labels $y_i$ from: $\Pr(y_i = c \mid x_i, -) \propto \psi_c \, f_c(x_i),$ which can be computed using the current mixture parameter estimates.
Then proceed with the same steps as in the fully supervised case.

Missing predictors

If some entries of $x_i$ are missing:

Use the conditional normal distribution to impute missing components given:
- The observed components of $x_i$,
- The current class/cluster parameters $(\mu^{}_{ch}, \Sigma^{}_{ch})$,
- And the current $y_i, z_i$.

This is done in an additional Gibbs step.

2.6 Prediction for a new observation

For a new item $(x_{n+1}, y_{n+1})$ with unknown $y_{n+1}$:

For each saved Gibbs iteration $s$:
- Compute $\Pr(y_{n+1} = c \mid x_{n+1}, y, X)$ under the current parameters.
Average over iterations: $\widehat{\Pr}(y_{n+1} = c \mid x_{n+1}, y, X) = \frac{1}{S} \sum_{s=1}^S \Pr^{(s)}(y_{n+1} = c \mid x_{n+1}),$ where $\Pr^{(s)}$ is the predictive probability at draw $s$.

For 0–1 loss (classification loss = 1 when prediction is wrong, 0 when correct), the Bayes optimal classifier is: $\hat{y}_{n+1} = \arg\max_{c \in \{1,\dots,C\}} \widehat{\Pr}(y_{n+1} = c \mid x_{n+1}, y, X).$

The full vector of predictive probabilities can also be reported to show uncertainty.

3. Product-kernel mixture models for mixed-type predictors and joint modeling

The previous discriminant analysis approach mainly considered continuous predictors. The text then generalizes to mixed-scale predictors and a joint model for both $y$ and $x$.

3.1 Product kernel for mixed-type $x$

Suppose $x_i$ has $p$ components of possibly different types (continuous, binary, count, etc.). We want a flexible model for the joint density of $x_i$ alone:

For each observation, f(xi∣θi)=∏j=1pKj(xij∣θij),f(x_i \mid \theta_i) = \prod_{j=1}^p K_j(x_{ij} \mid \theta_{ij}),where:
- $K_j$ is a kernel appropriate for the $j$th predictor:
  - Gaussian for continuous,
  - Bernoulli for binary,
  - Poisson for counts,
  - etc.
- $\theta_i = (\theta_{i1},\dots,\theta_{ip})$ is a parameter vector.

This assumes conditional independence of components given local parameter $\theta_i$.

To induce dependence across components, we put a mixture prior on $\theta_i$:

Let θi∼P=∑h=1Hπh δΘh,\theta_i \sim P = \sum_{h=1}^H \pi_h \, \delta_{\Theta_h},where:
- $\Theta_h = (\Theta_{h1},\dots,\Theta_{hp})$ is a component-specific parameter vector,
- $\pi_h$ are mixture weights,
- $\delta_{\Theta_h}$ is a point mass at $\Theta_h$.

Thus, each observation belongs to a latent class $z_i = h$ with probability $\pi_h$, and then $\theta_i = \Theta_h.$

Each component of $\Theta_h$ has its own prior:

$\Theta_{hj} \sim P_{0j}$, with $P_{0j}$ chosen appropriate for kernel $K_j$ (e.g., conjugate prior).

This is a latent class model:

Given $z_i$, the coordinates $x_{ij}$ are independent,
But marginally (after integrating out $z_i$), the coordinates of $x_i$ become dependent.

3.2 Joint modeling of $(y, x)$

We can use the same mixture structure to model both $y$ and $x$ jointly.

First, consider $x_i$ continuous with $x_i \in \mathbb{R}^p$. Conditionally on the latent class $z_i = h$:

For the response: $\Pr(y_i = c \mid z_i = h) = \psi_{hc},$ so each class $h$ has its own categorical distribution over outcomes $c$.
For the predictors: $f(x_i \mid z_i = h) = N_p(x_i \mid \mu_h, \Sigma_h).$ f(xi∣zi=h)=Np(xi∣μh,Σh).

So within each latent class $h$:

$y_i$ follows a multinomial distribution with parameters $\psi_{h\cdot}$,
$x_i$ follows a multivariate normal with parameters $(\mu_h,\Sigma_h)$,
And given $z_i$, $y_i$ and $x_i$ are conditionally independent.

Bayesian computation is similar to the earlier mixture setting:

Update class labels $z_i$ using a multinomial full conditional involving both $y_i$ and $x_i$,
Update $\psi_{hc}$, $\mu_h$, $\Sigma_h$, and $\pi_h$ based on their full conditionals.

Marginalizing out $z_i$ yields a very flexible conditional distribution: $\Pr(y_i = c \mid x_i) = \frac{\sum_{h=1}^H \pi_h \psi_{hc} N_p(x_i \mid \mu_h, \Sigma_h)} {\sum_{h=1}^H \pi_h N_p(x_i \mid \mu_h, \Sigma_h)}.$

The same idea can be extended to mixed-type predictors by choosing suitable kernels $K_j$ for each $x_{ij}$ and then modeling $y_i$ and $x_i$ together via a product kernel and mixture over latent classes.

4. Mixture models for regression

Now we move to regression: continuous outcome, continuous predictors.

4.1 Joint mixture model for $(y, x)$

Suppose:

$y_i \in \mathbb{R}$,
$x_i \in \mathbb{R}^p$,
Observed data: $(y_i, x_i)$ for $i = 1,\dots,n$.

Goal: predict $y_{n+1}$ from $x_{n+1}$.

One approach:

Define $w_i = (y_i, x_i)$ (a $(p+1)$-dimensional vector),
Model the joint density $f(w_i)$ via a finite Gaussian mixture: $f(w_i) = \sum_{h=1}^H \pi_h \, N_{p+1}(w_i \mid \mu_h, \Sigma_h). \tag{22.13}$

Here:

$\mu_h$ is the $(p+1)$–dimensional mean,
$\Sigma_h$ is the $(p+1) \times (p+1)$ covariance matrix.

This induces a flexible conditional density for $y_i$ given $x_i$.

4.2 Induced mixture-of-regressions form

From the joint Gaussian structure, the conditional distribution $f(y_i \mid x_i)$ becomes a mixture of linear regressions: $f(y_i \mid x_i) = \sum_{h=1}^H \pi_h(x_i)\, N\big(y_i \mid \beta_{0h} + x_i \beta_{1h}, \sigma_h^2\big), \tag{22.14}$

where:

Each component $h$ has its own linear regression:
- Intercept $\beta_{0h}$,
- Slope vector $\beta_{1h}$,
- Residual variance $\sigma_h^2$.
The mixture weights depend on $x_i$: πh(xi)=πh Np(xi∣μh(x),Σh(x))∑l=1Hπl Np(xi∣μl(x),Σl(x)),(22.15)\pi_h(x_i) = \frac{\pi_h \, N_p(x_i \mid \mu_h^{(x)}, \Sigma_h^{(x)})} {\sum_{l=1}^H \pi_l \, N_p(x_i \mid \mu_l^{(x)}, \Sigma_l^{(x)})}, \tag{22.15}where:
- $(\mu_h^{(x)}, \Sigma_h^{(x)})$ are the $x$–marginal mean and covariance from the joint $(y,x)$ Gaussian for component $h$.

Interpretation:

At a given predictor value $x_i = x$, the conditional density $f(y \mid x)$ is a univariate Gaussian mixture in $y$.
As $x$ changes, both:
- The means of the component regressions, and
- The weights $\pi_h(x)$
  change smoothly.

Thus, the model allows:

Nonlinear relationship between the conditional mean $E(y \mid x)$ and $x$,
Heteroskedasticity: $\text{Var}(y \mid x)$ can change with $x$,
Non-Gaussian residual shapes (skewness, heavy tails) through mixing.

4.3 Fitting via Gibbs sampling

A major advantage of this construction:

We only need to fit the simple joint model in (22.13) using standard mixture-of-Gaussians Gibbs sampling,
Then the conditional distribution in (22.14) is automatically derived from the joint draws.

So the workflow:

Fit the mixture:
- Use a Gibbs sampler for the finite Gaussian mixture on $w_i = (y_i, x_i)$.
- At each iteration, get draws for ${\pi_h, \mu_h, \Sigma_h}_{h=1}^H$ and latent $z_i$.
For any $x$ you care about:
- Compute the induced mixture-of-regressions form in (22.14) using the current draws,
- Average over iterations to get the posterior for $f(y \mid x)$, $E(y \mid x)$, predictive intervals, etc.

So you get a highly flexible regression model without explicitly programming a separate conditional model.

4.4 Limitations of joint modeling of $(y, x)$

Despite its elegance, joint modeling has some drawbacks:

Predictors fixed by design:
- In some experiments, $x$ is controlled by the experimenter,
- Treating $x$ as random in a full joint model may feel unnatural or conceptually awkward, even if done primarily as a modeling device.
Mixed-type or complex predictors:
- If predictors are not all continuous (categorical, counts, etc.), defining a joint multivariate mixture for $(y, x)$ is more complicated.
- You need product kernels and more complex priors.
High-dimensional predictors:
- For moderate or high $p$, fitting a mixture for the marginal density $f(x)$ is computationally heavy,
- This is hard to justify when interest is only in $f(y \mid x)$, not $f(x)$ itself.
Overly complex joint models when conditional is simple:
- Sometimes the conditional $y \mid x$ is relatively simple (e.g., a single regression component suffices),
- But $f(x)$ may be highly complex and multi-modal,
- Joint modeling forces a complex mixture for $(y, x)$, which may hurt efficiency and predictive performance for $y \mid x$.

These considerations motivate models that directly target the conditional distribution $f(y \mid x)$ using mixture-based ideas, without fully modeling $f(x)$. (Later sections of the book develop such conditional mixture ideas directly.)

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Mixture models for classification and regression

1. Overview: why use mixture models for classification and regression?

2. Mixture models for classification

2.1 Basic setup

2.2 Modeling via Bayes’ rule

2.3 Prior for class probabilities

2.4 Modeling the class-conditional density $f_c(x)$

2.4.1 Simple parametric option: Gaussian per class

2.4.2 Flexible option: finite mixture per class

2.5 Latent component representation and Gibbs sampling

Semi-supervised classification

Missing predictors

2.6 Prediction for a new observation

3. Product-kernel mixture models for mixed-type predictors and joint modeling

3.1 Product kernel for mixed-type $x$

3.2 Joint modeling of $(y, x)$

4. Mixture models for regression

4.1 Joint mixture model for $(y, x)$

4.2 Induced mixture-of-regressions form

4.3 Fitting via Gibbs sampling

4.4 Limitations of joint modeling of $(y, x)$

Like this:

Related

Leave a ReplyCancel reply

1. Overview: why use mixture models for classification and regression?

2. Mixture models for classification

2.1 Basic setup

2.2 Modeling via Bayes’ rule

2.3 Prior for class probabilities

2.4 Modeling the class-conditional density $f_c(x)$

2.4.1 Simple parametric option: Gaussian per class

2.4.2 Flexible option: finite mixture per class

2.5 Latent component representation and Gibbs sampling

Semi-supervised classification

Missing predictors

2.6 Prediction for a new observation

3. Product-kernel mixture models for mixed-type predictors and joint modeling

3.1 Product kernel for mixed-type $x$

3.2 Joint modeling of $(y, x)$

4. Mixture models for regression

4.1 Joint mixture model for $(y, x)$

4.2 Induced mixture-of-regressions form

4.3 Fitting via Gibbs sampling

4.4 Limitations of joint modeling of $(y, x)$

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery