1. Overview: why use mixture models for classification and regression?

The section explains how finite mixture models can be used not only for density estimation, but also for:

  • Classification (predicting a categorical label $y$ from predictors $x$), including:
    • Fully supervised settings (all $y_i$ observed),
    • Semi-supervised settings (some $y_i$ missing).
  • Regression (predicting a continuous $y$ from predictors $x$),
    • Allowing flexible, nonlinear relationships and heteroskedasticity.

The key idea is:

  • Model either:
    • the joint distribution of $(y, x)$ with a mixture, or
    • the conditional distribution of $x$ given $y$ (for discriminant analysis),
  • Then derive $\Pr(y \mid x)$ from Bayes’ rule.

Mixtures make the conditional mean, variance, and shape very flexible.


2. Mixture models for classification

2.1 Basic setup

We have:

  • A categorical response $y \in {1,\dots,C}$.
  • A real-valued predictor vector $x = (x_1,\dots,x_p)$.

Data:

  • Fully supervised: for $i = 1,\dots,n$, both $(y_i, x_i)$ observed.
  • Semi-supervised: $x_i$ observed for all $i$, but $y_i$ observed only for a subset of $n_0 < n$ items.

Goal:

  • Predict the response $y$ from $x$ by modeling $\Pr(y = c \mid x)$.

2.2 Modeling via Bayes’ rule

Instead of modeling $\Pr(y_i = c \mid x_i)$ directly, we model:

  • The marginal class probabilities: Pr(yi=c)=ψc\Pr(y_i = c) = \psi_c
  • The class-conditional predictor density: fc(xi)=f(xiyi=c)f_c(x_i) = f(x_i \mid y_i = c)

Then, by Bayes’ rule:Pr(yi=cxi)=Pr(yi=c)f(xiyi=c)c=1CPr(yi=c)f(xiyi=c)=ψcfc(xi)c=1Cψcfc(xi).\Pr(y_i = c \mid x_i) = \frac{\Pr(y_i = c)\, f(x_i \mid y_i = c)} {\sum_{c’=1}^C \Pr(y_i = c’)\, f(x_i \mid y_i = c’)} = \frac{\psi_c\, f_c(x_i)}{\sum_{c’=1}^C \psi_{c’} f_{c’}(x_i)}.

If we make $f_c$ flexible enough (e.g., via mixtures), this representation can approximate very complex classification boundaries (nonlinearities, interactions, etc.). Algebraically it looks similar to the allocation formula in a $C$–component finite mixture.

2.3 Prior for class probabilities

Let $\psi = (\psi_1,\dots,\psi_C)$ be the vector of class probabilities.

A simple conjugate prior:ψDirichlet(aψ01,aψ02,,aψ0C),\psi \sim \text{Dirichlet}(a \psi_{01}, a \psi_{02}, \dots, a \psi_{0C}),

where:

  • $\psi_0 = (\psi_{01},\dots,\psi_{0C})$ is the prior mean,
  • $a$ is a prior “sample size”, controlling how concentrated the prior is.

Given labeled data $(y_i, x_i)$ for $i = 1,\dots,n$ in the fully supervised case, the posterior is:ψy,XDirichlet(aψ01+i=1n1yi=1,,aψ0C+i=1n1yi=C).\psi \mid y, X \sim \text{Dirichlet}\Big( a \psi_{01} + \sum_{i=1}^n 1_{y_i = 1},\dots, a \psi_{0C} + \sum_{i=1}^n 1_{y_i = C} \Big).

So the prior counts $a\psi_{0c}$ get updated by the observed counts of each class.

2.4 Modeling the class-conditional density $f_c(x)$

We need a flexible model for $f_c(x_i)$ for each class $c$.

2.4.1 Simple parametric option: Gaussian per class

One option:fc(xi)=Np(xiμc,Σ),f_c(x_i) = N_p(x_i \mid \mu_c, \Sigma),

where:

  • All classes share a common covariance $\Sigma$,
  • Means $\mu_c$ differ across classes.

This is a standard discriminant analysis setup (like LDA), but may be:

  • Too restrictive,
  • Sensitive to non-normality,
  • Especially problematic in semi-supervised settings or when class differences are subtle and not captured well by a simple mean shift.

2.4.2 Flexible option: finite mixture per class

More flexibly, for each class $c$:fc(xi)=h=1HπchNp(xiμch,Σch),f_c(x_i) = \sum_{h=1}^H \pi_{ch} \, N_p(x_i \mid \mu^{*}_{ch}, \Sigma^{*}_{ch}),

with:

  • Class-specific mixture weights $\pi_c = (\pi_{c1},\dots,\pi_{cH})$,
  • Component-specific parameters $(\mu^{}_{ch}, \Sigma^{}_{ch})$,
  • A prior $(\mu^{}_{ch}, \Sigma^{}_{ch}) \sim P_0$ (often chosen as normal–inverse-Wishart).

This is very flexible but high-dimensional (many parameters).

To simplify, the authors recommend sharing the mixture weights across classes:

  • Let $\pi_c = \pi = (\pi_1,\dots,\pi_H)$ for all $c$,
  • Put a prior $\pi \sim \text{Dirichlet}\big(\tfrac{1}{H},\dots,\tfrac{1}{H}\big)$.

Then:

  • $f_c(x)$ differs across classes via $(\mu^{}_{ch}, \Sigma^{}_{ch})$,
  • The mixture structure and component weights are shared, which stabilizes estimation.

Each $(\mu^{}_{ch}, \Sigma^{}_{ch})$ can have a conjugate normal–inverse-Wishart prior $P_0$.

2.5 Latent component representation and Gibbs sampling

Introduce component indicators:

  • $z_i \in {1,\dots,H}$ is the mixture component index for observation $i$.
  • If $y_i = c$ and $z_i = h$, then: xiNp(xiμch,Σch).x_i \sim N_p(x_i \mid \mu^{*}_{ch}, \Sigma^{*}_{ch}).

The data model:

  • For item $i$ with known class $y_i = c$:
    • Choose component $z_i$ with probability $\Pr(z_i = h) = \pi_h$,
    • Then draw $x_i$ from the corresponding Gaussian for that class-component pair.

Gibbs sampling steps (fully supervised case):

  1. Update component indices $z_i$: For an item with $y_i = c$, the full conditional for $z_i$ is: Pr(zi=hyi=c,)=πhNp(xiμch,Σch)l=1HπlNp(xiμcl,Σcl),h=1,,H.\Pr(z_i = h \mid y_i = c, -) = \frac{\pi_h \, N_p(x_i \mid \mu^{*}_{ch}, \Sigma^{*}_{ch})} {\sum_{l=1}^H \pi_l \, N_p(x_i \mid \mu^{*}_{cl}, \Sigma^{*}_{cl})}, \quad h = 1,\dots,H.
  2. Update class probability vector $\psi$: Using the Dirichlet conditional posterior (as above) based on the labeled $y_i$’s.
  3. Update mixture weights $\pi$: Given the current allocation counts $n_h = \sum_i 1_{z_i = h}$, the conditional posterior is: πzDirichlet(1H+n1,,1H+nH).\pi \mid z \sim \text{Dirichlet}\left( \frac{1}{H} + n_1, \dots, \frac{1}{H} + n_H \right).
  4. Update $(\mu^{}_{ch}, \Sigma^{}_{ch})$: For each class $c$ and component $h$:
    • Collect all $x_i$ such that $y_i = c$ and $z_i = h$,
    • Update $(\mu^{}_{ch}, \Sigma^{}_{ch})$ from their multivariate normal–inverse-Wishart full conditional.

This step can be adapted if you use a different prior than normal–inverse-Wishart.

Semi-supervised classification

When some $y_i$ are missing:

  • Add a step to impute missing labels $y_i$ from: Pr(yi=cxi,)ψcfc(xi),\Pr(y_i = c \mid x_i, -) \propto \psi_c \, f_c(x_i),which can be computed using the current mixture parameter estimates.
  • Then proceed with the same steps as in the fully supervised case.

Missing predictors

If some entries of $x_i$ are missing:

  • Use the conditional normal distribution to impute missing components given:
    • The observed components of $x_i$,
    • The current class/cluster parameters $(\mu^{}_{ch}, \Sigma^{}_{ch})$,
    • And the current $y_i, z_i$.

This is done in an additional Gibbs step.

2.6 Prediction for a new observation

For a new item $(x_{n+1}, y_{n+1})$ with unknown $y_{n+1}$:

  • For each saved Gibbs iteration $s$:
    • Compute $\Pr(y_{n+1} = c \mid x_{n+1}, y, X)$ under the current parameters.
  • Average over iterations: Pr^(yn+1=cxn+1,y,X)=1Ss=1SPr(s)(yn+1=cxn+1),\widehat{\Pr}(y_{n+1} = c \mid x_{n+1}, y, X) = \frac{1}{S} \sum_{s=1}^S \Pr^{(s)}(y_{n+1} = c \mid x_{n+1}),where $\Pr^{(s)}$ is the predictive probability at draw $s$.

For 0–1 loss (classification loss = 1 when prediction is wrong, 0 when correct), the Bayes optimal classifier is:y^n+1=argmaxc{1,,C}Pr^(yn+1=cxn+1,y,X).\hat{y}_{n+1} = \arg\max_{c \in \{1,\dots,C\}} \widehat{\Pr}(y_{n+1} = c \mid x_{n+1}, y, X).

The full vector of predictive probabilities can also be reported to show uncertainty.


3. Product-kernel mixture models for mixed-type predictors and joint modeling

The previous discriminant analysis approach mainly considered continuous predictors. The text then generalizes to mixed-scale predictors and a joint model for both $y$ and $x$.

3.1 Product kernel for mixed-type $x$

Suppose $x_i$ has $p$ components of possibly different types (continuous, binary, count, etc.). We want a flexible model for the joint density of $x_i$ alone:

  • For each observation, f(xiθi)=j=1pKj(xijθij),f(x_i \mid \theta_i) = \prod_{j=1}^p K_j(x_{ij} \mid \theta_{ij}),where:
    • $K_j$ is a kernel appropriate for the $j$th predictor:
      • Gaussian for continuous,
      • Bernoulli for binary,
      • Poisson for counts,
      • etc.
    • $\theta_i = (\theta_{i1},\dots,\theta_{ip})$ is a parameter vector.

This assumes conditional independence of components given local parameter $\theta_i$.

To induce dependence across components, we put a mixture prior on $\theta_i$:

  • Let θiP=h=1HπhδΘh,\theta_i \sim P = \sum_{h=1}^H \pi_h \, \delta_{\Theta_h},where:
    • $\Theta_h = (\Theta_{h1},\dots,\Theta_{hp})$ is a component-specific parameter vector,
    • $\pi_h$ are mixture weights,
    • $\delta_{\Theta_h}$ is a point mass at $\Theta_h$.

Thus, each observation belongs to a latent class $z_i = h$ with probability $\pi_h$, and thenθi=Θh.\theta_i = \Theta_h.

Each component of $\Theta_h$ has its own prior:

  • $\Theta_{hj} \sim P_{0j}$, with $P_{0j}$ chosen appropriate for kernel $K_j$ (e.g., conjugate prior).

This is a latent class model:

  • Given $z_i$, the coordinates $x_{ij}$ are independent,
  • But marginally (after integrating out $z_i$), the coordinates of $x_i$ become dependent.

3.2 Joint modeling of $(y, x)$

We can use the same mixture structure to model both $y$ and $x$ jointly.

First, consider $x_i$ continuous with $x_i \in \mathbb{R}^p$. Conditionally on the latent class $z_i = h$:

  • For the response: Pr(yi=czi=h)=ψhc,\Pr(y_i = c \mid z_i = h) = \psi_{hc},so each class $h$ has its own categorical distribution over outcomes $c$.
  • For the predictors: f(xizi=h)=Np(xiμh,Σh).f(x_i \mid z_i = h) = N_p(x_i \mid \mu_h, \Sigma_h).f(xi​∣zi​=h)=Np​(xi​∣μh​,Σh​).

So within each latent class $h$:

  • $y_i$ follows a multinomial distribution with parameters $\psi_{h\cdot}$,
  • $x_i$ follows a multivariate normal with parameters $(\mu_h,\Sigma_h)$,
  • And given $z_i$, $y_i$ and $x_i$ are conditionally independent.

Bayesian computation is similar to the earlier mixture setting:

  • Update class labels $z_i$ using a multinomial full conditional involving both $y_i$ and $x_i$,
  • Update $\psi_{hc}$, $\mu_h$, $\Sigma_h$, and $\pi_h$ based on their full conditionals.

Marginalizing out $z_i$ yields a very flexible conditional distribution:Pr(yi=cxi)=h=1HπhψhcNp(xiμh,Σh)h=1HπhNp(xiμh,Σh).\Pr(y_i = c \mid x_i) = \frac{\sum_{h=1}^H \pi_h \psi_{hc} N_p(x_i \mid \mu_h, \Sigma_h)} {\sum_{h=1}^H \pi_h N_p(x_i \mid \mu_h, \Sigma_h)}.

The same idea can be extended to mixed-type predictors by choosing suitable kernels $K_j$ for each $x_{ij}$ and then modeling $y_i$ and $x_i$ together via a product kernel and mixture over latent classes.


4. Mixture models for regression

Now we move to regression: continuous outcome, continuous predictors.

4.1 Joint mixture model for $(y, x)$

Suppose:

  • $y_i \in \mathbb{R}$,
  • $x_i \in \mathbb{R}^p$,
  • Observed data: $(y_i, x_i)$ for $i = 1,\dots,n$.

Goal: predict $y_{n+1}$ from $x_{n+1}$.

One approach:

  • Define $w_i = (y_i, x_i)$ (a $(p+1)$-dimensional vector),
  • Model the joint density $f(w_i)$ via a finite Gaussian mixture: f(wi)=h=1HπhNp+1(wiμh,Σh).(22.13)f(w_i) = \sum_{h=1}^H \pi_h \, N_{p+1}(w_i \mid \mu_h, \Sigma_h). \tag{22.13}

Here:

  • $\mu_h$ is the $(p+1)$–dimensional mean,
  • $\Sigma_h$ is the $(p+1) \times (p+1)$ covariance matrix.

This induces a flexible conditional density for $y_i$ given $x_i$.

4.2 Induced mixture-of-regressions form

From the joint Gaussian structure, the conditional distribution $f(y_i \mid x_i)$ becomes a mixture of linear regressions:f(yixi)=h=1Hπh(xi)N(yiβ0h+xiβ1h,σh2),(22.14)f(y_i \mid x_i) = \sum_{h=1}^H \pi_h(x_i)\, N\big(y_i \mid \beta_{0h} + x_i \beta_{1h}, \sigma_h^2\big), \tag{22.14}

where:

  • Each component $h$ has its own linear regression:
    • Intercept $\beta_{0h}$,
    • Slope vector $\beta_{1h}$,
    • Residual variance $\sigma_h^2$.
  • The mixture weights depend on $x_i$: πh(xi)=πhNp(xiμh(x),Σh(x))l=1HπlNp(xiμl(x),Σl(x)),(22.15)\pi_h(x_i) = \frac{\pi_h \, N_p(x_i \mid \mu_h^{(x)}, \Sigma_h^{(x)})} {\sum_{l=1}^H \pi_l \, N_p(x_i \mid \mu_l^{(x)}, \Sigma_l^{(x)})}, \tag{22.15}where:
    • $(\mu_h^{(x)}, \Sigma_h^{(x)})$ are the $x$–marginal mean and covariance from the joint $(y,x)$ Gaussian for component $h$.

Interpretation:

  • At a given predictor value $x_i = x$, the conditional density $f(y \mid x)$ is a univariate Gaussian mixture in $y$.
  • As $x$ changes, both:
    • The means of the component regressions, and
    • The weights $\pi_h(x)$
      change smoothly.

Thus, the model allows:

  • Nonlinear relationship between the conditional mean $E(y \mid x)$ and $x$,
  • Heteroskedasticity: $\text{Var}(y \mid x)$ can change with $x$,
  • Non-Gaussian residual shapes (skewness, heavy tails) through mixing.

4.3 Fitting via Gibbs sampling

A major advantage of this construction:

  • We only need to fit the simple joint model in (22.13) using standard mixture-of-Gaussians Gibbs sampling,
  • Then the conditional distribution in (22.14) is automatically derived from the joint draws.

So the workflow:

  1. Fit the mixture:
    • Use a Gibbs sampler for the finite Gaussian mixture on $w_i = (y_i, x_i)$.
    • At each iteration, get draws for ${\pi_h, \mu_h, \Sigma_h}_{h=1}^H$ and latent $z_i$.
  2. For any $x$ you care about:
    • Compute the induced mixture-of-regressions form in (22.14) using the current draws,
    • Average over iterations to get the posterior for $f(y \mid x)$, $E(y \mid x)$, predictive intervals, etc.

So you get a highly flexible regression model without explicitly programming a separate conditional model.

4.4 Limitations of joint modeling of $(y, x)$

Despite its elegance, joint modeling has some drawbacks:

  1. Predictors fixed by design:
    • In some experiments, $x$ is controlled by the experimenter,
    • Treating $x$ as random in a full joint model may feel unnatural or conceptually awkward, even if done primarily as a modeling device.
  2. Mixed-type or complex predictors:
    • If predictors are not all continuous (categorical, counts, etc.), defining a joint multivariate mixture for $(y, x)$ is more complicated.
    • You need product kernels and more complex priors.
  3. High-dimensional predictors:
    • For moderate or high $p$, fitting a mixture for the marginal density $f(x)$ is computationally heavy,
    • This is hard to justify when interest is only in $f(y \mid x)$, not $f(x)$ itself.
  4. Overly complex joint models when conditional is simple:
    • Sometimes the conditional $y \mid x$ is relatively simple (e.g., a single regression component suffices),
    • But $f(x)$ may be highly complex and multi-modal,
    • Joint modeling forces a complex mixture for $(y, x)$, which may hurt efficiency and predictive performance for $y \mid x$.

These considerations motivate models that directly target the conditional distribution $f(y \mid x)$ using mixture-based ideas, without fully modeling $f(x)$. (Later sections of the book develop such conditional mixture ideas directly.)