Bayesian inference expresses conclusions about unknown parameters θ or unobserved data ỹ in terms of probabilities conditional on observed data (y). All probability statements are written as conditional distributions such as $p(θ|y)$ or $p(\tilde{y}|y)$, and implicitly conditioned on known covariates $x$.
This conditional view—updating beliefs about parameters after observing data—differs fundamentally from the classical (frequentist) approach, which evaluates estimation procedures over hypothetical repetitions of data given a fixed true θ.
Although both approaches can produce similar numerical results in simple cases, the Bayesian framework naturally extends to more complex problems because it directly models uncertainty through probability.
1. Probability Notation
- $p(\cdot|\cdot)$: Conditional probability density or mass function.
- $p(\cdot)$: Marginal distribution (same notation for continuous and discrete cases).
- The same symbol $p(\cdot)$ may refer to different distributions in a single equation; this is a compact and standard abuse of notation.
- For discrete events, $Pr(\cdot)$ may be used instead of $p(\cdot)$; e.g., $Pr(θ > 2) = \int_{θ>2} p(θ)dθ$.
- Standard distribution notation:
- If $θ \sim N(μ, σ^2)$, then $p(θ) = N(θ|μ, σ^2)$ or $p(θ|μ, σ^2) = N(θ|μ, σ^2)$.
- $N(μ, σ^2)$: random variable; $N(θ|μ, σ^2)$: probability density function.
- Other expressions:
- Coefficient of variation = $\text{sd}(θ) / E(θ)$
- Geometric mean = $\exp(E[\log(θ)])$
- Geometric standard deviation = $\exp(\text{sd}[\log(θ)])$
2. Bayes’ Rule
To make probabilistic statements about θ after observing y, we start with a joint probability model for θ and y:
$p(θ, y) = p(θ)p(y|θ)$
where:
- $p(θ)$: prior distribution (belief about θ before seeing data)
- $p(y|θ)$: sampling distribution or likelihood (data model given θ)
Conditioning on the observed data y and applying Bayes’ rule gives the posterior distribution:
$p(θ|y) = \frac{p(θ)p(y|θ)}{p(y)}$
where $p(y) = \int p(θ)p(y|θ)dθ$ is the marginal likelihood or evidence.
Often, we write the unnormalized form (dropping the constant $p(y)$, which does not depend on θ):
$p(θ|y) \propto p(θ)p(y|θ)$
This compact expression summarizes the technical essence of Bayesian inference:
Posterior ∝ Prior × Likelihood.
Every Bayesian analysis revolves around specifying $p(θ, y)$, computing $p(θ|y)$, and then summarizing or interpreting this posterior.
3. Prediction
Bayesian methods also handle predictions of unknown but observable quantities.
(a) Prior predictive distribution
Before observing any data, the predictive distribution for an unobserved $y$ is:
$p(y) = \int p(θ)p(y|θ)dθ$
This is called the prior predictive distribution because it predicts potential data outcomes before any observations are made.
(b) Posterior predictive distribution
After observing data $y$, we can predict new or future data $\tilde{y}$:
$p(\tilde{y}|y) = \int p(\tilde{y}|θ)p(θ|y)dθ$
This is the posterior predictive distribution, which averages predictions $p(\tilde{y}|θ)$ over uncertainty in θ as captured by the posterior.
It provides a distribution for future or unobserved outcomes conditional on the observed data.
Example:
- $y = (y_1, \dots, y_n)$: observed weights of an object.
- $θ = (μ, σ^2)$: true weight and measurement variance.
- $\tilde{y}$: future weight measurement.
Then $p(\tilde{y}|y)$ gives the probability distribution for the next recorded weight, integrating over uncertainty in both μ and σ².
4. Likelihood and the Likelihood Principle
In Bayes’ rule, the data affect the posterior only through $p(y|θ)$, known as the likelihood function (viewed as a function of θ for fixed y).
Thus, Bayesian inference inherently follows the likelihood principle—that all information from the data about θ is contained in the likelihood.
However, this principle holds only within the assumed model.
Since the model may be misspecified, repeated-sampling reasoning (frequentist logic) can still be useful for checking model adequacy.
In practice, an applied Bayesian analyst should be willing to apply Bayes’ rule under several alternative plausible models to assess robustness.
5. Likelihood and Odds Ratios
The ratio of posterior densities evaluated at two parameter values $θ_1$ and $θ_2$ is called the posterior odds in favor of $θ_1$ over $θ_2$.
For discrete cases, $θ_2$ often represents the complement of $θ_1$.
Bayes’ rule in terms of odds can be expressed as:
$\frac{p(θ_1|y)}{p(θ_2|y)} = \frac{p(θ_1)}{p(θ_2)} \times \frac{p(y|θ_1)}{p(y|θ_2)}$
This means:
Posterior Odds = Prior Odds × Likelihood Ratio.
This formulation is useful for comparing competing hypotheses or models. The likelihood ratio quantifies how strongly the observed data support one hypothesis relative to another.
6. Core Ideas Summarized
- Bayesian inference expresses uncertainty about θ and future data ỹ using conditional probabilities given observed data y.
- Bayes’ rule: $p(θ|y) \propto p(θ)p(y|θ)$.
- Prediction: future or missing observations follow $p(\tilde{y}|y) = \int p(\tilde{y}|θ)p(θ|y)dθ$.
- Likelihood principle: data influence inference only through the likelihood function.
- Odds formulation: posterior odds = prior odds × likelihood ratio.
Together, these provide the mathematical foundation of Bayesian reasoning—updating prior beliefs with data to form posterior beliefs, and extending those beliefs to predict future or unobserved outcomes.
