This section shows how Bayes’ rule works very transparently when the unknown thing is discrete (a small number of possibilities) rather than a continuous parameter. Because there are only a few states, we can see prior → likelihood → posterior directly.

Two examples are used:

(1) whether a woman is a carrier for hemophilia, and

(2) what word someone intended to type (“radom” case).

In both, we treat the unknown as a discrete variable θ.


1. Genetics example: “Is she a carrier?”

Setup

  • Hemophilia is X-linked recessive.
    • Men: XY → if they get the bad X, they are affected.
    • Women: XX → if they have only one bad X, usually not affected (carrier).
  • Consider a woman:
    • She has an affected brother → so her mother must have been a carrier (one good X, one bad X).
    • Her father is not affected → so he gave her a good X.
    • So this woman has a 50% chance of having inherited the bad X from her mother.
  • Define the unknown:
    • θ = 1 → woman is a carrier
    • θ = 0 → woman is not a carrier
  • Prior: $Pr(θ=1) = Pr(θ=0) = \frac{1}{2}$ because we know she had a 50–50 chance from her mother.

Data and likelihood

  • Now look at her sons.
  • If the woman is a carrier (θ=1), each son has a 50% chance of being affected.
  • If the woman is not a carrier (θ=0), each son is almost certainly unaffected (ignore rare mutation).
  • Suppose she has two sons, both unaffected: $y_1=0, y_2=0$.
  • Likelihoods:
    • If θ=1 (carrier): $Pr(y_1=0, y_2=0 \mid θ=1) = 0.5 \times 0.5 = 0.25P$
    • If θ=0 (not carrier): $Pr(y_1=0, y_2=0 \mid θ=0) = 1 \times 1 = 1$

Posterior

Apply Bayes’ rule:

$Pr(θ=1 \mid y) = \frac{Pr(y \mid θ=1)Pr(θ=1)}{Pr(y \mid θ=1)Pr(θ=1) + Pr(y \mid θ=0)Pr(θ=0)} = \frac{0.25 \times 0.5}{0.25 \times 0.5 + 1 \times 0.5} = \frac{0.125}{0.625} = 0.20$

So, after seeing two unaffected sons, the chance she’s a carrier drops from 50% to 20%.

You can also see it with odds:

  • prior odds = 0.5 / 0.5 = 1
  • likelihood ratio = 0.25 / 1 = 0.25
  • posterior odds = 1 × 0.25 = 0.25
  • convert odds 0.25 → probability = 0.25 / (1+0.25) = 0.2 → same result.

Adding more data (sequential updating)

A nice feature of Bayesian inference is that you can keep updating.

  • After 2 unaffected sons, posterior was:
    • $Pr(θ=1 \mid y_1,y_2) = 0.20,\quad Pr(θ=0 \mid y_1,y_2) = 0.80$
  • Now suppose third son is also unaffected. Given θ=1, an unaffected son has prob 0.5; given θ=0, prob 1.
  • New posterior:
    • $Pr(θ=1 \mid y_1,y_2,y_3) = \frac{0.5 \times 0.20}{0.5 \times 0.20 + 1 \times 0.80} = \frac{0.10}{0.90} \approx 0.111$ So it drops to about 11.1%.
  • If instead the third son were affected, then the data would overwhelmingly support θ=1 (carrier), and the posterior would jump to essentially 1 (ignoring mutation).

So this example shows:

  1. prior from family info → 2. update with children’s outcomes → 3. repeat as new children are born.

2. Spell-checking example: “radom”

Goal: given a typed word y = “radom”, what was the intended word θ?

Let θ be one of three discrete possibilities:

  • θ = “random”
  • θ = “radon”
  • θ = “radom” (actually typed correctly)

Bayes’ rule in proportional form:

$Pr(θ \mid y = \text{“radom”}) \propto p(θ)\,p(y=\text{“radom”} \mid θ)$

So we need:

  1. a prior for each possible intended word (how common that word is),
  2. a likelihood for each word (how likely it is to type “radom” when you meant that word).

Prior

From a corpus (Google researchers), frequencies were something like:

  • random: $7.60 \times 10^{-5}$
  • radon: $6.05 \times 10^{-6}$
  • radom: $3.12 \times 10^{-7}$

These serve as $p(θ)$. We could renormalize them to sum to 1, but we don’t have to, because Bayes’ rule with “∝” will normalize at the end.

Likelihood

From a spelling/typing error model:

  • $p(\text{“radom”} \mid θ=\text{“random”}) = 0.00193$
  • $p(\text{“radom”} \mid θ=\text{“radon”}) = 0.000143$
  • $p(\text{“radom”} \mid θ=\text{“radom”}) = 0.975$

Interpretation:

  • If the true word is “radom,” people type it correctly 97.5% of the time.
  • If the true word is “random,” there’s a small chance (about 0.2%) to drop a letter and get “radom.”
  • If the true word is “radon,” the chance to mistype it as “radom” is even smaller.

Posterior

Multiply prior × likelihood for each candidate:

θprior p(θ)likelihood p(y|θ)product p(θ)p(y|θ)posterior p(θ|y)
random$7.60×10^{-5}$0.00193≈ $1.47×10^{-7}$0.325
radon$6.05×10^{-6}$0.000143≈ $8.65×10^{-10}$0.002
radom$3.12×10^{-7}$0.975≈ $3.04×10^{-7}$0.673

After normalizing, the largest posterior is for θ = “radom” (about 0.673), then “random” (about 0.325), and “radon” is negligible.

So, given this model, the typed word “radom” is about twice as likely to be correct as to be a typo for “random.”

But… model matters

The authors immediately point out: in their context (statistics writing), “random” is way more plausible than “radom,” so the prior from Google’s general corpus is not a good match. That means:

  • if we have extra contextual info (document is about statistics),
  • we should change the prior to make “random” more likely than “radom.”
    Formally:

$p(θ \mid x, y) \propto p(θ \mid x)\,p(y \mid θ, x)$

where x = context (topic, domain, user, corpus). Often we still take $p(y \mid θ, x) \approx p(y \mid θ)$ to keep things simple.

This shows a very important Bayesian point: if the posterior looks wrong, that’s a sign the model (prior or likelihood) didn’t include all the information you actually have. You don’t throw away Bayes’ rule—you improve the model.


3. What these two examples show

  1. Discrete θ is easy to update: just prior × likelihood for each possible value, then normalize.
  2. Sequential updating is natural: posterior → new prior → new data → new posterior.
  3. Context matters: in spell checking, a corpus prior may not match your actual writing; change the prior.
  4. Same Bayes’ rule, different problems: genetics (causal/biological) and spell checking (classification/NLP) both use exactly $p(θ \mid y) \propto p(θ)\,p(y \mid θ).$.

That’s the whole point of this section: Bayes’ theorem works cleanly and visibly when the unknown is discrete, and it becomes very clear how prior information and data-based likelihood combine to produce the final inference.