This section shows how Bayes’ rule works very transparently when the unknown thing is discrete (a small number of possibilities) rather than a continuous parameter. Because there are only a few states, we can see prior → likelihood → posterior directly.
Two examples are used:
(1) whether a woman is a carrier for hemophilia, and
(2) what word someone intended to type (“radom” case).
In both, we treat the unknown as a discrete variable θ.
1. Genetics example: “Is she a carrier?”
Setup
- Hemophilia is X-linked recessive.
- Men: XY → if they get the bad X, they are affected.
- Women: XX → if they have only one bad X, usually not affected (carrier).
- Consider a woman:
- She has an affected brother → so her mother must have been a carrier (one good X, one bad X).
- Her father is not affected → so he gave her a good X.
- So this woman has a 50% chance of having inherited the bad X from her mother.
- Define the unknown:
- θ = 1 → woman is a carrier
- θ = 0 → woman is not a carrier
- Prior: $Pr(θ=1) = Pr(θ=0) = \frac{1}{2}$ because we know she had a 50–50 chance from her mother.
Data and likelihood
- Now look at her sons.
- If the woman is a carrier (θ=1), each son has a 50% chance of being affected.
- If the woman is not a carrier (θ=0), each son is almost certainly unaffected (ignore rare mutation).
- Suppose she has two sons, both unaffected: $y_1=0, y_2=0$.
- Likelihoods:
- If θ=1 (carrier): $Pr(y_1=0, y_2=0 \mid θ=1) = 0.5 \times 0.5 = 0.25P$
- If θ=0 (not carrier): $Pr(y_1=0, y_2=0 \mid θ=0) = 1 \times 1 = 1$
Posterior
Apply Bayes’ rule:
$Pr(θ=1 \mid y) = \frac{Pr(y \mid θ=1)Pr(θ=1)}{Pr(y \mid θ=1)Pr(θ=1) + Pr(y \mid θ=0)Pr(θ=0)} = \frac{0.25 \times 0.5}{0.25 \times 0.5 + 1 \times 0.5} = \frac{0.125}{0.625} = 0.20$
So, after seeing two unaffected sons, the chance she’s a carrier drops from 50% to 20%.
You can also see it with odds:
- prior odds = 0.5 / 0.5 = 1
- likelihood ratio = 0.25 / 1 = 0.25
- posterior odds = 1 × 0.25 = 0.25
- convert odds 0.25 → probability = 0.25 / (1+0.25) = 0.2 → same result.
Adding more data (sequential updating)
A nice feature of Bayesian inference is that you can keep updating.
- After 2 unaffected sons, posterior was:
- $Pr(θ=1 \mid y_1,y_2) = 0.20,\quad Pr(θ=0 \mid y_1,y_2) = 0.80$
- Now suppose third son is also unaffected. Given θ=1, an unaffected son has prob 0.5; given θ=0, prob 1.
- New posterior:
- $Pr(θ=1 \mid y_1,y_2,y_3) = \frac{0.5 \times 0.20}{0.5 \times 0.20 + 1 \times 0.80} = \frac{0.10}{0.90} \approx 0.111$ So it drops to about 11.1%.
- If instead the third son were affected, then the data would overwhelmingly support θ=1 (carrier), and the posterior would jump to essentially 1 (ignoring mutation).
So this example shows:
- prior from family info → 2. update with children’s outcomes → 3. repeat as new children are born.
2. Spell-checking example: “radom”
Goal: given a typed word y = “radom”, what was the intended word θ?
Let θ be one of three discrete possibilities:
- θ = “random”
- θ = “radon”
- θ = “radom” (actually typed correctly)
Bayes’ rule in proportional form:
$Pr(θ \mid y = \text{“radom”}) \propto p(θ)\,p(y=\text{“radom”} \mid θ)$
So we need:
- a prior for each possible intended word (how common that word is),
- a likelihood for each word (how likely it is to type “radom” when you meant that word).
Prior
From a corpus (Google researchers), frequencies were something like:
- random: $7.60 \times 10^{-5}$
- radon: $6.05 \times 10^{-6}$
- radom: $3.12 \times 10^{-7}$
These serve as $p(θ)$. We could renormalize them to sum to 1, but we don’t have to, because Bayes’ rule with “∝” will normalize at the end.
Likelihood
From a spelling/typing error model:
- $p(\text{“radom”} \mid θ=\text{“random”}) = 0.00193$
- $p(\text{“radom”} \mid θ=\text{“radon”}) = 0.000143$
- $p(\text{“radom”} \mid θ=\text{“radom”}) = 0.975$
Interpretation:
- If the true word is “radom,” people type it correctly 97.5% of the time.
- If the true word is “random,” there’s a small chance (about 0.2%) to drop a letter and get “radom.”
- If the true word is “radon,” the chance to mistype it as “radom” is even smaller.
Posterior
Multiply prior × likelihood for each candidate:
| θ | prior p(θ) | likelihood p(y|θ) | product p(θ)p(y|θ) | posterior p(θ|y) |
|---|---|---|---|---|
| random | $7.60×10^{-5}$ | 0.00193 | ≈ $1.47×10^{-7}$ | 0.325 |
| radon | $6.05×10^{-6}$ | 0.000143 | ≈ $8.65×10^{-10}$ | 0.002 |
| radom | $3.12×10^{-7}$ | 0.975 | ≈ $3.04×10^{-7}$ | 0.673 |
After normalizing, the largest posterior is for θ = “radom” (about 0.673), then “random” (about 0.325), and “radon” is negligible.
So, given this model, the typed word “radom” is about twice as likely to be correct as to be a typo for “random.”
But… model matters
The authors immediately point out: in their context (statistics writing), “random” is way more plausible than “radom,” so the prior from Google’s general corpus is not a good match. That means:
- if we have extra contextual info (document is about statistics),
- we should change the prior to make “random” more likely than “radom.”
Formally:
$p(θ \mid x, y) \propto p(θ \mid x)\,p(y \mid θ, x)$
where x = context (topic, domain, user, corpus). Often we still take $p(y \mid θ, x) \approx p(y \mid θ)$ to keep things simple.
This shows a very important Bayesian point: if the posterior looks wrong, that’s a sign the model (prior or likelihood) didn’t include all the information you actually have. You don’t throw away Bayes’ rule—you improve the model.
3. What these two examples show
- Discrete θ is easy to update: just prior × likelihood for each possible value, then normalize.
- Sequential updating is natural: posterior → new prior → new data → new posterior.
- Context matters: in spell checking, a corpus prior may not match your actual writing; change the prior.
- Same Bayes’ rule, different problems: genetics (causal/biological) and spell checking (classification/NLP) both use exactly $p(θ \mid y) \propto p(θ)\,p(y \mid θ).$.
That’s the whole point of this section: Bayes’ theorem works cleanly and visibly when the unknown is discrete, and it becomes very clear how prior information and data-based likelihood combine to produce the final inference.
