Example — Calibration for Record Linkage – Your Gateway to Data Mastery

This example demonstrates how probabilities can be empirically estimated and calibrated from data, rather than being treated as purely subjective or personal. The context is record linkage — the process of identifying and linking records from different databases that refer to the same individual.

This is a practical and data-driven example of probability estimation, where empirical modeling (specifically mixture modeling) is used to calibrate and improve the accuracy of match probabilities.

1. Record Linkage: Background and Motivation

Record linkage algorithms are widely used to merge or match records across large databases (e.g., matching census records with post-enumeration surveys).
In the U.S. Census Bureau’s application, record linkage was a crucial step in evaluating census coverage for various subpopulations.

Goal: Automatically identify as many “true matches” as possible without producing too many false matches.
→ Reducing manual verification effort while keeping error rates acceptably low.

Each candidate pair of records receives a composite matching score (y) based on how similar their identifying fields are (name, birth date, address, etc.).

The decision rule:

If the score y is above a certain threshold, the pair is declared a match.
Otherwise, it is sent to follow-up (manual checking).

The false-match rate is defined as:

$\text{False-match rate} = \frac{\text{Number of false matches}}{\text{Total declared matches}}$

2. The Problem with Existing Methods

Many systems attempt to convert matching scores directly into probabilities of being a true match.
However, these naïve methods often give grossly overconfident (optimistic) estimates of false-match probabilities.

For instance:

Records with nominal (claimed) false-match probabilities of 10⁻³–10⁻⁷ (supposedly almost certain matches) were found, in manual checks, to have actual false-match rates around 1%.
Records with nominal 1% false-match probability actually had ~5% false matches.

Clearly, the old probability estimates were poorly calibrated.

3. Goal: Empirical Calibration of Match Probabilities

The aim is to recalibrate these scores using data-driven methods — so that a score y corresponds to an empirically accurate probability of being a true match.

This is analogous to the football point-spread example, where probabilities of winning were estimated empirically from historical data.
Here, we estimate Pr(match | y) — the probability that a record pair is a true match given its score y.

4. Mixture Modeling for Empirical Probability Estimation

The key idea is to treat the distribution of matching scores as a mixture of two overlapping populations:

True matches: scores following distribution $p(y \mid \text{match})$
Non-matches: scores following distribution $p(y \mid \text{non-match})$

Thus:

$p(y) = Pr(\text{match}) \, p(y \mid \text{match}) + Pr(\text{non-match}) \, p(y \mid \text{non-match})$

This is a two-component mixture model.
The unknown parameters — the two component distributions and the mixture proportion $Pr(\text{match})$ — are estimated from the observed data.

When training data are available (with known match/non-match labels), the model can be fitted directly.
When match status is unknown (the usual case), the mixture model can still infer these components empirically from the observed combined histogram of scores.

Once the model is fitted, it allows us to compute:

The posterior probability that a given score y corresponds to a true match:
- $Pr(\text{match} \mid y) = \frac{Pr(\text{match}) \, p(y \mid \text{match})}{p(y)}$
And equivalently, the false-match probability:
- $Pr(\text{non-match} \mid y) = \frac{Pr(\text{non-match}) \, p(y \mid \text{non-match})}{p(y)}$

5. Empirical Evidence: Separation Between Match and Non-Match Scores

In practice, the distributions of scores for true matches and false matches are mostly separated, but they overlap slightly.

Most candidate pairs come from the “match” distribution because of prior filtering (only likely pairs were considered).
However, some overlap remains — which is exactly where probability calibration matters.

The goal of mixture modeling is to capture these two overlapping score distributions and use them to compute empirical false-match rates at any decision threshold.

6. Using the Model to Choose Decision Thresholds

Once the mixture model is estimated, we can compute for any threshold t:

$\text{False-match rate at threshold } t = Pr(\text{non-match} \mid y > t)$

Lowering the threshold → more pairs declared as matches → higher false-match rate.
Raising the threshold → fewer matches declared → lower false-match rate but more manual reviews.

Below shows the expected false-match rate (with 95% posterior bounds) as a function of the proportion of declared matches.
Each point (dot) represents the actual false-match proportion from validation data.
The model’s predictions track the observed rates very closely.

7. External Validation

The method was validated using data from three 1988 test Census sites where the true match status was known.

Steps:

Fit the mixture model using scores from all candidate pairs.
Use the model to predict expected false-match rates across varying thresholds.
Compare predicted vs actual false-match proportions.

Results:

The model’s predicted false-match curve (with 95% posterior intervals) closely matched the actual observed points.
As the decision threshold decreased, the false-match rate increased smoothly — consistent with intuition.
The model correctly identified the region (~88–90% matches) beyond which errors rise rapidly.

Below image zooms in on that region:

Both predicted and observed false-match rates show a sharp upward bend at the same point, confirming that the model accurately captures the transition from “mostly correct matches” to “too many errors.”

8. Interpretation and Significance

The calibration method produces well-calibrated, empirically grounded probabilities, not just arbitrary or subjective ones.
It gives decision-makers an objective tool to balance:
- Reducing clerical workload (by declaring more matches automatically), and
- Maintaining accuracy (keeping false matches within acceptable limits).
Compared to older heuristic methods (which multiplied field weights without calibration), this model:
- Provides accurate match probabilities,
- Gives reliable uncertainty bounds, and
- Performs well when validated against real data.

9. Key Takeaways

Purpose: Estimate realistic probabilities for record linkage decisions using empirical calibration.
Method: Use a mixture model combining $p(y \mid \text{match})$ and $p(y \mid \text{non-match})$.
Data-driven: Model parameters and mixture proportions are estimated from data, not subjective beliefs.
Output: Empirical calibration curve → False-match rate vs. decision threshold.
Validation: Predictions align closely with observed false-match behavior.
Practical insight:
- You can safely declare ~88–90% of records as matches automatically.
- Beyond that, the error rate increases sharply.
Conclusion: Empirical probability calibration — based on mixture modeling — provides a reliable, objective, and generalizable framework for uncertainty quantification in record linkage and similar classification problems.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Example — Calibration for Record Linkage