What it is (and why it’s useful)
- Precision = TP / (TP + FP): among items you predicted positive, how many are truly positive.
- Recall (a.k.a. TPR, sensitivity) = TP / (TP + FN): among truly positive items, how many you found.
- A Precision–Recall (PR) curve plots precision (y-axis) versus recall (x-axis) as you sweep a decision threshold over model scores.
- PR-AUC is the area under that PR curve; it summarizes the quality of the ranking your model produces, emphasizing performance on the positive class—especially valuable under class imbalance.
Interpreting PR-AUC
- Range and baseline: for a random ranking, expected PR-AUC ≈ positive class prevalence $\pi = \frac{P}{P+N}$.
- Example: if only 5% of examples are positive, a no-skill PR-AUC is ~0.05.
- Values below $\pi$ indicate worse-than-random ranking; values above $\pi$ indicate useful ranking.
- Higher is better; 1.0 is perfect (precision = 1 at all recalls).
- Not comparable across datasets with different prevalence. Always report $\pi$ alongside PR-AUC.
PR-AUC vs ROC-AUC (when to prefer which)
- ROC-AUC = area under TPR vs FPR; it treats positives and negatives symmetrically.
- In highly imbalanced settings, ROC-AUC can look optimistic even when the model yields many false positives.
- PR-AUC focuses on the positive class: it penalizes false positives directly through precision. Prefer PR-AUC when positives are rare and costs are asymmetric.
How PR curves relate to ROC curves
Given prevalence $\pi$,
$\text{precision} \;=\; \frac{\pi\cdot \text{TPR}}{\pi\cdot \text{TPR} + (1-\pi)\cdot \text{FPR}}.$
Thus the same ROC point maps to different PR points when $\pi$ changes—another reason PR-AUC depends on class balance.
Computing PR-AUC (practical)
- Sort examples by predicted score (descending).
- Sweep a threshold over the ranked list and compute $(\text{recall}_k, \text{precision}_k)$ at each step $k$.
- Integrate precision over recall. Two common summaries:
- Area under PR curve using step-wise interpolation (common in libraries).
- Average Precision (AP): a ranking metric that computes a weighted mean of precision at each new recall obtained when a true positive is encountered. AP and PR-AUC are closely related but not identical; AP corresponds to a specific interpolation that creates a precision “envelope.”
Tip: scikit-learn’s
average_precision_scorereports AP;auc(recall, precision)computes trapezoidal area under your sampled PR points. Report which one you use.
Properties and practical notes
- Ranking-invariance: Any monotonic transformation of scores (e.g., logits → probabilities) leaves PR-AUC unchanged; it depends on ordering, not calibration.
- Non-convex curves: PR curves can zig-zag; libraries often apply an envelope (monotone precision w.r.t. recall) before integrating.
- Edge cases:
- No positives → PR curve undefined (precision is undefined); most libraries return
nanand may warn. - Extremely few positives → high variance; use confidence intervals or repeated resampling.
- No positives → PR curve undefined (precision is undefined); most libraries return
- Micro vs macro averaging (multiclass/multilabel):
- Micro-averaged PR-AUC: pool all decisions across classes then compute one PR curve; dominated by common classes.
- Macro-averaged PR-AUC: compute per-class PR-AUC and average; treats classes equally. Report both if class supports differ.
- Sampling effects: Down/upsampling negatives/positives changes prevalence and thus PR-AUC. If you sample for training, compute PR-AUC on an evaluation set with natural class balance.
- Operational view: Choose thresholds by inspecting the PR curve where precision meets business constraints (e.g., “precision ≥ 0.9”) and read off the achievable recall (coverage).
Reporting checklist (quick)
- PR-AUC value and positive class prevalence $\pi$.
- Which summary you used (AP vs AUC under PR), and interpolation details.
- Confidence interval or variability estimate (e.g., bootstrap).
- (If multiclass) micro/macro choice.
- A precision@k or precision at target recall to connect to real decisions.
Tiny worked example (conceptual)
- Dataset: 1,000 samples, 50 positives (prevalence $\pi=0.05$).
- A model with PR-AUC = 0.42 is strong relative to the 0.05 baseline.
- If operations require precision ≥ 0.9, the PR curve may show recall ≈ 0.25 at that precision → you’ll capture ~25% of all positives while keeping false positives low.
