What it is (and why it’s useful)

  • Precision = TP / (TP + FP): among items you predicted positive, how many are truly positive.
  • Recall (a.k.a. TPR, sensitivity) = TP / (TP + FN): among truly positive items, how many you found.
  • A Precision–Recall (PR) curve plots precision (y-axis) versus recall (x-axis) as you sweep a decision threshold over model scores.
  • PR-AUC is the area under that PR curve; it summarizes the quality of the ranking your model produces, emphasizing performance on the positive class—especially valuable under class imbalance.

Interpreting PR-AUC

  • Range and baseline: for a random ranking, expected PR-AUC ≈ positive class prevalence $\pi = \frac{P}{P+N}$​.
    • Example: if only 5% of examples are positive, a no-skill PR-AUC is ~0.05.
    • Values below $\pi$ indicate worse-than-random ranking; values above $\pi$ indicate useful ranking.
  • Higher is better; 1.0 is perfect (precision = 1 at all recalls).
  • Not comparable across datasets with different prevalence. Always report $\pi$ alongside PR-AUC.

PR-AUC vs ROC-AUC (when to prefer which)

  • ROC-AUC = area under TPR vs FPR; it treats positives and negatives symmetrically.
  • In highly imbalanced settings, ROC-AUC can look optimistic even when the model yields many false positives.
  • PR-AUC focuses on the positive class: it penalizes false positives directly through precision. Prefer PR-AUC when positives are rare and costs are asymmetric.

How PR curves relate to ROC curves

Given prevalence $\pi$,

$\text{precision} \;=\; \frac{\pi\cdot \text{TPR}}{\pi\cdot \text{TPR} + (1-\pi)\cdot \text{FPR}}.$

Thus the same ROC point maps to different PR points when $\pi$ changes—another reason PR-AUC depends on class balance.

Computing PR-AUC (practical)

  1. Sort examples by predicted score (descending).
  2. Sweep a threshold over the ranked list and compute $(\text{recall}_k, \text{precision}_k)$ at each step $k$.
  3. Integrate precision over recall. Two common summaries:
    • Area under PR curve using step-wise interpolation (common in libraries).
    • Average Precision (AP): a ranking metric that computes a weighted mean of precision at each new recall obtained when a true positive is encountered. AP and PR-AUC are closely related but not identical; AP corresponds to a specific interpolation that creates a precision “envelope.”

Tip: scikit-learn’s average_precision_score reports AP; auc(recall, precision) computes trapezoidal area under your sampled PR points. Report which one you use.

Properties and practical notes

  • Ranking-invariance: Any monotonic transformation of scores (e.g., logits → probabilities) leaves PR-AUC unchanged; it depends on ordering, not calibration.
  • Non-convex curves: PR curves can zig-zag; libraries often apply an envelope (monotone precision w.r.t. recall) before integrating.
  • Edge cases:
    • No positives → PR curve undefined (precision is undefined); most libraries return nan and may warn.
    • Extremely few positives → high variance; use confidence intervals or repeated resampling.
  • Micro vs macro averaging (multiclass/multilabel):
    • Micro-averaged PR-AUC: pool all decisions across classes then compute one PR curve; dominated by common classes.
    • Macro-averaged PR-AUC: compute per-class PR-AUC and average; treats classes equally. Report both if class supports differ.
  • Sampling effects: Down/upsampling negatives/positives changes prevalence and thus PR-AUC. If you sample for training, compute PR-AUC on an evaluation set with natural class balance.
  • Operational view: Choose thresholds by inspecting the PR curve where precision meets business constraints (e.g., “precision ≥ 0.9”) and read off the achievable recall (coverage).

Reporting checklist (quick)

  • PR-AUC value and positive class prevalence $\pi$.
  • Which summary you used (AP vs AUC under PR), and interpolation details.
  • Confidence interval or variability estimate (e.g., bootstrap).
  • (If multiclass) micro/macro choice.
  • A precision@k or precision at target recall to connect to real decisions.

Tiny worked example (conceptual)

  • Dataset: 1,000 samples, 50 positives (prevalence $\pi=0.05$).
  • A model with PR-AUC = 0.42 is strong relative to the 0.05 baseline.
  • If operations require precision ≥ 0.9, the PR curve may show recall ≈ 0.25 at that precision → you’ll capture ~25% of all positives while keeping false positives low.