The ROC curve visualizes the trade‑off between a binary classifier’s True Positive Rate (TPR) and False Positive Rate (FPR) as you vary the decision threshold on its scores/probabilities.
- TPR (Sensitivity/Recall) = TP / (TP + FN)
- FPR (Fall‑out) = FP / (FP + TN)
Each threshold $\tau$ (predict positive if score ≥ $\tau$; “≥” is inclusive) produces one ROC point: $(\text{FPR}(\tau), \text{TPR}(\tau))$. Sweeping $\tau$ from +∞ → −∞ traces the curve from (0,0) to (1,1).
How the curve is constructed (and why it’s stable)
- Sort by score (highest → lowest), then move the threshold down through that ranked list.
- Each time you “include” an additional example as positive, update TP/FP and recompute TPR/FPR.
- The plot is piecewise-constant + vertical jumps (step curve).
- Monotonic-invariance: Any monotonic transform of the scores (e.g., logits → probabilities via sigmoid) leaves the ROC curve unchanged, because only ranking matters.
Ties are typically handled by stepping through all tied items; exact AUC treats ties as half‑wins (see below).
AUC (Area Under the ROC Curve)
- Definition:
- $\text{ROC–AUC} = \int_0^1 \text{TPR}(\text{FPR}) \, d(\text{FPR})$, i.e., the area under the ROC curve (a single scalar in $[0,1]$).
- Ranking interpretation (very useful):
- $\text{AUC} = \Pr\big(s^+ > s^-\big) + \tfrac{1}{2}\Pr\big(s^+ = s^-\big),$
- the probability that a random positive gets a higher score than a random negative (ties count half).
- Benchmarks:
- Random guessing ≈ 0.5 (diagonal); perfect separation = 1.0; < 0.5 suggests label inversion or a perverse scorer.
ROC vs ROC‑AUC: “ROC” is the curve; “ROC‑AUC” is its summary number. Don’t conflate them.
Interpreting the curve
- Top‑left is better. You want high TPR at low FPR.
- Diagonal line (TPR=FPR) is no‑skill; points below it indicate worse‑than‑random (you could flip the prediction).
- ROC Convex Hull (ROCCH): The upper‑left convex envelope represents the best achievable operating points (including mixtures of models).
Picking a threshold (operating point)
There is no universal “best” threshold—it depends on costs and prevalence.
- Youden’s $J$ (cost‑agnostic): maximize $J = \text{TPR} – \text{FPR}$. This finds the point farthest above the diagonal.
- Cost‑sensitive choice: Suppose prevalence is $\pi=P(\text{positive})$, false‑positive cost $c_{FP}$, false‑negative cost $c_{FN}$.
- Expected cost at a point is proportional to $(1-\pi)c_{FP}\cdot \text{FPR} + \pi c_{FN}\cdot (1-\text{TPR}).$
- Iso‑cost lines in ROC space have slope $πcFN\frac{(1-\pi)c_{FP}}{\pi c_{FN}}$. Choose the ROC point tangent to the lowest such line.
After deployment, if class prevalence shifts, your ROC curve remains the same (ranking unchanged) but the optimal threshold can change—re‑tune it.
Strengths and caveats
Strengths
- Ranking‑based & threshold‑free: Great for comparing scorers independent of any single cutoff.
- Invariant to class priors: Shape doesn’t change if the positive rate changes (again, threshold may).
Caveats
- Imbalanced data: ROC can look optimistic when positives are rare because FPR averages over a huge number of TNs. In such cases, PR curves (Precision–Recall) are often more informative about performance on the positive class.
- Calibration blindness: ROC says nothing about probability calibration (whether 0.9 really means ~90%). Use reliability plots/Brier score for that.
- Data leakage / overfitting: Always compute ROC on held‑out data or via proper cross‑validation.
Relationship to other metrics
- PR vs ROC:
- ROC answers: “How well do you rank positives above negatives across thresholds?”
- PR answers: “When I call items positive, how many are actually positive (precision) for a given recall?”
- With rare positives, PR is usually the more decision‑relevant view.
- Accuracy vs ROC: Accuracy depends on a single threshold and class balance; ROC summarizes all thresholds and ignores prevalence.
Multiclass / multilabel ROC
- One‑vs‑Rest (OvR): Compute a ROC curve/AUC for each class vs all others; then macro‑average (unweighted mean) or weighted‑macro (by class frequency).
- Micro‑average: Pool all $\text{TP}, \text{FP}, \text{FN}, \text{TN}$ across classes first, then compute one ROC/AUC.
- Micro‑AUC emphasizes performance on common classes; macro‑AUC treats classes equally.
Practical tips
- Report both the ROC curve (or AUC) and a chosen operating point with TPR/FPR (or sensitivity/specificity) tied to your use‑case.
- For imbalanced problems, add PR curves and baseline prevalence.
- State how you computed AUC (e.g., trapezoidal integration) and how you handled ties.
- If you expose probabilities, also report calibration (e.g., Brier, reliability curves).
Mini worked example (one point)
Assume prevalence 5%, and at a chosen threshold you measure on a test set:
- TP=45, FN=55, FP=90, TN=810.
Then - TPR = 45 / (45+55) = 0.45 (recall)
- FPR = 90 / (90+810) = 0.10
This (0.10, 0.45) is one point on your ROC curve; moving the threshold produces the rest.
Bottom line:
The ROC curve is a robust, threshold‑agnostic view of ranking quality. Use ROC‑AUC for overall separability, pick thresholds with costs and prevalence in mind, and complement with PR and calibration when positives are rare or well‑calibrated probabilities matter.
