1) What it is
- A non-parametric statistical test to compare the AUCs (ROC-AUCs) of two correlated models.
- Proposed by DeLong, DeLong & Clarke-Pearson (1988).
- Useful when you want to know: “Is model A’s AUC significantly better than model B’s AUC?”
Especially important because two models are often evaluated on the same dataset → their AUCs are correlated.
2) Why we need it
- AUCs are random variables (depend on sample).
- You can’t just look at AUC_A – AUC_B; you need a confidence interval or p-value to decide if the difference is statistically significant.
- DeLong’s test provides:
- Standard error (SE) of each AUC
- SE of the difference between AUCs
- A z-statistic and p-value
3) How it works (conceptually)
- Uses U-statistics to estimate the covariance of the AUCs.
- AUC can be expressed as the probability that a random positive is ranked higher than a random negative.
- DeLong computes variances and covariances of these pairwise comparisons, then uses them to derive:
$z = \frac{AUC_1 – AUC_2}{SE(AUC_1 – AUC_2)}$
- From $z$, we compute a p-value under the standard normal distribution.
4) Example
Suppose you evaluate two classifiers on the same dataset:
- Model A: AUC = 0.88
- Model B: AUC = 0.84
- DeLong’s test:
- Difference = 0.04
- SE = 0.015
- z = 0.04 / 0.015 ≈ 2.67
- p-value ≈ 0.0076
Interpretation: Model A’s AUC is significantly higher than Model B’s (p < 0.01).
5) When to use
- Comparing two classifiers’ ROC-AUC on the same test set.
- Comparing model versions (baseline vs improved).
- Medical/clinical trials where AUC differences must be statistically validated.
6) Alternatives
- Bootstrap CIs: resample data, recompute AUC difference distribution.
- Hanley–McNeil test: earlier but less accurate approximation.
- Permutation test: shuffle labels to test null hypothesis.
7) In Python (scikit-learn + statsmodels)
from sklearn.metrics import roc_auc_score
from statsmodels.stats.weightstats import ztest
from delong import delong_roc_test # available in packages
# y_true = true labels, y_pred1 = model1 scores, y_pred2 = model2 scores
p_value = delong_roc_test(y_true, y_pred1, y_pred2)
print("p-value:", p_value)
Summary
- DeLong’s test = standard statistical test for comparing two ROC-AUCs.
- Accounts for correlation (same dataset).
- Returns p-value and CI for difference.
- More reliable than naive comparison or Wald methods.
