1) What it is

  • A non-parametric statistical test to compare the AUCs (ROC-AUCs) of two correlated models.
  • Proposed by DeLong, DeLong & Clarke-Pearson (1988).
  • Useful when you want to know: “Is model A’s AUC significantly better than model B’s AUC?”

Especially important because two models are often evaluated on the same dataset → their AUCs are correlated.


2) Why we need it

  • AUCs are random variables (depend on sample).
  • You can’t just look at AUC_A – AUC_B; you need a confidence interval or p-value to decide if the difference is statistically significant.
  • DeLong’s test provides:
    1. Standard error (SE) of each AUC
    2. SE of the difference between AUCs
    3. A z-statistic and p-value

3) How it works (conceptually)

  • Uses U-statistics to estimate the covariance of the AUCs.
  • AUC can be expressed as the probability that a random positive is ranked higher than a random negative.
  • DeLong computes variances and covariances of these pairwise comparisons, then uses them to derive:

$z = \frac{AUC_1 – AUC_2}{SE(AUC_1 – AUC_2)}$

  • From $z$, we compute a p-value under the standard normal distribution.

4) Example

Suppose you evaluate two classifiers on the same dataset:

  • Model A: AUC = 0.88
  • Model B: AUC = 0.84
  • DeLong’s test:
    • Difference = 0.04
    • SE = 0.015
    • z = 0.04 / 0.015 ≈ 2.67
    • p-value ≈ 0.0076

Interpretation: Model A’s AUC is significantly higher than Model B’s (p < 0.01).


5) When to use

  • Comparing two classifiers’ ROC-AUC on the same test set.
  • Comparing model versions (baseline vs improved).
  • Medical/clinical trials where AUC differences must be statistically validated.

6) Alternatives

  • Bootstrap CIs: resample data, recompute AUC difference distribution.
  • Hanley–McNeil test: earlier but less accurate approximation.
  • Permutation test: shuffle labels to test null hypothesis.

7) In Python (scikit-learn + statsmodels)

from sklearn.metrics import roc_auc_score
from statsmodels.stats.weightstats import ztest
from delong import delong_roc_test  # available in packages

# y_true = true labels, y_pred1 = model1 scores, y_pred2 = model2 scores
p_value = delong_roc_test(y_true, y_pred1, y_pred2)
print("p-value:", p_value)

Summary

  • DeLong’s test = standard statistical test for comparing two ROC-AUCs.
  • Accounts for correlation (same dataset).
  • Returns p-value and CI for difference.
  • More reliable than naive comparison or Wald methods.