DeLong’s Test

1) What it is

A non-parametric statistical test to compare the AUCs (ROC-AUCs) of two correlated models.
Proposed by DeLong, DeLong & Clarke-Pearson (1988).
Useful when you want to know: “Is model A’s AUC significantly better than model B’s AUC?”

Especially important because two models are often evaluated on the same dataset → their AUCs are correlated.

2) Why we need it

AUCs are random variables (depend on sample).
You can’t just look at AUC_A – AUC_B; you need a confidence interval or p-value to decide if the difference is statistically significant.
DeLong’s test provides:
1. Standard error (SE) of each AUC
2. SE of the difference between AUCs
3. A z-statistic and p-value

3) How it works (conceptually)

Uses U-statistics to estimate the covariance of the AUCs.
AUC can be expressed as the probability that a random positive is ranked higher than a random negative.
DeLong computes variances and covariances of these pairwise comparisons, then uses them to derive:

$z = \frac{AUC_1 – AUC_2}{SE(AUC_1 – AUC_2)}$

From $z$, we compute a p-value under the standard normal distribution.

4) Example

Suppose you evaluate two classifiers on the same dataset:

Model A: AUC = 0.88
Model B: AUC = 0.84
DeLong’s test:
- Difference = 0.04
- SE = 0.015
- z = 0.04 / 0.015 ≈ 2.67
- p-value ≈ 0.0076

Interpretation: Model A’s AUC is significantly higher than Model B’s (p < 0.01).

5) When to use

Comparing two classifiers’ ROC-AUC on the same test set.
Comparing model versions (baseline vs improved).
Medical/clinical trials where AUC differences must be statistically validated.

6) Alternatives

Bootstrap CIs: resample data, recompute AUC difference distribution.
Hanley–McNeil test: earlier but less accurate approximation.
Permutation test: shuffle labels to test null hypothesis.

7) In Python (scikit-learn + statsmodels)

from sklearn.metrics import roc_auc_score
from statsmodels.stats.weightstats import ztest
from delong import delong_roc_test  # available in packages

# y_true = true labels, y_pred1 = model1 scores, y_pred2 = model2 scores
p_value = delong_roc_test(y_true, y_pred1, y_pred2)
print("p-value:", p_value)

from sklearn.metrics import roc_auc_score
from statsmodels.stats.weightstats import ztest
from delong import delong_roc_test  # available in packages

# y_true = true labels, y_pred1 = model1 scores, y_pred2 = model2 scores
p_value = delong_roc_test(y_true, y_pred1, y_pred2)
print("p-value:", p_value)

Summary

DeLong’s test = standard statistical test for comparing two ROC-AUCs.
Accounts for correlation (same dataset).
Returns p-value and CI for difference.
More reliable than naive comparison or Wald methods.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

DeLong’s Test

1) What it is

2) Why we need it

3) How it works (conceptually)

4) Example

5) When to use

6) Alternatives

7) In Python (scikit-learn + statsmodels)

Summary

Like this:

Related

Leave a ReplyCancel reply

1) What it is

2) Why we need it

3) How it works (conceptually)

4) Example

5) When to use

6) Alternatives

7) In Python (scikit-learn + statsmodels)

Summary

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery