Binary Classification Model Evaluation and Threshold Optimization

Receiver Operating Characteristics (ROC) Curve

Motivation: Why Consider All Thresholds?

Binary classification relies on a threshold applied to predicted Event probabilities $\hat p_{i1}$ to determine class membership. However, any single threshold represents only one operating point of the model.

If a model fits the data well:

Observations that are truly Events tend to receive higher predicted Event probabilities.
Whether an observation is classified as an Event depends entirely on the chosen threshold.

The ROC curve evaluates model performance across all possible thresholds, rather than fixing one arbitrarily.

Constructing the ROC Curve

The ROC curve is built using the following procedure:

Create a set of distinct predicted Event probabilities $\hat p_{i1}$.
Use each distinct probability value as a threshold $t$.
For each threshold:
- Compute the True Positive Rate (TPR), also called Sensitivity:
  $TPR = \frac{TP}{TP + FN}$
- Compute the False Positive Rate (FPR):
  $FPR = 1 – \text{Specificity} = \frac{FP}{FP + TN}$
Plot:
- Sensitivity (TPR) on the vertical axis,
- False Positive Rate (FPR) on the horizontal axis.
As the threshold varies, Sensitivity and False Positive Rate move together:
- Lower thresholds increase both,
- Higher thresholds decrease both.

The ROC curve therefore shows the trade-off between detecting Events and avoiding false alarms.

Kolmogorov–Smirnov (KS) Chart

Purpose of the KS Chart

The KS chart addresses a practical question:

Can the True Positive Rate be maximized while keeping the False Positive Rate under control?
If so, which threshold achieves this balance?

Constructing the KS Chart

The KS chart reuses statistics computed for the ROC curve:

Use the same set of thresholds derived from predicted Event probabilities.
Plot True Positive Rate versus threshold.
Plot False Positive Rate versus threshold.
Both rates decrease as the threshold increases.
Identify the threshold where the vertical distance between the two curves is maximized.

This threshold is called the KS threshold, representing the point of maximum separation between Events and Non-Events.

Precision–Recall Curve

Motivation

In many applications, especially marketing and customer behavior modeling:

An Event represents a positive action (e.g., a purchase).
The goal is to:
- Capture as many true buyers as possible,
- Avoid misclassifying non-buyers as buyers.

These objectives are often in tension, which motivates the Precision–Recall framework.

Precision and Recall Definitions

From the confusion matrix:

Recall (Sensitivity):
$Recall = \frac{TP}{TP + FN}$
Measures how many true Events are captured.
Precision:
$Precision = \frac{TP}{TP + FP}$
Measures how reliable Event predictions are.

Recall emphasizes coverage, while Precision emphasizes correctness.

Constructing the Precision–Recall Curve

Extract all distinct predicted Event probabilities.
Use each value as a threshold.
For each threshold:
- Compute Precision,
- Compute Recall.
Plot:
- Precision on the vertical axis,
- Recall on the horizontal axis.
Typically:
- Precision decreases as Recall increases,
- Recall decreases as Precision increases.

This curve visualizes the trade-off between capturing Events and avoiding false positives.

F1 Score

Definition

The F1 Score summarizes Precision and Recall into a single metric using the harmonic mean:

$F1 = \frac{1}{\left(\frac{1}{Precision} + \frac{1}{Recall}\right)/2}$

The harmonic mean penalizes extreme imbalance between Precision and Recall.

Using F1 Score for Threshold Selection

The procedure is:

Compute Precision and Recall at each threshold.
Compute the F1 Score at each threshold.
Plot F1 Score versus threshold.
Identify:
- The maximum F1 Score,
- The threshold at which it occurs.

This threshold is called the F1 Score threshold.

Logistic Regression Example: Interpretation

ROC Interpretation

If a False Positive Rate of 10% is acceptable, the model achieves approximately 30% True Positive Rate.
Achieving 80% True Positive Rate requires tolerating at least 50% False Positive Rate.
The maximum KS difference is 0.3724, occurring at threshold 0.29709371.

F1 and KS Threshold Alignment

The highest F1 Score is 0.5816, occurring at threshold 0.29709371.
At this threshold:
- Precision = 0.5234
- Recall = 0.6545
The KS chart identifies the same threshold, reinforcing its selection.
The misclassification rate at this threshold is 0.3024.

Although this misclassification rate is slightly higher than that obtained using an uninformative threshold (0.5), the threshold is model-driven and purpose-optimized, justifying the trade-off.

Lift Curve and Marketing Analytics

Business Context

In marketing applications:

A binary classification model predicts a customer’s likelihood of response.
Due to limited resources:
- Not all customers can be contacted.
Key questions include:
- Which customers should be contacted?
- What proportion should be targeted?
- What response rate can be expected?

Gain and Lift Strategy

If predictions are accurate:

Customers with higher predicted Event probabilities are more likely to respond.
Customers are sorted into groups based on decreasing predicted probabilities.
Ideally, early groups contain the most responsive customers.

Constructing Gain and Lift Tables

Step-by-Step Procedure

Sort predicted Event probabilities in descending order.
Divide observations into ten equal-count deciles:
- Decile 1: top 10% probabilities,
- Decile 10: bottom 10%.
For each decile, compute:
- Number of observations,
- Number of Events,
- Response rate,
- Gain,
- Lift.

Gain and Lift Metrics

For each decile:

Decile N: number of observations in the decile.
Decile %: percentage of all observations.
Gain N: number of Event observations in the decile.
Gain %: percentage of all Events captured.
Response %: Event rate within the decile.
Lift:
$Lift = \frac{\text{Response %}}{\text{Overall Event Rate}}$

Interpretation of Gain and Lift

From the sample table:

Contacting the top 10% of customers yields:
- Response rate = 52.47%,
- Lift = 1.97 (nearly double the baseline).
Contacting the next 10% yields:
- Response rate = 43.52%,
- Lift = 1.64.

This confirms that model ranking is highly effective.

Cumulative Gain and Lift

Cumulative metrics show performance when contacting multiple top deciles together.

Key interpretations:

Contacting the top 20% yields:
- Response rate = 48.43%,
- Lift = 1.82.
Contacting the top 30% yields:
- Response rate = 43.72%,
- Lift = 1.64.

Cumulative tables directly support resource allocation decisions.

Final Conceptual Summary

Binary Classification Models – Part II extends evaluation beyond accuracy by introducing:

ROC Curve: evaluates sensitivity–specificity trade-offs across all thresholds.
KS Chart: identifies the threshold with maximum class separation.
Precision–Recall Curve: balances coverage and reliability.
F1 Score: optimizes Precision and Recall jointly.
Gain and Lift Analysis: translates model performance into actionable business strategy.

Together, these tools allow binary classification models to be evaluated, optimized, and deployed in decision-critical environments.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Binary Classification Model Evaluation and Threshold Optimization

Receiver Operating Characteristics (ROC) Curve

Motivation: Why Consider All Thresholds?

Constructing the ROC Curve

Kolmogorov–Smirnov (KS) Chart

Purpose of the KS Chart

Constructing the KS Chart

Precision–Recall Curve

Motivation

Precision and Recall Definitions

Constructing the Precision–Recall Curve

F1 Score

Definition

Using F1 Score for Threshold Selection

Logistic Regression Example: Interpretation

ROC Interpretation

F1 and KS Threshold Alignment

Lift Curve and Marketing Analytics

Business Context

Gain and Lift Strategy

Constructing Gain and Lift Tables

Step-by-Step Procedure

Gain and Lift Metrics

Interpretation of Gain and Lift

Cumulative Gain and Lift

Final Conceptual Summary

Like this:

Related

Leave a ReplyCancel reply

Receiver Operating Characteristics (ROC) Curve

Motivation: Why Consider All Thresholds?

Constructing the ROC Curve

Kolmogorov–Smirnov (KS) Chart

Purpose of the KS Chart

Constructing the KS Chart

Precision–Recall Curve

Motivation

Precision and Recall Definitions

Constructing the Precision–Recall Curve

F1 Score

Definition

Using F1 Score for Threshold Selection

Logistic Regression Example: Interpretation

ROC Interpretation

F1 and KS Threshold Alignment

Lift Curve and Marketing Analytics

Business Context

Gain and Lift Strategy

Constructing Gain and Lift Tables

Step-by-Step Procedure

Gain and Lift Metrics

Interpretation of Gain and Lift

Cumulative Gain and Lift

Final Conceptual Summary

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery