Receiver Operating Characteristics (ROC) Curve
Motivation: Why Consider All Thresholds?
Binary classification relies on a threshold applied to predicted Event probabilities $\hat p_{i1}$ to determine class membership. However, any single threshold represents only one operating point of the model.
If a model fits the data well:
- Observations that are truly Events tend to receive higher predicted Event probabilities.
- Whether an observation is classified as an Event depends entirely on the chosen threshold.
The ROC curve evaluates model performance across all possible thresholds, rather than fixing one arbitrarily.
Constructing the ROC Curve
The ROC curve is built using the following procedure:
- Create a set of distinct predicted Event probabilities $\hat p_{i1}$.
- Use each distinct probability value as a threshold $t$.
- For each threshold:
- Compute the True Positive Rate (TPR), also called Sensitivity:
$TPR = \frac{TP}{TP + FN}$ - Compute the False Positive Rate (FPR):
$FPR = 1 – \text{Specificity} = \frac{FP}{FP + TN}$
- Compute the True Positive Rate (TPR), also called Sensitivity:
- Plot:
- Sensitivity (TPR) on the vertical axis,
- False Positive Rate (FPR) on the horizontal axis.
- As the threshold varies, Sensitivity and False Positive Rate move together:
- Lower thresholds increase both,
- Higher thresholds decrease both.
The ROC curve therefore shows the trade-off between detecting Events and avoiding false alarms.
Kolmogorov–Smirnov (KS) Chart
Purpose of the KS Chart
The KS chart addresses a practical question:
- Can the True Positive Rate be maximized while keeping the False Positive Rate under control?
- If so, which threshold achieves this balance?
Constructing the KS Chart
The KS chart reuses statistics computed for the ROC curve:
- Use the same set of thresholds derived from predicted Event probabilities.
- Plot True Positive Rate versus threshold.
- Plot False Positive Rate versus threshold.
- Both rates decrease as the threshold increases.
- Identify the threshold where the vertical distance between the two curves is maximized.
This threshold is called the KS threshold, representing the point of maximum separation between Events and Non-Events.
Precision–Recall Curve
Motivation
In many applications, especially marketing and customer behavior modeling:
- An Event represents a positive action (e.g., a purchase).
- The goal is to:
- Capture as many true buyers as possible,
- Avoid misclassifying non-buyers as buyers.
These objectives are often in tension, which motivates the Precision–Recall framework.
Precision and Recall Definitions
From the confusion matrix:
- Recall (Sensitivity):
$Recall = \frac{TP}{TP + FN}$
Measures how many true Events are captured. - Precision:
$Precision = \frac{TP}{TP + FP}$
Measures how reliable Event predictions are.
Recall emphasizes coverage, while Precision emphasizes correctness.
Constructing the Precision–Recall Curve
- Extract all distinct predicted Event probabilities.
- Use each value as a threshold.
- For each threshold:
- Compute Precision,
- Compute Recall.
- Plot:
- Precision on the vertical axis,
- Recall on the horizontal axis.
- Typically:
- Precision decreases as Recall increases,
- Recall decreases as Precision increases.
This curve visualizes the trade-off between capturing Events and avoiding false positives.
F1 Score
Definition
The F1 Score summarizes Precision and Recall into a single metric using the harmonic mean:
$F1 = \frac{1}{\left(\frac{1}{Precision} + \frac{1}{Recall}\right)/2}$
The harmonic mean penalizes extreme imbalance between Precision and Recall.
Using F1 Score for Threshold Selection
The procedure is:
- Compute Precision and Recall at each threshold.
- Compute the F1 Score at each threshold.
- Plot F1 Score versus threshold.
- Identify:
- The maximum F1 Score,
- The threshold at which it occurs.
This threshold is called the F1 Score threshold.
Logistic Regression Example: Interpretation
ROC Interpretation
- If a False Positive Rate of 10% is acceptable, the model achieves approximately 30% True Positive Rate.
- Achieving 80% True Positive Rate requires tolerating at least 50% False Positive Rate.
- The maximum KS difference is 0.3724, occurring at threshold 0.29709371.
F1 and KS Threshold Alignment
- The highest F1 Score is 0.5816, occurring at threshold 0.29709371.
- At this threshold:
- Precision = 0.5234
- Recall = 0.6545
- The KS chart identifies the same threshold, reinforcing its selection.
- The misclassification rate at this threshold is 0.3024.
Although this misclassification rate is slightly higher than that obtained using an uninformative threshold (0.5), the threshold is model-driven and purpose-optimized, justifying the trade-off.
Lift Curve and Marketing Analytics
Business Context
In marketing applications:
- A binary classification model predicts a customer’s likelihood of response.
- Due to limited resources:
- Not all customers can be contacted.
- Key questions include:
- Which customers should be contacted?
- What proportion should be targeted?
- What response rate can be expected?
Gain and Lift Strategy
If predictions are accurate:
- Customers with higher predicted Event probabilities are more likely to respond.
- Customers are sorted into groups based on decreasing predicted probabilities.
- Ideally, early groups contain the most responsive customers.
Constructing Gain and Lift Tables
Step-by-Step Procedure
- Sort predicted Event probabilities in descending order.
- Divide observations into ten equal-count deciles:
- Decile 1: top 10% probabilities,
- Decile 10: bottom 10%.
- For each decile, compute:
- Number of observations,
- Number of Events,
- Response rate,
- Gain,
- Lift.
Gain and Lift Metrics
For each decile:
- Decile N: number of observations in the decile.
- Decile %: percentage of all observations.
- Gain N: number of Event observations in the decile.
- Gain %: percentage of all Events captured.
- Response %: Event rate within the decile.
- Lift:
$Lift = \frac{\text{Response %}}{\text{Overall Event Rate}}$
Interpretation of Gain and Lift
From the sample table:
- Contacting the top 10% of customers yields:
- Response rate = 52.47%,
- Lift = 1.97 (nearly double the baseline).
- Contacting the next 10% yields:
- Response rate = 43.52%,
- Lift = 1.64.
This confirms that model ranking is highly effective.
Cumulative Gain and Lift
Cumulative metrics show performance when contacting multiple top deciles together.
Key interpretations:
- Contacting the top 20% yields:
- Response rate = 48.43%,
- Lift = 1.82.
- Contacting the top 30% yields:
- Response rate = 43.72%,
- Lift = 1.64.
Cumulative tables directly support resource allocation decisions.
Final Conceptual Summary
Binary Classification Models – Part II extends evaluation beyond accuracy by introducing:
- ROC Curve: evaluates sensitivity–specificity trade-offs across all thresholds.
- KS Chart: identifies the threshold with maximum class separation.
- Precision–Recall Curve: balances coverage and reliability.
- F1 Score: optimizes Precision and Recall jointly.
- Gain and Lift Analysis: translates model performance into actionable business strategy.
Together, these tools allow binary classification models to be evaluated, optimized, and deployed in decision-critical environments.
