Two commonly used measures for identifying outliers in regression analysis are:
- Ordinary residuals
- Studentized residuals (also called internally studentized residuals, or standardized residuals in Minitab)
Both are reviewed below, with additional detail to clarify their roles and limitations.
1. Ordinary Residuals
Definition
For each observation , the ordinary residual is defined as:
where:
- is the observed response
- is the predicted (fitted) response
Example
Consider the following small data set with four observations:
| x | y | FITS | RESI |
|---|---|---|---|
| 1 | 2 | 2.2 | -0.2 |
| 2 | 5 | 4.4 | 0.6 |
| 3 | 6 | 6.6 | -0.6 |
| 4 | 9 | 8.8 | 0.2 |
Each residual is computed by subtracting the fitted value from the observed value. For example:
- First residual:
- Second residual:
Limitation of Ordinary Residuals
The major drawback of ordinary residuals is that their magnitude depends on the units of measurement of the response variable. As a result:
- A residual of size 10 may be large in one context but small in another.
- This makes it difficult to use ordinary residuals directly to detect outliers.
2. Studentized (Internally Studentized) Residuals
Motivation
To remove the effect of measurement units, residuals are scaled by an estimate of their standard deviation. This produces studentized residuals, which are unit-free and directly comparable across observations.
Definition
The internally studentized residual for observation is:
where:
- is the ordinary residual
- MSE is the mean square error
- is the leverage of observation
This shows that studentized residuals depend on:
- the size of the residual,
- the overall variability of the model (MSE),
- how extreme the predictor value is (leverage).
Example with Leverage
Using the same four-point data set:
| x | y | FITS | RESI | HI | SRES |
|---|---|---|---|---|---|
| 1 | 2 | 2.2 | -0.2 | 0.7 | -0.57735 |
| 2 | 5 | 4.4 | 0.6 | 0.3 | 1.13389 |
| 3 | 6 | 6.6 | -0.6 | 0.3 | -1.13389 |
| 4 | 9 | 8.8 | 0.2 | 0.7 | 0.57735 |
Given:
The first studentized residual is:
Each studentized residual measures how many standard deviations the residual is away from zero, accounting for leverage.
3. Using Studentized Residuals to Detect Outliers
Interpretation Guidelines
- A studentized residual with absolute value greater than 3 is generally considered evidence of an outlier.
- Some software (e.g., Minitab) uses a more conservative cutoff of 2.
- These thresholds should not be treated rigidly; instead, they serve as warning signals prompting further investigation.
4. Example: Influence2 Data Set Revisited
In the Influence2 data set, one observation visually appears to deviate strongly from the general trend.
Minitab’s diagnostic output for that observation is:
| Obs | y | Fit | Resid | Std Resid |
|---|---|---|---|---|
| 21 | 40.00 | 23.11 | 16.89 | 3.68 |
Because the internally studentized residual is 3.68, Minitab flags this observation as having a large residual, confirming its outlier status.
5. Why Do Outliers Matter?
Outliers matter because they can substantially affect certain aspects of a regression analysis. One way to see this is to compare results with and without the outlier.
Regression Without the Outlier
- Mean Square Error (MSE): 6.72
- : 97.32%
- Standard error S: 2.59
Regression With the Outlier Included
- Mean Square Error (MSE): 22.19
- : 91.01%
- Standard error S: 4.71
6. Key Insight: What Changes and What Does Not
The most substantial change caused by the outlier is the inflation of MSE, from 6.72 to 22.19.
This is important because:
- MSE appears in all confidence interval and prediction interval formulas.
- A larger MSE leads to wider intervals, reducing precision.
However:
- The estimated regression coefficients,
- the predicted values,
- and hypothesis test conclusions
remain largely unchanged.
Therefore, in this case, the outlier is not influential in terms of coefficient estimates, but it is influential with respect to model uncertainty, as reflected by MSE.
Final Takeaway
- Ordinary residuals are simple but scale-dependent.
- Studentized residuals standardize residuals using MSE and leverage, making them effective for outlier detection.
- Outliers can dramatically inflate MSE, harming interval estimation even when coefficient estimates remain stable.
- Identifying and understanding outliers is essential for reliable regression inference, not merely for improving model fit.
