1) Why Stepwise Regression Is Needed

When a regression model includes many variables, especially irrelevant ones, the result is often an unnecessarily complex model that is difficult to interpret and may generalize poorly to new data.
Stepwise regression provides a systematic way to select a subset of important variables, balancing explanatory power and model simplicity.

The two most common approaches are forward stepwise selection and backward stepwise elimination.


2) Forward Stepwise Selection

Basic Idea

Forward stepwise selection starts from the simplest possible model and builds complexity gradually:

  • Begin with the null model, which includes only the intercept.
  • Add predictors one at a time.
  • At each step, choose the variable that improves the model the most.
  • Stop when a predefined criterion is met.

This approach answers the question:

“Which variable should be added next to improve the model?”


How the Next Variable Is Chosen

At each step, each remaining candidate variable is temporarily added to the current model. The variable selected is the one that, when added:

  • Has the smallest p-value, or
  • Produces the largest increase in $R^2$, or
  • Produces the largest reduction in RSS (residual sum of squares).

All three criteria are equivalent in spirit: they measure how much explanatory power the variable adds beyond what is already in the model.


Stopping Rule

Forward selection stops when none of the remaining variables satisfy the entry criterion.

This criterion is usually expressed as a p-value threshold. Common choices are:

  • A fixed threshold, such as $0.05$, $0.2$, or $0.5$.
  • A threshold determined automatically by AIC (Akaike Information Criterion).
  • A threshold determined automatically by BIC (Bayesian Information Criterion).

If no remaining variable has a p-value below the threshold when added, the algorithm terminates.


How AIC and BIC Set Thresholds (Intuition)

  • AIC adjusts the threshold based on the variable’s degrees of freedom.
    For example, a binary predictor (1 degree of freedom) must typically have a p-value below about $0.157$ to enter the model under AIC.
  • BIC adjusts the threshold based on the sample size $n$.
    For example, when $n = 20$, a variable typically needs a p-value below about $0.083$.

As $n$ increases, BIC becomes more restrictive than AIC and tends to produce smaller models. For this reason, BIC is generally recommended only when the sample size is large relative to the number of predictors.


3) Backward Stepwise Elimination

Basic Idea

Backward stepwise elimination works in the opposite direction:

  • Begin with the full model, which includes all candidate variables.
  • Remove predictors one at a time.
  • At each step, remove the least important variable.
  • Stop when all remaining variables meet the retention criterion.

This approach answers the question:

“Which variable can be removed with the least damage to the model?”


How the Variable to Remove Is Chosen

At each step, the least significant variable is identified as the one that:

  • Has the largest p-value, or
  • Causes the smallest decrease in $R^2$ when removed, or
  • Causes the smallest increase in RSS when removed.

Stopping Rule

Backward elimination stops when all remaining variables have p-values below the specified threshold.
The same threshold choices apply as in forward selection: fixed values, AIC, or BIC.


4) Forward vs Backward: When to Use Which

When Forward Selection Is Preferable

Forward selection has a crucial advantage:

  • It can be used even when the number of candidate variables exceeds the sample size.

This is because forward selection never fits the full model. It only considers models with a limited number of predictors, always fewer than:

  • the sample size (in linear regression), or
  • the number of events (in logistic regression).

When Backward Elimination Is Preferable

Backward elimination starts with the full model, which allows it to:

  • Consider the joint effects of all variables simultaneously.
  • Handle collinearity more naturally in some situations.

As a general guideline:

Unless the number of candidate variables exceeds the sample size (or number of events), backward stepwise elimination is usually preferred.


5) Advantages of Stepwise Selection

(1) Easy to Apply

Stepwise regression is automated and available in most statistical software. This makes it easy to apply, especially in exploratory analyses.

One important practical note: missing values should be handled beforehand, otherwise the effective sample size may shrink dramatically.


(2) Improves Generalizability

When the number of predictors is large relative to the sample size, a full model often overfits the data.
By reducing the number of predictors, stepwise selection can improve out-of-sample performance.


(3) Produces Simple, Interpretable Models

Smaller models are easier to interpret, explain, and deploy. Stepwise selection naturally favors such models.


(4) Objective and Reproducible

Compared to purely subjective variable selection, stepwise methods provide a transparent and reproducible procedure.

That said, automated selection should not replace domain knowledge. Variables known to be important for causal or confounding reasons should be included even if they are not statistically significant.


6) Limitations of Stepwise Selection

(1) Does Not Explore All Possible Models

Stepwise selection examines only a subset of all possible variable combinations.
This gives it a computational advantage, but also means it is not guaranteed to find the optimal model.


(2) Produces Biased Inference

Stepwise regression introduces systematic bias:

  • Regression coefficients appear too large.
  • Confidence intervals appear too narrow.
  • p-values are too small and statistically invalid.
  • $R^2$ appears too optimistic.

As noted by software documentation, p-values from stepwise procedures should generally not be trusted for inference.


(3) Unstable Variable Selection

Stepwise selection can be highly unstable, especially when the sample size is small relative to the number of predictors.

Small changes in the data can lead to very different selected models. This instability decreases only when the sample size is very large (often more than 50 observations per candidate variable).


(4) Ignores Causal Structure

Stepwise methods focus purely on statistical associations. They do not account for causality or confounding.
As a result, important control variables may be excluded if selection is applied blindly.


7) How to Mitigate These Limitations

Several strategies can reduce the risks:

  • Sample splitting:
    Use one part of the data to select variables and another part to estimate coefficients and $R^2$.
  • Bootstrap selection:
    Run stepwise selection on many resampled datasets and examine how often each variable is selected.
  • Alternative methods:
    Consider prior knowledge, shrinkage methods such as LASSO, or dimensionality reduction techniques like PCA.

8) Final Perspective

Stepwise regression can be useful, particularly for exploratory analysis, but it comes at a high statistical cost. Results should be reported cautiously, and claims such as “the best predictors” or “the best model” should be avoided.

In practice, stepwise regression should be treated as a tool for exploration, not as a definitive method for inference or causal conclusions.