1) The core problem: fit quality versus model complexity

When choosing a model order (for example, deciding the AR and MA orders pp and qq), you are balancing two competing goals:

  1. Fit the data well
    You want small residuals. In many time series models this is closely tied to a small estimated shock variance σ^2\hat{\sigma}^2 (the variance of the underlying white-noise innovations).
  2. Avoid overfitting
    Adding parameters almost always reduces in-sample residual error, but it can damage:
    • forecasting performance,
    • interpolation stability,
    • interpretability,
      because the model begins fitting random noise rather than real structure.

An extreme example is fitting a high-degree polynomial through a small number of points: it can match the observed points perfectly, yet forecast disastrously because the fitted function is too flexible and is not capturing a stable underlying mechanism.

The key insight is:

A model that minimizes in-sample error is not necessarily the best model for prediction or generalization.


2) Akaike Information Criterion (AIC): what it is trying to do

2.1 What the likelihood contributes

For a chosen ARMA(p,q) structure, maximum likelihood estimation produces fitted parameters. Conceptually, the likelihood rewards models that make the observed data “probable” under the fitted model.

  • Adding parameters (increasing pp and/or qq) usually increases the maximized likelihood because the model can adapt more closely to the data.
  • If you only maximize likelihood, you will tend to choose overly complex models.

So AIC modifies the likelihood-based score by adding an explicit penalty for complexity.

2.2 AIC definition in this context

Let β\beta represent the fitted ARMA coefficients collectively (the fitted AR coefficients and MA coefficients). With maximum likelihood, you can compute:

  • the maximized log-likelihood logL(β)\log L(\beta),
  • a residual sum-of-squares-type quantity S(β)S(\beta) (the sum of squared non-standardized prediction errors, adjusted by internal scaling terms in the likelihood formulation),
  • and the number of estimated parameters.

The AIC is presented in the form:

AIC(β)=2logL(β)+2(p+q+1)\text{AIC}(\beta) = -2\log L(\beta) + 2(p+q+1)

Interpretation of the two terms:

  • 2logL(β)-2\log L(\beta): measures lack-of-fit (smaller is better; larger likelihood means better fit).
  • 2(p+q+1)2(p+q+1): penalty for model size (more parameters increases AIC).

The model selection rule is:

Prefer the model with the smallest AIC among the candidates being compared.


3) AICc: the small-sample correction

AICc is a bias-corrected version of AIC that penalizes complexity more strongly when the sample size nn is not large relative to the number of parameters.

AICc(β)=2logL(β)+2(p+q+1)nnpq2\text{AICc}(\beta) = -2\log L(\beta) + 2(p+q+1)\cdot\frac{n}{n-p-q-2}

Key behavior:

  • If nn is large, nnpq21\frac{n}{n-p-q-2}\approx 1, so AICc \approxAIC.
  • If nn is modest, AICc imposes a noticeably larger penalty for additional parameters.

Practical preference:

AICc is generally preferred over AIC when the sample size is not huge.


4) How these criteria behave as you add parameters

A very important empirical pattern appears repeatedly:

  • The estimated shock variance σ^2\hat{\sigma}^2 (or residual sum of squares) tends to decrease monotonically as model complexity increases.
  • AIC/AICc typically decreases at first (when you add parameters that capture real structure), then eventually increases (when extra parameters mostly fit noise and are not worth the penalty).

So AIC/AICc often selects the point where:

  • “additional parameters stop paying for themselves.”

5) Demonstration with simulated AR(2) data: why AR(2) is often selected

5.1 The experiment setup

Simulate data from a true AR(2) process, then fit AR(p) models for p=1,2,3,4p=1,2,3,4. For each fitted model, compute:

  • estimated white-noise variance σ^2\hat{\sigma}^2,
  • AIC,
  • AICc.

5.2 What the outputs show (typical pattern)

In the provided run with n=100n=100, the results show:

  • σ^2\hat{\sigma}^2 decreases as pp increases (this is expected).
  • The biggest improvement in σ^2\hat{\sigma}^2 happens when moving from p=1p=1 to p=2p=2, because the true process is AR(2).
  • After p=2p=2, the improvement in fit is small, and the complexity penalty dominates.
  • Therefore AIC and AICc both choose p=2p=2.

This illustrates a key principle:

Information criteria reward adding parameters only when they materially improve the likelihood.

5.3 Variability across simulations: AIC/AICc can disagree or be close

When you repeat the simulation with a different random seed:

  • Sometimes AIC/AICc prefer AR(3) over AR(2), even though the truth is AR(2).
  • Often this happens by a small margin.

This does not mean the criterion is “wrong” in a trivial sense. It reflects that:

  • the realized dataset is only one random draw,
  • the criteria estimate an expected predictive performance tradeoff,
  • and the difference between AR(2) and AR(3) may be too small to be practically meaningful.

A practical decision rule used by many analysts is:

  • If AICc is only slightly smaller for the larger model, you may still choose the simpler model for interpretability and stability.

5.4 AIC vs AICc divergence when nn is smaller

With smaller sample size (example n=54n=54):

  • AIC may pick a slightly larger model (e.g., AR(3))
  • AICc may pick the smaller model (e.g., AR(2))

This reflects exactly what AICc is designed for: penalize extra parameters more strongly when data is limited.


6) Extending beyond pure AR: you usually compare a range of AR, MA, and ARMA models

The demonstrations focus on pure AR models to keep the number of cases manageable, but in general you should consider plausible candidates across:

  • AR(p),
  • MA(q),
  • ARMA(p,q),

within reasonable ranges suggested by:

  • the sample ACF and PACF patterns,
  • and the diagnostic behavior of fitted residuals.

7) Automated search with auto.arima: how to use it responsibly

7.1 What auto.arima does

auto.arima searches across a set of candidate models, evaluates an information criterion (AIC, AICc, or BIC), and chooses the model that minimizes it.

By default, it typically uses AICc.

7.2 Why trace = TRUE is important

With trace = TRUE, you can see which models were tried and their scores. This is important because:

  • you can verify the search space is sensible,
  • you can see if multiple models are close competitors,
  • you can avoid blindly accepting the single “best” model when the difference is trivial.

7.3 Interpreting the LakeHuron output

The output shows many candidate ARMA models tested, and the best according to AICc is:

  • ARMA(1,1) with a non-zero mean.

But the AR(2) model (ARMA(2,0)) has an AICc only slightly larger. This implies:

  • both models are plausible and worth deeper checking,
  • order selection should not be decided solely by “minimum AICc,” especially when differences are small.

A responsible workflow is:

  1. Use AICc to shortlist a few close models.
  2. Fit each shortlisted model.
  3. Run residual diagnostics (autocorrelation tests, normality checks, standardized residual plots).
  4. Prefer the model that both:
    • has good diagnostics, and
    • is not unnecessarily complex.

8) A critical practical warning: do not compare AIC values computed using different conventions

Different software functions and packages sometimes report AIC in different forms:

  • some report the raw AIC,
  • some report AIC divided by nnn,
  • some differ in constant terms or parameter counting conventions.

The example described shows exactly this issue:

  • one routine prints a normalized AIC value (AIC/n),
  • but internally stores the standard AIC that matches other packages.

Therefore:

When comparing models, compute AIC/AICc from the same function/package using the same convention for every candidate model.

This avoids accidentally comparing incompatible scores.


9) Practical decision logic you can apply directly

A strong, general-purpose approach looks like this:

  1. Choose a plausible candidate set (based on ACF/PACF and domain knowledge).
  2. Fit all candidates using the same estimation routine (to keep AIC/AICc comparable).
  3. Rank by AICc and identify:
    • the best model,
    • and any models with AICc within a small margin (often differences under ~2 are considered “close” in practice, though the exact threshold depends on context).
  4. Run residual diagnostics on the close contenders.
  5. Choose the simplest model that passes diagnostics well, unless there is a clear advantage to complexity.