1) Why ARIMA exists

So far, most modeling tools you have used assume stationarity (stable mean/variance and time-invariant dependence). Many real series are not stationary because they have trend (and later you will also address seasonality). The main strategy is:

  1. Transform the observed series into something plausibly stationary (most often by differencing).
  2. Fit a stationary ARMA model to the transformed series.
  3. Translate that stationary model back into a model for the original nonstationary series by “undoing” the differencing.

ARIMA formalizes exactly this workflow.


2) The definition: ARIMA(p,d,q)

2.1 Differencing operator

Let BB be the backshift operator:

BXt=Xt1B X_t = X_{t-1}​.

Define the first-difference operator:

Xt=(1B)Xt=XtXt1.\nabla X_t = (1-B)X_t = X_t – X_{t-1}.

Higher-order differencing is repeated application:

dXt=(1B)dXt.\nabla^d X_t = (1-B)^d X_t.

2.2 ARIMA definition

A process (Xt)(X_t) is ARIMA(p,d,q) if the differenced series

Yt=(1B)dXt=dXtY_t = (1-B)^d X_t = \nabla^d X_t

is a causal ARMA(p,q) process.

Equivalently, the model satisfies

ϕ(B)(1B)dXt=θ(B)Wt,\phi(B)(1-B)^d X_t = \theta(B)W_t,

where

  • ϕ(B)=1ϕ1BϕpBp\phi(B) = 1 – \phi_1 B – \cdots – \phi_p B^p (AR polynomial),
  • θ(B)=1+θ1B++θqBq\theta(B) = 1 + \theta_1 B + \cdots + \theta_q B^q (MA polynomial),
  • (Wt)(W_t) is white noise (often assumed Gaussian for MLE-based fitting).

Interpretation: ARIMA is an ARMA model applied to the ddd-times differenced series.


3) What simulations are illustrating (intuition)

3.1 Differencing makes a trending series look stationary

When you simulate an ARIMA(1,1,2) process, the original XtX_t​ typically wanders and exhibits nonstationary behavior (its level drifts). If you compute Yt=XtY_t=\nabla X_t​, that differenced series looks like a stationary ARMA(1,2) series.

This is the key operational idea:
If the original plot looks like it drifts, $\nabla X_t$ may remove that drift and reveal stationary dependence.

3.2 “Un-differencing” (integrating) makes the series smoother

If XtX_t​ is already a once-differenced process, then cumulative summation is the discrete analogue of integration:

  • If Yt=XtY_t = \nabla X_t, then XtX_t​ is essentially a cumulative sum of YtY_t​ (plus an initial value).

When you move from an ARIMA(1,1,2) to ARIMA(1,2,2), you are integrating one more time, and the resulting series becomes even smoother (because you are summing something that is already a cumulative behavior).

A useful heuristic (not a proof, but a mental model):

  • differencing \approx derivative (less smooth),
  • cumulative sum (un-differencing) \approx integral (more smooth).

4) Why the sample ACF “decays slowly” for nonstationary series

The true theoretical ACF is defined only for stationary processes. However, you can compute a sample ACF for any observed series, even nonstationary ones.

If the observed series is nonstationary (especially integrated, d1d\ge 1), the sample ACF typically shows:

  • very high correlation at many lags,
  • slow decay toward zero.

This is a practical red flag that differencing may be needed.

A key point emphasized by the example ARIMA(0,1,2):
even with no AR terms, the integrated structure (d=1d=1) alone can create very slow ACF decay in the raw series. The slow decay is not “because AR exists”; it is “because the series is integrated.”


5) The practical ambiguity: (1B)(1-B) versus (1ϕB)(1-\phi B) when ϕ1\phi\approx 1

A random walk satisfies:

(1B)Xt=Wt(unit root).(1-B)X_t = W_t \quad \text{(unit root)}.

An AR(1) satisfies:

(1ϕB)Xt=Wtwith ϕ<1(stationary).(1-\phi B)X_t = W_t \quad \text{with } |\phi|<1 \quad \text{(stationary)}.

If ϕ\phi is very close to 1 (e.g., 0.99), the behavior over moderate sample sizes can look extremely similar to a random walk. This creates a real modeling decision:

  • Is it truly nonstationary (unit root; needs differencing)?
  • Or is it stationary but highly persistent (AR coefficient near 1)?

This is why you need a statistical unit-root test rather than relying only on visual inspection.


6) Augmented Dickey–Fuller (ADF): deciding whether differencing is needed

6.1 Hypotheses and what “unit root” means

For a simple AR(1)-type representation:

Xt=ϕXt1+Wt,X_t = \phi X_{t-1} + W_t,

the null hypothesis of a unit root is:

H0:ϕ=1H_0: \phi = 1

which corresponds to a random walk-like behavior (nonstationary).

The alternative is:

H1:ϕ<1H_1: \phi < 1

which supports stationarity (so differencing is not required purely for stationarity).

Rewriting using differences:

Xt=(ϕ1)Xt1+Wt.\nabla X_t = (\phi – 1)X_{t-1} + W_t.

Let γ=ϕ1\gamma^* = \phi-1. Then:

  • H0:γ=0H_0: \gamma^* = 0
  • H1:γ<0H_1: \gamma^* < 0

ADF estimates γ\gamma (via OLS in a regression form) and then compares a standardized statistic to a nonstandard reference distribution (not normal). That is why typical t-test critical values do not apply.

6.2 Why “augmented”

If the errors are autocorrelated, the basic test is invalid. The “augmented” version includes lagged differences to soak up autocorrelation (conceptually: it tries to make the regression residuals closer to white noise).

So you must choose a lag order.

6.3 The role of deterministic terms: no constant / constant / constant+trend

When running the test in software, you choose among:

  • nc: no constant (appropriate only if mean is truly zero),
  • c: constant allowed,
  • ct: constant and deterministic trend allowed.

This choice matters a lot because the critical values change substantially. The listed quantiles show that the threshold for rejection becomes more negative when you allow more deterministic structure (especially trend). Practically:

  • If you include trend, it is harder to reject the unit root.
  • If you omit trend when it exists, you can get misleading results.

7) Applying ADF to LakeHuron: why results depend on lag and trend choice

For LakeHuron, you must include a constant (mean is not zero). Then:

  • With lag = 4, p-values are not small (cannot reject unit root).
  • With lag = 2, p-values become smaller, and whether you reject depends on whether you allow trend.

This is a realistic outcome: near-unit-root behavior plus limited sample size can yield borderline decisions. The important modeling implication is not “ADF gives a single final truth,” but:

  • You use ADF as evidence, alongside ACF/PACF patterns and diagnostic behavior of fitted models.
  • If stationary ARMA fits (like AR(2) or ARMA(1,1)) behave reasonably and diagnostics look good, they remain viable candidates even if unit-root evidence is not perfectly decisive.

8) Full ARIMA fitting workflow using BJsales (end-to-end pattern)

8.1 Decide differencing order dd

  • The raw BJsales plot clearly trends → nonstationary.
  • ADF on raw series yields a very large p-value (supports unit root / nonstationary).
  • After one difference, the series looks stationary.
  • ADF on the differenced series yields a very small p-value (reject unit root), so do not difference again.

Conclusion:

choose d=1d=1.

This is the logic:
Keep differencing until the differenced series is plausibly stationary, but avoid unnecessary extra differencing.

8.2 Choose ARMA(p,q) for the differenced series

Now fit ARMA models to Yt=XtY_t=\nabla X_t​. Use an information criterion (often AICc) to compare candidate ARMA(p,q) models.

In the output shown, the best AICc model for the differenced series is:

  • ARMA(1,1) with zero mean (for the differenced series).

So:

Yt=Xt follows ARMA(1,1)Xt is ARIMA(1,1,1).Y_t = \nabla X_t \ \text{follows ARMA(1,1)} \quad \Rightarrow \quad X_t \ \text{is ARIMA(1,1,1)}.

Written in operator form, the ARIMA(1,1,1) statement is:

(1ϕ1B)(1B)Xt=(1+θ1B)Wt,(1-\phi_1 B)(1-B)X_t = (1+\theta_1 B)W_t,

with fitted coefficients from the selected ARMA(1,1) model on the differenced series.

8.3 Diagnostics are mandatory

After selecting a candidate ARIMA model, you check standardized residuals:

  • residual time plot: no obvious nonstationary structure,
  • residual ACF: no significant spikes,
  • normal Q–Q: close to a line (minor tail deviations usually acceptable),
  • Ljung–Box p-values: not small across a range of lags (no evidence of residual autocorrelation).

If these are all acceptable, the fitted model is considered adequate.


9) ARIMA(0,1,1) and simple exponential smoothing: why they are connected

9.1 The model form

ARIMA(0,1,1) (also called IMA(1,1) in the material) is:

(1B)Xt=(1+θB)Wt.(1-B)X_t = (1+\theta B)W_t.

To emphasize invertibility, rewrite with c=θc=-\theta:

(1B)Xt=(1cB)Wt,c<1.(1-B)X_t = (1-cB)W_t,\quad |c|<1.

Invertibility (c<1|c|<1) lets you expand:

(1cB)1=j=0cjBj.(1-cB)^{-1} = \sum_{j=0}^{\infty} c^j B^j.

This leads to a representation where the optimal one-step predictor can be written recursively as a weighted average of:

  • the current observation,
  • the previous forecast.

9.2 The forecasting recursion becomes exponential smoothing

Define α=1c\alpha = 1-c. Then the one-step forecast update has the form:

X^t+1=αXt+(1α)X^t,\hat{X}_{t+1} = \alpha X_t + (1-\alpha)\hat{X}_t,

and (in the finite-data version):

X^t+1t=αXt+(1α)X^tt1.\hat{X}_{t+1|t} = \alpha X_t + (1-\alpha)\hat{X}_{t|t-1}.

If α(0,1]\alpha \in (0,1], each new forecast is a convex combination (weighted average) of:

  • the newest data point XtX_t,
  • the previous forecast.

Iterating the recursion shows that older observations receive weights proportional to α(1α)j1\alpha(1-\alpha)^{j-1}, which decay geometrically—hence “exponential” smoothing.

9.3 Why HoltWinters estimates α\alpha

HoltWinters with beta = FALSE, gamma = FALSE is fitting the simple exponential smoothing form, choosing α\alpha (often by minimizing one-step-ahead squared prediction errors).

In the example, the estimated α\alpha is close to the theoretical mapping α=1c=1+θ\alpha = 1-c = 1+\theta when the data truly comes from an ARIMA(0,1,1) process.

The caution in the material is important:

  • You can force α\alpha to be arbitrary, but then the procedure no longer corresponds to the justified ARIMA(0,1,1) probabilistic model unless that α\alpha is consistent with the fitted θ\theta.

10) What you should retain operationally

  1. ARIMA(p,d,q) means: after differencing dd times, you can fit a stationary ARMA(p,q).
  2. Nonstationary series often show: slow-decaying sample ACF.
  3. Near-unit-root AR(1) and random walk can look alike; use ADF for evidence, not as an infallible oracle.
  4. A disciplined workflow:
    • pick dd by differencing + ADF + visual stationarity checks,
    • pick (p,q) for the differenced series via AICc + shortlisting,
    • validate with residual diagnostics.
  5. ARIMA(0,1,1) connects directly to simple exponential smoothing, where α\alphaα is the smoothing parameter and has a model-based interpretation.