1) Why ARIMA exists
So far, most modeling tools you have used assume stationarity (stable mean/variance and time-invariant dependence). Many real series are not stationary because they have trend (and later you will also address seasonality). The main strategy is:
- Transform the observed series into something plausibly stationary (most often by differencing).
- Fit a stationary ARMA model to the transformed series.
- Translate that stationary model back into a model for the original nonstationary series by “undoing” the differencing.
ARIMA formalizes exactly this workflow.
2) The definition: ARIMA(p,d,q)
2.1 Differencing operator
Let be the backshift operator:
.
Define the first-difference operator:
Higher-order differencing is repeated application:
2.2 ARIMA definition
A process is ARIMA(p,d,q) if the differenced series
is a causal ARMA(p,q) process.
Equivalently, the model satisfies
where
- (AR polynomial),
- (MA polynomial),
- is white noise (often assumed Gaussian for MLE-based fitting).
Interpretation: ARIMA is an ARMA model applied to the d-times differenced series.
3) What simulations are illustrating (intuition)
3.1 Differencing makes a trending series look stationary
When you simulate an ARIMA(1,1,2) process, the original typically wanders and exhibits nonstationary behavior (its level drifts). If you compute , that differenced series looks like a stationary ARMA(1,2) series.
This is the key operational idea:
If the original plot looks like it drifts, $\nabla X_t$ may remove that drift and reveal stationary dependence.
3.2 “Un-differencing” (integrating) makes the series smoother
If is already a once-differenced process, then cumulative summation is the discrete analogue of integration:
- If , then is essentially a cumulative sum of (plus an initial value).
When you move from an ARIMA(1,1,2) to ARIMA(1,2,2), you are integrating one more time, and the resulting series becomes even smoother (because you are summing something that is already a cumulative behavior).
A useful heuristic (not a proof, but a mental model):
- differencing derivative (less smooth),
- cumulative sum (un-differencing) integral (more smooth).
4) Why the sample ACF “decays slowly” for nonstationary series
The true theoretical ACF is defined only for stationary processes. However, you can compute a sample ACF for any observed series, even nonstationary ones.
If the observed series is nonstationary (especially integrated, ), the sample ACF typically shows:
- very high correlation at many lags,
- slow decay toward zero.
This is a practical red flag that differencing may be needed.
A key point emphasized by the example ARIMA(0,1,2):
even with no AR terms, the integrated structure () alone can create very slow ACF decay in the raw series. The slow decay is not “because AR exists”; it is “because the series is integrated.”
5) The practical ambiguity: versus when
A random walk satisfies:
An AR(1) satisfies:
If is very close to 1 (e.g., 0.99), the behavior over moderate sample sizes can look extremely similar to a random walk. This creates a real modeling decision:
- Is it truly nonstationary (unit root; needs differencing)?
- Or is it stationary but highly persistent (AR coefficient near 1)?
This is why you need a statistical unit-root test rather than relying only on visual inspection.
6) Augmented Dickey–Fuller (ADF): deciding whether differencing is needed
6.1 Hypotheses and what “unit root” means
For a simple AR(1)-type representation:
the null hypothesis of a unit root is:
which corresponds to a random walk-like behavior (nonstationary).
The alternative is:
which supports stationarity (so differencing is not required purely for stationarity).
Rewriting using differences:
Let . Then:
ADF estimates (via OLS in a regression form) and then compares a standardized statistic to a nonstandard reference distribution (not normal). That is why typical t-test critical values do not apply.
6.2 Why “augmented”
If the errors are autocorrelated, the basic test is invalid. The “augmented” version includes lagged differences to soak up autocorrelation (conceptually: it tries to make the regression residuals closer to white noise).
So you must choose a lag order.
6.3 The role of deterministic terms: no constant / constant / constant+trend
When running the test in software, you choose among:
- nc: no constant (appropriate only if mean is truly zero),
- c: constant allowed,
- ct: constant and deterministic trend allowed.
This choice matters a lot because the critical values change substantially. The listed quantiles show that the threshold for rejection becomes more negative when you allow more deterministic structure (especially trend). Practically:
- If you include trend, it is harder to reject the unit root.
- If you omit trend when it exists, you can get misleading results.
7) Applying ADF to LakeHuron: why results depend on lag and trend choice
For LakeHuron, you must include a constant (mean is not zero). Then:
- With lag = 4, p-values are not small (cannot reject unit root).
- With lag = 2, p-values become smaller, and whether you reject depends on whether you allow trend.
This is a realistic outcome: near-unit-root behavior plus limited sample size can yield borderline decisions. The important modeling implication is not “ADF gives a single final truth,” but:
- You use ADF as evidence, alongside ACF/PACF patterns and diagnostic behavior of fitted models.
- If stationary ARMA fits (like AR(2) or ARMA(1,1)) behave reasonably and diagnostics look good, they remain viable candidates even if unit-root evidence is not perfectly decisive.
8) Full ARIMA fitting workflow using BJsales (end-to-end pattern)
8.1 Decide differencing order
- The raw BJsales plot clearly trends → nonstationary.
- ADF on raw series yields a very large p-value (supports unit root / nonstationary).
- After one difference, the series looks stationary.
- ADF on the differenced series yields a very small p-value (reject unit root), so do not difference again.
Conclusion:
choose .
This is the logic:
Keep differencing until the differenced series is plausibly stationary, but avoid unnecessary extra differencing.
8.2 Choose ARMA(p,q) for the differenced series
Now fit ARMA models to . Use an information criterion (often AICc) to compare candidate ARMA(p,q) models.
In the output shown, the best AICc model for the differenced series is:
- ARMA(1,1) with zero mean (for the differenced series).
So:
Written in operator form, the ARIMA(1,1,1) statement is:
with fitted coefficients from the selected ARMA(1,1) model on the differenced series.
8.3 Diagnostics are mandatory
After selecting a candidate ARIMA model, you check standardized residuals:
- residual time plot: no obvious nonstationary structure,
- residual ACF: no significant spikes,
- normal Q–Q: close to a line (minor tail deviations usually acceptable),
- Ljung–Box p-values: not small across a range of lags (no evidence of residual autocorrelation).
If these are all acceptable, the fitted model is considered adequate.
9) ARIMA(0,1,1) and simple exponential smoothing: why they are connected
9.1 The model form
ARIMA(0,1,1) (also called IMA(1,1) in the material) is:
To emphasize invertibility, rewrite with :
Invertibility () lets you expand:
This leads to a representation where the optimal one-step predictor can be written recursively as a weighted average of:
- the current observation,
- the previous forecast.
9.2 The forecasting recursion becomes exponential smoothing
Define . Then the one-step forecast update has the form:
and (in the finite-data version):
If , each new forecast is a convex combination (weighted average) of:
- the newest data point ,
- the previous forecast.
Iterating the recursion shows that older observations receive weights proportional to , which decay geometrically—hence “exponential” smoothing.
9.3 Why HoltWinters estimates
HoltWinters with beta = FALSE, gamma = FALSE is fitting the simple exponential smoothing form, choosing (often by minimizing one-step-ahead squared prediction errors).
In the example, the estimated is close to the theoretical mapping when the data truly comes from an ARIMA(0,1,1) process.
The caution in the material is important:
- You can force to be arbitrary, but then the procedure no longer corresponds to the justified ARIMA(0,1,1) probabilistic model unless that is consistent with the fitted .
10) What you should retain operationally
- ARIMA(p,d,q) means: after differencing times, you can fit a stationary ARMA(p,q).
- Nonstationary series often show: slow-decaying sample ACF.
- Near-unit-root AR(1) and random walk can look alike; use ADF for evidence, not as an infallible oracle.
- A disciplined workflow:
- pick by differencing + ADF + visual stationarity checks,
- pick (p,q) for the differenced series via AICc + shortlisting,
- validate with residual diagnostics.
- ARIMA(0,1,1) connects directly to simple exponential smoothing, where α is the smoothing parameter and has a model-based interpretation.
