This section develops the theoretical backbone of linear forecasting for stationary time series. It introduces:
- the best linear predictor (BLP) and its defining “orthogonality” conditions,
- how the one-step and m-step predictors are computed from the autocovariance / autocorrelation structure,
- the partial autocorrelation function (PACF) and why it identifies AR order, and
- the Durbin–Levinson algorithm, a fast recursive way to compute predictor coefficients and PACF without repeatedly inverting large matrices.
1. Best Linear Predictor (BLP)
1.1 Definition (MSE-minimizing linear combination)
Let and be random variables with finite second moments. Assume we know:
- the means , ,
- all covariances and .
We want to predict using the observed values of . We restrict ourselves to linear predictors of the form
The best linear predictor is the one that minimizes mean squared error:
We denote it by
Key point: is a random variable built from and the second-moment structure; it does not use the realized value of . It is the optimal linear forecast given what you know.
Notation convention: means “predict with no predictors,” which is simply the constant . This is handy when writing expressions that should also make sense for .
1.2 Relationship to regression / projection
The BLP is the same object as:
- the linear regression of on (with intercept),
- the orthogonal projection of onto the span of in an inner-product space where .
If are jointly multivariate normal, then the BLP equals the conditional expectation . In general, BLP is defined purely by second moments and linearity.
2. Core Properties: the “Prediction Equations” (Orthogonality)
Assume
Minimizing MSE implies first-order optimality conditions (taking partial derivatives w.r.t. coefficients). These yield:
2.1 Mean is preserved
So the predictor has the same mean as the target.
2.2 Residual is orthogonal (uncorrelated) to all predictors
Define the residual (prediction error)
Then
Interpretation: once you subtract the best linear predictor, what remains has no linear correlation with any predictor used. The predictor “extracts all linear information” from about .
Also, since itself is a linear combination of it follows that is also uncorrelated with the predictor itself. Hence:
is a decomposition into two uncorrelated components (useful for variance/MSE calculations).
3. Special Case : explicit coefficient formula
For a single predictor ,
The optimal slope is
and the intercept is
This mirrors the vector projection formula in linear algebra.
4. Specialization to Stationary Time Series Forecasting
The key forecasting setting is:
- is a mean-zero stationary process,
- you observe ,
- you want to predict (one-step ahead) or (m-step ahead).
Because we assume mean zero, we can write predictors with no intercept term.
5. One-Step-Ahead Best Linear Predictor
5.1 Form of the predictor and coefficients
Define
Here is the coefficient on the value that is lags behind .
5.2 Stationarity-based invariances
Stationarity implies:
- Time shifts don’t change coefficients: the same coefficients apply when you move the time window forward.
- Forecast/backcast symmetry: using evenness of autocovariance , the coefficients for predicting forward and “predicting backward” align (with appropriate ordering).
6. The Prediction Equations in Time Series Form (Toeplitz System)
Apply orthogonality with predictors . For each ,
Using stationarity, these become the Yule–Walker-type equations:
6.1 Matrix form
Let
- ,
- ,
- be the covariance matrix of with entries
(Toeplitz structure).
Then
Equivalently, dividing by yields a correlation version:
where and .
6.2 Existence/uniqueness
Under mild regularity (e.g., and , is invertible, so:
This is a complete “in principle” solution, but direct inversion becomes expensive as grows.
7. One-Step Prediction for a Pure AR(p): Why the model coefficients appear
For a causal AR(p),
with white noise uncorrelated with past ’s.
If , then the optimal one-step predictor given is:
Reason: the residual under that predictor is exactly , which is uncorrelated with all predictors, satisfying the defining orthogonality conditions.
If , you do not have enough lags available, so the predictor does not simply reduce to the AR equation.
8. Mean-Squared Error (MSE) of the One-Step Predictor
Let the one-step prediction error be
Because is uncorrelated with the predictor, the MSE is:
Interpretation: you start with the variance and subtract the part explained linearly by the past.
9. m-Step-Ahead Best Linear Predictor
For ,
The orthogonality conditions become:
This yields:
In matrix form:
where .
The m-step MSE has the same structure:
10. The Partial Autocorrelation Function (PACF)
10.1 Why PACF is needed
- ACF is excellent at identifying MA(q) because MA(q) ACF cuts off after lag .
- ACF does not cut off for AR(p) or ARMA(p,q); it decays.
- Distinguishing AR(p) from ARMA(p,q) can be difficult using ACF alone.
PACF provides an AR-analog: for AR(p), PACF cuts off after lag .
10.2 Two equivalent definitions
Definition A (coefficient definition):
Let be the coefficients of the one-step predictor .
Then the PACF at lag is:
So it is the coefficient on the oldest term when predicting from .
Definition B (residual correlation definition):
Interpretation: it measures the direct linear dependence between and after removing the linear effects of the intervening values .
10.3 “Cutoff after p” for AR(p)
For an AR(p), for ,
which has no $X_1$ term, so the coefficient on is zero:
Thus PACF cuts off at lag .
In contrast, PACF for MA(q) typically decays rather than cutting off.
11. Durbin–Levinson Algorithm: Fast recursive computation of predictors and PACF
11.1 Why not just invert ?
The “closed-form” solution
requires inverting an matrix for each , which is expensive for large .
Durbin–Levinson exploits the Toeplitz structure to compute:
- all predictor coefficients recursively,
- and therefore the PACF values as the diagonal entries.
11.2 Core recursion (as given in your notes)
- Step 1:
- Step $n\ge 2$: first compute the PACF at lag :
Then update the earlier coefficients for :
So each new lag adds a new “reflection” coefficient (which is the PACF) and adjusts the previous coefficients symmetrically.
11.3 MSE recursion via PACF
The one-step prediction error variance satisfies:
So once you have PACF values, you can compute MSEs efficiently.
12. R code (high-level)
- Define the ARMA model and compute its theoretical ACF
n <- 1000
phi <- c(0.3, 0.4)
th <- 0.7
ACF <- ARMAacf(ar = phi, ma = th, lag.max = n)[-1]
phi <- c(0.3, 0.4)andth <- 0.7specify an ARMA(2,1) model:- AR part: coefficients 0.3 and 0.4
- MA part: coefficient 0.7
ARMAacf(...)returns the theoretical autocorrelation function (ACF) implied by that model.lag.max = ncomputes lags up to 1000.[-1]drops lag 0 (which is always 1 for the ACF), soACF[1]corresponds to lag 1, etc.
- Implement Durbin–Levinson to compute one-step-ahead predictor coefficients for every order $j$
phi.matrix <- matrix(NA, n, n)
phi.matrix[1,1] = ACF[1]
for (j in 2:n) {
num <- ACF[j] - sum(phi.matrix[j-1,(j-1):1] * ACF[1:j-1])
denom <- 1 - sum(phi.matrix[j-1,1:(j-1)] * ACF[1:j-1])
phi.matrix[j,j] <- num/denom
for (k in 1:j-1) {
term.1 <- phi.matrix[j-1,k]
term.2 <- phi.matrix[j,j] * phi.matrix[j-1,j-k]
phi.matrix[j,k] <- term.1 - term.2
}
}
What this block is doing:
phi.matrixis built so that rowjcontains the coefficients used in the best linear one-step predictor- The recursion has two key parts for each
j:- Compute the diagonal term
phi.matrix[j,j](this is the PACF at lag ):
- Compute the diagonal term
num <- ACF[j] - sum(phi.matrix[j-1,(j-1):1] * ACF[1:j-1])
denom <- 1 - sum(phi.matrix[j-1,1:(j-1)] * ACF[1:j-1])
phi.matrix[j,j] <- num/denom
This is the Durbin–Levinson formula for , using the ACF values and the previous row’s coefficients.
- Update the remaining coefficients in the row:
phi.matrix[j,k] <- phi.matrix[j-1,k] - phi.matrix[j,j] * phi.matrix[j-1,j-k]
This computes for from the previous-order coefficients.
- Extract the PACF directly from the diagonal and verify it using built-in PACF
m = 5; PACF = NULL
for (h in 1:n) {
PACF[h] = phi.matrix[h,h]
}
PACF[1:m]
ARMAacf(ar = phi, ma = th, lag.max = m, pacf = TRUE)
PACF[h] = phi.matrix[h,h]extracts the diagonal entries, which are the PACF values:- The second line recomputes the PACF using R’s built-in routine:
pacf = TRUEtellsARMAacfto compute PACF instead of ACF.
- The point of this comparison is: your Durbin–Levinson implementation matches the trusted built-in PACF.
- Show what the predictor coefficients look like for a specific step (example $j=3$)
j = 3
phi.matrix[j,1:j]
- This prints , meaning the code has constructed the predictor (note the ordering: the row is aligned so the most recent value gets ).
- Simulate an ARMA time series and generate one-step-ahead predictions using the computed coefficients
set.seed(1)
X <- arima.sim(model = list(ar = phi, ma = th), n)
Y = NULL; Y[1] = 0
for (j in 1:n-1) {
Y[j+1] = sum(phi.matrix[j,1:j] * X[j:1])
}
Z <- X - Y
arima.sim(...)generates a synthetic time seriesXfrom the same ARMA(2,1) model.Y[j+1]computes the one-step-ahead prediction ofX[j+1]using the coefficients in rowj:phi.matrix[j,1:j]are the predictor weights for predicting stepj+1X[j:1]supplies the past observations in reverse order so they align with the coefficients
Z <- X - Ycomputes the prediction errors (actual minus predicted).
- Plot the simulated series vs predictions, and plot prediction errors
par(mfrow = c(1,2))
plot(X, xlim = c(0,20), lwd = 1.5, main = 'Series and One-Step Predictions')
lines(ts(Y), col = 'blue', type = 'b')
plot(Z, xlim = c(0,100), main = 'One-Step Prediction Errors')
abline(h = 0)
- Left plot: shows how close the one-step predictions
Ytrack the simulated seriesX. - Right plot: shows the prediction errors
Z; ideally they behave roughly like noise centered at 0.
In summary, the code demonstrates (a) how to compute the PACF and one-step predictor coefficients from an ARMA model using Durbin–Levinson, and (b) how to apply those coefficients to simulated data to produce forecasts and analyze forecast errors.
The big conceptual takeaway
- Forecasting in stationary linear time series is fundamentally a projection problem: choose coefficients so that the residual is orthogonal to the information set.
- The required coefficients are determined entirely by the autocovariance/autocorrelation structure (Toeplitz linear systems).
- PACF measures “direct” dependence at lag h after removing intervening lags; for AR(p) it cuts off after .
- Durbin–Levinson makes practical computation feasible by avoiding repeated matrix inversions and simultaneously producing PACF.
