This section develops the theoretical backbone of linear forecasting for stationary time series. It introduces:

  1. the best linear predictor (BLP) and its defining “orthogonality” conditions,
  2. how the one-step and m-step predictors are computed from the autocovariance / autocorrelation structure,
  3. the partial autocorrelation function (PACF) and why it identifies AR order, and
  4. the Durbin–Levinson algorithm, a fast recursive way to compute predictor coefficients and PACF without repeatedly inverting large matrices.

1. Best Linear Predictor (BLP)

1.1 Definition (MSE-minimizing linear combination)

Let YY and X1,,XnX_1,\dots,X_n​ be random variables with finite second moments. Assume we know:

  • the means E[Y]E[Y], E[Xj]E[X_j],
  • all covariances Cov(Xj,Xk)\mathrm{Cov}(X_j,X_k) and Cov(Xj,Y)\mathrm{Cov}(X_j,Y).

We want to predict YY using the observed values of X1,,XnX_1,\dots,X_n​. We restrict ourselves to linear predictors of the form

Y^=c0+c1X1++cnXn.\widehat Y = c_0 + c_1 X_1 + \cdots + c_n X_n .

The best linear predictor is the one that minimizes mean squared error:

E((YY^)2)=minc0,,cnE((Yc0j=1ncjXj)2).E\big( (Y-\widehat Y)^2 \big) = \min_{c_0,\dots,c_n} E\big( (Y – c_0 – \sum_{j=1}^n c_j X_j)^2 \big).

We denote it by

P(YX1,,Xn).P(Y \mid X_1,\dots,X_n).

Key point: P(YX1,,Xn)P(Y \mid X_1,\dots,X_n) is a random variable built from X1,,XnX_1,\dots,X_n​ and the second-moment structure; it does not use the realized value of YY. It is the optimal linear forecast given what you know.

Notation convention: P(Y)P(Y\mid) means “predict YY with no predictors,” which is simply the constant E[Y]E[Y]. This is handy when writing expressions that should also make sense for n=1n=1.

1.2 Relationship to regression / projection

The BLP is the same object as:

  • the linear regression of YY on X1,,XnX_1,\dots,X_n​ (with intercept),
  • the orthogonal projection of YY onto the span of {1,X1,,Xn}\{1,X_1,\dots,X_n\} in an inner-product space where U,V=E(UV)\langle U,V\rangle=E(UV).

If (Y,X1,,Xn)(Y,X_1,\dots,X_n) are jointly multivariate normal, then the BLP equals the conditional expectation E(YX1,,Xn)E(Y\mid X_1,\dots,X_n). In general, BLP is defined purely by second moments and linearity.


2. Core Properties: the “Prediction Equations” (Orthogonality)

Assume

P(YX1,,Xn)=α0+α1X1++αnXn.P(Y\mid X_1,\dots,X_n)=\alpha_0+\alpha_1X_1+\cdots+\alpha_nX_n.

Minimizing MSE implies first-order optimality conditions (taking partial derivatives w.r.t. coefficients). These yield:

2.1 Mean is preserved

E[P(YX1,,Xn)]=E[Y].E\big[P(Y\mid X_1,\dots,X_n)\big]=E[Y].

So the predictor has the same mean as the target.

2.2 Residual is orthogonal (uncorrelated) to all predictors

Define the residual (prediction error)

e=YP(YX1,,Xn).e = Y – P(Y\mid X_1,\dots,X_n).

Then

E[e]=0,E[eXj]=0(j=1,,n).E[e]=0,\qquad E[eX_j]=0\quad(j=1,\dots,n).

Interpretation: once you subtract the best linear predictor, what remains has no linear correlation with any predictor used. The predictor “extracts all linear information” from X1,,XnX_1,\dots,X_n​ about YY.

Also, since P(Y)P(Y\mid\cdot) itself is a linear combination of 1,X1,,Xn1,X_1,\dots,X_n it follows that ee is also uncorrelated with the predictor itself. Hence:

Y=P(YX1,,Xn)+eY = P(Y\mid X_1,\dots,X_n) + e

is a decomposition into two uncorrelated components (useful for variance/MSE calculations).


3. Special Case n=1n=1: explicit coefficient formula

For a single predictor XX,

P(YX)=α0+α1X.P(Y\mid X)=\alpha_0+\alpha_1X.

The optimal slope is

α1=Cov(Y,X)Var(X),\alpha_1=\frac{\mathrm{Cov}(Y,X)}{\mathrm{Var}(X)},

and the intercept is

α0=E[Y]α1E[X].\alpha_0=E[Y]-\alpha_1E[X].

This mirrors the vector projection formula in linear algebra.


4. Specialization to Stationary Time Series Forecasting

The key forecasting setting is:

  • (Xt)(X_t) is a mean-zero stationary process,
  • you observe X1,,XnX_1,\dots,X_n​,
  • you want to predict Xn+1X_{n+1}​ (one-step ahead) or Xn+mX_{n+m}​ (m-step ahead).

Because we assume mean zero, we can write predictors with no intercept term.


5. One-Step-Ahead Best Linear Predictor

5.1 Form of the predictor and coefficients

Define

P(Xn+1X1,,Xn)=ϕn,1Xn+ϕn,2Xn1++ϕn,nX1.P(X_{n+1}\mid X_1,\dots,X_n)=\phi_{n,1}X_n+\phi_{n,2}X_{n-1}+\cdots+\phi_{n,n}X_1.

Here ϕn,j\phi_{n,j} is the coefficient on the value that is jj lags behind Xn+1X_{n+1}​.

5.2 Stationarity-based invariances

Stationarity implies:

  • Time shifts don’t change coefficients: the same coefficients apply when you move the time window forward.
  • Forecast/backcast symmetry: using evenness of autocovariance γ(h)=γ(h)\gamma(h)=\gamma(-h), the coefficients for predicting forward and “predicting backward” align (with appropriate ordering).

6. The Prediction Equations in Time Series Form (Toeplitz System)

Apply orthogonality with predictors Xn,,X1X_n,\dots,X_1​. For each k=1,,nk=1,\dots,n,

E[(Xn+1j=1nϕn,jXn+1j)Xn+1k]=0.E\left[\left(X_{n+1}-\sum_{j=1}^n\phi_{n,j}X_{n+1-j}\right)X_{n+1-k}\right]=0.

Using stationarity, these become the Yule–Walker-type equations:

γ(k)=ϕn,1γ(k1)+ϕn,2γ(k2)++ϕn,nγ(kn),k=1,,n.\gamma(k)=\phi_{n,1}\gamma(k-1)+\phi_{n,2}\gamma(k-2)+\cdots+\phi_{n,n}\gamma(k-n), \quad k=1,\dots,n.

6.1 Matrix form

Let

  • γn=(γ(1),,γ(n))\gamma_n=(\gamma(1),\dots,\gamma(n))^\top,
  • ϕn=(ϕn,1,,ϕn,n)\phi_n=(\phi_{n,1},\dots,\phi_{n,n})^\top,
  • Γn\Gamma_n​ be the n×nn\times n covariance matrix of (X1,,Xn)(X_1,\dots,X_n) with entries
    (Γn)ij=γ(ij)(\Gamma_n)_{ij}=\gamma(i-j)(Toeplitz structure).

Thenγn=Γnϕn.\gamma_n=\Gamma_n\phi_n.

Equivalently, dividing by γ(0)\gamma(0) yields a correlation version:

ρn=Rnϕn,\rho_n=R_n\phi_n,

where (Rn)ij=ρ(ij)(R_n)_{ij}=\rho(i-j) and ρn=(ρ(1),,ρ(n))\rho_n=(\rho(1),\dots,\rho(n))^\top.

6.2 Existence/uniqueness

Under mild regularity (e.g., γ(0)>0\gamma(0)>0 and γ(h)0\gamma(h)\to 0, Γn\Gamma_n​ is invertible, so:

ϕn=Γn1γn=Rn1ρn.\phi_n=\Gamma_n^{-1}\gamma_n = R_n^{-1}\rho_n.

This is a complete “in principle” solution, but direct inversion becomes expensive as nn grows.


7. One-Step Prediction for a Pure AR(p): Why the model coefficients appear

For a causal AR(p),

Xt=φ1Xt1++φpXtp+Wt,X_{t}=\varphi_1X_{t-1}+\cdots+\varphi_pX_{t-p}+W_t,

with white noise WtW_t​ uncorrelated with past XX’s.

If npn\ge p, then the optimal one-step predictor given X1,,XnX_1,\dots,X_n​ is:

P(Xn+1X1,,Xn)=φ1Xn++φpXn+1p.P(X_{n+1}\mid X_1,\dots,X_n)=\varphi_1X_n+\cdots+\varphi_pX_{n+1-p}.

Reason: the residual under that predictor is exactly Wn+1W_{n+1}​, which is uncorrelated with all predictors, satisfying the defining orthogonality conditions.

If n<pn<p, you do not have enough lags available, so the predictor does not simply reduce to the AR equation.


8. Mean-Squared Error (MSE) of the One-Step Predictor

Let the one-step prediction error be

en+1=Xn+1P(Xn+1X1,,Xn).e_{n+1}=X_{n+1}-P(X_{n+1}\mid X_1,\dots,X_n).

Because en+1e_{n+1}​ is uncorrelated with the predictor, the MSE is:

E[en+12]=γ(0)ϕn,1γ(1)ϕn,nγ(n)=γ(0)ϕnγn.E[e_{n+1}^2] = \gamma(0)-\phi_{n,1}\gamma(1)-\cdots-\phi_{n,n}\gamma(n) = \gamma(0)-\phi_n^\top\gamma_n.

Interpretation: you start with the variance γ(0)\gamma(0) and subtract the part explained linearly by the past.


9. m-Step-Ahead Best Linear Predictor

For m1m\ge 1,P(Xn+mX1,,Xn)=ϕn,1(m)Xn++ϕn,n(m)X1.P(X_{n+m}\mid X_1,\dots,X_n)=\phi^{(m)}_{n,1}X_n+\cdots+\phi^{(m)}_{n,n}X_1.

The orthogonality conditions become:

E[(Xn+mj=1nϕn,j(m)Xn+1j)Xn+1k]=0,k=1,,n.E\left[\left(X_{n+m}-\sum_{j=1}^n\phi^{(m)}_{n,j}X_{n+1-j}\right)X_{n+1-k}\right]=0, \quad k=1,\dots,n.

This yields:

γ(m1+k)=ϕn,1(m)γ(k1)++ϕn,n(m)γ(kn),k=1,,n.\gamma(m-1+k)=\phi^{(m)}_{n,1}\gamma(k-1)+\cdots+\phi^{(m)}_{n,n}\gamma(k-n), \quad k=1,\dots,n.

In matrix form:

γn(m)=Γnϕn(m),\gamma^{(m)}_n=\Gamma_n\phi^{(m)}_n,

where γn(m)=(γ(m),γ(m+1),,γ(m+n1))\gamma^{(m)}_n=(\gamma(m),\gamma(m+1),\dots,\gamma(m+n-1))^\top.

The m-step MSE has the same structure:

E[(Xn+mP(Xn+mX1,,Xn))2]=γ(0)(ϕn(m))γn(m).E\big[(X_{n+m}-P(X_{n+m}\mid X_1,\dots,X_n))^2\big] =\gamma(0)-(\phi^{(m)}_n)^\top \gamma^{(m)}_n.


10. The Partial Autocorrelation Function (PACF)

10.1 Why PACF is needed

  • ACF is excellent at identifying MA(q) because MA(q) ACF cuts off after lag qq.
  • ACF does not cut off for AR(p) or ARMA(p,q); it decays.
  • Distinguishing AR(p) from ARMA(p,q) can be difficult using ACF alone.

PACF provides an AR-analog: for AR(p), PACF cuts off after lag pp.

10.2 Two equivalent definitions

Definition A (coefficient definition):
Let ϕn,j\phi_{n,j}​ be the coefficients of the one-step predictor P(Xn+1X1,,Xn)P(X_{n+1}\mid X_1,\dots,X_n).
Then the PACF at lag hh is:

PACF(h)=ϕh,h.\text{PACF}(h)=\phi_{h,h}.

So it is the coefficient on the oldest term X1X_1​ when predicting Xh+1X_{h+1}​ from X1,,XhX_1,\dots,X_h​.

Definition B (residual correlation definition):

ϕh,h=corr(Xh+1P(Xh+1X2,,Xh),  X1P(X1X2,,Xh)).\phi_{h,h}= \mathrm{corr}\Big( X_{h+1}-P(X_{h+1}\mid X_2,\dots,X_h), \; X_1-P(X_1\mid X_2,\dots,X_h) \Big).

Interpretation: it measures the direct linear dependence between X1X_1​ and Xh+1X_{h+1}​ after removing the linear effects of the intervening values X2,,XhX_2,\dots,X_h​.

10.3 “Cutoff after p” for AR(p)

For an AR(p), for h>ph>p,

P(Xh+1X1,,Xh)=φ1Xh++φpXh+1p,P(X_{h+1}\mid X_1,\dots,X_h)=\varphi_1X_h+\cdots+\varphi_pX_{h+1-p},

which has no $X_1$​ term, so the coefficient on X1X_1 is zero:

ϕh,h=0for h>p.\phi_{h,h}=0 \quad \text{for } h>p.

Thus PACF cuts off at lag pp.
In contrast, PACF for MA(q) typically decays rather than cutting off.


11. Durbin–Levinson Algorithm: Fast recursive computation of predictors and PACF

11.1 Why not just invert RnR_n?

The “closed-form” solution

ϕn=Rn1ρn\phi_n=R_n^{-1}\rho_n

requires inverting an n×nn\times n matrix for each nn, which is expensive for large nn.

Durbin–Levinson exploits the Toeplitz structure to compute:

  • all predictor coefficients ϕn,1,,ϕn,n\phi_{n,1},\dots,\phi_{n,n}​ recursively,
  • and therefore the PACF values ϕn,n\phi_{n,n}​ as the diagonal entries.

11.2 Core recursion (as given in your notes)

  • Step 1:

ϕ1,1=ρ(1).\phi_{1,1}=\rho(1).

  • Step $n\ge 2$: first compute the PACF at lag nn:

ϕn,n=ρ(n)k=1n1ϕn1,nkρ(k)1k=1n1ϕn1,kρ(k).\phi_{n,n}= \frac{ \rho(n)-\sum_{k=1}^{n-1}\phi_{n-1,n-k}\rho(k) }{ 1-\sum_{k=1}^{n-1}\phi_{n-1,k}\rho(k) }.

Then update the earlier coefficients for k=1,,n1k=1,\dots,n-1:

ϕn,k=ϕn1,kϕn,nϕn1,nk.\phi_{n,k}=\phi_{n-1,k}-\phi_{n,n}\phi_{n-1,n-k}.

So each new lag nn adds a new “reflection” coefficient ϕn,n\phi_{n,n}​ (which is the PACF) and adjusts the previous coefficients symmetrically.

11.3 MSE recursion via PACF

The one-step prediction error variance satisfies:

vn=E[(Xn+1P(Xn+1X1,,Xn))2]=γ(0)j=1n(1ϕj,j2).v_n=E\Big[\big(X_{n+1}-P(X_{n+1}\mid X_1,\dots,X_n)\big)^2\Big] =\gamma(0)\prod_{j=1}^n (1-\phi_{j,j}^2).

So once you have PACF values, you can compute MSEs efficiently.


12. R code (high-level)

  1. Define the ARMA model and compute its theoretical ACF
n <- 1000
phi <- c(0.3, 0.4)
th <- 0.7
ACF <- ARMAacf(ar = phi, ma = th, lag.max = n)[-1]
  • phi <- c(0.3, 0.4) and th <- 0.7 specify an ARMA(2,1) model:
    • AR part: coefficients 0.3 and 0.4
    • MA part: coefficient 0.7
  • ARMAacf(...) returns the theoretical autocorrelation function (ACF) implied by that model.
  • lag.max = n computes lags up to 1000.
  • [-1] drops lag 0 (which is always 1 for the ACF), so ACF[1] corresponds to lag 1, etc.
  1. Implement Durbin–Levinson to compute one-step-ahead predictor coefficients for every order $j$
phi.matrix <- matrix(NA, n, n)
phi.matrix[1,1] = ACF[1]

for (j in 2:n) {
  num <- ACF[j] - sum(phi.matrix[j-1,(j-1):1] * ACF[1:j-1])
  denom <- 1 - sum(phi.matrix[j-1,1:(j-1)] * ACF[1:j-1])
  phi.matrix[j,j] <- num/denom

  for (k in 1:j-1) {
    term.1 <- phi.matrix[j-1,k]
    term.2 <- phi.matrix[j,j] * phi.matrix[j-1,j-k]
    phi.matrix[j,k] <- term.1 - term.2
  }
}

What this block is doing:

  • phi.matrix is built so that row j contains the coefficients ϕj,1,,ϕj,j\phi_{j,1}, \ldots, \phi_{j,j}used in the best linear one-step predictor P(Xj+1X1,,Xj).P(X_{j+1}\mid X_1,\ldots,X_j).
  • The recursion has two key parts for each j:
    • Compute the diagonal term phi.matrix[j,j] (this is the PACF at lag jj):
num   <- ACF[j] - sum(phi.matrix[j-1,(j-1):1] * ACF[1:j-1])
denom <- 1 - sum(phi.matrix[j-1,1:(j-1)] * ACF[1:j-1])
phi.matrix[j,j] <- num/denom

This is the Durbin–Levinson formula for ϕj,j\phi_{j,j}, using the ACF values and the previous row’s coefficients.

  • Update the remaining coefficients in the row:
phi.matrix[j,k] <- phi.matrix[j-1,k] - phi.matrix[j,j] * phi.matrix[j-1,j-k]

This computes ϕj,k\phi_{j,k}​ for k=1,,j1k = 1,\ldots,j-1 from the previous-order coefficients.

  1. Extract the PACF directly from the diagonal and verify it using built-in PACF
m = 5; PACF = NULL
for (h in 1:n) {
  PACF[h] = phi.matrix[h,h]
}
PACF[1:m]

ARMAacf(ar = phi, ma = th, lag.max = m, pacf = TRUE)
  • PACF[h] = phi.matrix[h,h] extracts the diagonal entries, which are the PACF values: PACF at lag h=ϕh,h.\text{PACF at lag } h = \phi_{h,h}.
  • The second line recomputes the PACF using R’s built-in routine:
    • pacf = TRUE tells ARMAacf to compute PACF instead of ACF.
  • The point of this comparison is: your Durbin–Levinson implementation matches the trusted built-in PACF.
  1. Show what the predictor coefficients look like for a specific step (example $j=3$)
j = 3
phi.matrix[j,1:j]
  • This prints ϕ3,1,ϕ3,2,ϕ3,3\phi_{3,1}, \phi_{3,2}, \phi_{3,3}​, meaning the code has constructed the predictor P(X4X1,X2,X3)ϕ3,1X3+ϕ3,2X2+ϕ3,3X1P(X_4 \mid X_1, X_2, X_3) \approx \phi_{3,1}X_3 + \phi_{3,2}X_2 + \phi_{3,3}X_1(note the ordering: the row is aligned so the most recent value gets ϕ3,1\phi_{3,1}​).
  1. Simulate an ARMA time series and generate one-step-ahead predictions using the computed coefficients
set.seed(1)
X <- arima.sim(model = list(ar = phi, ma = th), n)

Y = NULL; Y[1] = 0
for (j in 1:n-1) {
  Y[j+1] = sum(phi.matrix[j,1:j] * X[j:1])
}
Z <- X - Y
  • arima.sim(...) generates a synthetic time series X from the same ARMA(2,1) model.
  • Y[j+1] computes the one-step-ahead prediction of X[j+1] using the coefficients in row j:
    • phi.matrix[j,1:j] are the predictor weights for predicting step j+1
    • X[j:1] supplies the past observations in reverse order so they align with the coefficients
  • Z <- X - Y computes the prediction errors (actual minus predicted).
  1. Plot the simulated series vs predictions, and plot prediction errors
par(mfrow = c(1,2))
plot(X, xlim = c(0,20), lwd = 1.5, main = 'Series and One-Step Predictions')
lines(ts(Y), col = 'blue', type = 'b')

plot(Z, xlim = c(0,100), main = 'One-Step Prediction Errors')
abline(h = 0)
  • Left plot: shows how close the one-step predictions Y track the simulated series X.
  • Right plot: shows the prediction errors Z; ideally they behave roughly like noise centered at 0.

In summary, the code demonstrates (a) how to compute the PACF and one-step predictor coefficients from an ARMA model using Durbin–Levinson, and (b) how to apply those coefficients to simulated data to produce forecasts and analyze forecast errors.


The big conceptual takeaway

  • Forecasting in stationary linear time series is fundamentally a projection problem: choose coefficients so that the residual is orthogonal to the information set.
  • The required coefficients are determined entirely by the autocovariance/autocorrelation structure (Toeplitz linear systems).
  • PACF measures “direct” dependence at lag hhh after removing intervening lags; for AR(p) it cuts off after pp.
  • Durbin–Levinson makes practical computation feasible by avoiding repeated matrix inversions and simultaneously producing PACF.