Best Linear Predictor of a Stationary Process

This section develops the theoretical backbone of linear forecasting for stationary time series. It introduces:

the best linear predictor (BLP) and its defining “orthogonality” conditions,
how the one-step and m-step predictors are computed from the autocovariance / autocorrelation structure,
the partial autocorrelation function (PACF) and why it identifies AR order, and
the Durbin–Levinson algorithm, a fast recursive way to compute predictor coefficients and PACF without repeatedly inverting large matrices.

1. Best Linear Predictor (BLP)

1.1 Definition (MSE-minimizing linear combination)

Let $Y$ and $X_1,\dots,X_n$ be random variables with finite second moments. Assume we know:

the means $E[Y]$ , $E[X_j]$ ,
all covariances $\mathrm{Cov}(X_j,X_k)$ and $\mathrm{Cov}(X_j,Y)$ .

We want to predict $Y$ using the observed values of $X_1,\dots,X_n$ . We restrict ourselves to linear predictors of the form

$\widehat Y = c_0 + c_1 X_1 + \cdots + c_n X_n .$

The best linear predictor is the one that minimizes mean squared error:

$E\big( (Y-\widehat Y)^2 \big) = \min_{c_0,\dots,c_n} E\big( (Y – c_0 – \sum_{j=1}^n c_j X_j)^2 \big).$

We denote it by

$P(Y \mid X_1,\dots,X_n).$

Key point: $P(Y \mid X_1,\dots,X_n)$ is a random variable built from $X_1,\dots,X_n$ and the second-moment structure; it does not use the realized value of $Y$ . It is the optimal linear forecast given what you know.

Notation convention: $P(Y\mid)$ means “predict $Y$ with no predictors,” which is simply the constant $E[Y]$ . This is handy when writing expressions that should also make sense for $n=1$ .

1.2 Relationship to regression / projection

The BLP is the same object as:

the linear regression of $Y$ on $X_1,\dots,X_n$ (with intercept),
the orthogonal projection of $Y$ onto the span of $\{1,X_1,\dots,X_n\}$ in an inner-product space where $\langle U,V\rangle=E(UV)$ .

If $(Y,X_1,\dots,X_n)$ are jointly multivariate normal, then the BLP equals the conditional expectation $E(Y\mid X_1,\dots,X_n)$ . In general, BLP is defined purely by second moments and linearity.

2. Core Properties: the “Prediction Equations” (Orthogonality)

Assume

$P(Y\mid X_1,\dots,X_n)=\alpha_0+\alpha_1X_1+\cdots+\alpha_nX_n.$

Minimizing MSE implies first-order optimality conditions (taking partial derivatives w.r.t. coefficients). These yield:

2.1 Mean is preserved

$E\big[P(Y\mid X_1,\dots,X_n)\big]=E[Y].$

So the predictor has the same mean as the target.

2.2 Residual is orthogonal (uncorrelated) to all predictors

Define the residual (prediction error)

$e = Y – P(Y\mid X_1,\dots,X_n).$

Then

$E[e]=0,\qquad E[eX_j]=0\quad(j=1,\dots,n).$

Interpretation: once you subtract the best linear predictor, what remains has no linear correlation with any predictor used. The predictor “extracts all linear information” from $X_1,\dots,X_n$ about $Y$ .

Also, since $P(Y\mid\cdot)$ itself is a linear combination of $1,X_1,\dots,X_n$ it follows that $e$ is also uncorrelated with the predictor itself. Hence:

$Y = P(Y\mid X_1,\dots,X_n) + e$

is a decomposition into two uncorrelated components (useful for variance/MSE calculations).

3. Special Case $n=1$ : explicit coefficient formula

For a single predictor $X$ ,

$P(Y\mid X)=\alpha_0+\alpha_1X.$

The optimal slope is

$\alpha_1=\frac{\mathrm{Cov}(Y,X)}{\mathrm{Var}(X)},$

and the intercept is

$\alpha_0=E[Y]-\alpha_1E[X].$

This mirrors the vector projection formula in linear algebra.

4. Specialization to Stationary Time Series Forecasting

The key forecasting setting is:

$(X_t)$ is a mean-zero stationary process,
you observe $X_1,\dots,X_n$ ,
you want to predict $X_{n+1}$ (one-step ahead) or $X_{n+m}$ (m-step ahead).

Because we assume mean zero, we can write predictors with no intercept term.

5. One-Step-Ahead Best Linear Predictor

5.1 Form of the predictor and coefficients

Define

$P(X_{n+1}\mid X_1,\dots,X_n)=\phi_{n,1}X_n+\phi_{n,2}X_{n-1}+\cdots+\phi_{n,n}X_1.$

Here $\phi_{n,j}$ is the coefficient on the value that is $j$ lags behind $X_{n+1}$ .

5.2 Stationarity-based invariances

Stationarity implies:

Time shifts don’t change coefficients: the same coefficients apply when you move the time window forward.
Forecast/backcast symmetry: using evenness of autocovariance $\gamma(h)=\gamma(-h)$ , the coefficients for predicting forward and “predicting backward” align (with appropriate ordering).

6. The Prediction Equations in Time Series Form (Toeplitz System)

Apply orthogonality with predictors $X_n,\dots,X_1$ . For each $k=1,\dots,n$ ,

$E\left[\left(X_{n+1}-\sum_{j=1}^n\phi_{n,j}X_{n+1-j}\right)X_{n+1-k}\right]=0.$

Using stationarity, these become the Yule–Walker-type equations:

$\gamma(k)=\phi_{n,1}\gamma(k-1)+\phi_{n,2}\gamma(k-2)+\cdots+\phi_{n,n}\gamma(k-n), \quad k=1,\dots,n.$

6.1 Matrix form

Let

$\gamma_n=(\gamma(1),\dots,\gamma(n))^\top$ ,
$\phi_n=(\phi_{n,1},\dots,\phi_{n,n})^\top$ ,
$\Gamma_n$ be the $n\times n$ covariance matrix of $(X_1,\dots,X_n)$ with entries
$(\Gamma_n)_{ij}=\gamma(i-j)$ (Toeplitz structure).

Then $\gamma_n=\Gamma_n\phi_n.$

Equivalently, dividing by $\gamma(0)$ yields a correlation version:

$\rho_n=R_n\phi_n,$

where $(R_n)_{ij}=\rho(i-j)$ and $\rho_n=(\rho(1),\dots,\rho(n))^\top$ .

6.2 Existence/uniqueness

Under mild regularity (e.g., $\gamma(0)>0$ and $\gamma(h)\to 0$ , $\Gamma_n$ is invertible, so:

$\phi_n=\Gamma_n^{-1}\gamma_n = R_n^{-1}\rho_n.$

This is a complete “in principle” solution, but direct inversion becomes expensive as $n$ grows.

7. One-Step Prediction for a Pure AR(p): Why the model coefficients appear

For a causal AR(p),

$X_{t}=\varphi_1X_{t-1}+\cdots+\varphi_pX_{t-p}+W_t,$

with white noise $W_t$ uncorrelated with past $X$ ’s.

If $n\ge p$ , then the optimal one-step predictor given $X_1,\dots,X_n$ is:

$P(X_{n+1}\mid X_1,\dots,X_n)=\varphi_1X_n+\cdots+\varphi_pX_{n+1-p}.$

Reason: the residual under that predictor is exactly $W_{n+1}$ , which is uncorrelated with all predictors, satisfying the defining orthogonality conditions.

If $n<p$ , you do not have enough lags available, so the predictor does not simply reduce to the AR equation.

8. Mean-Squared Error (MSE) of the One-Step Predictor

Let the one-step prediction error be

$e_{n+1}=X_{n+1}-P(X_{n+1}\mid X_1,\dots,X_n).$

Because $e_{n+1}$ is uncorrelated with the predictor, the MSE is:

$E[e_{n+1}^2] = \gamma(0)-\phi_{n,1}\gamma(1)-\cdots-\phi_{n,n}\gamma(n) = \gamma(0)-\phi_n^\top\gamma_n.$

Interpretation: you start with the variance $\gamma(0)$ and subtract the part explained linearly by the past.

9. m-Step-Ahead Best Linear Predictor

For $m\ge 1$ , $P(X_{n+m}\mid X_1,\dots,X_n)=\phi^{(m)}_{n,1}X_n+\cdots+\phi^{(m)}_{n,n}X_1.$

The orthogonality conditions become:

$E\left[\left(X_{n+m}-\sum_{j=1}^n\phi^{(m)}_{n,j}X_{n+1-j}\right)X_{n+1-k}\right]=0, \quad k=1,\dots,n.$

This yields:

$\gamma(m-1+k)=\phi^{(m)}_{n,1}\gamma(k-1)+\cdots+\phi^{(m)}_{n,n}\gamma(k-n), \quad k=1,\dots,n.$

In matrix form:

$\gamma^{(m)}_n=\Gamma_n\phi^{(m)}_n,$

where $\gamma^{(m)}_n=(\gamma(m),\gamma(m+1),\dots,\gamma(m+n-1))^\top$ .

The m-step MSE has the same structure:

$E\big[(X_{n+m}-P(X_{n+m}\mid X_1,\dots,X_n))^2\big] =\gamma(0)-(\phi^{(m)}_n)^\top \gamma^{(m)}_n.$

10. The Partial Autocorrelation Function (PACF)

10.1 Why PACF is needed

ACF is excellent at identifying MA(q) because MA(q) ACF cuts off after lag $q$ .
ACF does not cut off for AR(p) or ARMA(p,q); it decays.
Distinguishing AR(p) from ARMA(p,q) can be difficult using ACF alone.

PACF provides an AR-analog: for AR(p), PACF cuts off after lag $p$ .

10.2 Two equivalent definitions

Definition A (coefficient definition):
Let $\phi_{n,j}$ be the coefficients of the one-step predictor $P(X_{n+1}\mid X_1,\dots,X_n)$ .
Then the PACF at lag $h$ is:

$\text{PACF}(h)=\phi_{h,h}.$

So it is the coefficient on the oldest term $X_1$ when predicting $X_{h+1}$ from $X_1,\dots,X_h$ .

Definition B (residual correlation definition):

$\phi_{h,h}= \mathrm{corr}\Big( X_{h+1}-P(X_{h+1}\mid X_2,\dots,X_h), \; X_1-P(X_1\mid X_2,\dots,X_h) \Big).$

Interpretation: it measures the direct linear dependence between $X_1$ and $X_{h+1}$ after removing the linear effects of the intervening values $X_2,\dots,X_h$ .

10.3 “Cutoff after p” for AR(p)

For an AR(p), for $h>p$ ,

$P(X_{h+1}\mid X_1,\dots,X_h)=\varphi_1X_h+\cdots+\varphi_pX_{h+1-p},$

which has no $X_1$ term, so the coefficient on $X_1$ is zero:

$\phi_{h,h}=0 \quad \text{for } h>p.$

Thus PACF cuts off at lag $p$ .
In contrast, PACF for MA(q) typically decays rather than cutting off.

11. Durbin–Levinson Algorithm: Fast recursive computation of predictors and PACF

11.1 Why not just invert $R_n$ ?

The “closed-form” solution

$\phi_n=R_n^{-1}\rho_n$

requires inverting an $n\times n$ matrix for each $n$ , which is expensive for large $n$ .

Durbin–Levinson exploits the Toeplitz structure to compute:

all predictor coefficients $\phi_{n,1},\dots,\phi_{n,n}$ recursively,
and therefore the PACF values $\phi_{n,n}$ as the diagonal entries.

11.2 Core recursion (as given in your notes)

Step 1:

$\phi_{1,1}=\rho(1).$

Step $n\ge 2$: first compute the PACF at lag $n$ :

$\phi_{n,n}= \frac{ \rho(n)-\sum_{k=1}^{n-1}\phi_{n-1,n-k}\rho(k) }{ 1-\sum_{k=1}^{n-1}\phi_{n-1,k}\rho(k) }.$

Then update the earlier coefficients for $k=1,\dots,n-1$ :

$\phi_{n,k}=\phi_{n-1,k}-\phi_{n,n}\phi_{n-1,n-k}.$

So each new lag $n$ adds a new “reflection” coefficient $\phi_{n,n}$ (which is the PACF) and adjusts the previous coefficients symmetrically.

11.3 MSE recursion via PACF

The one-step prediction error variance satisfies:

$v_n=E\Big[\big(X_{n+1}-P(X_{n+1}\mid X_1,\dots,X_n)\big)^2\Big] =\gamma(0)\prod_{j=1}^n (1-\phi_{j,j}^2).$

So once you have PACF values, you can compute MSEs efficiently.

12. R code (high-level)

Define the ARMA model and compute its theoretical ACF

n <- 1000
phi <- c(0.3, 0.4)
th <- 0.7
ACF <- ARMAacf(ar = phi, ma = th, lag.max = n)[-1]

n <- 1000
phi <- c(0.3, 0.4)
th <- 0.7
ACF <- ARMAacf(ar = phi, ma = th, lag.max = n)[-1]

phi <- c(0.3, 0.4) and th <- 0.7 specify an ARMA(2,1) model:
- AR part: coefficients 0.3 and 0.4
- MA part: coefficient 0.7
ARMAacf(...) returns the theoretical autocorrelation function (ACF) implied by that model.
lag.max = n computes lags up to 1000.
[-1] drops lag 0 (which is always 1 for the ACF), so ACF[1] corresponds to lag 1, etc.

Implement Durbin–Levinson to compute one-step-ahead predictor coefficients for every order $j$

phi.matrix <- matrix(NA, n, n)
phi.matrix[1,1] = ACF[1]

for (j in 2:n) {
  num <- ACF[j] - sum(phi.matrix[j-1,(j-1):1] * ACF[1:j-1])
  denom <- 1 - sum(phi.matrix[j-1,1:(j-1)] * ACF[1:j-1])
  phi.matrix[j,j] <- num/denom

  for (k in 1:j-1) {
    term.1 <- phi.matrix[j-1,k]
    term.2 <- phi.matrix[j,j] * phi.matrix[j-1,j-k]
    phi.matrix[j,k] <- term.1 - term.2
  }
}

phi.matrix <- matrix(NA, n, n)
phi.matrix[1,1] = ACF[1]

for (j in 2:n) {
  num <- ACF[j] - sum(phi.matrix[j-1,(j-1):1] * ACF[1:j-1])
  denom <- 1 - sum(phi.matrix[j-1,1:(j-1)] * ACF[1:j-1])
  phi.matrix[j,j] <- num/denom

  for (k in 1:j-1) {
    term.1 <- phi.matrix[j-1,k]
    term.2 <- phi.matrix[j,j] * phi.matrix[j-1,j-k]
    phi.matrix[j,k] <- term.1 - term.2
  }
}

What this block is doing:

phi.matrix is built so that row j contains the coefficients $\phi_{j,1}, \ldots, \phi_{j,j}$ used in the best linear one-step predictor $P(X_{j+1}\mid X_1,\ldots,X_j).$
The recursion has two key parts for each j:
- Compute the diagonal term phi.matrix[j,j] (this is the PACF at lag $j$ ):

num   <- ACF[j] - sum(phi.matrix[j-1,(j-1):1] * ACF[1:j-1])
denom <- 1 - sum(phi.matrix[j-1,1:(j-1)] * ACF[1:j-1])
phi.matrix[j,j] <- num/denom

num   <- ACF[j] - sum(phi.matrix[j-1,(j-1):1] * ACF[1:j-1])
denom <- 1 - sum(phi.matrix[j-1,1:(j-1)] * ACF[1:j-1])
phi.matrix[j,j] <- num/denom

This is the Durbin–Levinson formula for $\phi_{j,j}$ , using the ACF values and the previous row’s coefficients.

Update the remaining coefficients in the row:

phi.matrix[j,k] <- phi.matrix[j-1,k] - phi.matrix[j,j] * phi.matrix[j-1,j-k]

phi.matrix[j,k] <- phi.matrix[j-1,k] - phi.matrix[j,j] * phi.matrix[j-1,j-k]

This computes $\phi_{j,k}$ for $k = 1,\ldots,j-1$ from the previous-order coefficients.

Extract the PACF directly from the diagonal and verify it using built-in PACF

m = 5; PACF = NULL
for (h in 1:n) {
  PACF[h] = phi.matrix[h,h]
}
PACF[1:m]

ARMAacf(ar = phi, ma = th, lag.max = m, pacf = TRUE)

m = 5; PACF = NULL
for (h in 1:n) {
  PACF[h] = phi.matrix[h,h]
}
PACF[1:m]

ARMAacf(ar = phi, ma = th, lag.max = m, pacf = TRUE)

PACF[h] = phi.matrix[h,h] extracts the diagonal entries, which are the PACF values: $\text{PACF at lag } h = \phi_{h,h}.$
The second line recomputes the PACF using R’s built-in routine:
- pacf = TRUE tells ARMAacf to compute PACF instead of ACF.
The point of this comparison is: your Durbin–Levinson implementation matches the trusted built-in PACF.

Show what the predictor coefficients look like for a specific step (example $j=3$)

j = 3
phi.matrix[j,1:j]

j = 3
phi.matrix[j,1:j]

This prints $\phi_{3,1}, \phi_{3,2}, \phi_{3,3}$ , meaning the code has constructed the predictor $P(X_4 \mid X_1, X_2, X_3) \approx \phi_{3,1}X_3 + \phi_{3,2}X_2 + \phi_{3,3}X_1$ (note the ordering: the row is aligned so the most recent value gets $\phi_{3,1}$ ).

Simulate an ARMA time series and generate one-step-ahead predictions using the computed coefficients

set.seed(1)
X <- arima.sim(model = list(ar = phi, ma = th), n)

Y = NULL; Y[1] = 0
for (j in 1:n-1) {
  Y[j+1] = sum(phi.matrix[j,1:j] * X[j:1])
}
Z <- X - Y

set.seed(1)
X <- arima.sim(model = list(ar = phi, ma = th), n)

Y = NULL; Y[1] = 0
for (j in 1:n-1) {
  Y[j+1] = sum(phi.matrix[j,1:j] * X[j:1])
}
Z <- X - Y

arima.sim(...) generates a synthetic time series X from the same ARMA(2,1) model.
Y[j+1] computes the one-step-ahead prediction of X[j+1] using the coefficients in row j:
- phi.matrix[j,1:j] are the predictor weights for predicting step j+1
- X[j:1] supplies the past observations in reverse order so they align with the coefficients
Z <- X - Y computes the prediction errors (actual minus predicted).

Plot the simulated series vs predictions, and plot prediction errors

par(mfrow = c(1,2))
plot(X, xlim = c(0,20), lwd = 1.5, main = 'Series and One-Step Predictions')
lines(ts(Y), col = 'blue', type = 'b')

plot(Z, xlim = c(0,100), main = 'One-Step Prediction Errors')
abline(h = 0)

par(mfrow = c(1,2))
plot(X, xlim = c(0,20), lwd = 1.5, main = 'Series and One-Step Predictions')
lines(ts(Y), col = 'blue', type = 'b')

plot(Z, xlim = c(0,100), main = 'One-Step Prediction Errors')
abline(h = 0)

Left plot: shows how close the one-step predictions Y track the simulated series X.
Right plot: shows the prediction errors Z; ideally they behave roughly like noise centered at 0.

In summary, the code demonstrates (a) how to compute the PACF and one-step predictor coefficients from an ARMA model using Durbin–Levinson, and (b) how to apply those coefficients to simulated data to produce forecasts and analyze forecast errors.

The big conceptual takeaway

Forecasting in stationary linear time series is fundamentally a projection problem: choose coefficients so that the residual is orthogonal to the information set.
The required coefficients are determined entirely by the autocovariance/autocorrelation structure (Toeplitz linear systems).
PACF measures “direct” dependence at lag $h$ h after removing intervening lags; for AR(p) it cuts off after $p$ .
Durbin–Levinson makes practical computation feasible by avoiding repeated matrix inversions and simultaneously producing PACF.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Best Linear Predictor of a Stationary Process

1. Best Linear Predictor (BLP)

1.1 Definition (MSE-minimizing linear combination)

1.2 Relationship to regression / projection

2. Core Properties: the “Prediction Equations” (Orthogonality)

2.1 Mean is preserved

2.2 Residual is orthogonal (uncorrelated) to all predictors

3. Special Case $n=1$ : explicit coefficient formula

4. Specialization to Stationary Time Series Forecasting

5. One-Step-Ahead Best Linear Predictor

5.1 Form of the predictor and coefficients

5.2 Stationarity-based invariances

6. The Prediction Equations in Time Series Form (Toeplitz System)

6.1 Matrix form

6.2 Existence/uniqueness

7. One-Step Prediction for a Pure AR(p): Why the model coefficients appear

8. Mean-Squared Error (MSE) of the One-Step Predictor

9. m-Step-Ahead Best Linear Predictor

10. The Partial Autocorrelation Function (PACF)

10.1 Why PACF is needed

10.2 Two equivalent definitions

10.3 “Cutoff after p” for AR(p)

11. Durbin–Levinson Algorithm: Fast recursive computation of predictors and PACF

11.1 Why not just invert $R_n$ ?

11.2 Core recursion (as given in your notes)

11.3 MSE recursion via PACF

12. R code (high-level)

The big conceptual takeaway

Like this:

Related

Leave a ReplyCancel reply

1. Best Linear Predictor (BLP)

1.1 Definition (MSE-minimizing linear combination)

1.2 Relationship to regression / projection

2. Core Properties: the “Prediction Equations” (Orthogonality)

2.1 Mean is preserved

2.2 Residual is orthogonal (uncorrelated) to all predictors

3. Special Case n=1n=1: explicit coefficient formula

4. Specialization to Stationary Time Series Forecasting

5. One-Step-Ahead Best Linear Predictor

5.1 Form of the predictor and coefficients

5.2 Stationarity-based invariances

6. The Prediction Equations in Time Series Form (Toeplitz System)

6.1 Matrix form

6.2 Existence/uniqueness

7. One-Step Prediction for a Pure AR(p): Why the model coefficients appear

8. Mean-Squared Error (MSE) of the One-Step Predictor

9. m-Step-Ahead Best Linear Predictor

10. The Partial Autocorrelation Function (PACF)

10.1 Why PACF is needed

10.2 Two equivalent definitions

10.3 “Cutoff after p” for AR(p)

11. Durbin–Levinson Algorithm: Fast recursive computation of predictors and PACF

11.1 Why not just invert RnR_n?

11.2 Core recursion (as given in your notes)

11.3 MSE recursion via PACF

12. R code (high-level)

The big conceptual takeaway

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery

3. Special Case $n=1$ : explicit coefficient formula

11.1 Why not just invert $R_n$ ?