Preliminary Estimation for AR Models and the Yule–Walker Equations

1) The practical problem being solved

After looking at a time series and deciding that an AR model might be a reasonable approximation, you still need to choose the numerical coefficients of that model.

For example, if you decide to try an AR(2) model, you are proposing that the value today depends linearly on the previous two values, plus an unpredictable shock: $X_t – \phi_1 X_{t-1} – \phi_2 X_{t-2} = W_t$

$X_t$ : the underlying time series model you want to build
$\phi_1,\phi_2$ : the coefficients you must estimate from data
$W_t$ : “new information” or shock at time $t$ (often called white noise), with variance $\sigma_W^2$

The key question is: How can we estimate $\phi_1,\dots,\phi_p$ (and also $\sigma_W^2$ ) using only the observed data $x_1,\dots,x_n$ ?

The method described here produces initial estimates that are often good enough to get started and are also useful as inputs to more refined estimation procedures.

2) The method-of-moments idea (why “moments” show up)

A “moment” is a statistical quantity like a mean, variance, or covariance. Covariances are sometimes called “mixed moments” because they involve products of two variables.

The core strategy is:

Compute sample autocovariances $\hat{\gamma}(h)$ (or sample autocorrelations $\hat{\rho}(h)$ ) from the observed dataset.
Pretend—temporarily—that these sample quantities are equal to the model’s true autocovariances (or autocorrelations) for the first few lags.
Use the mathematical relationships that must hold in an AR(p) model to solve for the unknown coefficients.

This is reasonable because, when the dataset is long enough, sample autocovariances tend to be close to the true autocovariances of the underlying process with high probability.

3) Where the Yule–Walker equations come from (the basic derivation idea)

Assume the series follows an AR(p) model:

$X_t = \phi_1 X_{t-1} + \phi_2 X_{t-2} + \cdots + \phi_p X_{t-p} + W_t$

To connect the unknown coefficients $\phi_1,\dots,\phi_p$ to observable correlation structure:

Multiply both sides by $X_{t-h}$ for $h=1,2,\dots,p$
Take expectations (long-run averages)

Because $W_t$ is an unpredictable shock, it is uncorrelated with past values $X_{t-h}$ for $h\ge 1$ . That makes the equations clean.

This produces a system of equations that ties the autocovariances of the process to the AR coefficients:

For $h=1$ : $\gamma(1) = \phi_1\gamma(0) + \phi_2\gamma(1) + \cdots + \phi_p\gamma(p-1)$

For $h=2$ : $\gamma(2) = \phi_1\gamma(1) + \phi_2\gamma(0) + \cdots + \phi_p\gamma(p-2)$

…

For $h=p$ : $\gamma(p) = \phi_1\gamma(p-1) + \phi_2\gamma(p-2) + \cdots + \phi_p\gamma(0)$

This family of equations is called the Yule–Walker equations.

4) The matrix form (why it matters)

Instead of writing many equations line by line, the system can be written compactly: $\gamma_p = \Gamma_p \phi$

Where:

$\phi = (\phi_1,\dots,\phi_p)^\top$ is the coefficient vector.
$\gamma_p = (\gamma(1),\dots,\gamma(p))^\top$
$\Gamma_p$ is a $p\times p$ matrix built from autocovariances:

$\Gamma_p = \begin{pmatrix} \gamma(0) & \gamma(1) & \cdots & \gamma(p-1) \\ \gamma(1) & \gamma(0) & \cdots & \gamma(p-2) \\ \vdots & \vdots & \ddots & \vdots \\ \gamma(p-1) & \gamma(p-2) & \cdots & \gamma(0) \end{pmatrix}$

This matrix has a special “lag structure” (constant diagonals), which makes it a standard object in time series.

You can also divide everything by $\gamma(0)$ and express the same relationship in terms of correlations: $\rho_p = R_p \phi$

Where $R_p$ is the same kind of matrix but using autocorrelations $\rho(h)$ instead of autocovariances $\gamma(h)$ .

5) Solving “backwards”: from ACF to AR coefficients

Often you already have estimates of autocorrelation from data, and you want coefficients.

AR(2) example logic

For an AR(2) model, the correlation-based system is: $\begin{pmatrix} \rho(1) \\ \rho(2) \end{pmatrix} = \begin{pmatrix} 1 & \rho(1) \\ \rho(1) & 1 \end{pmatrix} \begin{pmatrix} \phi_1 \\ \phi_2 \end{pmatrix}$

So if someone tells you $\rho(1)=0.6$ and $\rho(2)=0.8$ , you solve: $\begin{pmatrix} 0.6\\ 0.8 \end{pmatrix} = \begin{pmatrix} 1 & 0.6\\ 0.6 & 1 \end{pmatrix} \begin{pmatrix} \phi_1\\ \phi_2 \end{pmatrix}$

Solving yields $\phi_1=0.1875$ , $\phi_2=0.6875$ .
This demonstrates the key point:

Knowing a few early-lag correlations is enough to determine AR coefficients, assuming the AR order is known and the model assumptions apply.

6) Estimating the shock variance $\sigma_W^2$

An AR model does not just have coefficients $\phi_1,\dots,\phi_p$ ; it also needs the variance of the noise term $W_t$ .

A standard identity for an AR(p) model is: $\gamma(0)\left(1-\phi_1\rho(1)-\cdots-\phi_p\rho(p)\right)=\sigma_W^2$

This is extremely useful because it converts:

the series variance $\gamma(0)$ ,
the correlations $\rho(1),\dots,\rho(p)$ ,
and the AR coefficients $\phi_1,\dots,\phi_p$

into the noise variance $\sigma_W^2$ .

So once you estimate $\phi$ and $\rho$ from data, you can estimate $\sigma_W^2$ as well.

7) Turning the theory into a data-based estimator (Yule–Walker estimators)

In real data, you do not know $\gamma(h)$ or $\rho(h)$ . You estimate them from the sample:

sample autocovariance $\hat{\gamma}(h)$
sample autocorrelation $\hat{\rho}(h)$

Then you mimic the theoretical equations: $\hat{\gamma}_p = \hat{\Gamma}_p \hat{\phi} \quad\text{or}\quad \hat{\rho}_p = \hat{R}_p \hat{\phi}$

Solving gives the Yule–Walker estimator: $\hat{\phi} = \hat{\Gamma}_p^{-1}\hat{\gamma}_p = \hat{R}_p^{-1}\hat{\rho}_p$

Then the noise variance estimate is: $\hat{\sigma}_W^2 = \hat{\gamma}(0)\left(1-\hat{\phi}_1\hat{\rho}(1)-\cdots-\hat{\phi}_p\hat{\rho}(p)\right)$

These are “method-of-moments” estimators because they match a finite number of covariance/correlation moments.

8) Example 1: Recovering coefficients from a simulated AR(2)

A simulated series is generated from: $X_t = -0.3X_{t-1}+0.4X_{t-2}+W_t$

A long length $n=10000$ is used so that sample correlations are stable.

From the data, the sample autocorrelations are computed:

$\hat{\rho}(1)\approx -0.484$
$\hat{\rho}(2)\approx 0.546$

Then the AR(2) Yule–Walker system is built: $\begin{pmatrix} \hat{\rho}(1)\\ \hat{\rho}(2) \end{pmatrix} = \begin{pmatrix} 1 & \hat{\rho}(1)\\ \hat{\rho}(1) & 1 \end{pmatrix} \begin{pmatrix} \hat{\phi}_1\\ \hat{\phi}_2 \end{pmatrix}$

Solving yields:

$\hat{\phi}_1\approx -0.287$
$\hat{\phi}_2\approx 0.407$

These are close to the true values $-0.3$ and $0.4$ . The difference is normal: even with large $n$ , estimates are not exact.

Then the variance of the series is estimated via $\hat{\gamma}(0)\approx 1.603$ , and the noise variance estimate is computed:

$\hat{\sigma}_W^2 \approx 1.024$

This is close to 1, which was the simulation default.

What this demonstrates:

With enough data, the Yule–Walker method can recover AR coefficients reasonably well.
The noise variance can also be estimated from the same ingredients.

9) Example 2: Fitting an AR(2) model to a real dataset (LakeHuron)

A real dataset is treated as a candidate for an AR(2) approximation.

From the data:

$\hat{\rho}(1)\approx 0.832$
$\hat{\rho}(2)\approx 0.610$

Solving the same AR(2) Yule–Walker system gives:

$\hat{\phi}_1\approx 1.054$
$\hat{\phi}_2\approx -0.267$

And estimated series variance:

$\hat{\gamma}(0)\approx 1.720$

Then the noise variance estimate:

$\hat{\sigma}_W^2\approx 0.492$

A critical practical correction: the mean is not zero

An AR(2) model written as:

$X_t = 1.054X_{t-1}-0.267X_{t-2}+W_t$

is naturally a mean-zero model when written in this form (if the noise has mean zero and the process is stationary). But the dataset consists of large positive numbers, with sample mean:

$\bar{y}\approx 579.004$

So it is not reasonable to model the raw series $y_t$ as mean-zero.

Why the coefficient estimation still works

Autocovariance and autocorrelation are computed after subtracting the mean (explicitly or implicitly). Adding a constant shifts the entire series but does not change how values co-move around the mean.

So the estimated $\hat{\phi}_1,\hat{\phi}_2$ remain valid for the mean-adjusted series:

$x_t = y_t – \bar{y}$

Meaning: the AR(2) structure is being fit to the fluctuations around the mean, not to the absolute level.

Converting back to a model for the original series

Define a model $Y_t$ for the original scale by:

$Y_t = X_t + \bar{y}$

Then $E(Y_t)=\bar{y}$ , as desired.

Algebraically, this leads to an AR(2) model with an intercept (constant term):

$Y_t = c + 1.054Y_{t-1} – 0.267Y_{t-2} + W_t$

where the intercept is computed by:

$c = \bar{y}\left(1 – (\hat{\phi}_1+\hat{\phi}_2)\right)$

Using the provided numbers:

$\bar{y}\approx 579.004$
$\hat{\phi}_1+\hat{\phi}_2 \approx 1.054-0.267=0.787$

$c \approx 579.004(1-0.787) \approx 123.286$

So the final fitted model is:

$Y_t = 123.286 + 1.054Y_{t-1} – 0.267Y_{t-2} + W_t$

with $\operatorname{Var}(W_t)\approx 0.492$ .

Interpretation of this fitted model:

The series has strong persistence because the coefficient on $Y_{t-1}$ is large.
The negative coefficient on $Y_{t-2}$ introduces a corrective effect that can produce oscillation or damping behavior, depending on the combination of coefficients.
The noise variance indicates how much unpredictable variation remains after accounting for the AR structure.

10) What “preliminary” means here

This approach is intentionally positioned as an initial, fast way to obtain parameters because:

it relies only on estimated correlations,
it is computationally simple,
it often provides reasonable starting values,

but it is not always the best final estimator under all conditions. More systematic methods can refine parameter estimates and compare competing models more rigorously.

11) Key takeaways

Once you pick an AR order $p$, the early autocovariances/correlations determine the AR coefficients via a linear system.
Replacing unknown true correlations with sample correlations produces the Yule–Walker estimators.
You can estimate the noise variance from the same ingredients using a standard identity.
If the observed series has a nonzero mean, you should include an intercept or model mean-adjusted data, because AR equations in their simplest form are centered around zero.
These estimates are often used as a first approximation and as inputs to more advanced model-fitting procedures.

Your Gateway to Data Mastery

Learn, explore, and innovate with data science.

Preliminary Estimation for AR Models and the Yule–Walker Equations

1) The practical problem being solved

2) The method-of-moments idea (why “moments” show up)

3) Where the Yule–Walker equations come from (the basic derivation idea)

4) The matrix form (why it matters)

5) Solving “backwards”: from ACF to AR coefficients

AR(2) example logic

6) Estimating the shock variance $\sigma_W^2$

7) Turning the theory into a data-based estimator (Yule–Walker estimators)

8) Example 1: Recovering coefficients from a simulated AR(2)

9) Example 2: Fitting an AR(2) model to a real dataset (LakeHuron)

A critical practical correction: the mean is not zero

Why the coefficient estimation still works

Converting back to a model for the original series

10) What “preliminary” means here

11) Key takeaways

Like this:

Related

Leave a ReplyCancel reply

1) The practical problem being solved

2) The method-of-moments idea (why “moments” show up)

3) Where the Yule–Walker equations come from (the basic derivation idea)

4) The matrix form (why it matters)

5) Solving “backwards”: from ACF to AR coefficients

AR(2) example logic

6) Estimating the shock variance σW2\sigma_W^2​

7) Turning the theory into a data-based estimator (Yule–Walker estimators)

8) Example 1: Recovering coefficients from a simulated AR(2)

9) Example 2: Fitting an AR(2) model to a real dataset (LakeHuron)

A critical practical correction: the mean is not zero

Why the coefficient estimation still works

Converting back to a model for the original series

10) What “preliminary” means here

11) Key takeaways

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Your Gateway to Data Mastery

6) Estimating the shock variance $\sigma_W^2$