1) The practical problem being solved

After looking at a time series and deciding that an AR model might be a reasonable approximation, you still need to choose the numerical coefficients of that model.

For example, if you decide to try an AR(2) model, you are proposing that the value today depends linearly on the previous two values, plus an unpredictable shock:Xtϕ1Xt1ϕ2Xt2=WtX_t – \phi_1 X_{t-1} – \phi_2 X_{t-2} = W_t

  • XtX_t​: the underlying time series model you want to build
  • ϕ1,ϕ2\phi_1,\phi_2​: the coefficients you must estimate from data
  • WtW_t​: “new information” or shock at time tt (often called white noise), with variance σW2\sigma_W^2

The key question is: How can we estimate $\phi_1,\dots,\phi_p$​ (and also σW2\sigma_W^2​) using only the observed data x1,,xnx_1,\dots,x_n​?

The method described here produces initial estimates that are often good enough to get started and are also useful as inputs to more refined estimation procedures.


2) The method-of-moments idea (why “moments” show up)

A “moment” is a statistical quantity like a mean, variance, or covariance. Covariances are sometimes called “mixed moments” because they involve products of two variables.

The core strategy is:

  1. Compute sample autocovariances γ^(h)\hat{\gamma}(h) (or sample autocorrelations ρ^(h)\hat{\rho}(h)) from the observed dataset.
  2. Pretend—temporarily—that these sample quantities are equal to the model’s true autocovariances (or autocorrelations) for the first few lags.
  3. Use the mathematical relationships that must hold in an AR(p) model to solve for the unknown coefficients.

This is reasonable because, when the dataset is long enough, sample autocovariances tend to be close to the true autocovariances of the underlying process with high probability.


3) Where the Yule–Walker equations come from (the basic derivation idea)

Assume the series follows an AR(p) model:

Xt=ϕ1Xt1+ϕ2Xt2++ϕpXtp+WtX_t = \phi_1 X_{t-1} + \phi_2 X_{t-2} + \cdots + \phi_p X_{t-p} + W_t

To connect the unknown coefficients ϕ1,,ϕp\phi_1,\dots,\phi_p to observable correlation structure:

  • Multiply both sides by XthX_{t-h}​ for h=1,2,,ph=1,2,\dots,p
  • Take expectations (long-run averages)

Because WtW_t​ is an unpredictable shock, it is uncorrelated with past values XthX_{t-h}​ for h1h\ge 1. That makes the equations clean.

This produces a system of equations that ties the autocovariances of the process to the AR coefficients:

For h=1h=1:γ(1)=ϕ1γ(0)+ϕ2γ(1)++ϕpγ(p1)\gamma(1) = \phi_1\gamma(0) + \phi_2\gamma(1) + \cdots + \phi_p\gamma(p-1)

For h=2h=2:γ(2)=ϕ1γ(1)+ϕ2γ(0)++ϕpγ(p2)\gamma(2) = \phi_1\gamma(1) + \phi_2\gamma(0) + \cdots + \phi_p\gamma(p-2)

For h=ph=p:γ(p)=ϕ1γ(p1)+ϕ2γ(p2)++ϕpγ(0)\gamma(p) = \phi_1\gamma(p-1) + \phi_2\gamma(p-2) + \cdots + \phi_p\gamma(0)

This family of equations is called the Yule–Walker equations.


4) The matrix form (why it matters)

Instead of writing many equations line by line, the system can be written compactly:γp=Γpϕ\gamma_p = \Gamma_p \phi

Where:

  • ϕ=(ϕ1,,ϕp)\phi = (\phi_1,\dots,\phi_p)^\top is the coefficient vector.
  • γp=(γ(1),,γ(p))\gamma_p = (\gamma(1),\dots,\gamma(p))^\top
  • Γp\Gamma_p​ is a p×pp\times p matrix built from autocovariances:

Γp=(γ(0)γ(1)γ(p1)γ(1)γ(0)γ(p2)γ(p1)γ(p2)γ(0))\Gamma_p = \begin{pmatrix} \gamma(0) & \gamma(1) & \cdots & \gamma(p-1) \\ \gamma(1) & \gamma(0) & \cdots & \gamma(p-2) \\ \vdots & \vdots & \ddots & \vdots \\ \gamma(p-1) & \gamma(p-2) & \cdots & \gamma(0) \end{pmatrix}

This matrix has a special “lag structure” (constant diagonals), which makes it a standard object in time series.

You can also divide everything by γ(0)\gamma(0) and express the same relationship in terms of correlations:ρp=Rpϕ\rho_p = R_p \phi

Where RpR_p​ is the same kind of matrix but using autocorrelations ρ(h)\rho(h) instead of autocovariances γ(h)\gamma(h).


5) Solving “backwards”: from ACF to AR coefficients

Often you already have estimates of autocorrelation from data, and you want coefficients.

AR(2) example logic

For an AR(2) model, the correlation-based system is:(ρ(1)ρ(2))=(1ρ(1)ρ(1)1)(ϕ1ϕ2)\begin{pmatrix} \rho(1) \\ \rho(2) \end{pmatrix} = \begin{pmatrix} 1 & \rho(1) \\ \rho(1) & 1 \end{pmatrix} \begin{pmatrix} \phi_1 \\ \phi_2 \end{pmatrix}

So if someone tells you ρ(1)=0.6\rho(1)=0.6 and ρ(2)=0.8\rho(2)=0.8, you solve:(0.60.8)=(10.60.61)(ϕ1ϕ2)\begin{pmatrix} 0.6\\ 0.8 \end{pmatrix} = \begin{pmatrix} 1 & 0.6\\ 0.6 & 1 \end{pmatrix} \begin{pmatrix} \phi_1\\ \phi_2 \end{pmatrix}

Solving yields ϕ1=0.1875\phi_1=0.1875, ϕ2=0.6875\phi_2=0.6875.
This demonstrates the key point:

  • Knowing a few early-lag correlations is enough to determine AR coefficients, assuming the AR order is known and the model assumptions apply.

6) Estimating the shock variance σW2\sigma_W^2

An AR model does not just have coefficients ϕ1,,ϕp\phi_1,\dots,\phi_p​; it also needs the variance of the noise term WtW_t​.

A standard identity for an AR(p) model is:γ(0)(1ϕ1ρ(1)ϕpρ(p))=σW2\gamma(0)\left(1-\phi_1\rho(1)-\cdots-\phi_p\rho(p)\right)=\sigma_W^2

This is extremely useful because it converts:

  • the series variance γ(0)\gamma(0),
  • the correlations ρ(1),,ρ(p)\rho(1),\dots,\rho(p),
  • and the AR coefficients ϕ1,,ϕp\phi_1,\dots,\phi_p

into the noise variance σW2\sigma_W^2​.

So once you estimate ϕ\phi and ρ\rho from data, you can estimate σW2\sigma_W^2​ as well.


7) Turning the theory into a data-based estimator (Yule–Walker estimators)

In real data, you do not know γ(h)\gamma(h) or ρ(h)\rho(h). You estimate them from the sample:

  • sample autocovariance γ^(h)\hat{\gamma}(h)
  • sample autocorrelation ρ^(h)\hat{\rho}(h)

Then you mimic the theoretical equations:γ^p=Γ^pϕ^orρ^p=R^pϕ^\hat{\gamma}_p = \hat{\Gamma}_p \hat{\phi} \quad\text{or}\quad \hat{\rho}_p = \hat{R}_p \hat{\phi}

Solving gives the Yule–Walker estimator:ϕ^=Γ^p1γ^p=R^p1ρ^p\hat{\phi} = \hat{\Gamma}_p^{-1}\hat{\gamma}_p = \hat{R}_p^{-1}\hat{\rho}_p

Then the noise variance estimate is:σ^W2=γ^(0)(1ϕ^1ρ^(1)ϕ^pρ^(p))\hat{\sigma}_W^2 = \hat{\gamma}(0)\left(1-\hat{\phi}_1\hat{\rho}(1)-\cdots-\hat{\phi}_p\hat{\rho}(p)\right)

These are “method-of-moments” estimators because they match a finite number of covariance/correlation moments.


8) Example 1: Recovering coefficients from a simulated AR(2)

A simulated series is generated from:Xt=0.3Xt1+0.4Xt2+WtX_t = -0.3X_{t-1}+0.4X_{t-2}+W_t

A long length n=10000n=10000 is used so that sample correlations are stable.

From the data, the sample autocorrelations are computed:

  • ρ^(1)0.484\hat{\rho}(1)\approx -0.484
  • ρ^(2)0.546\hat{\rho}(2)\approx 0.546

Then the AR(2) Yule–Walker system is built:(ρ^(1)ρ^(2))=(1ρ^(1)ρ^(1)1)(ϕ^1ϕ^2)\begin{pmatrix} \hat{\rho}(1)\\ \hat{\rho}(2) \end{pmatrix} = \begin{pmatrix} 1 & \hat{\rho}(1)\\ \hat{\rho}(1) & 1 \end{pmatrix} \begin{pmatrix} \hat{\phi}_1\\ \hat{\phi}_2 \end{pmatrix}

Solving yields:

  • ϕ^10.287\hat{\phi}_1\approx -0.287
  • ϕ^20.407\hat{\phi}_2\approx 0.407

These are close to the true values 0.3-0.3 and 0.40.4. The difference is normal: even with large nn, estimates are not exact.

Then the variance of the series is estimated via γ^(0)1.603\hat{\gamma}(0)\approx 1.603, and the noise variance estimate is computed:

  • σ^W21.024\hat{\sigma}_W^2 \approx 1.024

This is close to 1, which was the simulation default.

What this demonstrates:

  • With enough data, the Yule–Walker method can recover AR coefficients reasonably well.
  • The noise variance can also be estimated from the same ingredients.

9) Example 2: Fitting an AR(2) model to a real dataset (LakeHuron)

A real dataset is treated as a candidate for an AR(2) approximation.

From the data:

  • ρ^(1)0.832\hat{\rho}(1)\approx 0.832
  • ρ^(2)0.610\hat{\rho}(2)\approx 0.610

Solving the same AR(2) Yule–Walker system gives:

  • ϕ^11.054\hat{\phi}_1\approx 1.054
  • ϕ^20.267\hat{\phi}_2\approx -0.267

And estimated series variance:

  • γ^(0)1.720\hat{\gamma}(0)\approx 1.720

Then the noise variance estimate:

  • σ^W20.492\hat{\sigma}_W^2\approx 0.492

A critical practical correction: the mean is not zero

An AR(2) model written as:

Xt=1.054Xt10.267Xt2+WtX_t = 1.054X_{t-1}-0.267X_{t-2}+W_t

is naturally a mean-zero model when written in this form (if the noise has mean zero and the process is stationary). But the dataset consists of large positive numbers, with sample mean:

yˉ579.004\bar{y}\approx 579.004

So it is not reasonable to model the raw series yty_t as mean-zero.

Why the coefficient estimation still works

Autocovariance and autocorrelation are computed after subtracting the mean (explicitly or implicitly). Adding a constant shifts the entire series but does not change how values co-move around the mean.

So the estimated ϕ^1,ϕ^2\hat{\phi}_1,\hat{\phi}_2​ remain valid for the mean-adjusted series:

xt=ytyˉx_t = y_t – \bar{y}

Meaning: the AR(2) structure is being fit to the fluctuations around the mean, not to the absolute level.

Converting back to a model for the original series

Define a model YtY_t​ for the original scale by:

Yt=Xt+yˉY_t = X_t + \bar{y}

Then E(Yt)=yˉE(Y_t)=\bar{y}, as desired.

Algebraically, this leads to an AR(2) model with an intercept (constant term):

Yt=c+1.054Yt10.267Yt2+WtY_t = c + 1.054Y_{t-1} – 0.267Y_{t-2} + W_t

where the intercept is computed by:

c=yˉ(1(ϕ^1+ϕ^2))c = \bar{y}\left(1 – (\hat{\phi}_1+\hat{\phi}_2)\right)

Using the provided numbers:

  • yˉ579.004\bar{y}\approx 579.004
  • ϕ^1+ϕ^21.0540.267=0.787\hat{\phi}_1+\hat{\phi}_2 \approx 1.054-0.267=0.787

c579.004(10.787)123.286c \approx 579.004(1-0.787) \approx 123.286

So the final fitted model is:

Yt=123.286+1.054Yt10.267Yt2+WtY_t = 123.286 + 1.054Y_{t-1} – 0.267Y_{t-2} + W_t

with Var(Wt)0.492\operatorname{Var}(W_t)\approx 0.492.

Interpretation of this fitted model:

  • The series has strong persistence because the coefficient on Yt1Y_{t-1}​ is large.
  • The negative coefficient on Yt2Y_{t-2} introduces a corrective effect that can produce oscillation or damping behavior, depending on the combination of coefficients.
  • The noise variance indicates how much unpredictable variation remains after accounting for the AR structure.

10) What “preliminary” means here

This approach is intentionally positioned as an initial, fast way to obtain parameters because:

  • it relies only on estimated correlations,
  • it is computationally simple,
  • it often provides reasonable starting values,

but it is not always the best final estimator under all conditions. More systematic methods can refine parameter estimates and compare competing models more rigorously.


11) Key takeaways

  1. Once you pick an AR order $p$, the early autocovariances/correlations determine the AR coefficients via a linear system.
  2. Replacing unknown true correlations with sample correlations produces the Yule–Walker estimators.
  3. You can estimate the noise variance from the same ingredients using a standard identity.
  4. If the observed series has a nonzero mean, you should include an intercept or model mean-adjusted data, because AR equations in their simplest form are centered around zero.
  5. These estimates are often used as a first approximation and as inputs to more advanced model-fitting procedures.