1. Formal Definition

Mean Squared Error (MSE) measures the average squared difference between the true values ($y_i$​) and the model’s predictions ($\hat{y}_i$​).

$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i – \hat{y}_i)^2$

  • $y_i$​: Actual observed values (ground truth)
  • $\hat{y}_i$​: Predicted values from the model
  • $n$: Number of samples

2. Why Square the Errors?

  • No cancellation: If we just sum raw errors ($y_i – \hat{y}_i$), positive and negative values cancel each other out. Squaring avoids this.
  • Penalizes large errors more strongly: A large deviation (say 10 units off) becomes 100 after squaring.
  • This property makes MSE very sensitive to outliers.

3. Relation to Other Metrics

  • MAE (Mean Absolute Error): Uses absolute differences instead of squares. Less sensitive to outliers.
  • RMSE (Root Mean Squared Error): Square root of MSE, brings the error back to the same unit as the target variable.
  • MSE vs. RMSE:
    • MSE is easier to compute (no root), often used as a loss function in optimization.
    • RMSE is easier to interpret (same units as data).

4. Statistical Meaning

  • MSE = Variance + Bias² (Bias-Variance Decomposition).
    • Bias²: Systematic error from wrong assumptions (e.g., using a linear model for curved data).
    • Variance: Error from sensitivity to training data (overfitting).
  • This decomposition explains why minimizing MSE is crucial in balancing underfitting vs. overfitting.

5. Properties

  • Non-negative: $MSE \geq 0$.
  • Consistent estimator: As sample size increases, MSE approaches the true expected squared error.
  • Differentiable: Important for gradient-based optimization (e.g., in neural networks).

6. Example Calculation

Suppose we have 4 observations:

ObservationTrue Value ($y$)Prediction ($\hat{y}$​)Error ($y – \hat{y}$​)Squared Error
13211
25500
324-24
47611

$MSE = \frac{1}{4}(1+0+4+1) = \frac{6}{4} = 1.5$

So the model’s average squared error is 1.5.


7. Applications of MSE

  1. Model Training:
    • Linear regression minimizes MSE to find the best-fit line.
    • Neural networks use MSE (or RMSE) as a loss function for regression tasks.
  2. Forecasting Accuracy:
    • Time series models (ARIMA, LSTM, etc.) often evaluated by MSE.
  3. Signal Processing:
    • Used to measure reconstruction accuracy (e.g., audio/image compression).

8. Advantages & Disadvantages

Advantages:

  • Simple to compute and widely used.
  • Differentiable (good for gradient descent).
  • Penalizes larger errors more (useful if large deviations are unacceptable).

Disadvantages:

  • Units are squared (harder to interpret).
  • Sensitive to outliers (a single bad prediction can dominate the error).

9. When to Use MSE vs Others

  • Use MSE: When large errors must be penalized heavily.
  • Use MAE: When robustness to outliers is more important.
  • Use RMSE: When interpretability (same units as target) matters.

Summary in one line:
MSE is the average of squared prediction errors, widely used in regression because it’s mathematically convenient, strongly penalizes large errors, but can be distorted by outliers.