1. Formal Definition
Mean Squared Error (MSE) measures the average squared difference between the true values ($y_i$) and the model’s predictions ($\hat{y}_i$).
$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i – \hat{y}_i)^2$
- $y_i$: Actual observed values (ground truth)
- $\hat{y}_i$: Predicted values from the model
- $n$: Number of samples
2. Why Square the Errors?
- No cancellation: If we just sum raw errors ($y_i – \hat{y}_i$), positive and negative values cancel each other out. Squaring avoids this.
- Penalizes large errors more strongly: A large deviation (say 10 units off) becomes 100 after squaring.
- This property makes MSE very sensitive to outliers.
3. Relation to Other Metrics
- MAE (Mean Absolute Error): Uses absolute differences instead of squares. Less sensitive to outliers.
- RMSE (Root Mean Squared Error): Square root of MSE, brings the error back to the same unit as the target variable.
- MSE vs. RMSE:
- MSE is easier to compute (no root), often used as a loss function in optimization.
- RMSE is easier to interpret (same units as data).
4. Statistical Meaning
- MSE = Variance + Bias² (Bias-Variance Decomposition).
- Bias²: Systematic error from wrong assumptions (e.g., using a linear model for curved data).
- Variance: Error from sensitivity to training data (overfitting).
- This decomposition explains why minimizing MSE is crucial in balancing underfitting vs. overfitting.
5. Properties
- Non-negative: $MSE \geq 0$.
- Consistent estimator: As sample size increases, MSE approaches the true expected squared error.
- Differentiable: Important for gradient-based optimization (e.g., in neural networks).
6. Example Calculation
Suppose we have 4 observations:
| Observation | True Value ($y$) | Prediction ($\hat{y}$) | Error ($y – \hat{y}$) | Squared Error |
|---|---|---|---|---|
| 1 | 3 | 2 | 1 | 1 |
| 2 | 5 | 5 | 0 | 0 |
| 3 | 2 | 4 | -2 | 4 |
| 4 | 7 | 6 | 1 | 1 |
$MSE = \frac{1}{4}(1+0+4+1) = \frac{6}{4} = 1.5$
So the model’s average squared error is 1.5.
7. Applications of MSE
- Model Training:
- Linear regression minimizes MSE to find the best-fit line.
- Neural networks use MSE (or RMSE) as a loss function for regression tasks.
- Forecasting Accuracy:
- Time series models (ARIMA, LSTM, etc.) often evaluated by MSE.
- Signal Processing:
- Used to measure reconstruction accuracy (e.g., audio/image compression).
8. Advantages & Disadvantages
Advantages:
- Simple to compute and widely used.
- Differentiable (good for gradient descent).
- Penalizes larger errors more (useful if large deviations are unacceptable).
Disadvantages:
- Units are squared (harder to interpret).
- Sensitive to outliers (a single bad prediction can dominate the error).
9. When to Use MSE vs Others
- Use MSE: When large errors must be penalized heavily.
- Use MAE: When robustness to outliers is more important.
- Use RMSE: When interpretability (same units as target) matters.
Summary in one line:
MSE is the average of squared prediction errors, widely used in regression because it’s mathematically convenient, strongly penalizes large errors, but can be distorted by outliers.
