R² (Coefficient of Determination)
1. Definition
- R² measures how much of the variance in the dependent variable (y) is explained by the independent variables (X) in a regression model.
- It tells us the goodness of fit: how well the model captures the variability in the data.
2. Formula
Let:
- $y_i$ = actual values
- $\hat{y}_i$ = predicted values
- $\bar{y}$ = mean of actual values
- Total Sum of Squares (TSS):
$TSS = \sum_{i}(y_i – \bar{y})^2$
= total variance in data
- Residual Sum of Squares (RSS):
$RSS = \sum_{i}(y_i – \hat{y}_i)^2$
= variance not explained by the model
- Explained Sum of Squares (ESS):
$ESS = \sum_{i}(\hat{y}_i – \bar{y})^2$
= variance explained by the model
- R² definition:
$R^2 = 1 – \frac{RSS}{TSS} = \frac{ESS}{TSS}$
3. Range & Interpretation
- R² = 1 → perfect fit (model explains all variance).
- R² = 0 → model explains no variance (same as predicting the mean).
- R² < 0 → model is worse than just predicting the mean (bad model).
4. Example
Suppose:
- Actual values: $y = [3, 4, 5]$
- Predictions: $\hat{y} = [2.8, 4.2, 5.0]$
- Mean: $\bar{y} = 4$
$TSS = (3-4)^2 + (4-4)^2 + (5-4)^2 = 1 + 0 + 1 = 2$
$RSS = (3-2.8)^2 + (4-4.2)^2 + (5-5.0)^2 = 0.04 + 0.04 + 0 = 0.08$
$R^2 = 1 – \frac{0.08}{2} = 0.96$
Model explains 96% of variance. Very good fit.
5. Limitations
- High R² ≠ good model: A model can overfit (memorize data) and get high R² but perform poorly on new data.
- Not for all tasks: R² is useful for regression, not classification.
- Doesn’t show bias: Two models can have the same R² but different prediction errors.
6. Variants
- Adjusted R²: Penalizes adding irrelevant predictors (avoids artificially inflated R² in multiple regression).
$R^2_{adj} = 1 – \left( \frac{(1-R^2)(n-1)}{n-p-1} \right)$
where $n$ = number of observations, $p$ = number of predictors.
- Pseudo-R²: Used in logistic regression since regular R² doesn’t apply.
Summary:
R² (coefficient of determination) = proportion of variance in the dependent variable explained by the regression model.
- $R² = 1$: perfect fit
- $R² = 0$: no fit
- $R² < 0$: worse than baseline
