R² (Coefficient of Determination)

1. Definition

  • measures how much of the variance in the dependent variable (y) is explained by the independent variables (X) in a regression model.
  • It tells us the goodness of fit: how well the model captures the variability in the data.

2. Formula

Let:

  • $y_i$​ = actual values
  • $\hat{y}_i$ = predicted values
  • $\bar{y}$​ = mean of actual values
  1. Total Sum of Squares (TSS):

$TSS = \sum_{i}(y_i – \bar{y})^2$

= total variance in data

  1. Residual Sum of Squares (RSS):

$RSS = \sum_{i}(y_i – \hat{y}_i)^2$

= variance not explained by the model

  1. Explained Sum of Squares (ESS):

$ESS = \sum_{i}(\hat{y}_i – \bar{y})^2$

= variance explained by the model

  1. R² definition:

$R^2 = 1 – \frac{RSS}{TSS} = \frac{ESS}{TSS}$


3. Range & Interpretation

  • R² = 1 → perfect fit (model explains all variance).
  • R² = 0 → model explains no variance (same as predicting the mean).
  • R² < 0 → model is worse than just predicting the mean (bad model).

4. Example

Suppose:

  • Actual values: $y = [3, 4, 5]$
  • Predictions: $\hat{y} = [2.8, 4.2, 5.0]$
  • Mean: $\bar{y} = 4$

$TSS = (3-4)^2 + (4-4)^2 + (5-4)^2 = 1 + 0 + 1 = 2$

$RSS = (3-2.8)^2 + (4-4.2)^2 + (5-5.0)^2 = 0.04 + 0.04 + 0 = 0.08$

$R^2 = 1 – \frac{0.08}{2} = 0.96$

Model explains 96% of variance. Very good fit.


5. Limitations

  • High R² ≠ good model: A model can overfit (memorize data) and get high R² but perform poorly on new data.
  • Not for all tasks: R² is useful for regression, not classification.
  • Doesn’t show bias: Two models can have the same R² but different prediction errors.

6. Variants

  • Adjusted R²: Penalizes adding irrelevant predictors (avoids artificially inflated R² in multiple regression).

$R^2_{adj} = 1 – \left( \frac{(1-R^2)(n-1)}{n-p-1} \right)$

where $n$ = number of observations, $p$ = number of predictors.

  • Pseudo-R²: Used in logistic regression since regular R² doesn’t apply.

Summary:
R² (coefficient of determination) = proportion of variance in the dependent variable explained by the regression model.

  • $R² = 1$: perfect fit
  • $R² = 0$: no fit
  • $R² < 0$: worse than baseline