1) Best Subset Selection
Let $p$ be the number of predictors. Best subset selection fits all $2^p$ possible models and picks the best one. The algorithm is as follows:
- Let $M_0$ be the model with no predictors (the model always predicts the sample mean).
- For $k=1$ to $p$, fit all ${p \choose k}$ models with $k$ predictors, and label $M_k$ the model with lowest training error.
- Consider models $M_0, \dots, M_p$, and pick the one with the lowest estimated test error.
Training error can be measured via RSS, MSE, the $R^2$ statistic, or some other measure of deviance. We discuss five measures of test error below.
2) Stepwise Selection
Best subset selection is computationally expensive, since it requires fitting $2^p$ models. Stepwise selection remedies this by performing a guided search for the best model. Indeed, for a given number of predictors (step 2), instead of fitting all models, stepwise selection only looks at those models that add or remove one predictor to the previous model.
There are three types of stepwise selection: forward, backward, and mixed selection. The algorithm for forward selection is as follows:
Backward selection is similar to forward selection, but it starts with all predictors and removes them one at a time:
Finally, we can mix forward and backward selection so that for each iteration, we either add or remove a predictor to the model. This allows us to better approximate best subset selection, while keeping the computational cost relatively low.
Now we look at ways to assess test errors. The first three adjust the training error to account for overfitting, while the fourth estimates test error directly.
1) Mallow's $C_p$
Say there are $d$ predictors, and the estimated variance of the error $\epsilon$ is $\hat{\sigma}^2$, then
$$
C_p = \frac{1}{n}(\text{RSS} + 2d\hat{\sigma}^2).
$$
If $\hat{\sigma}^2$ is an unbiased estimate of $\sigma^2$, then $C_p$ is an unbiased estimate of the test MSE, so small values of $C_p$ are desired.
2) BIC
The Bayesian information criterion is, up to constants,
$$
\text{BIC} = \frac{1}{n}(\text{RSS} + \text{log}(n)d\hat{\sigma}^2).
$$
Since $\text{log}(n)>2$ for $n>7$, BIC gives a bigger penalty than $C_p$ to models with extra predictors, so it tends to pick models with less predictors.
3) Adjusted $R^2$
Recall that the $R^2$ statistic is $1-\text{RSS}/\text{TSS}$, where RSS is the residual sum of squares, and TSS is the total sum of squares. The adjusted $R^2$ statistic is
$$
\text{Adjusted }R^2 = 1 - \frac{\text{RSS}/(n-d-1)}{\text{TSS}/(n-1)}.
$$
Here a large adjusted $R^2$ means a low test error. The idea is that if we add irrelevant predictors to our model, the RSS will only slightly decrease while $d$ increases, so $\frac{\text{RSS}}{(n-d-1)}$ will increase, leading to a smaller adjusted $R^2$.
4) Cross-Validation
CV requires less assumptions about the data and it can be used more broadly. See this post.
2) Stepwise Selection
Best subset selection is computationally expensive, since it requires fitting $2^p$ models. Stepwise selection remedies this by performing a guided search for the best model. Indeed, for a given number of predictors (step 2), instead of fitting all models, stepwise selection only looks at those models that add or remove one predictor to the previous model.
There are three types of stepwise selection: forward, backward, and mixed selection. The algorithm for forward selection is as follows:
- Let $M_0$ be the model with no predictors.
- For $k=0$ to $p-1$, fit all $p-k$ models (with $k+1$ predictors) that add one predictor to $M_k$, and label $M_{k+1}$ the model with lowest training error.
- Consider models $M_0, \dots, M_p$, and pick the one with the lowest estimated test error.
Backward selection is similar to forward selection, but it starts with all predictors and removes them one at a time:
- Let $M_p$ be the model with all $p$ predictors.
- For $k=p$ to $1$, fit all $k$ models (with $k-1$ predictors) that remove one predictor from $M_k$, and label $M_{k-1}$ the model with lowest training error.
- Consider models $M_0, \dots, M_p$, and pick the one with the lowest estimated test error.
Finally, we can mix forward and backward selection so that for each iteration, we either add or remove a predictor to the model. This allows us to better approximate best subset selection, while keeping the computational cost relatively low.
Now we look at ways to assess test errors. The first three adjust the training error to account for overfitting, while the fourth estimates test error directly.
1) Mallow's $C_p$
Say there are $d$ predictors, and the estimated variance of the error $\epsilon$ is $\hat{\sigma}^2$, then
$$
C_p = \frac{1}{n}(\text{RSS} + 2d\hat{\sigma}^2).
$$
If $\hat{\sigma}^2$ is an unbiased estimate of $\sigma^2$, then $C_p$ is an unbiased estimate of the test MSE, so small values of $C_p$ are desired.
2) BIC
The Bayesian information criterion is, up to constants,
$$
\text{BIC} = \frac{1}{n}(\text{RSS} + \text{log}(n)d\hat{\sigma}^2).
$$
Since $\text{log}(n)>2$ for $n>7$, BIC gives a bigger penalty than $C_p$ to models with extra predictors, so it tends to pick models with less predictors.
3) Adjusted $R^2$
Recall that the $R^2$ statistic is $1-\text{RSS}/\text{TSS}$, where RSS is the residual sum of squares, and TSS is the total sum of squares. The adjusted $R^2$ statistic is
$$
\text{Adjusted }R^2 = 1 - \frac{\text{RSS}/(n-d-1)}{\text{TSS}/(n-1)}.
$$
Here a large adjusted $R^2$ means a low test error. The idea is that if we add irrelevant predictors to our model, the RSS will only slightly decrease while $d$ increases, so $\frac{\text{RSS}}{(n-d-1)}$ will increase, leading to a smaller adjusted $R^2$.
4) Cross-Validation
CV requires less assumptions about the data and it can be used more broadly. See this post.