Pages

December 12, 2014

Variable Selection

Most classical statistical methods were designed for a situation where the number of observations is much bigger than the number of predictors. If we are in a different situation, we might want to fit the data using a subset of the available predictors. This can help us obtain a more accurate model, because too many predictors can mean overfitting or high variance. Also, using less predictors improves the interpretability of our model.

1) Best Subset Selection
Let $p$ be the number of predictors. Best subset selection fits all $2^p$ possible models and picks the best one. The algorithm is as follows:
  1. Let $M_0$ be the model with no predictors (the model always predicts the sample mean).
  2. For $k=1$ to $p$, fit all ${p \choose k}$ models with $k$ predictors, and label $M_k$ the model with lowest training error.
  3. Consider models $M_0, \dots, M_p$, and pick the one with the lowest estimated test error.
Note that in step 2, we select the best model according to training error, while in step 3, we prefer a measure of test error. This is important, because as the number of predictors increases, overfitting can occur so the training error decreases monotonically. Thus, we need a measure of test error to compare models with a different number of predictors.
Training error can be measured via RSS, MSE, the $R^2$ statistic, or some other measure of deviance. We discuss five measures of test error below.


2) Stepwise Selection
Best subset selection is computationally expensive, since it requires fitting $2^p$ models. Stepwise selection remedies this by performing a guided search for the best model. Indeed, for a given number of predictors (step 2), instead of fitting all models, stepwise selection only looks at those models that add or remove one predictor to the previous model.
There are three types of stepwise selection: forward, backward, and mixed selection. The algorithm for forward selection is as follows:
  1. Let $M_0$ be the model with no predictors.
  2. For $k=0$ to $p-1$, fit all $p-k$ models (with $k+1$ predictors) that add one predictor to $M_k$, and label $M_{k+1}$ the model with lowest training error.
  3. Consider models $M_0, \dots, M_p$, and pick the one with the lowest estimated test error.
Forward selection requires less computations than best subset selection since it only fits a total of $1+\frac{p(p+1)}{2}$ models, but it does not guarantee to find the same model as best subset selection. Also note that forward selection can be used when the number of predictors exceeds the number of observations; it suffices to consider models up to $M_{n-1}$ only, where $n$ is the number of observations.

Backward selection is similar to forward selection, but it starts with all predictors and removes them one at a time:
  1. Let $M_p$ be the model with all $p$ predictors.
  2. For $k=p$ to $1$, fit all $k$ models (with $k-1$ predictors) that remove one predictor from $M_k$, and label $M_{k-1}$ the model with lowest training error.
  3. Consider models $M_0, \dots, M_p$, and pick the one with the lowest estimated test error.
Backward selection fits the same number of models than forward selection, but it cannot be used when $p>n$ since it starts by fitting a model with all $p$ predictors.

Finally, we can mix forward and backward selection so that for each iteration, we either add or remove a predictor to the model. This allows us to better approximate best subset selection, while keeping the computational cost relatively low.


Now we look at ways to assess test errors. The first three adjust the training error to account for overfitting, while the fourth estimates test error directly.

1) Mallow's $C_p$
Say there are $d$ predictors, and the estimated variance of the error $\epsilon$ is $\hat{\sigma}^2$, then
$$
C_p = \frac{1}{n}(\text{RSS} + 2d\hat{\sigma}^2).
$$
If $\hat{\sigma}^2$ is an unbiased estimate of $\sigma^2$, then $C_p$ is an unbiased estimate of the test MSE, so small values of $C_p$ are desired.

2) BIC
The Bayesian information criterion is, up to constants,
$$
\text{BIC} = \frac{1}{n}(\text{RSS} + \text{log}(n)d\hat{\sigma}^2).
$$
Since $\text{log}(n)>2$ for $n>7$, BIC gives a bigger penalty than $C_p$ to models with extra predictors, so it tends to pick models with less predictors.

3) Adjusted $R^2$
Recall that the $R^2$ statistic is $1-\text{RSS}/\text{TSS}$, where RSS is the residual sum of squares, and TSS is the total sum of squares. The adjusted $R^2$ statistic is
$$
\text{Adjusted }R^2 = 1 - \frac{\text{RSS}/(n-d-1)}{\text{TSS}/(n-1)}.
$$
Here a large adjusted $R^2$ means a low test error. The idea is that if we add irrelevant predictors to our model, the RSS will only slightly decrease while $d$ increases, so $\frac{\text{RSS}}{(n-d-1)}$ will increase, leading to a smaller adjusted $R^2$.

4) Cross-Validation
CV requires less assumptions about the data and it can be used more broadly. See this post.

December 5, 2014

Cross-Validation

Cross-validation (CV) is a technique that helps us assess the quality of our model. To quantify how well a model fits the data, we look at its error rate: the extent to which the model makes false predictions. If the target variable is numerical, we can use the mean squared error:
$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2,$$
and if it is categorical, we can use the error rate:
$$\text{Err} = \frac{1}{n}\sum_{i=1}^{n}I(y_i \neq \hat{y_i}),$$
where $\hat{y_i}$ is the predicted value of the target variable corresponding to $x_i$, and $y_i$ is its actual value.

We can use these statistics on the training data that was used to fit the model, which is called the training error rate, but that's not what interests us most, since we're more concerned about how well our model performs on new data that wasn't used to fit the model, called the test error rate. Plus, the training error rate will tend to underestimate the test error rate because the model optimized the fit to these observations. Of course, we only know the actual values of the target variable for observations in our training data, so unless we have a special data set reserved for testing our model, we need to use cross-validation on our training data to simulate test data, called validation data, or hold-out data. There are three main way to do this.

1) Validation Set
The most basic way to do this is to split the available data in two, one training set and one validation set. We fit the model on the training set, and we estimate the test error rate on the validation set.
There are two concerns with this method: the error rate can be highly variable depending on the split of the data, and we need to omit a lot of observations while fitting the model (which means a less accurate model, so possibly an overestimated test error rate).

2) Leave-One-Out CV
LOOCV improves the validation set method by repeating it $n$ times, where $n$ is the number of observations in the original data set. For each iteration, we take out one distinct observation which serves as the validation set, fit the model on the other $n-1$ observations, and calculate the MSE (or the error rate) for the single left-out observation. After we've completed all iterations, we average all iterations' MSEs (or the error rates), and this gives us the overall estimated test error rate.
LOOCV addresses both problems of the validation set approach since there is no randomness in splitting the original data, and it fits the models on $n-1$ observations. However, it can be computationally expensive since it requires fitting $n$ models.

3) $k$-Fold CV
$k$-fold cross-validation is a compromise between the other two methods. It involves splitting the original data set into $k$ folds, and performing $k$ iterations of the validation set method. In each iteration, the validation set is one of the $k$ folds while the remaining folds are used to fit the model. Once again, we average out the error rates from each iteration to get the estimate for the overall test error. Note that LOOCV is the special case of $k$-fold CV where $k=n$.
$k$-fold CV is usually performed with $k=5$ or $10$, which makes it much more manageable than LOOCV. It does have some variability since it introduces randomness, but it is much more stable than the validation set approach.