>>> What's Up Data??!: Cross-Validation

Cross-validation (CV) is a technique that helps us assess the quality of our model. To quantify how well a model fits the data, we look at its error rate: the extent to which the model makes false predictions. If the target variable is numerical, we can use the mean squared error:

$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2,$$

and if it is categorical, we can use the error rate:

$$\text{Err} = \frac{1}{n}\sum_{i=1}^{n}I(y_i \neq \hat{y_i}),$$

where $\hat{y_i}$ is the predicted value of the target variable corresponding to $x_i$, and $y_i$ is its actual value.

We can use these statistics on the training data that was used to fit the model, which is called the training error rate, but that's not what interests us most, since we're more concerned about how well our model performs on new data that wasn't used to fit the model, called the test error rate. Plus, the training error rate will tend to underestimate the test error rate because the model optimized the fit to these observations. Of course, we only know the actual values of the target variable for observations in our training data, so unless we have a special data set reserved for testing our model, we need to use cross-validation on our training data to simulate test data, called validation data, or hold-out data. There are three main way to do this.

1) Validation Set

The most basic way to do this is to split the available data in two, one training set and one validation set. We fit the model on the training set, and we estimate the test error rate on the validation set.
There are two concerns with this method: the error rate can be highly variable depending on the split of the data, and we need to omit a lot of observations while fitting the model (which means a less accurate model, so possibly an overestimated test error rate).

2) Leave-One-Out CV
LOOCV improves the validation set method by repeating it $n$ times, where $n$ is the number of observations in the original data set. For each iteration, we take out one distinct observation which serves as the validation set, fit the model on the other $n-1$ observations, and calculate the MSE (or the error rate) for the single left-out observation. After we've completed all iterations, we average all iterations' MSEs (or the error rates), and this gives us the overall estimated test error rate.
LOOCV addresses both problems of the validation set approach since there is no randomness in splitting the original data, and it fits the models on $n-1$ observations. However, it can be computationally expensive since it requires fitting $n$ models.

3) $k$-Fold CV
$k$-fold cross-validation is a compromise between the other two methods. It involves splitting the original data set into $k$ folds, and performing $k$ iterations of the validation set method. In each iteration, the validation set is one of the $k$ folds while the remaining folds are used to fit the model. Once again, we average out the error rates from each iteration to get the estimate for the overall test error. Note that LOOCV is the special case of $k$-fold CV where $k=n$.
$k$-fold CV is usually performed with $k=5$ or $10$, which makes it much more manageable than LOOCV. It does have some variability since it introduces randomness, but it is much more stable than the validation set approach.

>>> What's Up Data??!

Pages

December 5, 2014

Cross-Validation

No comments:

Post a Comment