Pages

January 7, 2015

Decision Trees and Random Forests

A decision tree can be used for both regression and classification problems. The idea is to partition the predictor space into non-overlapping regions. Then, for a given test observation, we can predict the average response of the region in which that observation falls. For regression problems, we can use the mean of the training observations' responses in the appropriate region, while for classification problems, we can predict whichever class is most common. This comes down to fitting the following model:
$$
f(x) = \sum_{j=1}^{J}c_j \cdot 1_{(x \in R_j)},
$$
where $R_1, \dots, R_J$ are the regions, and $c_j$ is the average response in region $R_j$.

Ideally, we would look at all possible partitions $R_1, \dots, R_J$, and pick the one that minimizes training error, but that is computationally infeasible. Instead, we can use a stepwise method such as recursive binary splitting, which successively splits the predictor space, each time using the split that reduces training error the most, until we reach a predetermined stopping criterion (typically until all regions have less than a certain number of training observations).

Now, if we build a decision tree that is too complex (i.e. if we create too many regions), then the tree is prone to overfit the training data. To correct this, we can "prune" a tree by selecting a subtree with smaller test error. Looking at all possible subtrees and estimating their test error is infeasible, so we look at a subset of subtrees instead. With cost complexity pruning, we consider varying values of a tuning parameter $\alpha$ and assign to each value the subtree with $J$ regions that minimizes
$$
\sum_{j=1}^{J}\sum_{i: x_i \in R_j}(y_i - \bar{y}_{R_j})^2 + \alpha J.
$$
As $\alpha$ increases from 0, we get a sequence of nested subtrees that is predictable, making it easy to compute the sequence of subtrees as a function of $\alpha$. Thus, we can choose $\alpha$ using cross-validation, and build the decision tree with the corresponding number of regions using the entire training data.

To measure training error, we can use the usual residual sum of squares (RSS) for regression:
$$
\text{RSS} =  \sum_{j=1}^{J}\sum_{i \in R_j}(y_i - \bar{y}_{R_j})^2,
$$
where $\bar{y}_{R_j}$ is the mean of the training observations' responses in region $R_j$.
In the classification setting, we use different measures when growing and pruning the tree. The Gini index (G) reflects purity (whether a split results in regions with observations mostly from the same class), and therefore is good to grow a decision tree:
$$
\text{G} = \sum_{j=1}^{J}\sum_{k=1}^{K}\hat{p}_{jk}(1 - \hat{p}_{jk}),
$$
where $\hat{p}_{jk}$ is the proportion of training observations of class $k$ in region $R_j$. On the other hand, the classification error rate (E) reflects prediction accuracy, and thus is good when pruning a tree:
$$
\text{E} = \sum_{j=1}^{J}1 - \underset{k}{\text{max}} (\hat{p}_{jk}).
$$

Note that decision trees do not need to create dummy variables for categorical predictors, since a split simply means that we assign some values of the predictor to one region, and the other values to another region.

Simple decision trees are easy to interpret and visualize, but they do not have as much predictive power as other methods and they suffer from high variance. To improve performance, we can use bagging and random forests. For a description of bagging, read this.
Random forests improve upon bagging via a small modification: when building a tree and creating a split, only consider a random sample of all available predictors, and use a new random sample for every split. The idea is to give a chance to all predictors instead of always using the strongests early on in the tree. This produces uncorrelated trees, which results in a bigger reduction in variance. If we have a large number of correlated predictors, then using small samples is especially helpful (usually of size $\sqrt{p}$, where $p$ is the number of predictors).

December 12, 2014

Variable Selection

Most classical statistical methods were designed for a situation where the number of observations is much bigger than the number of predictors. If we are in a different situation, we might want to fit the data using a subset of the available predictors. This can help us obtain a more accurate model, because too many predictors can mean overfitting or high variance. Also, using less predictors improves the interpretability of our model.

1) Best Subset Selection
Let $p$ be the number of predictors. Best subset selection fits all $2^p$ possible models and picks the best one. The algorithm is as follows:
  1. Let $M_0$ be the model with no predictors (the model always predicts the sample mean).
  2. For $k=1$ to $p$, fit all ${p \choose k}$ models with $k$ predictors, and label $M_k$ the model with lowest training error.
  3. Consider models $M_0, \dots, M_p$, and pick the one with the lowest estimated test error.
Note that in step 2, we select the best model according to training error, while in step 3, we prefer a measure of test error. This is important, because as the number of predictors increases, overfitting can occur so the training error decreases monotonically. Thus, we need a measure of test error to compare models with a different number of predictors.
Training error can be measured via RSS, MSE, the $R^2$ statistic, or some other measure of deviance. We discuss five measures of test error below.


2) Stepwise Selection
Best subset selection is computationally expensive, since it requires fitting $2^p$ models. Stepwise selection remedies this by performing a guided search for the best model. Indeed, for a given number of predictors (step 2), instead of fitting all models, stepwise selection only looks at those models that add or remove one predictor to the previous model.
There are three types of stepwise selection: forward, backward, and mixed selection. The algorithm for forward selection is as follows:
  1. Let $M_0$ be the model with no predictors.
  2. For $k=0$ to $p-1$, fit all $p-k$ models (with $k+1$ predictors) that add one predictor to $M_k$, and label $M_{k+1}$ the model with lowest training error.
  3. Consider models $M_0, \dots, M_p$, and pick the one with the lowest estimated test error.
Forward selection requires less computations than best subset selection since it only fits a total of $1+\frac{p(p+1)}{2}$ models, but it does not guarantee to find the same model as best subset selection. Also note that forward selection can be used when the number of predictors exceeds the number of observations; it suffices to consider models up to $M_{n-1}$ only, where $n$ is the number of observations.

Backward selection is similar to forward selection, but it starts with all predictors and removes them one at a time:
  1. Let $M_p$ be the model with all $p$ predictors.
  2. For $k=p$ to $1$, fit all $k$ models (with $k-1$ predictors) that remove one predictor from $M_k$, and label $M_{k-1}$ the model with lowest training error.
  3. Consider models $M_0, \dots, M_p$, and pick the one with the lowest estimated test error.
Backward selection fits the same number of models than forward selection, but it cannot be used when $p>n$ since it starts by fitting a model with all $p$ predictors.

Finally, we can mix forward and backward selection so that for each iteration, we either add or remove a predictor to the model. This allows us to better approximate best subset selection, while keeping the computational cost relatively low.


Now we look at ways to assess test errors. The first three adjust the training error to account for overfitting, while the fourth estimates test error directly.

1) Mallow's $C_p$
Say there are $d$ predictors, and the estimated variance of the error $\epsilon$ is $\hat{\sigma}^2$, then
$$
C_p = \frac{1}{n}(\text{RSS} + 2d\hat{\sigma}^2).
$$
If $\hat{\sigma}^2$ is an unbiased estimate of $\sigma^2$, then $C_p$ is an unbiased estimate of the test MSE, so small values of $C_p$ are desired.

2) BIC
The Bayesian information criterion is, up to constants,
$$
\text{BIC} = \frac{1}{n}(\text{RSS} + \text{log}(n)d\hat{\sigma}^2).
$$
Since $\text{log}(n)>2$ for $n>7$, BIC gives a bigger penalty than $C_p$ to models with extra predictors, so it tends to pick models with less predictors.

3) Adjusted $R^2$
Recall that the $R^2$ statistic is $1-\text{RSS}/\text{TSS}$, where RSS is the residual sum of squares, and TSS is the total sum of squares. The adjusted $R^2$ statistic is
$$
\text{Adjusted }R^2 = 1 - \frac{\text{RSS}/(n-d-1)}{\text{TSS}/(n-1)}.
$$
Here a large adjusted $R^2$ means a low test error. The idea is that if we add irrelevant predictors to our model, the RSS will only slightly decrease while $d$ increases, so $\frac{\text{RSS}}{(n-d-1)}$ will increase, leading to a smaller adjusted $R^2$.

4) Cross-Validation
CV requires less assumptions about the data and it can be used more broadly. See this post.

December 5, 2014

Cross-Validation

Cross-validation (CV) is a technique that helps us assess the quality of our model. To quantify how well a model fits the data, we look at its error rate: the extent to which the model makes false predictions. If the target variable is numerical, we can use the mean squared error:
$$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2,$$
and if it is categorical, we can use the error rate:
$$\text{Err} = \frac{1}{n}\sum_{i=1}^{n}I(y_i \neq \hat{y_i}),$$
where $\hat{y_i}$ is the predicted value of the target variable corresponding to $x_i$, and $y_i$ is its actual value.

We can use these statistics on the training data that was used to fit the model, which is called the training error rate, but that's not what interests us most, since we're more concerned about how well our model performs on new data that wasn't used to fit the model, called the test error rate. Plus, the training error rate will tend to underestimate the test error rate because the model optimized the fit to these observations. Of course, we only know the actual values of the target variable for observations in our training data, so unless we have a special data set reserved for testing our model, we need to use cross-validation on our training data to simulate test data, called validation data, or hold-out data. There are three main way to do this.

1) Validation Set
The most basic way to do this is to split the available data in two, one training set and one validation set. We fit the model on the training set, and we estimate the test error rate on the validation set.
There are two concerns with this method: the error rate can be highly variable depending on the split of the data, and we need to omit a lot of observations while fitting the model (which means a less accurate model, so possibly an overestimated test error rate).

2) Leave-One-Out CV
LOOCV improves the validation set method by repeating it $n$ times, where $n$ is the number of observations in the original data set. For each iteration, we take out one distinct observation which serves as the validation set, fit the model on the other $n-1$ observations, and calculate the MSE (or the error rate) for the single left-out observation. After we've completed all iterations, we average all iterations' MSEs (or the error rates), and this gives us the overall estimated test error rate.
LOOCV addresses both problems of the validation set approach since there is no randomness in splitting the original data, and it fits the models on $n-1$ observations. However, it can be computationally expensive since it requires fitting $n$ models.

3) $k$-Fold CV
$k$-fold cross-validation is a compromise between the other two methods. It involves splitting the original data set into $k$ folds, and performing $k$ iterations of the validation set method. In each iteration, the validation set is one of the $k$ folds while the remaining folds are used to fit the model. Once again, we average out the error rates from each iteration to get the estimate for the overall test error. Note that LOOCV is the special case of $k$-fold CV where $k=n$.
$k$-fold CV is usually performed with $k=5$ or $10$, which makes it much more manageable than LOOCV. It does have some variability since it introduces randomness, but it is much more stable than the validation set approach.

November 18, 2014

LDA and QDA

LDA (linear discriminant analysis) and QDA (quadratic discriminant analysis) are both classifiers that deal with a categorical target variable. Instead of directly modeling the conditional probability that the target belongs to a certain class given the values of the predictors (as is the case with logistic regression), LDA and QDA rely on Bayes' theorem to do this. Rather, they aim to model the reverse conditional probability: the probability that a set of predictors take on a specific set of values, given that the corresponding target variable belongs to a certain class.

If $X$ represents the predictors, $Y$ the target variable, and $K$ the number of different classes, this means that LDA and QDA estimate $\text{Pr}(X = x | Y = k)$. Then they use the following form of Bayes' theorem to compute what we are ultimately interested in, $\text{Pr}(Y = k | X = x)$:
$$
\text{Pr}(Y = k | X = x) = \frac{\pi_{k}\text{Pr}(X = x | Y = k)}{\sum_{j=1}^{K} \pi_{j}\text{Pr}(X = x | Y = j)},
$$
where $\pi_{k}$ represents the probability that a randomly chosen observation belongs to the $k$th class. $\pi_{k}$ is called the prior probability and $\text{Pr}(Y = k | X = x)$ the posterior probability. $\pi_{k}$ can easily be estimated using the fraction of the training observations in the $k$th class:
$$\hat{\pi}_{k} = \frac{n_k}{n},$$
where $n_k$ is the number of training observation in class $k$ and $n$ is the total number of training observations.

To model $\text{Pr}(X = x | Y = k)$, LDA and QDA assume that it has a multivariate normal distribution for which the $k$th class has mean $\mu_k$ and covariance matrix $\Sigma_k$. This means that, with $p$ predictors, we get:
$$
\text{Pr}(X = x | Y = k) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}} \text{exp}\left( -\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) \right).
$$
The difference between the two classifiers is that LDA makes a simplifying assumption that each class has the same covariance matrix $\Sigma$. Note that if there is only one predictor, then we assume a normal distribution, which is just a special case of the multivariate normal distribution, so the following formulas still hold.

With LDA, we need to estimate $k$ class-specific means $\hat{\mu}_k$ and one common covariance matrix $\hat{\Sigma}$, which is really just the weighted average of the class-specific covariance matrices:
$$
\begin{split}
\hat{\mu}_k & = \frac{1}{n_k}\sum_{i: y_i = k}x_i\\
\hat{\Sigma} & = \frac{1}{n - K}\sum_{k = 1}^{K}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T
\end{split}
$$

Now, calculating $\text{Pr}(X = x | Y = k)$ is equivalent to maximizing, with respect to $k$, the following discriminant function:
$$
\delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \text{log}(\pi_k).
$$
Note that this function is linear in $x$, hence the name linear discriminant analysis.


If we do assume class-specific covariance matrices and use QDA instead, then we can use the same estimates for $\mu_k$, but the estimates for $\Sigma_k$ are:
$$
\hat{\Sigma_k} = \frac{1}{n _k}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T,
$$
and the discriminant function is:
$$
\delta_k(x) = - \frac{1}{2}x^T\Sigma^{-1}x + x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k - \frac{1}{2}\text{log}|\Sigma_k| + \text{log}(\pi_k).
$$
which is quadratic in $x$, hence the name quadratic discriminant analysis.


LDA has more model bias but less variance than QDA since it makes more assumptions and estimates less parameters. Thus, LDA is probably a better choice if we have a relatively small training set, but QDA might do better if we have a very large training set, or if the common covariance matrix is an unrealistic assumption.

November 17, 2014

Logistic Regression

Logistic regression deals with a target variable that is categorical. Its goal is to predict the conditional probability that the target variable belongs to a specific class, given the values of the predictors.

Here we only consider the case with two classes. There exist an extension of logistic regression that deals with more classes, but the preferred method is linear discriminant analysis so the extension is not discussed here.

In the case where the target variable $Y$ can only belong to one of two categories. $Y$ can then be represented as such:
$$Y =
\begin{cases}
0 & \text{if in class } 1\\
1 & \text{if in class } 2
\end{cases}$$
The conditional probability that we are interested in is $\text{Pr}(Y = 1 | X)$, where $X$ represents the predictors. A nice property of logistic regression is that this conditional probability will always be greater than 0 and less than 1.

If we abbreviate $\text{Pr}(Y = 1 | X)$ as $p(X)$, the model that we are fitting is the following:
$$p(X) = \frac{e^{\beta_0 + \beta_{1}X_1 + \cdots + \beta_{p}X_p}}{1 + e^{\beta_0 + \beta_{1}X_1 + \cdots + \beta_{p}X_p}}$$
where $p$ is the number of predictors. Since this is equivalent to
$$\text{log}\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_{1}X_1 + \cdots + \beta_{p}X_p,$$
we see that we are really modeling the log-odds of $p(X)$. The expression on the left-hand side is called the logit function (which explains the name of this method).

To estimate the coefficients $\beta_i$'s, we maximize the likelihood function
$$l(\beta_0, \beta_1, \cdots, \beta_p) = \prod_{i: y_i = 1}p(x_i) \prod_{j: y_j = 0}(1 - p(x_j)).$$

To assess the significance of each predictor in the model, we can perform z-tests. The null hypothesis is $H_0: \beta_i = 0$ and the z-statistic is $\frac{\hat{\beta_i}}{\text{SE}(\hat{\beta_i})}$.

November 12, 2014

K-Nearest Neighbors

KNN classifier
Here we want to estimate the conditional probability that a predictor vector $x_0$ belongs to a class $j$. To do so with the KNN classifier, first we identify the $K$ observations closest to $x_0$ and represent them by $N_0$. Then we take the proportion of points in $N_0$ with target class $j$:
$$\text{Pr}(Y = j | X = x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j).$$


KNN regression
Here we want to predict the target value for predictor vector $x_0$. To do so with KNN regression, we take the $K$ observations closest to $x_0$ (which we represent by $N_0$) and simply take the average of the target values:
$$\hat{f}(x_0) = \frac{1}{K} \sum_{i \in N_0} y_i.$$

November 11, 2014

Least Squares Linear Regression

The goal of least squares regression is to fit the following model, where $p$ is the number of predictors, and $n$ is the number of observations:
$$Y = \beta_{0} + \beta_{1}X_{1} + \cdots + \beta_{p}X_{p} + \epsilon .$$

To do so, we minimize the residual sum of squares: $\sum_{i = 1}^{n} (y_{i} - \hat{y}_{i})^2$ where $y_{i}$ is the observed value of the $i$th target variable and $\hat{y}_{i}$ is its predicted value.


There are several ways to assess the accuracy of the model:

1) Residual Standard Error (RSE)
$$\text{RSE} =  \sqrt{\frac{\sum_{i = 1}^{n} (y_{i} - \hat{y}_{i})^2}{n-p-1}}$$
RSE measures the average amount that the prediction deviates from the truth. Generally, a high RSE means that the model is not a good fit for the data. However, it is not always clear what "high" is because RSE is measured in the units of the target variable. One way to deal with this is to divide the RSE by the mean of the target variable to report the percentage error.

2) $R^2$
$$R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}},$$
where $\text{TSS} = \sum_{i = 1}^{n} (y_{i} - \bar{y}_{i})^2$ is the total sum of squares and $\text{RSS} = \sum_{i = 1}^{n} (y_{i} - \hat{y}_{i})^2$ is the residual sum of squares.
The $R^2$ statistic reports the proportion of variability in $Y$ explained by the regression. $R^2$ is a number between 0 and 1 and is independent of the scale of $Y$. Generally, the closer to 1, the better the model. However, if we expect $\epsilon$ to be large (if we expect our model to explain only a small part of the change in the target value), then a small $R^2$ might be sufficient.
Note that when there is a single predictor, $R^2 = \text{Cor}(X, Y)^2$, but that when there are multiple predictors, $R^2 = \text{Cor}(Y, \hat{Y})^2$.

3) F-test
To check whether there is a relationship between the target variable and the predictors, we can perform an F-test. Here we are asking whether at least one of the predictors' coefficient is non-zero:
$$\begin{split}
& H_0: \beta_{1} = \beta_{2} = \cdots= \beta_{p} = 0\\
& H_1: \beta_{i} \neq 0 \text{  for at least one } i
\end{split}$$
The F-statistic is as follows:
$$F = \frac{(\text{TSS}-\text{RSS})/p}{\text{RSS}/(n-p-1)}.$$
When $H_0$ is true, the F-statistic has an F-distribution and we can compute the p-value. If the p-value is small, then there is evidence to reject $H_0$, which suggests that there is a relationship between at least one of the predictors and the target variable.

4) t-tests
We can also perform individual tests on each predictor to ask whether one predictor is related to the target variable. If we want to know whether the $i$th predictor is relevant, we can perform the following t-test:
$$\begin{split}
& H_0: \beta_{i} = 0\\
& H_1: \beta_{i} \neq 0
\end{split}$$
with the following t-statistic:
$$t = \frac{\hat{\beta_i}-0}{\text{SE}(\hat{\beta}_i)},$$
where $\text{SE}(\hat{\beta}_i)$ is the standard error of $\hat{\beta}_i$. If $H_0$ is true, then the t-statistic has a $t$ distribution with $n - 2$ degrees of freedom, and we can compute the p-value.
Note that the square of this t-statistic is equivalent to an F-test that checks whether all coefficients except $\beta_i$ are zero. This F-test can be computed using the residual sum of squares of the regression without the $i$th predictor instead of TSS. This means that the t-statistic reports the partial effect of adding $\beta_i$ to the regression.


We can extend linear regression to include non-linear relationships. The model is still linear, we just create new predictors that are non-linear functions of the original predictors.

1) Interaction Terms
The interaction effect occurs when the effect of one predictor on the target variable depends on the value of another predictor. This is also called the synergy effect in marketing. If we have two predictors $X_{1}$ and $X_{2}$, then we can add an interaction term $X_{1}X_{2}$ to our model, like so:
$$Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{1}X_{2} + \epsilon .$$
This can be rewritten so that the coefficient of $X_{1}$ coefficient depends on the value of $X_2$, making the interaction clear:
$$Y = \beta_{0} + (\beta_{1} + \beta_{3}X_{2})X_{1} + \beta_{2}X_{2} + \epsilon.$$
If we do include an interaction term in our model, then it is safer to also include the main terms regardless of their significance. This helps the interpretability of the model and is referred to as the hierarchical principle.

2) Polynomial Regression
With polynomial regression, we add higher powers of existing predictors. It can also be useful to consider other transformations, such as the logarithm or the square root.

November 7, 2014

CRISP-DM

The Cross-Industry Standard Process for Data Mining is a framework that outlines the main tasks underlying the data-mining process. It is mostly from the business point of view, and it doesn't specify technical details.


Here are the six phases, with the corresponding tasks:

1) Business Understanding:
     - defining business goals
     - assessing the situation
     - defining data-mining goals
     - creating the project plan

2) Data Understanding:
     - gathering data
     - describing data (overall)
     - exploring data (variable by variable)
     - verifying data quality

3) Data Preparation:
     - selecting data
     - cleaning data
     - constructing data (ex: deriving new variables)
     - integrating data (merge data into one dataset)
     - formatting data

4) Modeling:
     - selecting modeling techniques
     - designing tests
     - building models
     - assessing models

5) Evaluation:
     - evaluating results
     - reviewing the process
     - determining the next step

6) Deployment:
     - planning deployment
     - planning monitoring and maintenance
     - reporting final results
     - reviewing final results

November 4, 2014

9 Laws of Data Mining

Theses laws were established by Thomas Khabaza, a pioneer of data mining.

1) Business Goals
business objectives are at the origin of every data-mining solution
You should focus on business goals throughout the data-mining process.

2) Business Knowledge
business knowledge is central to every step of the data-mining process
Data mining is only valuable if you have the business understanding to give context to your data.

3) Data Preparation
data preparation is more than half of every data-mining process
Data usually isn't collected specifically for data mining.

4) Right Model (no free lunch)
the right model for a given application can only be discovered by experiment
Models are selected through trial and error rather than by studying theory.

5) Pattern
there are always patterns
Data always has something to say, whether it's what you wanted to hear or not.

6) Amplification (insight)
data mining amplifies perception in the business domain
Data mining finds information that wouldn't appear in ordinary reports.

7) Prediction
prediction increases information locally by generalization
Data mining uses what we know to predict what we don't know.

8) Value
the value of data-mining results is not determined by the accuracy or stability of predictive models
Priority is not given to theoretical properties, but rather to business applications.

9) Change
all patterns are subject to change
Models you create today might be useless tomorrow.