Pages

November 18, 2014

LDA and QDA

LDA (linear discriminant analysis) and QDA (quadratic discriminant analysis) are both classifiers that deal with a categorical target variable. Instead of directly modeling the conditional probability that the target belongs to a certain class given the values of the predictors (as is the case with logistic regression), LDA and QDA rely on Bayes' theorem to do this. Rather, they aim to model the reverse conditional probability: the probability that a set of predictors take on a specific set of values, given that the corresponding target variable belongs to a certain class.

If $X$ represents the predictors, $Y$ the target variable, and $K$ the number of different classes, this means that LDA and QDA estimate $\text{Pr}(X = x | Y = k)$. Then they use the following form of Bayes' theorem to compute what we are ultimately interested in, $\text{Pr}(Y = k | X = x)$:
$$
\text{Pr}(Y = k | X = x) = \frac{\pi_{k}\text{Pr}(X = x | Y = k)}{\sum_{j=1}^{K} \pi_{j}\text{Pr}(X = x | Y = j)},
$$
where $\pi_{k}$ represents the probability that a randomly chosen observation belongs to the $k$th class. $\pi_{k}$ is called the prior probability and $\text{Pr}(Y = k | X = x)$ the posterior probability. $\pi_{k}$ can easily be estimated using the fraction of the training observations in the $k$th class:
$$\hat{\pi}_{k} = \frac{n_k}{n},$$
where $n_k$ is the number of training observation in class $k$ and $n$ is the total number of training observations.

To model $\text{Pr}(X = x | Y = k)$, LDA and QDA assume that it has a multivariate normal distribution for which the $k$th class has mean $\mu_k$ and covariance matrix $\Sigma_k$. This means that, with $p$ predictors, we get:
$$
\text{Pr}(X = x | Y = k) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}} \text{exp}\left( -\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) \right).
$$
The difference between the two classifiers is that LDA makes a simplifying assumption that each class has the same covariance matrix $\Sigma$. Note that if there is only one predictor, then we assume a normal distribution, which is just a special case of the multivariate normal distribution, so the following formulas still hold.

With LDA, we need to estimate $k$ class-specific means $\hat{\mu}_k$ and one common covariance matrix $\hat{\Sigma}$, which is really just the weighted average of the class-specific covariance matrices:
$$
\begin{split}
\hat{\mu}_k & = \frac{1}{n_k}\sum_{i: y_i = k}x_i\\
\hat{\Sigma} & = \frac{1}{n - K}\sum_{k = 1}^{K}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T
\end{split}
$$

Now, calculating $\text{Pr}(X = x | Y = k)$ is equivalent to maximizing, with respect to $k$, the following discriminant function:
$$
\delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \text{log}(\pi_k).
$$
Note that this function is linear in $x$, hence the name linear discriminant analysis.


If we do assume class-specific covariance matrices and use QDA instead, then we can use the same estimates for $\mu_k$, but the estimates for $\Sigma_k$ are:
$$
\hat{\Sigma_k} = \frac{1}{n _k}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T,
$$
and the discriminant function is:
$$
\delta_k(x) = - \frac{1}{2}x^T\Sigma^{-1}x + x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k - \frac{1}{2}\text{log}|\Sigma_k| + \text{log}(\pi_k).
$$
which is quadratic in $x$, hence the name quadratic discriminant analysis.


LDA has more model bias but less variance than QDA since it makes more assumptions and estimates less parameters. Thus, LDA is probably a better choice if we have a relatively small training set, but QDA might do better if we have a very large training set, or if the common covariance matrix is an unrealistic assumption.

November 17, 2014

Logistic Regression

Logistic regression deals with a target variable that is categorical. Its goal is to predict the conditional probability that the target variable belongs to a specific class, given the values of the predictors.

Here we only consider the case with two classes. There exist an extension of logistic regression that deals with more classes, but the preferred method is linear discriminant analysis so the extension is not discussed here.

In the case where the target variable $Y$ can only belong to one of two categories. $Y$ can then be represented as such:
$$Y =
\begin{cases}
0 & \text{if in class } 1\\
1 & \text{if in class } 2
\end{cases}$$
The conditional probability that we are interested in is $\text{Pr}(Y = 1 | X)$, where $X$ represents the predictors. A nice property of logistic regression is that this conditional probability will always be greater than 0 and less than 1.

If we abbreviate $\text{Pr}(Y = 1 | X)$ as $p(X)$, the model that we are fitting is the following:
$$p(X) = \frac{e^{\beta_0 + \beta_{1}X_1 + \cdots + \beta_{p}X_p}}{1 + e^{\beta_0 + \beta_{1}X_1 + \cdots + \beta_{p}X_p}}$$
where $p$ is the number of predictors. Since this is equivalent to
$$\text{log}\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_{1}X_1 + \cdots + \beta_{p}X_p,$$
we see that we are really modeling the log-odds of $p(X)$. The expression on the left-hand side is called the logit function (which explains the name of this method).

To estimate the coefficients $\beta_i$'s, we maximize the likelihood function
$$l(\beta_0, \beta_1, \cdots, \beta_p) = \prod_{i: y_i = 1}p(x_i) \prod_{j: y_j = 0}(1 - p(x_j)).$$

To assess the significance of each predictor in the model, we can perform z-tests. The null hypothesis is $H_0: \beta_i = 0$ and the z-statistic is $\frac{\hat{\beta_i}}{\text{SE}(\hat{\beta_i})}$.

November 12, 2014

K-Nearest Neighbors

KNN classifier
Here we want to estimate the conditional probability that a predictor vector $x_0$ belongs to a class $j$. To do so with the KNN classifier, first we identify the $K$ observations closest to $x_0$ and represent them by $N_0$. Then we take the proportion of points in $N_0$ with target class $j$:
$$\text{Pr}(Y = j | X = x_0) = \frac{1}{K} \sum_{i \in N_0} I(y_i = j).$$


KNN regression
Here we want to predict the target value for predictor vector $x_0$. To do so with KNN regression, we take the $K$ observations closest to $x_0$ (which we represent by $N_0$) and simply take the average of the target values:
$$\hat{f}(x_0) = \frac{1}{K} \sum_{i \in N_0} y_i.$$

November 11, 2014

Least Squares Linear Regression

The goal of least squares regression is to fit the following model, where $p$ is the number of predictors, and $n$ is the number of observations:
$$Y = \beta_{0} + \beta_{1}X_{1} + \cdots + \beta_{p}X_{p} + \epsilon .$$

To do so, we minimize the residual sum of squares: $\sum_{i = 1}^{n} (y_{i} - \hat{y}_{i})^2$ where $y_{i}$ is the observed value of the $i$th target variable and $\hat{y}_{i}$ is its predicted value.


There are several ways to assess the accuracy of the model:

1) Residual Standard Error (RSE)
$$\text{RSE} =  \sqrt{\frac{\sum_{i = 1}^{n} (y_{i} - \hat{y}_{i})^2}{n-p-1}}$$
RSE measures the average amount that the prediction deviates from the truth. Generally, a high RSE means that the model is not a good fit for the data. However, it is not always clear what "high" is because RSE is measured in the units of the target variable. One way to deal with this is to divide the RSE by the mean of the target variable to report the percentage error.

2) $R^2$
$$R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}},$$
where $\text{TSS} = \sum_{i = 1}^{n} (y_{i} - \bar{y}_{i})^2$ is the total sum of squares and $\text{RSS} = \sum_{i = 1}^{n} (y_{i} - \hat{y}_{i})^2$ is the residual sum of squares.
The $R^2$ statistic reports the proportion of variability in $Y$ explained by the regression. $R^2$ is a number between 0 and 1 and is independent of the scale of $Y$. Generally, the closer to 1, the better the model. However, if we expect $\epsilon$ to be large (if we expect our model to explain only a small part of the change in the target value), then a small $R^2$ might be sufficient.
Note that when there is a single predictor, $R^2 = \text{Cor}(X, Y)^2$, but that when there are multiple predictors, $R^2 = \text{Cor}(Y, \hat{Y})^2$.

3) F-test
To check whether there is a relationship between the target variable and the predictors, we can perform an F-test. Here we are asking whether at least one of the predictors' coefficient is non-zero:
$$\begin{split}
& H_0: \beta_{1} = \beta_{2} = \cdots= \beta_{p} = 0\\
& H_1: \beta_{i} \neq 0 \text{  for at least one } i
\end{split}$$
The F-statistic is as follows:
$$F = \frac{(\text{TSS}-\text{RSS})/p}{\text{RSS}/(n-p-1)}.$$
When $H_0$ is true, the F-statistic has an F-distribution and we can compute the p-value. If the p-value is small, then there is evidence to reject $H_0$, which suggests that there is a relationship between at least one of the predictors and the target variable.

4) t-tests
We can also perform individual tests on each predictor to ask whether one predictor is related to the target variable. If we want to know whether the $i$th predictor is relevant, we can perform the following t-test:
$$\begin{split}
& H_0: \beta_{i} = 0\\
& H_1: \beta_{i} \neq 0
\end{split}$$
with the following t-statistic:
$$t = \frac{\hat{\beta_i}-0}{\text{SE}(\hat{\beta}_i)},$$
where $\text{SE}(\hat{\beta}_i)$ is the standard error of $\hat{\beta}_i$. If $H_0$ is true, then the t-statistic has a $t$ distribution with $n - 2$ degrees of freedom, and we can compute the p-value.
Note that the square of this t-statistic is equivalent to an F-test that checks whether all coefficients except $\beta_i$ are zero. This F-test can be computed using the residual sum of squares of the regression without the $i$th predictor instead of TSS. This means that the t-statistic reports the partial effect of adding $\beta_i$ to the regression.


We can extend linear regression to include non-linear relationships. The model is still linear, we just create new predictors that are non-linear functions of the original predictors.

1) Interaction Terms
The interaction effect occurs when the effect of one predictor on the target variable depends on the value of another predictor. This is also called the synergy effect in marketing. If we have two predictors $X_{1}$ and $X_{2}$, then we can add an interaction term $X_{1}X_{2}$ to our model, like so:
$$Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{1}X_{2} + \epsilon .$$
This can be rewritten so that the coefficient of $X_{1}$ coefficient depends on the value of $X_2$, making the interaction clear:
$$Y = \beta_{0} + (\beta_{1} + \beta_{3}X_{2})X_{1} + \beta_{2}X_{2} + \epsilon.$$
If we do include an interaction term in our model, then it is safer to also include the main terms regardless of their significance. This helps the interpretability of the model and is referred to as the hierarchical principle.

2) Polynomial Regression
With polynomial regression, we add higher powers of existing predictors. It can also be useful to consider other transformations, such as the logarithm or the square root.

November 7, 2014

CRISP-DM

The Cross-Industry Standard Process for Data Mining is a framework that outlines the main tasks underlying the data-mining process. It is mostly from the business point of view, and it doesn't specify technical details.


Here are the six phases, with the corresponding tasks:

1) Business Understanding:
     - defining business goals
     - assessing the situation
     - defining data-mining goals
     - creating the project plan

2) Data Understanding:
     - gathering data
     - describing data (overall)
     - exploring data (variable by variable)
     - verifying data quality

3) Data Preparation:
     - selecting data
     - cleaning data
     - constructing data (ex: deriving new variables)
     - integrating data (merge data into one dataset)
     - formatting data

4) Modeling:
     - selecting modeling techniques
     - designing tests
     - building models
     - assessing models

5) Evaluation:
     - evaluating results
     - reviewing the process
     - determining the next step

6) Deployment:
     - planning deployment
     - planning monitoring and maintenance
     - reporting final results
     - reviewing final results

November 4, 2014

9 Laws of Data Mining

Theses laws were established by Thomas Khabaza, a pioneer of data mining.

1) Business Goals
business objectives are at the origin of every data-mining solution
You should focus on business goals throughout the data-mining process.

2) Business Knowledge
business knowledge is central to every step of the data-mining process
Data mining is only valuable if you have the business understanding to give context to your data.

3) Data Preparation
data preparation is more than half of every data-mining process
Data usually isn't collected specifically for data mining.

4) Right Model (no free lunch)
the right model for a given application can only be discovered by experiment
Models are selected through trial and error rather than by studying theory.

5) Pattern
there are always patterns
Data always has something to say, whether it's what you wanted to hear or not.

6) Amplification (insight)
data mining amplifies perception in the business domain
Data mining finds information that wouldn't appear in ordinary reports.

7) Prediction
prediction increases information locally by generalization
Data mining uses what we know to predict what we don't know.

8) Value
the value of data-mining results is not determined by the accuracy or stability of predictive models
Priority is not given to theoretical properties, but rather to business applications.

9) Change
all patterns are subject to change
Models you create today might be useless tomorrow.