>>> What's Up Data??!: LDA and QDA

LDA (linear discriminant analysis) and QDA (quadratic discriminant analysis) are both classifiers that deal with a categorical target variable. Instead of directly modeling the conditional probability that the target belongs to a certain class given the values of the predictors (as is the case with logistic regression), LDA and QDA rely on Bayes' theorem to do this. Rather, they aim to model the reverse conditional probability: the probability that a set of predictors take on a specific set of values, given that the corresponding target variable belongs to a certain class.

If $X$ represents the predictors, $Y$ the target variable, and $K$ the number of different classes, this means that LDA and QDA estimate $\text{Pr}(X = x | Y = k)$. Then they use the following form of Bayes' theorem to compute what we are ultimately interested in, $\text{Pr}(Y = k | X = x)$:
$$
\text{Pr}(Y = k | X = x) = \frac{\pi_{k}\text{Pr}(X = x | Y = k)}{\sum_{j=1}^{K} \pi_{j}\text{Pr}(X = x | Y = j)},
$$
where $\pi_{k}$ represents the probability that a randomly chosen observation belongs to the $k$th class. $\pi_{k}$ is called the prior probability and $\text{Pr}(Y = k | X = x)$ the posterior probability. $\pi_{k}$ can easily be estimated using the fraction of the training observations in the $k$th class:
$$\hat{\pi}_{k} = \frac{n_k}{n},$$
where $n_k$ is the number of training observation in class $k$ and $n$ is the total number of training observations.

To model $\text{Pr}(X = x | Y = k)$, LDA and QDA assume that it has a multivariate normal distribution for which the $k$th class has mean $\mu_k$ and covariance matrix $\Sigma_k$. This means that, with $p$ predictors, we get:
$$
\text{Pr}(X = x | Y = k) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}} \text{exp}\left( -\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) \right).
$$
The difference between the two classifiers is that LDA makes a simplifying assumption that each class has the same covariance matrix $\Sigma$. Note that if there is only one predictor, then we assume a normal distribution, which is just a special case of the multivariate normal distribution, so the following formulas still hold.

With LDA, we need to estimate $k$ class-specific means $\hat{\mu}_k$ and one common covariance matrix $\hat{\Sigma}$, which is really just the weighted average of the class-specific covariance matrices:
$$
\begin{split}
\hat{\mu}_k & = \frac{1}{n_k}\sum_{i: y_i = k}x_i\\
\hat{\Sigma} & = \frac{1}{n - K}\sum_{k = 1}^{K}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T
\end{split}
$$

Now, calculating $\text{Pr}(X = x | Y = k)$ is equivalent to maximizing, with respect to $k$, the following discriminant function:
$$
\delta_k(x) = x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \text{log}(\pi_k).
$$
Note that this function is linear in $x$, hence the name linear discriminant analysis.

If we do assume class-specific covariance matrices and use QDA instead, then we can use the same estimates for $\mu_k$, but the estimates for $\Sigma_k$ are:
$$
\hat{\Sigma_k} = \frac{1}{n _k}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T,
$$
and the discriminant function is:
$$
\delta_k(x) = - \frac{1}{2}x^T\Sigma^{-1}x + x^T\Sigma^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k - \frac{1}{2}\text{log}|\Sigma_k| + \text{log}(\pi_k).
$$
which is quadratic in $x$, hence the name quadratic discriminant analysis.

LDA has more model bias but less variance than QDA since it makes more assumptions and estimates less parameters. Thus, LDA is probably a better choice if we have a relatively small training set, but QDA might do better if we have a very large training set, or if the common covariance matrix is an unrealistic assumption.

>>> What's Up Data??!

Pages

November 18, 2014

LDA and QDA

No comments:

Post a Comment