Pages

November 11, 2014

Least Squares Linear Regression

The goal of least squares regression is to fit the following model, where $p$ is the number of predictors, and $n$ is the number of observations:
$$Y = \beta_{0} + \beta_{1}X_{1} + \cdots + \beta_{p}X_{p} + \epsilon .$$

To do so, we minimize the residual sum of squares: $\sum_{i = 1}^{n} (y_{i} - \hat{y}_{i})^2$ where $y_{i}$ is the observed value of the $i$th target variable and $\hat{y}_{i}$ is its predicted value.


There are several ways to assess the accuracy of the model:

1) Residual Standard Error (RSE)
$$\text{RSE} =  \sqrt{\frac{\sum_{i = 1}^{n} (y_{i} - \hat{y}_{i})^2}{n-p-1}}$$
RSE measures the average amount that the prediction deviates from the truth. Generally, a high RSE means that the model is not a good fit for the data. However, it is not always clear what "high" is because RSE is measured in the units of the target variable. One way to deal with this is to divide the RSE by the mean of the target variable to report the percentage error.

2) $R^2$
$$R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}},$$
where $\text{TSS} = \sum_{i = 1}^{n} (y_{i} - \bar{y}_{i})^2$ is the total sum of squares and $\text{RSS} = \sum_{i = 1}^{n} (y_{i} - \hat{y}_{i})^2$ is the residual sum of squares.
The $R^2$ statistic reports the proportion of variability in $Y$ explained by the regression. $R^2$ is a number between 0 and 1 and is independent of the scale of $Y$. Generally, the closer to 1, the better the model. However, if we expect $\epsilon$ to be large (if we expect our model to explain only a small part of the change in the target value), then a small $R^2$ might be sufficient.
Note that when there is a single predictor, $R^2 = \text{Cor}(X, Y)^2$, but that when there are multiple predictors, $R^2 = \text{Cor}(Y, \hat{Y})^2$.

3) F-test
To check whether there is a relationship between the target variable and the predictors, we can perform an F-test. Here we are asking whether at least one of the predictors' coefficient is non-zero:
$$\begin{split}
& H_0: \beta_{1} = \beta_{2} = \cdots= \beta_{p} = 0\\
& H_1: \beta_{i} \neq 0 \text{  for at least one } i
\end{split}$$
The F-statistic is as follows:
$$F = \frac{(\text{TSS}-\text{RSS})/p}{\text{RSS}/(n-p-1)}.$$
When $H_0$ is true, the F-statistic has an F-distribution and we can compute the p-value. If the p-value is small, then there is evidence to reject $H_0$, which suggests that there is a relationship between at least one of the predictors and the target variable.

4) t-tests
We can also perform individual tests on each predictor to ask whether one predictor is related to the target variable. If we want to know whether the $i$th predictor is relevant, we can perform the following t-test:
$$\begin{split}
& H_0: \beta_{i} = 0\\
& H_1: \beta_{i} \neq 0
\end{split}$$
with the following t-statistic:
$$t = \frac{\hat{\beta_i}-0}{\text{SE}(\hat{\beta}_i)},$$
where $\text{SE}(\hat{\beta}_i)$ is the standard error of $\hat{\beta}_i$. If $H_0$ is true, then the t-statistic has a $t$ distribution with $n - 2$ degrees of freedom, and we can compute the p-value.
Note that the square of this t-statistic is equivalent to an F-test that checks whether all coefficients except $\beta_i$ are zero. This F-test can be computed using the residual sum of squares of the regression without the $i$th predictor instead of TSS. This means that the t-statistic reports the partial effect of adding $\beta_i$ to the regression.


We can extend linear regression to include non-linear relationships. The model is still linear, we just create new predictors that are non-linear functions of the original predictors.

1) Interaction Terms
The interaction effect occurs when the effect of one predictor on the target variable depends on the value of another predictor. This is also called the synergy effect in marketing. If we have two predictors $X_{1}$ and $X_{2}$, then we can add an interaction term $X_{1}X_{2}$ to our model, like so:
$$Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \beta_{3}X_{1}X_{2} + \epsilon .$$
This can be rewritten so that the coefficient of $X_{1}$ coefficient depends on the value of $X_2$, making the interaction clear:
$$Y = \beta_{0} + (\beta_{1} + \beta_{3}X_{2})X_{1} + \beta_{2}X_{2} + \epsilon.$$
If we do include an interaction term in our model, then it is safer to also include the main terms regardless of their significance. This helps the interpretability of the model and is referred to as the hierarchical principle.

2) Polynomial Regression
With polynomial regression, we add higher powers of existing predictors. It can also be useful to consider other transformations, such as the logarithm or the square root.

No comments:

Post a Comment