Multiple Linear Regression — ISLR Series: Chapter 3 Part II
In the previous blog, we talked about Simple Linear Regression (SLR), predicting the response using only one predictor. But in the real world, we do not have just one one variable, but instead we have multiple variables. In these common situations we apply Multiple Linear Regression (MLR).
The MLR assumption is the same as SLR: it assumes that data can be represented using a linear form. The only difference in MLR is that there is just more predictors to consider.
For each additional predictor, there needs to be a different coefficient that is associated with that predictor. (Remember that the purpose of the coefficient is to quantify the association between that predictor and the response.)
With more predictors, comes more responsibility
There are four questions to ask when applying a MLR model:
Question 1: Is at least one predictor (X1, X2, … Xp) useful in predicting the response?
To answer this question, we bring back the idea of hypothesis testing. With SLR, our null hypothesis is β1 = 0. With MLR, the null hypothesis includes all predictors.
Instead of the t-statistic, we use the F-statistic to help us test our hypothesis.
The larger the F-statistic is, the more evidence we have to reject the null hypothesis. If the F-statistic is small (close to 1) then the null hypothesis may stand. How large does the F-statistic need to be in order to reject the null hypothesis? It depends on n, the number of observations. The more observations we have, the smaller F-statistic can be in order to reject null hypothesis.
Question 2: Which ones are the most important predictors?
Not all predictors are heavily associated with the response. Unnecessary predictors can harm performance of the model. So it is crucial to determine which predictors are useful and which predictors are not. This process is called variable selection. There are many techniques to use to perform variable selection. There are three common ones: Forward Selection (add variables one by one), Backward Selection (fit all the predictors and remove useless predictors one by one) and Mixed Selection (add predictors one by one but also look to remove predictors). These methods will be discussed more in Chapter 5. Once applying these techniques we can use Mallow’s Cp, AIC, BIC and.or adjusted R2 to determine which technique produced the best model.
Question 3: How should we fit the model?
The most common ways to fit the model is using RSE and R2.
Question 4: How accurate are our predictions?
When we estimate coefficients, it leads to high reducible error. We need to reduce it as much as possible. One way to do that is compute confidence intervals that can show how close our estimations are to the actual function. Also since we are assuming that the data fits in a linear model, there is that additional source of reducible error called model bias.
Extensions to Multiple Linear Regression
Linear Regression is limited to data that falls close to a linear structure. But there are some techniques to apply to Linear Regression to improve the models. First technique is to remove the additive assumption. The additive assumption states that the effect of changes in a predictor Xj on response Y is independent of the values of other predictors. If we look at the equation for MLR, each term (X1β1, X2β2, … , Xjβj) has no effect on each other. This is the additive assumption. To remove this assumption, we create a new term by multiplying any two predictors we think there is an interaction between them. This new term is called the interaction term.
We can see here that increasing X2 will no longer affect only X2 but also it will affect X1. It will change the impact X1 will have on Y.
The Fine Print for Multiple Linear Regression
There are some potential problems that can occur when implementing MLR so we will go over these problems. In the chapter, they stated six problems:
- Non-linearity of the response-predictor relationships
- Correlation of error terms
- Non-constant variance of error terms
- Outliers
- High-leverage points
- Collinearity
Non-linearity of the response-predictor relationships
Recall from the previous blog, the major assumption of Linear Regression is that the relationship between predictors and response is almost linear. If this assumption is not accurate, then the model will perform poorly. To determine how linear the model is, we can observe the residual plots. If there is a pattern in the residual plot, then that means the relationship is non-linear.
Correlation of Error Terms
Another assumption we make when applying a Linear Model is that the error terms are uncorrelated. If the error terms are correlated, then the estimated standard errors will tend to underestimate true standard errors. To determine if there is correlation, we again look at the residual plot. If there is a pattern, then that means there is correlation between error terms which means MLR won’t be the right mode.
Non-constant Variance of Error Terms
Another assumption linear regression makes is that error terms have a constant variance. Once again we look at the residual plot to see if there is constant variance. If there is a funnel shape, then the variance of error terms is non-constant or there is heteroscedasticity. To reduce heteroscedasticity, we can use a concave function such as log or square root function on Y.
heteroscedasticity: when the variance of error terms are not constant
Outliers
Outliers can definitely skew the data and reduce the accuracy of our model. One way to determine if an observation is an outlier is to look at the studentized residuals. The
studentized residuals takes the residuals and divides it by its estimated standard error to give a more comparable result. A rule of thumb to interpret the studentized residuals is observations whose studentized residuals are greater than 3 or less than -3 are possible outliers.
High Leverage Points
High leverage points are observations that have an unusual value for xi, that one observation can impact the model. This is a no-no because it is detrimental if one observation has the power to alter the results. There is a statistic called the leverage statistic that can be used to determine if an observation has high leverage. If the leverage statistic is higher than (p+1)/n, then it has high leverage.
Collinearity
Collinearity is when two or more predictors are closely related to one another. This might not sound like a problem but it is. It gets difficult to determine how each of those predictors is associated with the response separately. Collinearity causes a reduction of accuracy of the coefficient estimates which causes standard errors to grow, decreasing the t-statistic therefore making it uncertain for a non-zero coefficient. To solve this issue, we can calculate the variance inflation factor (VIF).
VIF calculates the ratio of the variance of βj when fitting a full model divided by variance of model fitting only on βj. If the VIF is 1, then there is complete absence of collinearity. The rule of thumb here is if the VIF > 5 then it is problematic. The solution here is to drop one of the variables or create an interaction term between the collinear variables.
Completed Chapter 3 of Introduction to Statistical Learning with R. We will continue with Chapter 4 next: Classification. Feel free to comment on anything we missed or if any explanation is not clear. There will be some walkthroughs in R that will complement each chapter soon.
Collaborators: Michael Mellinger