Multiple Linear Regression — ISLR Series: Chapter 3 Part II

Taraqur Rahman
The Biased Outliers
7 min readMar 10, 2021

--

In the previous blog, we talked about Simple Linear Regression (SLR), predicting the response using only one predictor. But in the real world, we do not have just one one variable, but instead we have multiple variables. In these common situations we apply Multiple Linear Regression (MLR).

Photo by Mingwei Lim on Unsplash

The MLR assumption is the same as SLR: it assumes that data can be represented using a linear form. The only difference in MLR is that there is just more predictors to consider.

Simple Linear Regression only has one predictor, X, a slope intercept, and an error term.
Multiple Linear Regression includes multiple predictors (X1, …, Xp), a slope intercept and an error term.
This is how a Multiple Linear Regression model looks like with two predictors, X1 and X2. (src: ISLR)

For each additional predictor, there needs to be a different coefficient that is associated with that predictor. (Remember that the purpose of the coefficient is to quantify the association between that predictor and the response.)

With more predictors, comes more responsibility

There are four questions to ask when applying a MLR model:

Question 1: Is at least one predictor (X1, X2, … Xp) useful in predicting the response?

To answer this question, we bring back the idea of hypothesis testing. With SLR, our null hypothesis is β1 = 0. With MLR, the null hypothesis includes all predictors.

Null Hypothesis. The null hypothesis states that all predictors does not have a relationship with the response.
Alternative Hypothesis. The alternative hypothesis states that at least one predictor is associated with the response.

Instead of the t-statistic, we use the F-statistic to help us test our hypothesis.

Equation for the F-statistic. TSS is the total sum of squares, RSS is the residual sum of squares, p is the number of predictors and n is the number of observations. View previous blog for detailed explanation.

The larger the F-statistic is, the more evidence we have to reject the null hypothesis. If the F-statistic is small (close to 1) then the null hypothesis may stand. How large does the F-statistic need to be in order to reject the null hypothesis? It depends on n, the number of observations. The more observations we have, the smaller F-statistic can be in order to reject null hypothesis.

Question 2: Which ones are the most important predictors?

Not all predictors are heavily associated with the response. Unnecessary predictors can harm performance of the model. So it is crucial to determine which predictors are useful and which predictors are not. This process is called variable selection. There are many techniques to use to perform variable selection. There are three common ones: Forward Selection (add variables one by one), Backward Selection (fit all the predictors and remove useless predictors one by one) and Mixed Selection (add predictors one by one but also look to remove predictors). These methods will be discussed more in Chapter 5. Once applying these techniques we can use Mallow’s Cp, AIC, BIC and.or adjusted R2 to determine which technique produced the best model.

Question 3: How should we fit the model?

The most common ways to fit the model is using RSE and R2.

Question 4: How accurate are our predictions?

When we estimate coefficients, it leads to high reducible error. We need to reduce it as much as possible. One way to do that is compute confidence intervals that can show how close our estimations are to the actual function. Also since we are assuming that the data fits in a linear model, there is that additional source of reducible error called model bias.

Extensions to Multiple Linear Regression

Linear Regression is limited to data that falls close to a linear structure. But there are some techniques to apply to Linear Regression to improve the models. First technique is to remove the additive assumption. The additive assumption states that the effect of changes in a predictor Xj on response Y is independent of the values of other predictors. If we look at the equation for MLR, each term (X1β1, X2β2, … , Xjβj) has no effect on each other. This is the additive assumption. To remove this assumption, we create a new term by multiplying any two predictors we think there is an interaction between them. This new term is called the interaction term.

In this equation, the interaction term is β3X1X2. It multiplies X1 and X2 and assigns itself a coefficient. Adjusting X2 will not only effect how it relates to Y but also effect how X1 will relate to Y.

We can see here that increasing X2 will no longer affect only X2 but also it will affect X1. It will change the impact X1 will have on Y.

The Fine Print for Multiple Linear Regression

There are some potential problems that can occur when implementing MLR so we will go over these problems. In the chapter, they stated six problems:

  • Non-linearity of the response-predictor relationships
  • Correlation of error terms
  • Non-constant variance of error terms
  • Outliers
  • High-leverage points
  • Collinearity

Non-linearity of the response-predictor relationships

Recall from the previous blog, the major assumption of Linear Regression is that the relationship between predictors and response is almost linear. If this assumption is not accurate, then the model will perform poorly. To determine how linear the model is, we can observe the residual plots. If there is a pattern in the residual plot, then that means the relationship is non-linear.

The left shows that there is a pattern in the residuals. It goes down, then up, almost like a concave. This indicates that the relationship between the variables and the response is NOT linear. On the right, there is not much pattern which means the linear assumption holds. (src: ISLR)

Correlation of Error Terms

Another assumption we make when applying a Linear Model is that the error terms are uncorrelated. If the error terms are correlated, then the estimated standard errors will tend to underestimate true standard errors. To determine if there is correlation, we again look at the residual plot. If there is a pattern, then that means there is correlation between error terms which means MLR won’t be the right mode.

Non-constant Variance of Error Terms

Another assumption linear regression makes is that error terms have a constant variance. Once again we look at the residual plot to see if there is constant variance. If there is a funnel shape, then the variance of error terms is non-constant or there is heteroscedasticity. To reduce heteroscedasticity, we can use a concave function such as log or square root function on Y.

heteroscedasticity: when the variance of error terms are not constant

The left shows that the variance increases as the we go up in value (heteroscedasticity). This is something we need to avoid if applying MLR. On the right, the variance seems constant which is what MLR assumes. (src: ISLR)

Outliers

Outliers can definitely skew the data and reduce the accuracy of our model. One way to determine if an observation is an outlier is to look at the studentized residuals. The

studentized residuals takes the residuals and divides it by its estimated standard error to give a more comparable result. A rule of thumb to interpret the studentized residuals is observations whose studentized residuals are greater than 3 or less than -3 are possible outliers.

The graph plots the values vs the studentized residuals. The red circle with value 20 is considered an outlier because it is above 3 studentized residuals. (src: ISLR)

High Leverage Points

High leverage points are observations that have an unusual value for xi, that one observation can impact the model. This is a no-no because it is detrimental if one observation has the power to alter the results. There is a statistic called the leverage statistic that can be used to determine if an observation has high leverage. If the leverage statistic is higher than (p+1)/n, then it has high leverage.

In the left, observation 41 has the highest leverage. The red line is fitting all the data, while the blue line is fitted the data with observation 41 removed. There is a visual difference between the fit. In the right, the red circle does not fall in the cluster of data, therefore has high leverage. (src: ISLR)

Collinearity

Collinearity is when two or more predictors are closely related to one another. This might not sound like a problem but it is. It gets difficult to determine how each of those predictors is associated with the response separately. Collinearity causes a reduction of accuracy of the coefficient estimates which causes standard errors to grow, decreasing the t-statistic therefore making it uncertain for a non-zero coefficient. To solve this issue, we can calculate the variance inflation factor (VIF).

VIF calculates the ratio of the variance of βj when fitting a full model divided by variance of model fitting only on βj. If the VIF is 1, then there is complete absence of collinearity. The rule of thumb here is if the VIF > 5 then it is problematic. The solution here is to drop one of the variables or create an interaction term between the collinear variables.

Completed Chapter 3 of Introduction to Statistical Learning with R. We will continue with Chapter 4 next: Classification. Feel free to comment on anything we missed or if any explanation is not clear. There will be some walkthroughs in R that will complement each chapter soon.

Collaborators: Michael Mellinger

Github

--

--