Linear Model Selection and Regularization — ISLR Series Chapter 6

Taraqur Rahman
The Biased Outliers
4 min readJul 8, 2021

--

In Chapter 3, we talked about Linear Regression: how we can assume the data fits a linear model and predict using that linear model. The linear model falls short when there are a lot of features. It is good practice to collect as many data points as possible but the drawback would be that some data will not relevant to the target and it will slow down the training and might even be detrimental to the accuracy of the model. So what should we do? This chapter talks about methods to beef up a linear model so that the model can put more emphasis on the most important features. These methods are subset selection, shrinkage, and dimension reduction. (Dimension reduction is a big topic so it will be discussed in a different blog.)

Subset Selection

There are two main subset selections: best subset selection (bss) and stepwise selection. Best subset selection fits a least squares regression model on every combination of predictors. If there are p predictors then BSS will model through every combination of a 1-predictor model, 2-predictors model,…, to p-predictor models. After that it chooses the model with the best results.

Stepwise selection determines the best model by adding or subtracting predictors at every step. Two common stepwise functions are forward stepwise and backward stepwise. Forward stepwise functions start off with a single predictor and at every step it adds another predictor and checks to see if it is a better model. Backward stepwise starts off with all the predictors and removes a predictor at every step. The stepwise methods are more computationally efficient than the best subset selection, since bss models through every possible combination of predictors.

To choose the best model, we cannot use R² or RSS because these metrics are sums. That means that the more features added, the metrics will improve no matter how important the features are to the target. We want the model to take into consideration the number of features it is using and see if the model is improving. Instead the some metrics used to determine the best model is Cp, AIC, BIC, or adjusted R².

Shrinkage Methods

An alternative to using subset selection is using shrinkage methods. Shrinkage methods use all the predictors available but put a constraint on these predictors aka regularizes them. The two main shrinkage methods are Ridge and Lasso Regression.

Ridge Regression

Ridge Regression adds an additional term to the residual sum of squares (RSS). In Chapter 3, we learned that the least squares method minimizes the RSS function.

Equation for Residual Sum of Squares (src: ISLR)
The equation used in ridge regression. The only difference from the RSS equation is the addition term called the shrinkage penalty. (src: ISLR)

The additional term in ridge regression is called a shrinkage penalty and its equation is the last term in the figure above. The ‘B’ are the coefficients and it is being squared. This is called the L2 norm. The lambda is the tuning parameter, the parameter that we tweak to influence how much we should shrink the parameters. If lambda is zero, then there is no shrinkage parameter and the equation goes back to the normal RSS. When lambda approaches infinity, then we are putting weights on the coefficients that shrink them towards zero. To choose the best lambda, we can experiment with different values and determine the best one using cross validation.

This graph shows the result of ridge regression. As lambda increases, ridge shrinks the coefficients making them approach 0 but not equal to it. (src: ISLR)

We would use the ridge when the least squares has an issue generalizing on unseen data, or commonly referred to high variance. Ridge regression penalizes effects on certain coefficients making the model more flexible. And, as usual, generalizing a model comes at the cost of increasing the bias (how inaccurate our predictions are).

Lasso Regression

Lasso regression, just like ridge, adds a shrinkage penalty to the RSS.

Equation used in Lasso regression. Just like ridge regression, it adds a shrinkage penalty. The difference is that lasso finds the absolute value of coefficients instead of squaring them (ridge). (src: ISLR)

The difference here is that instead of squaring the coefficients, lasso takes the absolute value of it. This shrinkage penalty is referred to as the L1 norm. Just like ridge, lambda is the tuning parameter and plays the same effect: if lambda is 0 then there is no shrinkage penalty applied and as lambda goes to infinity, the coefficients will be penalized more. The difference here is that lasso forces some coefficients to be zero. That means some predictors or variables will not be used in the model. Lasso can be considered as a variable selection method, selecting only a handful of variables to be used.

This graph shows that as lambda is getting bigger (penalty), it is forcing more coefficients to go to zero. (src: ISLR)

Just like ridge, lasso can be used to decrease variance with a trade-off of increasing bias. However lasso can come in handy when there are a lot of predictors in the data. Ridge incorporates all of the variables but lasso looks at all of them and selects only a handful.

Collaborators: Michael Mellinger Github

--

--