Model Selection & Regularization

Author

Termeh Shafie

1 Regularization

Now that we know how to use model validation to check for overfitting, we might want to help fix overfitting. We do this with regularization which helps take a model thats to the righthand side of our Bias Variance graph and move it towards the center by making our model simpler.

One way we can do that is by penalizing the coefficients of our model if they’re too large in magnitude (far away from 0 in either direction).

1.1 LASSO

LASSO penalizes the sum of the absolute value of the coefficients by adding a penalty term to the loss function:

\[ \underbrace{\text{RSS}}_{\sum(x_i - \hat{x_i})^2} + \lambda \sum | \beta_i |\]

LASSO has the benefit of tending to drag coefficients that don’t “pull their weight” to exactly 0, thus removing them from the model.

1.2 Ridge

Ridge penalizes the sum of the squared coefficients by adding a penalty term to the loss function:

\[ \underbrace{\text{RSS}}_{\sum(x_i - \hat{x_i})^2} + \lambda \sum \beta_i^2\]

Unlike LASSO, Ridge tends to drag coefficients that don’t “pull their weight” to near 0, thus NOT removing them from the model.

1.3 Penalties in General

Both models create a tug-of-war where coefficients (\(\beta_i\)) need to “pull their weight” by reducing the Sum of Squared Residuals (\(\text{RSS}\)) in order to be “worth” having a large value in the penalty.

The hyperparameter \(\lambda\) controls how much coefficients are penalized.

1.3.1 Question

If \(\lambda = 0\), what happens to our model?

1.4 Alternative Regularization Explanation

You can also think of Ridge and LASSO as having a “budget” for how big the sum of your (squared or absolute valued) coefficients can be.

This graph shows the concept of LASSO and RIDGE in a simple 2-parameter situation. The teal/blue circle (right) and square (left) represent the values for \(\beta_1\) and \(\beta_2\) that satisfy the “budget” for our coefficients. The dot in the middle of the red rings represents what the coefficients would be if we did not add a penalty at all. As you move from the inner to the outer red rings, the Residual Sums of Squares (RSS) goes up, meaning that our model is worse at accurately predicting data in our sample (remember that we’re giving up accuracy in the current sample, to hopefully gain accuracy out-of-sample).

You can see that the coefficients chosen by LASSO/Ridge occur at the point of the teal/blue area that meets the red rings. This is because 1) when we penalize our model, it HAS to fit within the contraints we give it (the teal/blue area) but 2) we still want the RSS to be as small as possible (a model that’s bad on ALL data sets isn’t useful).

1.5 How to choose between LASSO and Ridge?

In general: if you want to do variable selection (i.e. completely remove some variables), choose LASSO. But in the real world, people often use something called Elastic Net, which adds both LASSO (L1) and Ridge (L2) penalties to the loss function (see here).

2 Ridge and LASSO in R

We apply both Ridge and LASSO to the Amazon book data set from before.

# Packages needed
library(tidyverse)
library(glmnet)
library(knitr)
library(kableExtra)
library(broom)
library(patchwork)
library(leaps) # install if not on your machine
library(ISLR2) # data used from book package

Note, when using glmnet, the alpha argument determines the penalty term:

If alpha = 0: pure ridge
If alpha = 1: pure lasso
If alpha = 0.5: half ridge + half lasso

2.1 LASSO

# ---- Load & clean data ----
ama <- read_tsv("07-data/amazon-books.txt") |>
  drop_na()

Rows: 325 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): Title, Author, Hard/ Paper, Publisher, ISBN-10
dbl (8): List Price, Amazon Price, NumPages, Pub year, Height, Width, Thick,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

colnames(ama) <- c(
  "Title", "Author", "List.Price", "Amazon.Price", "Hard.Paper",
  "NumPages", "Publisher", "Pub.year", "ISBN.10",
  "Height", "Width", "Thick", "Weight.oz"
)

# ---- Set up X and y ----
predictors <- c("List.Price", "NumPages", "Weight.oz", "Thick", "Height", "Width")

X <- ama[, predictors]
y <- ama$Amazon.Price

# ---- Train/Test Split (80/20) ----
set.seed(12345)

n <- nrow(ama)
train_idx <- sample(seq_len(n), size = floor(0.8 * n))

X_train <- X[train_idx, ]
X_test  <- X[-train_idx, ]
y_train <- y[train_idx]
y_test  <- y[-train_idx]

# Model matrices (glmnet will standardize internally)
X_train_mm <- model.matrix(~ . - 1, data = X_train)
X_test_mm  <- model.matrix(~ . - 1, data = X_test)

# ---- Lasso with Cross-Validation (alpha = 1) ----
cv_fit <- cv.glmnet(
  x = X_train_mm,
  y = y_train,
  alpha = 1,          # Lasso
  standardize = TRUE  # default; explicit for clarity
)

# ---- Predict ----
y_pred_train <- as.numeric(predict(cv_fit, newx = X_train_mm, s = "lambda.min"))
y_pred_test  <- as.numeric(predict(cv_fit, newx = X_test_mm,  s = "lambda.min"))

# ---- Metrics ----
mse  <- function(y, yhat) mean((y - yhat)^2)
mae  <- function(y, yhat) mean(abs(y - yhat))
mape <- function(y, yhat) mean(abs((y - yhat) / y)) * 100
r2   <- function(y, yhat) 1 - sum((y - yhat)^2) / sum((y - mean(y))^2)

# ---- Summary ----
resultsL <- data.frame(
  Dataset = c("Train", "Test"),
  MSE  = c(mse(y_train, y_pred_train),
           mse(y_test,  y_pred_test)),
  MAE  = c(mae(y_train, y_pred_train),
           mae(y_test,  y_pred_test)),
  MAPE = c(mape(y_train, y_pred_train),
           mape(y_test,  y_pred_test)),
  R2   = c(r2(y_train, y_pred_train),
           r2(y_test,  y_pred_test))
)

# ---- Pretty Table ----
resultsL |>
  kable(digits = 4, align = "c",
        caption = "LASSO Regression Performance Summary") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

LASSO Regression Performance Summary
Dataset	MSE	MAE	MAPE	R2
Train	12.0562	2.3216	20.4430	0.9259
Test	5.3945	1.7597	18.1162	0.6238

2.2 Ridge

# ---- Train/Test Split (80/20) ----
set.seed(12345)  # use same seed as LASSO if you want identical split

n <- nrow(ama)
train_idx <- sample(seq_len(n), size = floor(0.8 * n))

X_train <- X[train_idx, ]
X_test  <- X[-train_idx, ]
y_train <- y[train_idx]
y_test  <- y[-train_idx]

# Model matrices (glmnet will standardize internally)
X_train_mm <- model.matrix(~ . - 1, data = X_train)
X_test_mm  <- model.matrix(~ . - 1, data = X_test)

# ---- Ridge with Cross-Validation (alpha = 0) ----
cv_ridge <- cv.glmnet(
  x = X_train_mm,
  y = y_train,
  alpha = 0,          # Ridge
  standardize = TRUE  # default; explicit for clarity
)

# ---- Predict ----
y_pred_train <- as.numeric(predict(cv_ridge, newx = X_train_mm, s = "lambda.min"))
y_pred_test  <- as.numeric(predict(cv_ridge, newx = X_test_mm,  s = "lambda.min"))

# ---- Metrics ----
mse  <- function(y, yhat) mean((y - yhat)^2)
mae  <- function(y, yhat) mean(abs(y - yhat))
mape <- function(y, yhat) mean(abs((y - yhat) / y)) * 100
r2   <- function(y, yhat) 1 - sum((y - yhat)^2) / sum((y - mean(y))^2)

resultsR <- data.frame(
  Dataset = c("Train", "Test"),
  MSE  = c(mse(y_train, y_pred_train),
           mse(y_test,  y_pred_test)),
  MAE  = c(mae(y_train, y_pred_train),
           mae(y_test,  y_pred_test)),
  MAPE = c(mape(y_train, y_pred_train),
           mape(y_test,  y_pred_test)),
  R2   = c(r2(y_train, y_pred_train),
           r2(y_test,  y_pred_test))
)

# ---- Pretty Table ----
resultsR |>
  kable(digits = 4, align = "c",
        caption = "Ridge Regression Performance Summary") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Ridge Regression Performance Summary
Dataset	MSE	MAE	MAPE	R2
Train	13.5612	2.2791	19.8514	0.9166
Test	4.9377	1.7572	20.6326	0.6557

2.2.1 Question

Interpret the results from LASSO and Rdige Regression.

Answer

Ridge Regression shows better test performance than LASSO, with a lower Test MSE (4.94 vs 5.39) and higher Test R² (0.656 vs 0.624).
LASSO fits the training data slightly better, but this advantage does not translate into improved test accuracy.
The similarity in MAE and MAPE across models indicates both methods capture overall price patterns comparably.
The stronger generalization of Ridge suggests that, for this dataset, Ridge is the preferred regularization method, producing more stable estimates without overshrinking important predictors. In other words, Ridge generalizes slightly better than LASSO on this dataset. Ridge has lower Test MSE (4.94 vs 5.39) and higher Test R² (0.656 vs 0.624).

2.3 Plotting Regularization Paths

# Fit models
ridge_fit <- glmnet(X, y, alpha = 0)
lasso_fit <- glmnet(X, y, alpha = 1)

# Convert glmnet output to long data frame 
extract_paths <- function(fit, model_name) {
  coefs <- as.matrix(fit$beta)   # p x K (rows = features, cols = lambdas)
  lambdas <- fit$lambda          # length K
  
  # transpose so rows = lambdas, cols = features
  df <- as.data.frame(t(coefs))  # now nrow = length(lambda), ncol = p
  
  df$lambda <- lambdas
  df$model <- model_name
  
  df_long <- df %>%
    tidyr::pivot_longer(
      cols = -c(lambda, model),
      names_to = "term",
      values_to = "estimate"
    ) %>%
    dplyr::mutate(log_lambda = log(lambda))
  
  df_long
}

ridge_df <- extract_paths(ridge_fit, "Ridge")
lasso_df <- extract_paths(lasso_fit, "Lasso")

paths <- bind_rows(ridge_df, lasso_df)

# Ridge plot
g_ridge <- paths %>%
  filter(model == "Ridge") %>%
  ggplot(aes(x = log_lambda, y = estimate, color = term)) +
  geom_line(linewidth = 1) +
  labs(title = "Ridge Regularization Paths",
       x = "log(lambda)", y = "Coefficient") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

# Lasso plot
g_lasso <- paths %>%
  filter(model == "Lasso") %>%
  ggplot(aes(x = log_lambda, y = estimate, color = term)) +
  geom_line(linewidth = 1) +
  labs(title = "Lasso Regularization Paths",
       x = "log(lambda)", y = "Coefficient") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")


# now let's put norms as x axes
paths_norm <- paths %>%
  group_by(model, lambda) %>%
  mutate(
    l1_norm = sum(abs(estimate)),            # For Lasso
    l2_norm = sqrt(sum(estimate^2))          # For Ridge
  ) %>%
  ungroup()

g_ridgenorms <- paths_norm %>%
  filter(model == "Ridge") %>%
  ggplot(aes(x = l2_norm, y = estimate, color = term)) +
  geom_line(linewidth = 1) +
  labs(title = "Ridge Regularization Paths",
       x = "L2 Norm of Coefficients", y = "Coefficient") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none")

g_lassonorms <- paths_norm %>%
  filter(model == "Lasso") %>%
  ggplot(aes(x = l1_norm, y = estimate, color = term)) +
  geom_line(linewidth = 1) +
  labs(title = "Lasso Regularization Paths",
       x = "L1 Norm of Coefficients", y = "Coefficient") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "bottom")


# Combine all
(g_ridge + g_lasso) / (g_ridgenorms + g_lassonorms)

Note that we can use glmnet() to plot these regularizations plots directly, we do this in the next part. However, the x axis will then look slightly different as glmnet() uses norms as x axes, making the plot independent of the specific \(\lambda\) scale.

Also note that in the above plot, each model is using its own \(\lambda\) sequence generated by glmnet(), so:

Ridge might use \(\lambda\) from \(0.1 \rightarrow 10^4\)
Lasso might use \(\lambda\) from \(10^{-4} \rightarrow 10^2\)

This is because ridge and lasso get different grids, leading to different x-axis ranges. You can of course set the same grid for both which we do in the next part.

3 Classwork: Linear Models and Regularization Methods (from ISLR2)

3.1 Subset Selection Methods

3.1.1 Best Subset Selection

Here we apply the best subset selection approach to the Hitters data from the ISLR2 package. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.

First of all, we note that the Salary variable is missing for some of the players. The is.na() function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a TRUE for any elements that are missing, and a FALSE for non-missing elements. The sum() function can then be used to count all of the missing elements.

names(Hitters)

 [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
 [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
[13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
[19] "Salary"    "NewLeague"

dim(Hitters)

[1] 322  20

sum(is.na(Hitters$Salary))

[1] 59

Hence we see that Salary is missing for \(59\) players. The na.omit() function removes all of the rows that have missing values in any variable.

Hitters <- na.omit(Hitters)
dim(Hitters)

[1] 263  20

sum(is.na(Hitters))

[1] 0

The regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for lm(). The summary() command outputs the best set of variables for each model size.

regfit.full <- regsubsets(Salary ~ ., Hitters)
summary(regfit.full)

Subset selection object
Call: regsubsets.formula(Salary ~ ., Hitters)
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "   " " 
8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"   " " 
         CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
2  ( 1 ) " "    " "     " "       " "     " "     " "    " "       
3  ( 1 ) " "    " "     " "       "*"     " "     " "    " "       
4  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
5  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
6  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
7  ( 1 ) " "    " "     "*"       "*"     " "     " "    " "       
8  ( 1 ) "*"    " "     "*"       "*"     " "     " "    " "

An asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Hits and CRBI. By default, regsubsets() only reports results up to the best eight-variable model. But the nvmax option can be used in order to return as many variables as are desired. Here we fit up to a 19-variable model.

regfit.full <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19)
reg.summary <- summary(regfit.full)

The summary() function also returns \(R^2\), RSS, adjusted \(R^2\), \(C_p\), and BIC. We can examine these to try to select the best overall model.

names(reg.summary)

[1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

For instance, we see that the \(R^2\) statistic increases from \(32 \%\), when only one variable is included in the model, to almost \(55 \%\), when all variables are included. As expected, the \(R^2\) statistic increases monotonically as more variables are included.

reg.summary$rsq

 [1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
 [8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
[15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159

Plotting RSS, adjusted \(R^2\), \(C_p\), and BIC for all of the models at once will help us decide which model to select.

df = data.frame(adjR2 = reg.summary$adjr2,
                rss = reg.summary$rss,
                Cp = reg.summary$cp,
                BIC = reg.summary$bic,
                Variables = seq(1:19))
df_long = df %>%
  pivot_longer(cols = c(adjR2, rss, Cp, BIC),
               names_to = "Metric",
               values_to = "Value")

df_long %>%
  ggplot(aes(x = Variables, y = Value)) +
  geom_line() +
  facet_wrap(~Metric, scales = "free_y") + 
  labs(x = "Number of Variables", y = "Value") +
  theme_minimal()

The which.max() function can be used to identify the location of the maximum point of a vector. We will now plot a red dot to indicate the model with the largest adjusted \(R^2\) statistic.

max_index <- which.max(df$adjR2)

df %>%
  ggplot(aes(x = Variables, y = adjR2)) +
  geom_line() + 
  geom_point(aes(x = max_index, y = adjR2[max_index]),
             color = "red", size = 4, shape = 20) + 
  labs(x = "Number of Variables", y = "Adjusted R²") +
  theme_minimal()

In a similar fashion we can plot the \(C_p\) and BIC statistics, and indicate the models with the smallest statistic using which.min().

min_cp <- which.min(df$Cp)
min_bic <- which.min(df$BIC)

df %>%
  ggplot(aes(x = Variables, y = Cp)) +
  geom_line() + 
  geom_point(aes(x = min_cp, y = Cp[min_cp]),
             color = "red", size = 4, shape = 20) + 
  labs(x = "Number of Variables", y = "Cp") +
  theme_minimal()

df %>%
  ggplot(aes(x = Variables, y = BIC)) +
  geom_line() + 
  geom_point(aes(x = min_bic, y = BIC[min_bic]),
             color = "red", size = 4, shape = 20) + 
  labs(x = "Number of Variables", y = "BIC") +
  theme_minimal()

The regsubsets() function has a built-in plot() command which can be used to display the selected variables for the best model with a given number of predictors, ranked according to the BIC, \(C_p\), adjusted \(R^2\), or AIC. To find out more about this function, type ?plot.regsubsets.

plot(regfit.full, scale = "r2")

plot(regfit.full, scale = "adjr2")

plot(regfit.full, scale = "Cp")

plot(regfit.full, scale = "bic")

The top row of each plot contains a black square for each variable selected according to the optimal model associated with that statistic. For instance, we see that several models share a BIC close to \(-150\). However, the model with the lowest BIC is the six-variable model that contains only AtBat, Hits, Walks, CRBI, DivisionW, and PutOuts. We can use the coef() function to see the coefficient estimates associated with this model.

coef(regfit.full, 6)

 (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
  91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338 
     PutOuts 
   0.2643076

3.1.2 Forward and Backward Stepwise Selection

We can also use the regsubsets() function to perform forward stepwise or backward stepwise selection, using the argument method = "forward" or method = "backward".

regfit.fwd <- regsubsets(Salary ~ ., data = Hitters,
                         nvmax = 19, method = "forward")
summary(regfit.fwd)

Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 19
Selection Algorithm: forward
          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "   "*" 
6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "   "*" 
8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   "*" 
9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*" 
16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*" 
19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*" 
          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
2  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
3  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
4  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
5  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
6  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
7  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
8  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
9  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
10  ( 1 ) "*"    " "     "*"       "*"     "*"     " "    " "       
11  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
12  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
13  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
14  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
15  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
16  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
17  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
18  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
19  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"

regfit.bwd <- regsubsets(Salary ~ ., data = Hitters,
                         nvmax = 19, method = "backward")
summary(regfit.bwd)

Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")
19 Variables  (and intercept)
           Forced in Forced out
AtBat          FALSE      FALSE
Hits           FALSE      FALSE
HmRun          FALSE      FALSE
Runs           FALSE      FALSE
RBI            FALSE      FALSE
Walks          FALSE      FALSE
Years          FALSE      FALSE
CAtBat         FALSE      FALSE
CHits          FALSE      FALSE
CHmRun         FALSE      FALSE
CRuns          FALSE      FALSE
CRBI           FALSE      FALSE
CWalks         FALSE      FALSE
LeagueN        FALSE      FALSE
DivisionW      FALSE      FALSE
PutOuts        FALSE      FALSE
Assists        FALSE      FALSE
Errors         FALSE      FALSE
NewLeagueN     FALSE      FALSE
1 subsets of each size up to 19
Selection Algorithm: backward
          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
4  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"   " " 
5  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " " 
6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " " 
7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   " " 
8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"   "*" 
9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"   "*" 
12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"   "*" 
15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"   "*" 
16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"   "*" 
18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"   "*" 
19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"   "*" 
          CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
2  ( 1 )  " "    " "     " "       " "     " "     " "    " "       
3  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
4  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
5  ( 1 )  " "    " "     " "       "*"     " "     " "    " "       
6  ( 1 )  " "    " "     "*"       "*"     " "     " "    " "       
7  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
8  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
9  ( 1 )  "*"    " "     "*"       "*"     " "     " "    " "       
10  ( 1 ) "*"    " "     "*"       "*"     "*"     " "    " "       
11  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
12  ( 1 ) "*"    "*"     "*"       "*"     "*"     " "    " "       
13  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
14  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
15  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
16  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    " "       
17  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
18  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"       
19  ( 1 ) "*"    "*"     "*"       "*"     "*"     "*"    "*"

For instance, we see that using forward stepwise selection, the best one-variable model contains only CRBI, and the best two-variable model additionally includes Hits. For this data, the best one-variable through six-variable models are each identical for best subset and forward selection. However, the best seven-variable models identified by forward stepwise selection, backward stepwise selection, and best subset selection are different.

coef(regfit.full, 7)

 (Intercept)         Hits        Walks       CAtBat        CHits       CHmRun 
  79.4509472    1.2833513    3.2274264   -0.3752350    1.4957073    1.4420538 
   DivisionW      PutOuts 
-129.9866432    0.2366813

coef(regfit.fwd, 7)

 (Intercept)        AtBat         Hits        Walks         CRBI       CWalks 
 109.7873062   -1.9588851    7.4498772    4.9131401    0.8537622   -0.3053070 
   DivisionW      PutOuts 
-127.1223928    0.2533404

coef(regfit.bwd, 7)

 (Intercept)        AtBat         Hits        Walks        CRuns       CWalks 
 105.6487488   -1.9762838    6.7574914    6.0558691    1.1293095   -0.7163346 
   DivisionW      PutOuts 
-116.1692169    0.3028847

3.1.3 Choosing Among Models Using the Validation-Set Approach and Cross-Validation

We just saw that it is possible to choose among a set of models of different sizes using \(C_p\), BIC, and adjusted \(R^2\). We will now consider how to do this using the validation set and cross-validation approaches.

In order for these approaches to yield accurate estimates of the test error, we must use only the training observations to perform all aspects of model-fitting—including variable selection. Therefore, the determination of which model of a given size is best must be made using only the training observations. This point is subtle but important. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error.

set.seed(1)
train <- sample(c(TRUE, FALSE),
                nrow(Hitters),
                replace = TRUE)

test <- (!train)

Now, we apply regsubsets() to the training set in order to perform best subset selection.

regfit.best <- regsubsets(Salary ~ .,
                          data = Hitters[train, ],
                          nvmax = 19)

Notice that we subset the Hitters data frame directly in the call in order to access only the training subset of the data, using the expression Hitters[train, ]. We now compute the validation set error for the best model of each model size. We first make a model matrix from the test data.

test.mat <- model.matrix(Salary ~ ., data = Hitters[test, ])

The model.matrix() function is used in many regression packages for building an “X” matrix from data. Now we run a loop, and for each size i, we extract the coefficients from regfit.best for the best model of that size, multiply them into the appropriate columns of the test model matrix to form the predictions, and compute the test MSE.

val.errors <- rep(NA, 19)
for (i in 1:19) {
 coefi <- coef(regfit.best, id = i)
 pred <- test.mat[, names(coefi)] %*% coefi
 val.errors[i] <- mean((Hitters$Salary[test] - pred)^2)
}

We find that the best model is the one that contains seven variables.

val.errors

 [1] 164377.3 144405.5 152175.7 145198.4 137902.1 139175.7 126849.0 136191.4
 [9] 132889.6 135434.9 136963.3 140694.9 140690.9 141951.2 141508.2 142164.4
[17] 141767.4 142339.6 142238.2

which.min(val.errors)

[1] 7

coef(regfit.best, 7)

 (Intercept)        AtBat         Hits        Walks        CRuns       CWalks 
  67.1085369   -2.1462987    7.0149547    8.0716640    1.2425113   -0.8337844 
   DivisionW      PutOuts 
-118.4364998    0.2526925

This was a little tedious, partly because there is no predict() method for regsubsets(). Since we will be using this function again, we can capture our steps above and write our own predict method.

 predict.regsubsets <- function(object, newdata, id, ...) {
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[, xvars] %*% coefi
 }

Our function pretty much mimics what we did above. The only complex part is how we extracted the formula used in the call to regsubsets(). We demonstrate how we use this function below, when we do cross-validation.

Finally, we perform best subset selection on the full data set, and select the best seven-variable model. It is important that we make use of the full data set in order to obtain more accurate coefficient estimates.

Note that we perform best subset selection on the full data set and select the best seven-variable model, rather than simply using the variables that were obtained from the training set, because the best seven-variable model on the full data set may differ from the corresponding model on the training set.

regfit.best <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19)
coef(regfit.best, 7)

 (Intercept)         Hits        Walks       CAtBat        CHits       CHmRun 
  79.4509472    1.2833513    3.2274264   -0.3752350    1.4957073    1.4420538 
   DivisionW      PutOuts 
-129.9866432    0.2366813

In fact, we see that the best seven-variable model on the full data set has a different set of variables than the best seven-variable model on the training set.

We now try to choose among the models of different sizes using cross-validation. This approach is somewhat involved, as we must perform best subset selection within each of the \(k\) training sets. Despite this, we see that with its clever subsetting syntax, R makes this job quite easy. First, we create a vector that allocates each observation to one of \(k=10\) folds, and we create a matrix in which we will store the results.

k <- 10
n <- nrow(Hitters)
set.seed(1)
folds <- sample(rep(1:k, length = n))
cv.errors <- matrix(NA, k, 19,
    dimnames = list(NULL, paste(1:19)))

Now we write a for loop that performs cross-validation. In the \(j\)th fold, the elements of folds that equal j are in the test set, and the remainder are in the training set. We make our predictions for each model size (using our new predict() method), compute the test errors on the appropriate subset, and store them in the appropriate slot in the matrix cv.errors. Note that in the following code R will automatically use our predict.regsubsets() function when we call predict() because the best.fit object has class regsubsets.

for (j in 1:k) {
  best.fit <- regsubsets(Salary ~ .,
       data = Hitters[folds != j, ],
       nvmax = 19)
  for (i in 1:19) {
    pred <- predict(best.fit, Hitters[folds == j, ], id = i)
    cv.errors[j, i] <-
         mean((Hitters$Salary[folds == j] - pred)^2)
   }
 }

This has given us a \(10 \times 19\) matrix, of which the \((j,i)\)th element corresponds to the test MSE for the \(j\)th cross-validation fold for the best \(i\)-variable model. We use the apply() function to average over the columns of this matrix in order to obtain a vector for which the \(i\)th element is the cross-validation error for the \(i\)-variable model.

mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors

       1        2        3        4        5        6        7        8 
143439.8 126817.0 134214.2 131782.9 130765.6 120382.9 121443.1 114363.7 
       9       10       11       12       13       14       15       16 
115163.1 109366.0 112738.5 113616.5 115557.6 115853.3 115630.6 116050.0 
      17       18       19 
116117.0 116419.3 116299.1

par(mfrow = c(1, 1))
plot(mean.cv.errors, type = "b")

We see that cross-validation selects a 10-variable model. We now perform best subset selection on the full data set in order to obtain the 10-variable model.

reg.best <- regsubsets(Salary ~ ., data = Hitters,
    nvmax = 19)
coef(reg.best, 10)

 (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
 162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490 
        CRBI       CWalks    DivisionW      PutOuts      Assists 
   0.7743122   -0.8308264 -112.3800575    0.2973726    0.2831680

3.2 Ridge Regression and the Lasso

We will use the glmnet package in order to perform ridge regression and the lasso. The main function in this package is glmnet(), which can be used to fit ridge regression models, lasso models, and more. This function has slightly different syntax from other model-fitting functions that we have encountered thus far in this book. In particular, we must pass in an x matrix as well as a y vector, and we do not use the y ~ x syntax. We will now perform ridge regression and the lasso in order to predict Salary on the Hitters data. Before proceeding ensure that the missing values have been removed from the data, as described in Section 6.5.1.

x <- model.matrix(Salary ~ ., Hitters)[, -1]
y <- Hitters$Salary

The model.matrix() function is particularly useful for creating x; not only does it produce a matrix corresponding to the \(19\) predictors but it also automatically transforms any qualitative variables into dummy variables. The latter property is important because glmnet() can only take numerical, quantitative inputs.

3.2.1 Ridge Regression

The glmnet() function has an alpha argument that determines what type of model is fit. If alpha=0 then a ridge regression model is fit, and if alpha=1 then a lasso model is fit. We first fit a ridge regression model.

grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = grid)

By default the glmnet() function performs ridge regression for an automatically selected range of \(\lambda\) values. However, here we have chosen to implement the function over a grid of values ranging from \(\lambda=10^{10}\) to \(\lambda=10^{-2}\), essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit. As we will see, we can also compute model fits for a particular value of \(\lambda\) that is not one of the original grid values. Note that by default, the glmnet() function standardizes the variables so that they are on the same scale. To turn off this default setting, use the argument standardize = FALSE.

Associated with each value of \(\lambda\) is a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef(). In this case, it is a \(20 \times 100\) matrix, with \(20\) rows (one for each predictor, plus an intercept) and \(100\) columns (one for each value of \(\lambda\)).

dim(coef(ridge.mod))

[1]  20 100

We expect the coefficient estimates to be much smaller, in terms of \(\ell_2\) norm, when a large value of \(\lambda\) is used, as compared to when a small value of \(\lambda\) is used. These are the coefficients when \(\lambda=11{,}498\), along with their \(\ell_2\) norm:

ridge.mod$lambda[50]

[1] 11497.57

coef(ridge.mod)[, 50]

  (Intercept)         AtBat          Hits         HmRun          Runs 
407.356050200   0.036957182   0.138180344   0.524629976   0.230701523 
          RBI         Walks         Years        CAtBat         CHits 
  0.239841459   0.289618741   1.107702929   0.003131815   0.011653637 
       CHmRun         CRuns          CRBI        CWalks       LeagueN 
  0.087545670   0.023379882   0.024138320   0.025015421   0.085028114 
    DivisionW       PutOuts       Assists        Errors    NewLeagueN 
 -6.215440973   0.016482577   0.002612988  -0.020502690   0.301433531

sqrt(sum(coef(ridge.mod)[-1, 50]^2))

[1] 6.360612

In contrast, here are the coefficients when \(\lambda=705\), along with their \(\ell_2\) norm. Note the much larger \(\ell_2\) norm of the coefficients associated with this smaller value of \(\lambda\).

ridge.mod$lambda[60]

[1] 705.4802

coef(ridge.mod)[, 60]

 (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
 54.32519950   0.11211115   0.65622409   1.17980910   0.93769713   0.84718546 
       Walks        Years       CAtBat        CHits       CHmRun        CRuns 
  1.31987948   2.59640425   0.01083413   0.04674557   0.33777318   0.09355528 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
  0.09780402   0.07189612  13.68370191 -54.65877750   0.11852289   0.01606037 
      Errors   NewLeagueN 
 -0.70358655   8.61181213

sqrt(sum(coef(ridge.mod)[-1, 60]^2))

[1] 57.11001

We can use the predict() function for a number of purposes. For instance, we can obtain the ridge regression coefficients for a new value of \(\lambda\), say \(50\):

predict(ridge.mod, s = 50, type = "coefficients")[1:20, ]

  (Intercept)         AtBat          Hits         HmRun          Runs 
 4.876610e+01 -3.580999e-01  1.969359e+00 -1.278248e+00  1.145892e+00 
          RBI         Walks         Years        CAtBat         CHits 
 8.038292e-01  2.716186e+00 -6.218319e+00  5.447837e-03  1.064895e-01 
       CHmRun         CRuns          CRBI        CWalks       LeagueN 
 6.244860e-01  2.214985e-01  2.186914e-01 -1.500245e-01  4.592589e+01 
    DivisionW       PutOuts       Assists        Errors    NewLeagueN 
-1.182011e+02  2.502322e-01  1.215665e-01 -3.278600e+00 -9.496680e+00

We now split the samples into a training set and a test set in order to estimate the test error of ridge regression and the lasso. There are two common ways to randomly split a data set. The first is to produce a random vector of TRUE, FALSE elements and select the observations corresponding to TRUE for the training data. The second is to randomly choose a subset of numbers between \(1\) and \(n\); these can then be used as the indices for the training observations. The two approaches work equally well. We used the former method in Section 6.5.1. Here we demonstrate the latter approach.

We first set a random seed so that the results obtained will be reproducible.

set.seed(1)
train <- sample(1:nrow(x), nrow(x) / 2)
test <- (-train)
y.test <- y[test]

Next we fit a ridge regression model on the training set, and evaluate its MSE on the test set, using \(\lambda=4\). Note the use of the predict() function again. This time we get predictions for a test set, by replacing type="coefficients" with the newx argument.

ridge.mod <- glmnet(x[train, ], y[train], alpha = 0,
    lambda = grid, thresh = 1e-12)

ridge.pred <- predict(ridge.mod, s = 4, newx = x[test, ])
mean((ridge.pred - y.test)^2)

[1] 142199.2

The test MSE is \(142{,}199\). Note that if we had instead simply fit a model with just an intercept, we would have predicted each test observation using the mean of the training observations. In that case, we could compute the test set MSE like this:

mean((mean(y[train]) - y.test)^2)

[1] 224669.9

We could also get the same result by fitting a ridge regression model with a very large value of \(\lambda\). Note that 1e10 means \(10^{10}\).

ridge.pred <- predict(ridge.mod, s = 1e10, newx = x[test, ])
mean((ridge.pred - y.test)^2)

[1] 224669.8

So fitting a ridge regression model with \(\lambda=4\) leads to a much lower test MSE than fitting a model with just an intercept. We now check whether there is any benefit to performing ridge regression with \(\lambda=4\) instead of just performing least squares regression. Recall that least squares is simply ridge regression with \(\lambda=0\). (In order for glmnet() to yield the exact least squares coefficients when \(\lambda=0\), we use the argument exact = T when calling the predict() function. Otherwise, the predict() function will interpolate over the grid of \(\lambda\) values used in fitting the glmnet() model, yielding approximate results. When we use exact = T, there remains a slight discrepancy in the third decimal place between the output of glmnet() when \(\lambda = 0\) and the output of lm(); this is due to numerical approximation on the part of glmnet().)

ridge.pred <- predict(ridge.mod, s = 0, newx = x[test, ],
    exact = T, x = x[train, ], y = y[train])
mean((ridge.pred - y.test)^2)

[1] 168588.6

lm(y ~ x, subset = train)


Call:
lm(formula = y ~ x, subset = train)

Coefficients:
(Intercept)       xAtBat        xHits       xHmRun        xRuns         xRBI  
   274.0145      -0.3521      -1.6377       5.8145       1.5424       1.1243  
     xWalks       xYears      xCAtBat       xCHits      xCHmRun       xCRuns  
     3.7287     -16.3773      -0.6412       3.1632       3.4008      -0.9739  
      xCRBI      xCWalks     xLeagueN   xDivisionW     xPutOuts     xAssists  
    -0.6005       0.3379     119.1486    -144.0831       0.1976       0.6804  
    xErrors  xNewLeagueN  
    -4.7128     -71.0951

predict(ridge.mod, s = 0, exact = T, type = "coefficients",
    x = x[train, ], y = y[train])[1:20, ]

 (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
 274.0200994   -0.3521900   -1.6371383    5.8146692    1.5423361    1.1241837 
       Walks        Years       CAtBat        CHits       CHmRun        CRuns 
   3.7288406  -16.3795195   -0.6411235    3.1629444    3.4005281   -0.9739405 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
  -0.6003976    0.3378422  119.1434637 -144.0853061    0.1976300    0.6804200 
      Errors   NewLeagueN 
  -4.7127879  -71.0898914

In general, if we want to fit a (unpenalized) least squares model, then we should use the lm() function, since that function provides more useful outputs, such as standard errors and p-values for the coefficients.

In general, instead of arbitrarily choosing \(\lambda=4\), it would be better to use cross-validation to choose the tuning parameter \(\lambda\). We can do this using the built-in cross-validation function, cv.glmnet(). By default, the function performs ten-fold cross-validation, though this can be changed using the argument nfolds. Note that we set a random seed first so our results will be reproducible, since the choice of the cross-validation folds is random.

set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 0)
plot(cv.out)

bestlam <- cv.out$lambda.min
bestlam

[1] 326.0828

Therefore, we see that the value of \(\lambda\) that results in the smallest cross-validation error is \(326\). What is the test MSE associated with this value of \(\lambda\)?

ridge.pred <- predict(ridge.mod, s = bestlam,
    newx = x[test, ])
mean((ridge.pred - y.test)^2)

[1] 139856.6

This represents a further improvement over the test MSE that we got using \(\lambda=4\). Finally, we refit our ridge regression model on the full data set, using the value of \(\lambda\) chosen by cross-validation, and examine the coefficient estimates.

out <- glmnet(x, y, alpha = 0)
predict(out, type = "coefficients", s = bestlam)[1:20, ]

 (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
 15.44383120   0.07715547   0.85911582   0.60103106   1.06369007   0.87936105 
       Walks        Years       CAtBat        CHits       CHmRun        CRuns 
  1.62444617   1.35254778   0.01134999   0.05746654   0.40680157   0.11456224 
        CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
  0.12116504   0.05299202  22.09143197 -79.04032656   0.16619903   0.02941950 
      Errors   NewLeagueN 
 -1.36092945   9.12487765

As expected, none of the coefficients are zero—ridge regression does not perform variable selection!

3.2.2 The Lasso

We saw that ridge regression with a wise choice of \(\lambda\) can outperform least squares as well as the null model on the Hitters data set. We now ask whether the lasso can yield either a more accurate or a more interpretable model than ridge regression. In order to fit a lasso model, we once again use the glmnet() function; however, this time we use the argument alpha=1. Other than that change, we proceed just as we did in fitting a ridge model.

lasso.mod <- glmnet(x[train, ], y[train], alpha = 1,
    lambda = grid)
plot(lasso.mod)

We can see from the coefficient plot that depending on the choice of tuning parameter, some of the coefficients will be exactly equal to zero. We now perform cross-validation and compute the associated test error.

set.seed(1)
cv.out <- cv.glmnet(x[train, ], y[train], alpha = 1)
plot(cv.out)

bestlam <- cv.out$lambda.min
lasso.pred <- predict(lasso.mod, s = bestlam,
    newx = x[test, ])
mean((lasso.pred - y.test)^2)

[1] 143673.6

This is substantially lower than the test set MSE of the null model and of least squares, and very similar to the test MSE of ridge regression with \(\lambda\) chosen by cross-validation.

However, the lasso has a substantial advantage over ridge regression in that the resulting coefficient estimates are sparse. Here we see that 8 of the 19 coefficient estimates are exactly zero. So the lasso model with \(\lambda\) chosen by cross-validation contains only eleven variables.

out <- glmnet(x, y, alpha = 1, lambda = grid)
lasso.coef <- predict(out, type = "coefficients",
    s = bestlam)[1:20, ]
lasso.coef

  (Intercept)         AtBat          Hits         HmRun          Runs 
   1.27479059   -0.05497143    2.18034583    0.00000000    0.00000000 
          RBI         Walks         Years        CAtBat         CHits 
   0.00000000    2.29192406   -0.33806109    0.00000000    0.00000000 
       CHmRun         CRuns          CRBI        CWalks       LeagueN 
   0.02825013    0.21628385    0.41712537    0.00000000   20.28615023 
    DivisionW       PutOuts       Assists        Errors    NewLeagueN 
-116.16755870    0.23752385    0.00000000   -0.85629148    0.00000000

lasso.coef[lasso.coef != 0]

  (Intercept)         AtBat          Hits         Walks         Years 
   1.27479059   -0.05497143    2.18034583    2.29192406   -0.33806109 
       CHmRun         CRuns          CRBI       LeagueN     DivisionW 
   0.02825013    0.21628385    0.41712537   20.28615023 -116.16755870 
      PutOuts        Errors 
   0.23752385   -0.85629148