Linear Regression II

Author

Termeh Shafie

1 Evaluate Model Performance

Our goal is to run a few models of different complexity and evaluate the performance based on a test-train split of the data.

1.1 Question

Can you understand why we do not use the full data to test the different models?

We will start by creating a function that allows us to assess model performance:

rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}

We will also use a function that gives the complexity of the linear model as number of independent variables defined.

get_complexity = function(model) {
  length(coef(model)) - 1
}

1.2 Question

Can you understand why we take -1 in the above function?

1.3 Data

We will use the Advertisement data from ISLR2 which is in a compressed folder for download. Make sure you have set the correct working directory for reading the data. You can also load the data from the ISLR2 package. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.

library(readr) # install if not in your library
Advertising = read_csv("03-data/Advertising.csv") 
knitr::kable(head(Advertising,10)) # look at 10 first rows of data

TV	Radio	Newspaper	Sales
230.1	37.8	69.2	22.1
44.5	39.3	45.1	10.4
17.2	45.9	69.3	9.3
151.5	41.3	58.5	18.5
180.8	10.8	58.4	12.9
8.7	48.9	75.0	7.2
57.5	32.8	23.5	11.8
120.2	19.6	11.6	13.2
8.6	2.1	1.0	4.8
199.8	2.6	21.2	10.6

Explore the data by doing some plots. Example shown below.

plot(Sales ~ TV, data = Advertising, col = "dodgerblue", pch = 20, cex = 1.5,
     main = "Sales vs Television Advertising")

We can also explore the data by using a correlation matrix plot:

pairs(Advertising)

1.4 Question

What are the main patterns you detect?

1.5 Test-Train Split

We will now split the data in two. One part of the datasets will be used to fit (train) a model, which we will call the training data. The remainder of the original data will be used to assess how well the model is predicting, which we will call the test data.

We use the sample() function to obtain a random sample of the rows of the original data. We then use those row numbers (and remaining row numbers) to split the data accordingly. Notice we used the set.seed() function to reproduce the same random split each time we perform this analysis.

set.seed(1)
num_obs = nrow(Advertising)

train_index = sample(num_obs, size = trunc(0.50 * num_obs))
train_data = Advertising[train_index, ]
test_data = Advertising[-train_index, ]

1.6 The Model

We start by fitting the simplest linear model, a model with no predictors:

fit_0 = lm(Sales ~ 1, data = train_data)
get_complexity(fit_0)

[1] 0

As seen, the complexity of this model is 0. We compute the Test and Train RMSE for this model, via the function specified above and a direct formula:

# train RMSE
rmse(actual = train_data$Sales, predicted = predict(fit_0, train_data))

[1] 5.297333

sqrt(mean((train_data$Sales - predict(fit_0, train_data)) ^ 2))

[1] 5.297333

# test RMSE
rmse(actual = test_data$Sales, predicted = predict(fit_0, test_data))

[1] 5.112073

sqrt(mean((test_data$Sales - predict(fit_0, test_data)) ^ 2))

[1] 5.112073

Let’s make the computations of RMSE even more easy by using the below function:

# train RMSE
get_rmse = function(model, data, response) {
  rmse(actual = subset(data, select = response, drop = TRUE),
       predicted = predict(model, data))
}

Can you figure out how to use the above function?

We will fit 5 models, compute the train and test MSE for each model to evaluate it and visualize how it relates to the complexity (i.e. model size in terms of parameters) of the model. Furthermore, we will interpret the results in terms of underfitting or overfitting. We will conclude with dertemrineing which model to choose. The five models to fit are listed below:

Sales ~ Radio + Newspaper + TV
Sales ~ Radio * Newspaper * TV
Sales ~ Radio * Newspaper * TV + I(TV ^ 2)
Sales ~ Radio * Newspaper * TV + I(TV ^ 2) + I(Radio ^ 2) + I(Newspaper ^ 2)
Sales ~ Radio * Newspaper * TV + I(TV ^ 2) * I(Radio ^ 2) * I(Newspaper ^ 2)

Note that the specification first*second indicates the cross of first and second. This is the same as first + second + first:second.

We fit all five models and print the output by creating a list of the model fits. We then obtain train RMSE, test RMSE, and model complexity for each. Finally, plot the results. The train RMSE can be seen in blue, while the test RMSE is given in orange.

fit_1 = lm(Sales ~ Radio + Newspaper + TV, data = train_data)
fit_2 = lm(Sales ~ Radio * Newspaper * TV, data = train_data)
fit_3 = lm(Sales ~ Radio * Newspaper * TV + I(TV ^ 2), data = train_data)
fit_4 = lm(Sales ~ Radio * Newspaper * TV + 
             I(TV ^ 2) + I(Radio ^ 2) + I(Newspaper ^ 2), data = train_data)
fit_5 = lm(Sales ~ Radio * Newspaper * TV +
             I(TV ^ 2) * I(Radio ^ 2) * I(Newspaper ^ 2), data = train_data)

model_list = list(fit_1, fit_2, fit_3, fit_4, fit_5)
#model_list #uncomment if you wish to see this list object

train_rmse = sapply(model_list, get_rmse, data = train_data, response = "Sales")
test_rmse = sapply(model_list, get_rmse, data = test_data, response = "Sales")
model_complexity = sapply(model_list, get_complexity)

train_rmse = sapply(model_list, get_rmse, data = train_data, response = "Sales")
test_rmse = sapply(model_list, get_rmse, data = test_data, response = "Sales")
model_complexity = sapply(model_list, get_complexity)

plot(model_complexity, train_rmse, type = "b", 
     ylim = c(min(c(train_rmse, test_rmse)) - 0.02, 
              max(c(train_rmse, test_rmse)) + 0.02), 
     col = "blue", 
     lwd = 2,
     xlab = "Complexity (Model Size)",
     ylab = "RMSE")
lines(model_complexity, test_rmse, type = "b", col = "red", lwd = 2)
grid()

1.7 Question

What do you conclude?

Answer

Underfitting models: In general High Train RMSE, High Test RMSE. Seen in fit_1 and fit_2.
Overfitting models: In general Low Train RMSE, High Test RMSE. Seen in fit_4 and fit_5.

We see that the Test RMSE is smallest for fit_3, and it is the model we believe will perform the best on future data not used to train the model.

1.8 Exercise

Write functions for the some of the other loss functions we covered in the lecture (e.g. MAE, MAPE). Apply them to asses the same five models above. Do you see similar or deviant results in terms of overfitting/underfitting?

2 Extrapolation

What is extrapolation and why is it dangerous?

2.1 Real Life Extrapolation

Ironman Data

2.2 The Bias Variance Tradeoff

\[ \underbrace{E(y_0 - \hat{f}(x_0))^2}_\text{expected MSE at $x_0$} = \overbrace{Var(\hat{f}(x_0))}^\text{variance of the model when fit on different samples} + \underbrace{[Bias(\hat{f}(x_0))]^2}_\text{ability of the model to "get it right" on average} + \overbrace{Var(\epsilon)}^\text{irreducible error}\]

What are we looking for when trying to see if there is overfitting?
What is Overfitting and how is it different from extrapolation?

3 Simulation from Lecture

library(tibble)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

We begin by defining two functions: one that represents the true probability model and another that generates data from it.

cubic_mean = function(x) {
  1 - 2 * x - 3 * x ^ 2 + 5 * x ^ 3
}

gen_slr_data = function(sample_size = 100, mu) {
  x = round(runif(n = sample_size, min = -1, max = 1), 2)
  y = round(mu(x) + rnorm(n = sample_size), 2)
  tibble(x, y)
}

Let’s generate some data from this cubic model.

set.seed(1)
sim_slr_data = gen_slr_data(sample_size = 10, mu = cubic_mean)
sim_slr_data

# A tibble: 10 × 2
       x     y
   <dbl> <dbl>
 1 -0.47 -0.06
 2 -0.26  1.72
 3  0.15  1.39
 4  0.82  0.68
 5 -0.6  -0.27
 6  0.8   1.55
 7  0.89  0.76
 8  0.32 -0.4 
 9  0.26 -1.85
10 -0.88 -1.85

Next, we fit three polynomial regression models of degrees 1, 3, and 9.

poly_mod_1 = lm(y ~ x, data = sim_slr_data)
poly_mod_3 = lm(y ~ poly(x, degree = 3), data = sim_slr_data)
poly_mod_9 = lm(y ~ poly(x, degree = 9), data = sim_slr_data)

The following plots show how each model fits the data. The dashed line represents the true mean function, while the solid line represents the fitted model.

par(mfrow = c(1, 3))

plot(sim_slr_data, pch = 20, col = "grey", cex = 2,
     main = "Degree 1 Polynomial",
     xlim = c(-1.1, 1.1), ylim = c(-2, 2))
curve(cubic_mean(x), add = TRUE, lwd = 2, lty = 2)
curve(predict(poly_mod_1, tibble(x = x)), 
      col = "#fc8d62", lwd = 2, lty = 1, add = TRUE, n = 10000)
grid()

plot(sim_slr_data, pch = 20, col = "grey", cex = 2,
     main = "Degree 3 Polynomial",
     xlim = c(-1.1, 1.1), ylim = c(-2, 2))
curve(cubic_mean(x), add = TRUE, lwd = 2, lty = 2)
curve(predict(poly_mod_3, tibble(x = x)),
      col = "#beaed4", lwd = 2, lty = 1, add = TRUE, n = 10000)
grid()

plot(sim_slr_data, pch = 20, col = "grey", cex = 2,
     main = "Degree 9 Polynomial",
     xlim = c(-1.1, 1.1), ylim = c(-2, 2))
curve(cubic_mean(x), add = TRUE, lwd = 2, lty = 2)
curve(predict(poly_mod_9, tibble(x = x)),
      col = "#91bfdb", lwd = 2, lty = 1, add = TRUE, n = 10000)
grid()

Which of these models is best?

To illustrate why it’s important to fit the true regression mean rather than overfitting the sample data, we generate a new dataset and apply the same models.

set.seed(7)
new_sim_slr_data = gen_slr_data(sample_size = 20, mu = cubic_mean)

par(mfrow = c(1, 3))

plot(new_sim_slr_data, pch = 20, col = "grey", cex = 2,
     main = "Degree 1 Polynomial | New Data",
     xlim = c(-1.1, 1.1), ylim = c(-2, 2))
curve(cubic_mean(x), add = TRUE, lwd = 2, lty = 2)
curve(predict(poly_mod_1, tibble(x = x)), 
      col = "#fc8d62", lwd = 2, lty = 1, add = TRUE, n = 10000)
grid()

plot(new_sim_slr_data, pch = 20, col = "grey", cex = 2,
     main = "Degree 3 Polynomial | New Data",
     xlim = c(-1.1, 1.1), ylim = c(-2, 2))
curve(cubic_mean(x), add = TRUE, lwd = 2, lty = 2)
curve(predict(poly_mod_3, tibble(x = x)),
      col = "#beaed4", lwd = 2, lty = 1, add = TRUE, n = 10000)
grid()

plot(new_sim_slr_data, pch = 20, col = "grey", cex = 2,
     main = "Degree 9 Polynomial | New Data",
     xlim = c(-1.1, 1.1), ylim = c(-2, 2))
curve(cubic_mean(x), add = TRUE, lwd = 2, lty = 2)
curve(predict(poly_mod_9, tibble(x = x)),
      col = "#91bfdb", lwd = 2, lty = 1, add = TRUE, n = 10000)
grid()

This experiment demonstrates how different polynomial degrees affect model fit:

Degree 1 underfits: it misses curvature in the data.
Degree 3 closely matches the true cubic relationship.
Degree 9 overfits: it captures noise and performs poorly on new data.

The example highlights the dangers of overfitting: complex models can appear accurate on limited data but fail to generalize.

3.1 Exercise

Replicate the simulation example but try other models based on different order polynomials.