‘Non-Linear’ Linear Regression

Author

Termeh Shafie

1 Nonlinearity

In order to make better models, sometimes we create new features in our data. This can be as simple as creating a column like BMI or bill_ratio like we’ve done in past classworks, or extracting the day of the week from a date string, but it can also be things like creating polynomial features, step functions, splines, or interactions.

These features can help our model perform better. For polynomial features, step functions, and splines, it also allows us to add non-linearity to our predictions even though we’re using linear regression.

We will use multiple approaches for modelling non-linearity and apply it to the Boston dataset included in the R library MASS. This dataset consists of 506 samples. The response variable is median value of owner-occupied homes in Boston (medv). The dataset has 13 associated predictor variables.

library(MASS)
help(Boston)
names(Boston)

 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

head(Boston)

     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
  medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7

We will analyse medv with respect to the predictor lstat (percentage of lower status population).

head(cbind(Boston$medv, Boston$lstat))

     [,1] [,2]
[1,] 24.0 4.98
[2,] 21.6 9.14
[3,] 34.7 4.03
[4,] 33.4 2.94
[5,] 36.2 5.33
[6,] 28.7 5.21

For convenience we can name the response as y and the predictor x. We will also pre-define the labels for the x and y-axes that we will use repeatedly in figures throughout this practical.

y = Boston$medv
x = Boston$lstat
y.lab = 'Median Property Value'
x.lab = 'Lower Status (%)'

plot( x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
      main = "", bty = 'l' )

1.1 Polynomial Regression

Start by fitting to the data a degree-2 polynomial using the command lm() and summarizing the results using summary().

poly2 = lm(y ~ poly(x,  2,  raw = TRUE))
summary(poly2)


Call:
lm(formula = y ~ poly(x, 2, raw = TRUE))

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2834  -3.8313  -0.5295   2.3095  25.4148 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             42.862007   0.872084   49.15   <2e-16 ***
poly(x, 2, raw = TRUE)1 -2.332821   0.123803  -18.84   <2e-16 ***
poly(x, 2, raw = TRUE)2  0.043547   0.003745   11.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared:  0.6407,    Adjusted R-squared:  0.6393 
F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

The argument raw = TRUE In terms of fitting the curve poly(x, 2, raw = TRUE)) and poly(x, 2)) will give the same result! They are just based on different (orthogonal) basis but with polynomial regression we are almost never interested in the regression coefficients.

For plotting th results, we need to create an object, which we name sort.x, which has the sorted values of predictor x in a ascending order. Without sort.x we will not be able to produce the plots since in lecture. Then, we need to use predict() with sort.x as input in order to proceed to the next steps.

sort.x = sort(x)
sort.x[1:10]     # the first 10 sorted values of x

 [1] 1.73 1.92 1.98 2.47 2.87 2.88 2.94 2.96 2.97 2.98

pred2 = predict(poly2, newdata = list(x = sort.x), se = TRUE)
names(pred2)

[1] "fit"            "se.fit"         "df"             "residual.scale"

The object pred2 contains fit, which are the fitted values, and se.fit, which are the standard errors of the mean prediction, that we need in order to construct the approximate 95% confidence intervals (of the mean prediction). With this information we can construct the confidence intervals using cbind(). Lets see how the first 10 fitted values and confidence intervals look like.

pred2$fit[1:10]  # the first 10 fitted values of the curve

       1        2        3        4        5        6        7        8 
38.95656 38.54352 38.41374 37.36561 36.52550 36.50468 36.37992 36.33840 
       9       10 
36.31765 36.29691

se.bands2 = cbind( pred2$fit - 2 * pred2$se.fit, 
                   pred2$fit + 2 * pred2$se.fit )
se.bands2[1:10,] # the first 10 confidence intervals of the curve

       [,1]     [,2]
1  37.58243 40.33069
2  37.20668 39.88036
3  37.08853 39.73895
4  36.13278 38.59845
5  35.36453 37.68647
6  35.34546 37.66390
7  35.23118 37.52865
8  35.19314 37.48365
9  35.17413 37.46117
10 35.15513 37.43870

Now we can plot the results:

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
     main = "Degree-2 polynomial", bty = 'l')
lines(sort.x, pred2$fit, lwd = 2, col = "firebrick")
matlines(sort.x, se.bands2, lwd = 1.4, col = "firebrick", lty = 3)

Note: We use lines() for pred2$fit because this is a vector, but for se.bands2, which is a matrix, we have to use matlines().

Then we do similar steps to produce a plot of degree-3 up to degree-5 polynomial fits.

poly3 = lm(y ~ poly(x,  3))
poly4 = lm(y ~ poly(x,  4))
poly5 = lm(y ~ poly(x, 5))

pred3 = predict(poly3, newdata = list(x = sort.x), se = TRUE)
pred4 = predict(poly4, newdata = list(x = sort.x), se = TRUE)
pred5 = predict(poly5, newdata = list(x = sort.x), se = TRUE)

se.bands3 = cbind(pred3$fit + 2*pred3$se.fit, pred3$fit-2*pred3$se.fit)
se.bands4 = cbind(pred4$fit + 2*pred4$se.fit, pred4$fit-2*pred4$se.fit)
se.bands5 = cbind(pred5$fit + 2*pred5$se.fit, pred5$fit-2*pred5$se.fit)


par(mfrow = c(2,2))
# Degree-2
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
     main = "Degree-2 polynomial", bty = 'l')
lines(sort.x, pred2$fit, lwd = 2, col = "firebrick")
matlines(sort.x, se.bands2, lwd = 2, col = "firebrick", lty = 3)

# Degree-3
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
     main = "Degree-3 polynomial", bty = 'l')
lines(sort.x, pred3$fit, lwd = 2, col = "darkviolet")
matlines(sort.x, se.bands3, lwd = 2, col = "darkviolet", lty = 3)

# Degree-4
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
     main = "Degree-4 polynomial", bty = 'l')
lines(sort.x, pred4$fit, lwd = 2, col = "royalblue")
matlines(sort.x, se.bands4, lwd = 2, col = "royalblue", lty = 3)

# Degree-5
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
     main = "Degree-5 polynomial", bty = 'l')
lines(sort.x, pred5$fit, lwd = 2, col = "darkgreen")
matlines(sort.x, se.bands5, lwd = 2, col = "darkgreen", lty = 3)

All four curves look reasonable given the data available. We may choose the degree-2 polynomial since it is simpler and seems to do about as well as the others. However, if we want to base our decision on a more formal procedure, we can use analysis-of-variance (ANOVA). Specifically, we will perform sequential comparisons based on the F-test, comparing first the linear model vs. the quadratic model (degree-2 polynomial), then the quadratic model vs. the cubic model (degree-3 polynomial) and so on. We therefore have to fit the simple linear model, and we also choose to fit the degree-6 polynomial to investigate the effects of an additional predictor as well. We can perform this analysis in RStudio using the command anova() as displayed below.

poly1 = lm(y ~ x)
poly6 = lm(y ~ poly(x, 6))
anova(poly1, poly2, poly3, poly4, poly5, poly6)

Analysis of Variance Table

Model 1: y ~ x
Model 2: y ~ poly(x, 2, raw = TRUE)
Model 3: y ~ poly(x, 3)
Model 4: y ~ poly(x, 4)
Model 5: y ~ poly(x, 5)
Model 6: y ~ poly(x, 6)
  Res.Df   RSS Df Sum of Sq        F    Pr(>F)    
1    504 19472                                    
2    503 15347  1    4125.1 151.8623 < 2.2e-16 ***
3    502 14616  1     731.8  26.9390 3.061e-07 ***
4    501 13968  1     647.8  23.8477 1.406e-06 ***
5    500 13597  1     370.7  13.6453 0.0002452 ***
6    499 13555  1      42.4   1.5596 0.2123125    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which model would you choose?

1.2 Step Functions

For step function regression we can make use of the command cut(), which automatically assigns samples to intervals given a specific number of intervals. We can check how this works by executing the following syntax:

table(cut(x, 2))


(1.69,19.9]   (19.9,38] 
        430          76

What we see is that cut(x, 2) automatically created a factor with two levels, corresponding to the intervals $(1.69,19.9]$ and $(19.9,38]$ and assigned each entry in x to one of these factors depending on which interval it was in. The command table() tells us that 430 samples of x fall within the first interval and that 76 samples fall within the second interval. Note that cut(x, 2) generated 2 intervals, but this means there is only 1 cutpoint (at 19.9). The number of cutpoints is naturally one less than the number of intervals, but it is important to be aware that cut requires specification of the number of required intervals.

So, we can use cut() within lm() to easily fit regression models with step functions. Below we consider 4 models with 1, 2, 3 and 4 cutpoints (2, 3, 4 and 5 intervals) respectively.

step2 = lm(y ~ cut(x, 2))
step3 = lm(y ~ cut(x, 3))
step4 = lm(y ~ cut(x, 4))
step5 = lm(y ~ cut(x, 5))

The analysis then is essentially the same as previously. We plot the fitted lines of the four models, along with approximate 95% confidence intervals for the mean predictions.

#| fig-width: 9
#| fig-height: 8
pred2 = predict(step2, newdata = list(x = sort(x)), se = TRUE)
pred3 = predict(step3, newdata = list(x = sort(x)), se = TRUE)
pred4 = predict(step4, newdata = list(x = sort(x)), se = TRUE)
pred5 = predict(step5, newdata = list(x = sort(x)), se = TRUE)

se.bands2 = cbind(pred2$fit + 2*pred2$se.fit, pred2$fit-2*pred2$se.fit)
se.bands3 = cbind(pred3$fit + 2*pred3$se.fit, pred3$fit-2*pred3$se.fit)
se.bands4 = cbind(pred4$fit + 2*pred4$se.fit, pred4$fit-2*pred4$se.fit)
se.bands5 = cbind(pred5$fit + 2*pred5$se.fit, pred5$fit-2*pred5$se.fit)

par(mfrow = c(2,2))

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
     main = "1 cutpoint", bty = 'l')
lines(sort(x), pred2$fit, lwd = 2, col = "firebrick")
matlines(sort(x), se.bands2, lwd = 1.4, col = "firebrick", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
     main = "2 cutpoints", bty = 'l')
lines(sort(x), pred3$fit, lwd = 2, col = "darkviolet")
matlines(sort(x), se.bands3, lwd = 1.4, col = "darkviolet", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
     main = "3 cutpoints", bty = 'l')
lines(sort(x), pred4$fit, lwd = 2, col = "royalblue")
matlines(sort(x), se.bands4, lwd = 1.4, col = "royalblue", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab,
     main = "4 cutpoints", bty = 'l')
lines(sort(x), pred5$fit, lwd = 2, col = "darkgreen")
matlines(sort(x), se.bands5, lwd = 1.4, col = "darkgreen", lty = 3)

Note that we do not necessarily need to rely on the automatic selections of cutpoints used by cut(). We can define the intervals if we want to. For instance, if we want cutpoints at 10, 20 and 30 we can do the following

breaks4 = c(min(x), 10, 20, 30, max(x))
table(cut(x, breaks = breaks4))


(1.73,10]   (10,20]   (20,30]   (30,38] 
      218       213        62        12

By including min(x) and max(x) at the start and end, we ensure the intervals covered the entire range of x. Our model is then

step.new4 = lm(y ~ cut(x, breaks = breaks4))
summary(step.new4)


Call:
lm(formula = y ~ cut(x, breaks = breaks4))

Residuals:
     Min       1Q   Median       3Q      Max 
-17.4803  -4.6239  -0.4239   2.8968  20.6197 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      29.3803     0.4415  66.540   <2e-16 ***
cut(x, breaks = breaks4)(10,20] -10.4563     0.6281 -16.648   <2e-16 ***
cut(x, breaks = breaks4)(20,30] -16.6770     0.9383 -17.773   <2e-16 ***
cut(x, breaks = breaks4)(30,38] -18.6886     1.9331  -9.668   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.519 on 501 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.4925,    Adjusted R-squared:  0.4895 
F-statistic: 162.1 on 3 and 501 DF,  p-value: < 2.2e-16

We can now make predictions at new data points using the constructed linear model as usual.

newx <- c(10.56, 5.89)
preds = predict(step.new4, newdata = list(x = newx), se = TRUE)
preds

$fit
       1        2 
18.92394 29.38028 

$se.fit
        1         2 
0.4466955 0.4415432 

$df
[1] 501

$residual.scale
[1] 6.519307

1.3 Regression Splines

For this analysis we will require package splines.

library(splines)

Initially let’s fit regression splines by specifying knots. From the previous plot it is not clear where exactly we should place knots, so we will make use of the command summary in order to find the 25th, 50th and 75th percentiles ofx, which will be the positions where we will place the knots. We also sort the variable x before fitting the splines.

summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.73    6.95   11.36   12.65   16.95   37.97

cuts = summary(x)[c(2, 3, 5)] 
cuts

1st Qu.  Median 3rd Qu. 
  6.950  11.360  16.955

sort.x = sort(x)

For a start lets fit a linear spline using our selected placement of knots. For this we can use command lm() and inside it we use the command bs() in which we specify degree = 1 for a linear spline and knots = cuts for the placement of the knots at the three percentiles. We also calculate the corresponding fitted values and confidence intervals exactly in the same way we did in previous practical demonstrations.

spline1 = lm(y ~  splines::bs(x, degree = 1, knots = cuts))
pred1 = predict(spline1, newdata = list(x = sort.x), se = TRUE)
se.bands1 = cbind(pred1$fit + 2 * pred1$se.fit, 
                  pred1$fit - 2 * pred1$se.fit)

Let’s plot the results:

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Linear Spline", bty = 'l')
lines(sort.x, pred1$fit, lwd = 2, col = "firebrick")
matlines(sort.x, se.bands1, lwd = 2, col = "firebrick", lty = 3)

Using ?bs we see that instead of using the argument knots we can use the argument df, which are the degrees of freedom. Splines have $(d+1)+K$ degrees of freedom, where $d$ is the degree of the polynomial and $K$ the number of knots. So in this case we have 1+1+3 = 5 degrees of freedom. Selecting df = 5 in bs() will automatically use 3 knots placed at the 25th, 50th and 75th percentiles. Below we check whether the plot based on df=5 is indeed the same as the previous plot and as we can see it is.

spline1df = lm(y ~ splines::bs(x, degree = 1, df = 5))
pred1df = predict(spline1df, newdata = list(x = sort.x), se = TRUE)
se.bands1df = cbind( pred1df$fit + 2 * pred1df$se.fit, 
                     pred1df$fit - 2 * pred1df$se.fit )

par(mfrow = c(1, 2))
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Linear Spline (with knots)", bty = 'l')
lines(sort.x, pred1$fit, lwd = 2, col = "firebrick")
matlines(sort.x, se.bands1, lwd = 2, col = "firebrick", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Linear Spline (with df)", bty = 'l')
lines(sort.x, pred1df$fit, lwd = 2, col = "firebrick")
matlines(sort.x, se.bands1df, lwd = 2, col = "firebrick", lty = 3)
matlines(sort.x, se.bands1, lwd = 2, col = "firebrick", lty = 3)

Having seen how this works we can also fit a degree-2 (quadratic) and degree-3 (cubic) spline to the data, all we have to do is change degree = 1 to degree = 2 and degree = 3 respectively. Also we increase the respective degrees of freedom from df = 5 to df = 6 and df = 7 in order to keep the same number (and position) of knots in the quadratic and cubic spline models.

spline2 = lm(y ~ splines::bs(x, degree = 2, df = 6))
pred2 = predict(spline2, newdata = list(x = sort.x), se = TRUE)
se.bands2 = cbind(pred2$fit + 2 * pred2$se.fit, pred2$fit - 2 * pred2$se.fit)

spline3 = lm(y ~ splines::bs(x, degree = 3, df = 7))
pred3 = predict(spline3, newdata = list(x = sort.x), se = TRUE)
se.bands3 = cbind(pred3$fit + 2 * pred3$se.fit, pred3$fit - 2 * pred3$se.fit)

par(mfrow = c(1,3))
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Linear Spline", bty = 'l')
lines(sort.x, pred1$fit, lwd = 2, col = "firebrick")
matlines(sort.x, se.bands1, lwd = 2, col = "firebrick", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Quadratic Spline", bty = 'l')
lines(sort.x, pred2$fit, lwd = 2, col = "darkgreen")
matlines(sort.x, se.bands2, lwd = 2, col = "darkgreen", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Cubic Spline", bty = 'l')
lines(sort.x, pred3$fit, lwd = 2, col = "royalblue")
matlines(sort.x, se.bands3, lwd = 2, col = "royalblue", lty = 3)

1.4 Natural Splines

For natural splines, we can use the command ns(). As with the command bs() previously, we again have the option to either specify the knots manually (via the argument knots) or to simply pre-define the degrees of freedom (via the argument df). Below we use the latter option to fit four natural splines with 1, 2, 3 and 4 degrees of freedom. As we see using 1 degree of freedom actually results in just a linear model.

spline.ns1 = lm(y ~ splines::ns(x, df = 1))
pred.ns1 = predict(spline.ns1, newdata = list(x = sort.x), se = TRUE)
se.bands.ns1 = cbind(pred.ns1$fit + 2 * pred.ns1$se.fit, 
                     pred.ns1$fit - 2 * pred.ns1$se.fit)

spline.ns2 = lm(y ~ splines::ns(x, df = 2))
pred.ns2 = predict(spline.ns2, newdata = list(x = sort.x), se = TRUE)
se.bands.ns2 = cbind(pred.ns2$fit + 2 * pred.ns2$se.fit, 
                     pred.ns2$fit - 2 * pred.ns2$se.fit)

spline.ns3 = lm(y ~ splines::ns(x, df = 3))
pred.ns3 = predict(spline.ns3, newdata = list(x = sort.x), se = TRUE)
se.bands.ns3 = cbind(pred.ns3$fit + 2 * pred.ns3$se.fit, 
                     pred.ns3$fit - 2 * pred.ns3$se.fit)

spline.ns4 = lm(y ~ splines::ns(x, df = 4))
pred.ns4 = predict(spline.ns4, newdata = list(x = sort.x), se = TRUE)
se.bands.ns4 = cbind(pred.ns4$fit + 2 * pred.ns4$se.fit, 
                     pred.ns4$fit - 2 * pred.ns4$se.fit)

par(mfrow = c(2, 2))

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Natural Spline (1 df)", bty = 'l')
lines(sort.x, pred.ns1$fit, lwd = 2, col = "firebrick")
matlines(sort.x, se.bands.ns1, lwd = 2, col = "firebrick", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Natural Spline (2 df)", bty = 'l')
lines(sort.x, pred.ns2$fit, lwd = 2, col = "darkviolet")
matlines(sort.x, se.bands.ns2, lwd = 2, col = "darkviolet", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Natural Spline (3 df)", bty = 'l')
lines(sort.x, pred.ns3$fit, lwd = 2, col = "royalblue")
matlines(sort.x, se.bands.ns3, lwd = 2, col = "royalblue", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Natural Spline (4 df)", bty = 'l')
lines(sort.x, pred.ns4$fit, lwd = 2, col = "darkgreen")
matlines(sort.x, se.bands.ns4, lwd = 2, col = "darkgreen", lty = 3)

Below we plot the cubic spline next to the natural cubic spline for comparison. As we can see, the natural cubic spline is generally smoother and closer to linear on the right boundary of the predictor space, where it has, additionally, narrower confidence intervals in comparison to the cubic spline.

par(mfrow = c(1,2))
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Cubic Spline", bty = 'l')
lines(sort.x, pred3$fit, lwd = 2, col = "darkblue")
matlines(sort.x, se.bands3, lwd = 2, col = "darkblue", lty = 3)

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Natural Spline (3 df)", bty = 'l')
lines(sort.x, pred.ns3$fit, lwd = 2, col = "royalblue")
matlines(sort.x, se.bands.ns3, lwd = 2, col = "royalblue", lty = 3)

1.5 Smoothing Splines

For fitting smoothing splines we use the command smooth.splines() instead of lm(). Under smoothing splines there are no knots to specify; the only parameter is $\lambda$. This can be specified via cross-validation by specifying cv = TRUE inside smooth.splines(). Alternatively, we can specify the effective degrees of freedom which correspond to some value of $\lambda$. Below we first fit a smoothing spline with 3 effective degrees of freedom (via the argument df = 3), and then also by tuning $\lambda$ via cross-validation. In this case we see that tuning $\lambda$ through cross-validation results in a curve which is slightly wiggly on the right boundary of the predictor space.

smooth1 = smooth.spline(x, y, df = 3)
smooth2 = smooth.spline(x, y, cv = TRUE)

Warning in smooth.spline(x, y, cv = TRUE): cross-validation with non-unique 'x'
values seems doubtful

par(mfrow = c(1,2))
plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Smoothing Spline (3 df)", bty = 'l')
lines(smooth1, lwd = 2, col = "brown")

plot(x, y, cex.lab = 1.1, col="darkgrey", xlab = x.lab, ylab = y.lab, 
     main = "Smoothing Spline (CV)", bty = 'l')
lines(smooth2, lwd = 2, col = "darkorange")

Note: the effective degrees of freedom of a smoothing spline are similar to the degrees of freedom in standard spline models and can be used as an alternative to cross-validation as a way to fix $\lambda$.

1.6 GAMs

In this final part we will fit a generalized additive model (GAM) utilizing more than one predictor from the Boston dataset.We first use the command names() in order to check once again the available predictor variables.

names(Boston)

 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Let’s say that we want to use predictors lstat, indus and chas for the analysis (use ?Boston again to check what these refer to).

For GAMs we will make use of the library gam, so the first thing that we have to do is to install this package by executing install.packages("gam") once. Then we load the library.

library(gam)

The main function is gam(). Inside this function we can use any combination of non-linear and linear modelling of the various predictors. For example below we use a cubic spline with 5 degrees of freedom for lstat, a smoothing spline with 5 degrees of freedom for indus and a simple linear model for variable chas. We then plot the contributions of each predictor using the command plot(). As we can see, GAMs are very useful as they estimate the contribution of the effects of each predictor.

gam = gam( medv ~ bs(lstat, degree = 3, df = 5) + s(indus, df = 5) + chas, 
           data = Boston )
par( mfrow = c(1,3) )
plot( gam,  se = TRUE, col = "blue" )

Note that simply using chas inside gam() is just fitting a linear model for this variable. However, one thing that we observe is that chas is a binary variable as it only takes the values of 0 and 1. This we can see from the x-axis of the chas plot on the right above. So, it would be preferable to use a step function for this variable. In order to do this we have to change the variable chas to a factor. We first create a second object called Boston1 (in order not to change the initial dataset Boston) and then we use the command factor() to change variable chas. Then we fit again the same model. As we can see below now gam() fits a step function for variable chas which is more appropriate.

Boston1 = Boston
Boston1$chas = factor(Boston1$chas)

gam1 = gam( medv ~ bs(lstat, degree = 3, df = 5) + s(indus, df = 5) + chas, 
            data = Boston1 )
par(mfrow = c(1,3))
plot(gam1,  se = TRUE, col = "blue")

We can make predictions from gam objects, just like lm objects, using the predict() method for the class gam. Here we make predictions on some new data. Note that when assigning the value 0 to chas, we enclose it in “” since we informed R to treat chas as a categorical factor with two levels: “0” and “1”.

preds <- predict( gam1, 
                  newdata = data.frame( chas = "0", indus = 3, lstat = 5 )  )
preds

       1 
32.10065

1.7 Classwork Exercises

1.7.1 Polynomial and Step Function Regression

Load libraries MASS and faraway (contains dataset seatpos needed which includes daat on car seat positioning depending on driver size).
We will analyse the effects of predictor variable Ht on the response variable hipcenter. Assign the hipcenter values to a vector y, and the Ht variable values to a vector x.
Plot variable Ht and hipcenter against each other. Remember to include suitable axis labels. From visual inspection of this plot, what sort of polynomial might be appropriate for this data?
Fit a first and second order polynomial to the data using the commands lm and poly. Look at the corresponding summary objects. Do these back up your answer to above?
Plot the first and second polynomial fits to the data, along with ±2 standard deviation confidence intervals. What do you notice about the degree-2 polynomial plot?
Perform an analysis of variance to confirm whether higher order degrees of polynomial are useful for modelling hipcenter based on Ht.
Use step function regression with 5 cut-points to model hipcenter based on Ht. Plot the results.

1.7.2 Splines

We will here again analyze the effects of predictor variable Ht on the response variable hipcenter.

Load the necessary packages for the dataset and spline functions. Assign the hipcenter values to a vector y, and the Ht variable values to a vector x.
Find the 25th, 50th and 75th percentiles of x, storing them in a vector cuts.
Use a linear spline to model hipcenter as a function of Htv, putting knots at the 25th, 50th and 75th percentiles ofx`.
Plot the fitted linear spline from part (c) over the data, along with ±2 standard deviation confidence intervals.
Use a smoothing spline to model hipcenter as a function of Ht, selecting
$\lambda$ with cross-validation, and generate a relevant plot. What do you notice?

1.7.3 GAMs

Fit a GAM for hipcenter that consists of three terms:

a natural spline with 5 degrees of freedom for Age,
a smoothing spline with 3 degrees of freedom for Thigh
a simple linear model term for Ht.

Plot the resulting contributions of each term to the GAM, and compare them with plots of hipcenter against each of the three variables Age, Thigh and Ht.
Does the contribution of each term of the GAM make sense in light of these pair-wise plots. Is the GAM fitting the data well?