Linear Regression I

Author

Termeh Shafie

1 Linear Regression

Linear regression models relationships using straight lines (or flat planes/hyperplanes when there are multiple predictors), following the equation $Y = mx + b$.

If the true relationship between the predictor and the outcome isn’t actually linear, the model’s performance can suffer. More importantly, this poor performance often isn’t uniform and it may be especially bad for specific ranges of predictor values.

\[\underbrace{Y_i}_\text{Observed Value for Person i} = \underbrace{\beta_0}_\text{intercept} + \underbrace{\beta_1}_\text{Coefficient for X1} * \underbrace{X_{i1}}_\text{Value for person i on Variable X1} + ... + \beta_p * X_{ip}\]

1.1 Question

Can you think of a situation where it would be particularly problematic if our model consistently under- or over-predicted within certain ranges of the predictor?

Answer

Imagine we’re predicting house prices ($Y$) using square footage ($X$).
If we fit a linear model, it might do a decent job for mid-sized homes but fail at the extremes.

For very small homes, the model might over-predict price (since tiny houses often sell for disproportionately less).
For very large homes, it might under-predict (since luxury properties tend to have higher prices per square foot).

This would be especially problematic if those extreme cases are important to decision-making, for example, a real estate investor estimating renovation costs or an appraiser evaluating high-end homes.
In such cases, a nonlinear model (like polynomial regression or a tree-based model) might capture those curved patterns better.

1.2 How to Choose Model Parameters

Linear regression models include two types of parameters: coefficients which describe how changes in the predictors affect the predicted outcome, and an intercept, which represents the predicted value when all predictors are zero.

To build an effective model, we need to estimate these parameters in a way that best fits the data. We covered two main approaches for doing this: Least Squares and Maximum Likelihood Estimation (MLE).

1.2.1 Least Squares

As the name implies, Least Squares aims to choose parameter values that minimize the sum of squared errors.

\[ \text{SSE} = \sum_{i = 1}^n(y_i - \beta_0 - \beta_1*x_i)^2 \]

the $\hat{y_i}$ represents our model’s predicted value of $y_i$. For a simple linear regression with only 1 predictor, we get our prediction using the formula:

\[ \hat{y} = \beta_0 + \beta_1*x_i \]

So let’s plug that in for $\hat{y_i}$:

\[ \text{SSE} = \sum_{i = 1}^n(y_i - \beta_0 - \beta_1*x_i)^2 \]

Now all we need to do is set the partial derivatives of the $\text{SSE}$ to 0 and solve. The formula above has two parameters that we’re interested in: $\beta_0$ and $\beta_1$, so we’ll take the partial derivatives of $\text{SSE}$ with respect to each of them:

\[\frac{\partial SSE}{\partial \beta_0} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-1)\] \[\frac{\partial SSE}{\partial \beta_1} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-x_i)\]

and set them equal to 0.

\[\frac{\partial SSE}{\partial \beta_0} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-1) = 0\] \[\frac{\partial SSE}{\partial \beta_1} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-x_i) = 0\]

then we solve for $\beta_0$ and $\beta_1$ and we get:

\[\beta_0 = \bar{y} - \hat{\beta_1}* \bar{x}\]

and

\[ \beta_1 = \frac{Cov(x,y)}{Var(x)} = Corr(x,y) * \frac{sd(x)}{sd(y)} \]

These values for $\beta_0$ and $\beta_1$ are the ones that minimize our Sum of Squared Errors ($\text{SSE}$) and therefore give us a model that performs very well.

1.2.2 Maximum Likelihood Estimation

Another way to estimate the parameters (coefficients and intercept) of a linear regression model is through Maximum Likelihood Estimation (MLE), a method we’ll revisit several times in this course. MLE selects parameter values that make the observed training data as likely as possible under the model.

Remember, a model is our mathematical description of the world. If a model assigns very low probability to data that looks like what we actually observed, it’s probably not a good description of reality.

The likelihood of an individual data point in our model is:

\[ p(y_i | x_i; \beta_0, \beta_1, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y_i - (\beta_0 + \beta_1 * x_i))^2}{2\sigma^2}}\]

$^\text{(notice that the numerator in the exponent is just the squared error for that data point)}$

If we have multiple data points in our training data, we’ll multiply their likelihoods.

\[ \prod_{i = 1}^{n} p(y_i | x_i; \beta_0, \beta_1, \sigma^2) = \prod_{i = 1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y_i - (\beta_0 + \beta_1 * x_i))^2}{2\sigma^2}}\]

this gives us the overall likelihood of our training data given the values of $\beta_0$, $\beta_1$. We want to choose values of $\beta_0$ and $\beta_1$ that maximize the likelihood from the equation above. To do so, we typically take the log of the likelihood (remember logs turn multiplications into sums, which makes the math easier) and maximize that by setting it’s partial derivatives (w.r.t $\beta_0$ and $\beta_1$) equal to 0.

When we do that, it turns out we get the exact same estimates as from least squares!

\[\beta_0 = \bar{y} - \hat{\beta_1}* \bar{x}\]

and

\[ \beta_1 = \frac{Cov(x,y)}{Var(x)} = Corr(x,y) * \frac{sd(x)}{sd(y)} \]

These values for $\beta_0$ and $\beta_1$ are the ones that mazimize the likelihood of our data, and therefore give us a model that performs very well.

2 Linear Regression in R

R has a clear and consistent workflow for linear modeling:

Specify your model using a formula interface (e.g., y ~ x1 + x2).
Fit the model on the training data using a fitting function like lm(), glm(), or another modeling function.
Assess model performance by comparing predictions to the observed outcomes.

2.1 Packages needed

# We use base R mostly but sometimes explore the tidyverse option, so load tidyverse
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

2.2 Read Amazon Book Data

The dataset contains information about a collection of books sold on Amazon, including their list price, physical dimensions (such as height, width, and thickness), weight, and number of pages. It also includes the Amazon price, which serves as the outcome variable we aim to predict.

We’ll use this dataset to explore how different characteristics of a book relate to its selling price, applying linear regression to model and interpret these relationships.

# Load the readr package (comes with tidyverse)
# library(readr)

# Read the Amazon data (tab-separated)
ama <- read_tsv("02-data/amazon-books.txt")

Rows: 325 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): Title, Author, Hard/ Paper, Publisher, ISBN-10
dbl (8): List Price, Amazon Price, NumPages, Pub year, Height, Width, Thick,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# if you don't want to use tidyverse you can use:
# ama <- read.delim(
#  "amazon-books.txt",
#  header = TRUE,
#  stringsAsFactors = FALSE)

# look at the 10 first rows of the data
knitr::kable(head(ama,10))

Title	Author	List Price	Amazon Price	Hard/ Paper	NumPages	Publisher	Pub year	ISBN-10	Height	Width	Thick	Weight (oz)
1,001 Facts that Will Scare the S#*t Out of You: The Ultimate Bathroom Reader	Cary McNeal	12.95	5.18	P	304	Adams Media	2010	1605506249	7.8	5.5	0.8	11.2
21: Bringing Down the House - Movie Tie-In: The Inside Story of Six M.I.T. Students Who Took Vegas for Millions	Ben Mezrich	15.00	10.20	P	273	Free Press	2008	1416564195	8.4	5.5	0.7	7.2
100 Best-Loved Poems (Dover Thrift Editions)	Smith	1.50	1.50	P	96	Dover Publications	1995	486285537	8.3	5.2	0.3	4.0
1421: The Year China Discovered America	Gavin Menzies	15.99	10.87	P	672	Harper Perennial	2008	0061564893	8.8	6.0	1.6	28.8
1493: Uncovering the New World Columbus Created	Charles C. Mann	30.50	16.77	P	720	Knopf	2011	0307265722	8.0	5.2	1.4	22.4
1861: The Civil War Awakening	Adam Goodheart	28.95	16.44	H	460	Knopf	2011	1400040159	8.9	6.3	1.7	32.0
A Christmas Carol and Other Christmas Writings (Penguin Classics)	Dickens	20.00	13.46	H	336	Penguin Classics Hardcover	2010	141195851	7.8	5.3	1.2	15.5
A Confederacy of Dunces	John Kennedy Toole	15.00	8.44	P	405	Grove Weidenfeld	1987	802130208	8.2	5.3	0.8	11.2
A Dance With Dragons	Martin RR, George	35.00	18.81	H	NA	Bantam	2011	553801473	9.6	6.5	2.1	NA
A Farewell To Arms	Ernest Hemingway	30.00	19.80	H	304	Scribner	1997	0684837889	9.6	6.4	1.1	19.2

2.3 Clean the data

Let’s remove missing values…

# Check for missing (NA) values in each column
colSums(is.na(ama))

       Title       Author   List Price Amazon Price  Hard/ Paper     NumPages 
           0            1            1            0            0            2 
   Publisher     Pub year      ISBN-10       Height        Width        Thick 
           1            1            0            4            5            1 
 Weight (oz) 
           9

dim(ama)

[1] 325  13

# Drop rows with any missing values
ama <- na.omit(ama)

# Reset the row index (Rrownames)
rownames(ama) <- NULL

# final dimensions of data
dim(ama)

[1] 310  13

2.4 Separate data into `X` (predictors) and `y` (outcome)

We want to prepare our data for modeling by separating the predictors ($X$) from the outcome variable ($y$). The predictors are the input features used to explain or predict the outcome — in this case, various physical attributes of books. The outcome (Amazon Price) is the numeric value we’re trying to predict.

# Define predictor column names
predictors <- c("List Price", "NumPages", "Weight (oz)", "Thick", "Height", "Width")

# Subset predictors (X)
X <- ama[, predictors]

# Subset outcome (y)
y <- ama[["Amazon Price"]]  # or ama$`Amazon Price`

2.5 Fitting the model

Before fitting a regression model, it’s often useful to standardize the predictor variables so they’re on the same scale. Here, we transform each predictor into a z-score, meaning we subtract its mean and divide by its standard deviation.
This ensures that all predictors contribute comparably to the model, especially when their original units differ (for example, weight in ounces vs. height in inches).
In R, this can be done easily using the scale() function.

# Standardize (z-score) the predictor variables
X_z <- as.data.frame(scale(X))

Now we fit a linear model with lm(). This ensures that all steps are applied consistently and that the model is trained on properly scaled data.

#Fit a linear regression model using the standardized data
model <- lm(y ~ ., data = X_z)

# View model summary
summary(model)


Call:
lm(formula = y ~ ., data = X_z)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.6291  -1.6496  -0.3563   1.2966  22.9981 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   12.58765    0.18715  67.261  < 2e-16 ***
`List Price`  11.42032    0.23158  49.314  < 2e-16 ***
NumPages       0.23189    0.34195   0.678  0.49820    
`Weight (oz)` -0.42037    0.33361  -1.260  0.20862    
Thick         -1.16151    0.35601  -3.263  0.00123 ** 
Height        -0.09905    0.24145  -0.410  0.68194    
Width         -0.19750    0.24447  -0.808  0.41981    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.295 on 303 degrees of freedom
Multiple R-squared:  0.9206,    Adjusted R-squared:  0.919 
F-statistic: 585.4 on 6 and 303 DF,  p-value: < 2.2e-16

## Fitted (predicted) values
y_pred <- predict(model)

# view some of the observed and predicted values
head(cbind(y,y_pred))

      y     y_pred
1  5.18  8.6869171
2 10.20 10.9158750
3  1.50  0.6459632
4 10.87  7.8749115
5 16.77 21.7538857
6 16.44 18.0935043

2.6 Evaluate model perfromance

After fitting and predicting, we assess how well our model performs by comparing the predicted values (y_pred) to the actual outcomes (y).
We compute several common metrics:

MSE (Mean Squared Error): the average of squared prediction errors and penalizes large errors more heavily.
MAE (Mean Absolute Error): the average of absolute prediction errors and gives an intuitive sense of the typical deviation.
MAPE (Mean Absolute Percentage Error): expresses prediction errors as a percentage of actual values.
R² (R-squared): the proportion of variance in the outcome explained by the model (higher is better).

# Evaluate model performance
MSE  <- mean((y - y_pred)^2)
MAE  <- mean(abs(y - y_pred))
MAPE <- mean(abs((y - y_pred) / y)) * 100
R2   <- 1 - sum((y - y_pred)^2) / sum((y - mean(y))^2)

# alternatively for R2: summary(model)$r.squared

cbind(MSE,MAE,MAPE,R2)

          MSE     MAE     MAPE        R2
[1,] 10.61234 2.16044 19.62124 0.9205886

2.7 Question

What do these metrics tell us about the model’s performance?
Discuss whether this model seems to fit the data well, and what potential issues (if any) might still exist.

Answer

These results suggest that the model performs quite well:

MSE (10.61) and MAE (2.16) indicate that, on average, predictions are only a few units away from the actual values.
MSE penalizes larger errors more heavily, while MAE provides the average absolute deviation.
MAPE (19.6%) means predictions are, on average, about 20% off from the true values, thus a reasonably good level of accuracy depending on context.
R² = 0.92 indicates that 92% of the variation in the outcome variable is explained by the predictors, thus a strong fit.

Overall, this model explains most of the variability in the data and makes relatively accurate predictions.
However, we’d still want to check residual plots for patterns or bias (e.g., under- or over-prediction at certain ranges) to confirm the model’s assumptions hold true.

2.8 Checking Assumptions

Remember there are 3 main assumptions$^*$ of Linear Regression:

Linearity
Homoskedasticity
Normality of Errors

$^*$ There’s also an assumption of Independence, aka that the value of one data point does not affect the value of another data point.

2.8.1 Linearity

We assess linearity by either:

plotting one predictor at a time against the outcome, and see if there are any clear non-linear patterns.

We will use ggplot here and use a faceted plot:

# The base R version:

# Set up a 2x3 layout for the plots
par(mfrow = c(2, 3))
# Loop through each predictor and plot Amazon Price vs that predictor
for (c in predictors) {
  plot(
    ama[[c]], ama[["Amazon Price"]],
    main = paste("Amazon Price vs.", c),
    xlab = c,
    ylab = "Amazon Price",
    pch = 19, col = "steelblue"
  )
}

# predictors vector from earlier:
# predictors <- c("List Price", "NumPages", "Weight (oz)", "Thick", "Height", "Width")

# Reshape to long format: one column for predictor name, one for its values
ama_long <- ama |>
  pivot_longer(
    cols = all_of(predictors),
    names_to = "predictor",
    values_to = "value"
  )

# Faceted scatterplot: one panel per predictor
ggplot(ama_long, aes(x = value, y = `Amazon Price`)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~ predictor, scales = "free_x") +
  labs(
    title = "Amazon Price vs. Predictors",
    x = NULL,
    y = "Amazon Price"
  ) +
  theme_minimal()

Next we plot the predicted values (x-axis) by the residuals (y-axis) and look for clear non-linear patterns:

# The base R version:

# Create a data frame of residuals and predictions
fitdata <- data.frame(
  errors = y - y_pred,
  pred = y_pred
)

# Base R residual plot
plot(
  fitdata$pred, fitdata$errors,
  main = "Residuals vs. Predicted Values",
  xlab = "Predicted Values",
  ylab = "Residuals",
  pch = 19, col = "steelblue"
)

# Add horizontal reference line at 0
abline(h = 0, col = "red", lwd = 2, lty = 2)

# Residual plot using ggplot
ggplot(fitdata, aes(x = pred, y = errors)) +
  geom_point(color = "black", alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red", linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Residuals vs. Predicted Values",
    x = "Predicted Values",
    y = "Residuals"
  )

Note: from now on I will only use ggplot.

2.8.2 Homoskedasticity

We assess homoskedasticity (constant variance of residuals) using the same residual plot as above. Recall that “homo” means same and “hetero” means different, while “skedasticity” refers to the spread or variance of the errors.

When examining the plot, look for whether the residuals appear to have a consistent spread across the range of predicted values.
Do some areas along the x-axis show much larger or smaller errors than others? The variance doesn’t need to be perfectly uniform, but noticeable patterns or changes in spread could suggest heteroskedasticity, meaning the model’s error variance isn’t constant.

2.8.3 Normality

Normality doesn’t really impact prediction, in fact many argue it doesnt even impact inference ($p$-values/confidence intervals). So we won’t dwell on it here. Here’s code to check it using something called a QQ (Quantile-Quantile plot). A Q–Q plot compares the distribution of the model’s residuals to a theoretical normal distribution.
If the residuals are approximately normal, the points should fall along the dashed red line.

# Q–Q plot of residuals (normality check)
ggplot(fitdata, aes(sample = errors)) +
  stat_qq(color = "black", alpha = 0.5) +
  stat_qq_line(color = "red", linetype = "dashed", linewidth = 1) +
  theme_minimal() +
  labs(
    title = "Q–Q Plot of Residuals",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  )

2.9 Summary

Overall, we can observe that:

Linearity:
- From the individual predictor plots, List Price shows a clear linear relationship with Amazon Price, suggesting it fits well within a linear framework.
- The other predictors (e.g., Height, Width, Thick, Weight (oz), NumPages) show weak or nonlinear patterns, often clustering at common physical book sizes. This indicates that the linear model captures the trend in List Price effectively but may not fully represent the relationships of the other variables.
- List Price displays a strong and roughly linear relationship with Amazon Price. As the list price increases, the Amazon price tends to increase in a nearly proportional way.
- The remaining predictors: (Height, Width, Thick, Weight (oz), and NumPages) show weak or nonlinear relationships. Many of these variables form visible clusters or plateaus, likely reflecting standard book formats or physical constraints (e.g., most books share similar dimensions).
- These patterns suggest that while List Price is likely a strong predictor, the other features may require nonlinear transformations or may contribute less explanatory power in a linear model.
Homoskedasticity (Constant Variance):
- In several cases, there appears to be heteroskedasticity, where the variability in Amazon Price increases with the predictor (especially noticeable for List Price and Weight (oz)).
- The residuals vs. predicted values plot shows that residuals are mostly centered around zero, but the spread is not perfectly uniform. There appears to be slightly greater variability at lower predicted values, which suggests mild heteroskedasticity.
  However, no severe funneling pattern is observed, so the assumption is only moderately violated.

In summary, the data indicate that a simple linear regression may fit well for List Price, but more complex relationships could exist for the other predictors.

2.10 Model Coefficients

After fitting a linear regression model, we can inspect the estimated coefficients to understand how each predictor contributes to the outcome. Each coefficient represents the expected change in the response variable (Amazon Price) for a one-unit increase in that predictor, holding all other variables constant.

In R, the coef() function returns both the intercept and the slope coefficients for each predictor.
By converting these values into a data frame, we can easily view and interpret them in a tabular format.
The intercept represents the model’s predicted value when all predictors are zero, while each coefficient shows the direction and strength of the predictor’s relationship with the outcome.

# Extract model coefficients and intercept
# (assuming you already have a fitted model called 'model' from lm())

# Create a data frame of coefficients
coefficients_df <- data.frame(
  Name = names(coef(model)),
  Coef = as.numeric(coef(model))
)

# Display the table of coefficients
coefficients_df

           Name        Coef
1   (Intercept) 12.58764516
2  `List Price` 11.42032143
3      NumPages  0.23189316
4 `Weight (oz)` -0.42036500
5         Thick -1.16150737
6        Height -0.09904558
7         Width -0.19749780

The model’s coefficients show how each predictor relates to the Amazon Price, holding all other variables constant.

The Intercept (12.59) represents the predicted Amazon Price when all predictors are zero though not meaningful in practice, it serves as a baseline for the model.
List Price (11.42) has by far the largest positive effect: for each additional unit increase in list price, the Amazon Price is predicted to increase by roughly $11.42, all else equal.
NumPages (0.23) also shows a small positive relationship, suggesting that books with more pages tend to cost slightly more.
The remaining variables, Weight (oz), Thick, Height, and Width, have negative coefficients, indicating that when holding the other factors constant, increases in these dimensions are associated with slightly lower predicted prices. These effects are relatively small and may reflect overlapping information among the physical features of books.

Overall, the model suggests that List Price is the dominant predictor of Amazon Price, while the other physical attributes contribute modestly or redundantly to the price prediction.

2.11 Question

Looking at the coefficients, most physical attributes (Weight (oz), Thick, Height, Width) have negative values.
What might explain why these coefficients are negative, even though we might expect larger or heavier books to cost more?

Answer

The negative coefficients likely arise from multicollinearity; the physical attributes of a book (height, width, thickness, and weight) are strongly correlated with one another and with NumPages and List Price.
When these correlated predictors are included together in the same linear model, their individual coefficients can shift direction or appear negative, even if the overall relationship with price is positive.

In other words, once List Price and NumPages are accounted for, the remaining variation explained by size and weight might actually correspond to lower prices (e.g., large but inexpensive books such as textbooks or coffee-table books).
This doesn’t mean heavier or thicker books are truly cheaper, it means their unique contribution to predicting price, after controlling for other variables, happens to be negative.

3 Classwork

3.1 Modeling Beyoncé Song Popularity

This dataset contains song-level data scraped from Spotify’s API, and includes a variety of numerical features that describe Beyoncé’s songs, such as danceability, energy, loudness, speechiness, Chhoose your own continuous response variable to model and predict.

Think about what interests you most:

Do you want to predict how danceable a song is based on its acoustic and rhythmic qualities?
Or are you more curious about what makes a song sound energetic, acoustic, or happy (valence)?

Choose any continuous numeric variable as your response, then use the other features as predictors to build a linear regression model.
Your goal is to see whether the musical characteristics of Beyoncé’s songs can meaningfully explain or predict that chosen outcome.

# Load data
b <- read.csv("02-data/Beyonce_data.csv")

# Preview the data

# Define predictor variables

# Fit a linear regression model

# Generate predictions

# Assess model performance

# Check residuals (assumptions)

3.2 Simulation

Sometimes, we’ll run a simulation to see how our models perform in different scenarios. This helps us learn more about how the model works. Below is a custom function linear_simulation() that

Generates fake data about cats’ weights and lengths
Fits a Linear Regression Model
Grabs the coefficients
Returns the data and Coefficients for us to look at

linear_simulation <- function(n = 100, trueCoef = 0.04, intercept = 0.2, error_sd = 1) {
  
  # 1. Generate fake data for cat length and weight -----------------------
  # Simulate "cat length" from a standard normal distribution
  length <- rnorm(n, mean = 0, sd = 1)
  
  # Simulate "weight" using a linear model with some random error
  weight <- intercept + length * trueCoef + rnorm(n, mean = 0, sd = error_sd)
  
  cats <- data.frame(length = length, weight = weight)
  
  
  # 2. Fit a Linear Regression Model -------------------------------------
  model <- lm(weight ~ length, data = cats)
  
  
  # 3. Extract Coefficients ----------------------------------------------
  coef_table <- data.frame(
    Names = names(coef(model)),
    Coef = as.numeric(coef(model))
  )
  
  
  # 4. Return data and coefficients
  return(list(coef = coef_table, data = cats))
}

The data are created using a known linear equation:

\[ \text{weight} = \text{intercept} + (\text{trueCoef} \times \text{length}) + \text{random error} \]

We can run this simulation 100’s of times.

n is the number of samples in each fake dataset
trueCoef is the true coefficient for cat length
intercept is the true intercept for the model
error_sd tells us how spread out the data is around the regression line

After generating the data, the function fits a linear regression model using lm() and returns both the estimated coefficients and the simulated data. This allows us to explore how well the model recovers the true coefficient values (trueCoef and intercept) as we vary the number of observations or the noise level.

To explore how sampling variability affects our regression estimates, we run the linear_simulation() function 500 times.
Each run generates a new random dataset of cat lengths and weights, fits a linear regression, and stores the estimated coefficients.

We then combine the results into two data frames:

coef_df contains the estimated coefficients (intercept and length) from each simulation.
data_df contains all the simulated data points, labeled by which simulation they came from.

By analyzing or visualizing coef_df, we can see how much the estimated slope and intercept vary across repeated samples, an empirical demonstration of sampling variability and the distribution of regression estimates.

# Play around with these numbers 
n <- 100
trueCoef <- 0.45   # don't change
intercept <- 6     # don't change
error_sd <- 1

# Run regression simulation 500 times 
set.seed(123)  # for reproducibility

# Run regression simulation 500 times 
simulations <- replicate(
  500,
  linear_simulation(n = n, trueCoef = trueCoef, intercept = intercept, error_sd = error_sd),
  simplify = FALSE
)

# Extract coefficients from 500 simulations
coef_df <- do.call(rbind, lapply(simulations, function(x) x$coef))
coef_df$simulation_no <- rep(1:500, each = nrow(simulations[[1]]$coef))

# Extract data from 500 simulations
data_df <- do.call(rbind, lapply(simulations, function(x) x$data))
data_df$simulation_no <- rep(1:500, each = n)

Now that we’ve run a bunch of simulations with the SAME true coefficient and intercept (but different random samples), let’s look at the results of our 500 regression models.

First, let’s just make some scatter plots to see some of the simulations. Notice how similar or different the simulations are from each other.

# Choose number of simulations to visualize
n_plot <- 9

# Subset data for the first 9 simulations
chosen_datasets <- subset(data_df, simulation_no < n_plot)

# Faceted scatterplots for each simulation
ggplot(chosen_datasets, aes(x = length, y = weight, color = factor(simulation_no))) +
  geom_point() +
  facet_wrap(~ simulation_no) +
  theme_minimal() +
  labs(
    title = "Simulated Datasets of Cat Length vs Weight",
    x = "Length (standardized units)",
    y = "Weight",
    color = "Simulation Number"
  )

Let’s look at the coefficient values from all the linear regressions we ran. This histogram shows the estimated length coefficient from 500 repeated regression simulations. Although each dataset was generated from the same true underlying relationship, random variation in the data causes the estimated slope to differ slightly across runs.

The red dashed line marks the mean estimated coefficient, which should be close to the true value (trueCoef = 0.45).
This plot illustrates the sampling distribution of the slope estimate, that is how regression coefficients vary across repeated samples due to random noise, even when the underlying relationship remains constant.

# Filter to include only the 'length' coefficients
coef_only <- subset(coef_df, Names == "length")

# Calculate mean coefficient value
mean_coef <- mean(coef_only$Coef)

# Plot histogram of coefficient estimates
ggplot(coef_only, aes(x = Coef)) +
  geom_histogram(color = "black", fill = "steelblue", bins = 30) +
  geom_vline(xintercept = mean_coef, color = "red", linetype = "dashed", linewidth = 1.5) +
  theme_minimal() +
  labs(
    title = "Distribution of Length Coefficients Across 500 Simulations",
    x = "Estimated Coefficient for Length",
    y = "Frequency"
  )

3.3 Question

Look at the different values you got for the coefficient of length. We set the TRUE coefficient value to be 0.45, think about and describe how spread apart the estimates from our 500 regression models are. Does seeing how different our coefficient estimates can be change how you think about the coefficient estimates you get in regression models on real data?

Answer

Across the 500 simulated regressions, the estimated length coefficients cluster around the true value of 0.45, but they vary from sample to sample.
This spread reflects sampling variability; even though each dataset comes from the same true relationship, random noise in the data produces slightly different estimates each time.

Most of the slopes are fairly close to 0.45, but some are noticeably higher or lower.
This shows that our coefficient estimates aren’t fixed truths, they’re estimates with uncertainty.
If we repeatedly sampled new data from the same population, our slope estimate would fluctuate around the true value, forming a distribution of possible outcomes.

Seeing this variation highlights why we report confidence intervals and $p$-values in regression: they describe the uncertainty around our estimated coefficients.
In real-world analyses, a single regression result is just one possible estimate from a range of plausible values.

3.4 Play with `n` and `error_sd`

Here are some suggestions:

Change n, the number of data points in each sample, to be very small (say 10), how does this change the results you saw?
Change n, the number of data points in each sample, to be very large (say 1,000), how does this change the results you saw?
Change the error_sd term, this is a measure of how much error is in the model. Less error means that data is scattered tightly around the regression line, more error means that the data is scatters very loosely around the regression line. How does changing error_sd change the results you originally saw?

3.5 Violations of Linearity

Here, you’ll compare how linear regression performs on data that does and does not meet the assumption of linearity.

In the first code chunk (#nonLinReg), fit a linear regression model on data that is nonlinear, that is, data where the relationship between x and y bends or curves.
In the second code chunk (#LinReg), fit a model on linear data, that is data that satisfies the assumption of linearity.
For both models, create a residual plot showing the predicted values on the x-axis and the residuals (errors) on the y-axis.

# Load nonlinear data
df_nonlin <- read.csv("02-data/xy-nonlin.csv")

# Separate X and y
X <- data.frame(x = df_nonlin$x)
y <- df_nonlin$y

# Z-score the predictor (standardize x)
x_z <- as.numeric(scale(X$x))

# Fit linear regression on standardized x

# Predict on training data

# Prediction + error (residuals) data frame

# Residual plot

# Load nonlinear data
df_lin <- read.csv("02-data/xy-lin.csv")

# Separate X and y
X <- data.frame(x = df_lin$x)
y <- df_lin$y

# Z-score the predictor (standardize x)
x_z <- as.numeric(scale(X$x))

# Fit linear regression on standardized x

# Predict on training data

# Prediction + error (residuals) data frame

# Residual plot

3.6 Questions

How do the residual patterns differ between the linear and nonlinear datasets? What visual cues suggest that the linearity assumption is being violated?
What are some possible consequences of violating the assumption of linearity?
If we used a linear regression model on the nonlinear data, are there specific ranges of x where the model would consistently over-predict or under-predict the outcome? How can you tell from the residual plot?

1 Linear Regression

1.1 Question

1.2 How to Choose Model Parameters

1.2.1 Least Squares

1.2.2 Maximum Likelihood Estimation

2 Linear Regression in R

2.1 Packages needed

2.2 Read Amazon Book Data

2.3 Clean the data

2.4 Separate data into X (predictors) and y (outcome)

2.5 Fitting the model

2.6 Evaluate model perfromance

2.7 Question

2.8 Checking Assumptions

2.8.1 Linearity

2.8.2 Homoskedasticity

2.8.3 Normality

2.9 Summary

2.10 Model Coefficients

2.11 Question

3 Classwork

3.1 Modeling Beyoncé Song Popularity

3.2 Simulation

3.3 Question

3.4 Play with n and error_sd

3.5 Violations of Linearity

3.6 Questions

2.4 Separate data into `X` (predictors) and `y` (outcome)

3.4 Play with `n` and `error_sd`