If the true relationship between the predictor and the outcome isn’t actually linear, the model’s performance can suffer. More importantly, this poor performance often isn’t uniform and it may be especially bad for specific ranges of predictor values.
How to Choose Model Parameters
Linear regression models include two types of parameters: coefficients which describe how changes in the predictors affect the predicted outcome, and an intercept, which represents the predicted value when all predictors are zero.
To build an effective model, we need to estimate these parameters in a way that best fits the data. We covered two main approaches for doing this: Least Squares and Maximum Likelihood Estimation (MLE).
Least Squares
As the name implies, Least Squares aims to choose parameter values that minimize the sum of squared errors.
\[ \text{SSE} = \sum_{i = 1}^n(y_i - \beta_0 - \beta_1*x_i)^2 \]
the \(\hat{y_i}\) represents our model’s predicted value of \(y_i\). For a simple linear regression with only 1 predictor, we get our prediction using the formula:
\[ \hat{y} = \beta_0 + \beta_1*x_i \]
So let’s plug that in for \(\hat{y_i}\):
\[ \text{SSE} = \sum_{i = 1}^n(y_i - \beta_0 - \beta_1*x_i)^2 \]
Now all we need to do is set the partial derivatives of the \(\text{SSE}\) to 0 and solve. The formula above has two parameters that we’re interested in: \(\beta_0\) and \(\beta_1\), so we’ll take the partial derivatives of \(\text{SSE}\) with respect to each of them:
\[\frac{\partial SSE}{\partial \beta_0} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-1)\] \[\frac{\partial SSE}{\partial \beta_1} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-x_i)\]
and set them equal to 0.
\[\frac{\partial SSE}{\partial \beta_0} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-1) = 0\] \[\frac{\partial SSE}{\partial \beta_1} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-x_i) = 0\]
then we solve for \(\beta_0\) and \(\beta_1\) and we get:
\[\beta_0 = \bar{y} - \hat{\beta_1}* \bar{x}\]
and
\[ \beta_1 = \frac{Cov(x,y)}{Var(x)} = Corr(x,y) * \frac{sd(x)}{sd(y)} \]
These values for \(\beta_0\) and \(\beta_1\) are the ones that minimize our Sum of Squared Errors (\(\text{SSE}\)) and therefore give us a model that performs very well.
Maximum Likelihood Estimation
Another way to estimate the parameters (coefficients and intercept) of a linear regression model is through Maximum Likelihood Estimation (MLE), a method we’ll revisit several times in this course. MLE selects parameter values that make the observed training data as likely as possible under the model.
Remember, a model is our mathematical description of the world. If a model assigns very low probability to data that looks like what we actually observed, it’s probably not a good description of reality.
The likelihood of an individual data point in our model is:
\[ p(y_i | x_i; \beta_0, \beta_1, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y_i - (\beta_0 + \beta_1 * x_i))^2}{2\sigma^2}}\]
\(^\text{(notice that the numerator in the exponent is just the squared error for that data point)}\)
If we have multiple data points in our training data, we’ll multiply their likelihoods.
\[ \prod_{i = 1}^{n} p(y_i | x_i; \beta_0, \beta_1, \sigma^2) = \prod_{i = 1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(y_i - (\beta_0 + \beta_1 * x_i))^2}{2\sigma^2}}\]
this gives us the overall likelihood of our training data given the values of \(\beta_0\), \(\beta_1\). We want to choose values of \(\beta_0\) and \(\beta_1\) that maximize the likelihood from the equation above. To do so, we typically take the log of the likelihood (remember logs turn multiplications into sums, which makes the math easier) and maximize that by setting it’s partial derivatives (w.r.t \(\beta_0\) and \(\beta_1\)) equal to 0.
When we do that, it turns out we get the exact same estimates as from least squares!
\[\beta_0 = \bar{y} - \hat{\beta_1}* \bar{x}\]
and
\[ \beta_1 = \frac{Cov(x,y)}{Var(x)} = Corr(x,y) * \frac{sd(x)}{sd(y)} \]
These values for \(\beta_0\) and \(\beta_1\) are the ones that mazimize the likelihood of our data, and therefore give us a model that performs very well.