# Split data (example dataset)
set.seed(123)
idx <- sample(seq_len(nrow(df)), size = 0.8*nrow(df))
train <- df[idx, ]
test <- df[-idx, ]
# 1. Define model formula
formula <- outcome ~ predictor1 + predictor2
# 2 & 3. Fit logistic regression model
log_model <- glm(formula, data = train, family = binomial)
# 4. Predict probabilities
pred_probs <- predict(log_model, test, type = "response")
# Convert to class prediction (optional threshold = 0.5)
pred_class <- ifelse(pred_probs > 0.5, 1, 0)
# 5. Assess model
table(Predicted = pred_class, Actual = test$outcome)Classification I
1 Logistic Regression
In our Linear Regression lectures, we talked about adding non-linearity through Feature Engineering, but that’s not the only way! We can also use link functions to add non-linearity.
Link functions are just algebra we do to the linear prediction (\(\mathbf{X}\beta\)) in order to get the predicted value we actually want (e.g a probability).
\[\underbrace{y = \mathbf{X}\beta}_\text{Linear Model}\] \[\underbrace{y = g^{-1}(\mathbf{X}\beta)}_\text{Generalized Linear Model}\]
Oddly, we often specify our link function using it’s inverse, hence the \(g^{-1}()\) instead of \(g()\). \(g^{-1}()\) takes the linear prediction and transforms it into our desired predicted value. \(g()\) takes our desired predicted value and transforms it back into our linear prediction.
In logistic regression, our goal is to predict a probability that a data point is in group 1. We talked about using:
- Linear Probability Models \(g^{-1}: y = x\)
- Odds Models \(g^{-1}: y = e^x\)
- Logistic Regression: \(g^{-1}: y = \frac{e^x}{1 + e^x}\)
Logistic Regression using the link function \(g(x) = log{\frac{x}{1-x}}\) and inverse link \(g^{-1}: y = \frac{e^x}{1 + e^x}\) gave us a great sigmoid shape that takes linear predictions (\(y = \mathbf{X}\beta\)) and turns them into predicted probabilities (\(p = \frac{e^{\mathbf{X}\beta}}{1 + e^{\mathbf{X}\beta}}\)).
1.1 Maximum Likelihood Estimation
Just like with Linear Regression, we can use Maximum Likelihood Estimation to choose the parameters (intercept and coefficients) of the model. But we have a different likelihood.
In a linear regression, we assumed that our errors are normally distributed around the regression line. For logistic regression, we assume that our errors are Bernoulli distributed. The Bernoulli distribution is a discrete distribution (since our outcome is discrete, a.k.a categorical) that tells you the proability of being 0 or 1.
1.2 Bernoulli Likelihood
The formula for a Bernoulli distribution for a single data point \(x\) is:
\[ f(y;p(x)) = p(x)^{y} * (1-p(x))^{1-y}\]
where \(y\) is the group the data point belongs to (either 0 or 1), and \(p(x)\) is the predicted probability of that data point being a 1.
For example, let’s say we’re looking at the probability that it’s sunny tomorrow. The predicted probability, according to the weather channel is \(p(x) = 0.8\). The likelihood of it being sunny (\(k = 1\)) is:
\[ f(1;0.8) = 0.8^1 * (1-0.8)^{1-1} = 0.8\]
The likelihood of it not being sunny (\(k = 0\)) is: \[ f(0;0.8) = 0.8^0 * (1-0.8)^{1-0} = 0.2\]
1.3 Likelihood Function
But we don’t just have a SINGLE data point when fitting a logistic regression, we have MANY. So, we multiply the likelihood of each data point together to get the likelihood of the dataset:
\[\prod_{i = 1}^n p(x_i)^{y_i} * (1-p(x_i))^{1-y_i}\]
We want to choose parameters (e.g. \(\beta_0\), or \(\beta_1\)) that maximize this likelihood function. And how to we maximize it? We take it’s (partial) derivatives and set them equal to zero!
However, it turns out that its much easier to work with the log of this likelihood function, so we’re often working with the log likelihood and taking it’s derivatives (this will still find the optimal parameters for the model as the values that maximize the log likelihood will also maximize the likelihood):
\[\sum_{i = 1}^n y_i * log(p(x_i)) + (1-y_i) * log(1-p(x_i))\]
1.4 Loss Function
Now it turns out, if we multiply the log loss by \(-\frac{1}{N}\), this log-loss is a really great loss function for logistic regression. Loss functions are metrics that
- measure the performance of your model, and
- have lower scores indicate better performing models
\[-\frac{1}{N} \sum_{i = 1}^n y_i * log(p(x_i)) + (1-y_i) * log(1-p(x_i))\]
Log-Loss (also called Binary Cross Entropy) does just that! Thus we often use it as a loss function for Logistic Regression.
1.5 Logistic Regression in R
Let’s build a Logistic Regression model in R. We’ll follow a similar workflow to what we used for linear models:
- Separate your data into predictors (
X) and outcome (y), and optionally set up a train/test split. - Create a model formula and initialize the logistic regression model using
glm()withfamily = binomial. - Fit the model to the training data.
- Use
predict()on new data to obtain predicted probabilities or class predictions. - Assess the model’s performance (e.g., accuracy, confusion matrix, ROC curve).
1.6 Breast Cancer Data
Let’s do an example with logistic regression to classify cancer diagnosis. We will: 1. Load and lightly clean the dataset.
2. Select predictors whose names end with "mean".
3. Split the data into training and testing sets (80/20).
4. Fit a logistic regression (glm, binomial family).
5. Predict class probabilities on the test set.
6. Evaluate performance using binary cross-entropy (log loss).
We import the Breast Cancer dataset and drop any rows with missing values to ensure the model can be fit without errors.
bc <- read.csv(
"04-Data/BreastCancer.csv",
stringsAsFactors = FALSE
)
bc <- na.omit(bc)
nrow(bc)[1] 569
The outcome is diagnosis (Benign B vs Malignant M). As predictors we only use columns whose names end in “mean”.
# columns ending with "mean"
predictors <- grep("mean$", names(bc), value = TRUE)
# modeling frame: outcome + predictors
df <- data.frame(
diagnosis = factor(bc$diagnosis, levels = c("B","M")), # B=0, M=1
bc[, predictors]
)
str(df[, c("diagnosis", predictors[1:5])])'data.frame': 569 obs. of 6 variables:
$ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
$ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
$ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
$ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
$ area_mean : num 1001 1326 1203 386 1297 ...
$ smoothness_mean: num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
We split once and keep a fixed seed for reproducibility. The model will train on train and be evaluated on test.
set.seed(123)
n <- nrow(df)
idx_train <- sample.int(n, size = floor(0.8 * n))
train <- df[idx_train, ]
test <- df[-idx_train, ]
c(n_train = nrow(train), n_test = nrow(test))n_train n_test
455 114
We specify a formula that uses all “mean” predictors and fit a logistic regression using the binomial family.
formula <- as.formula(paste("diagnosis ~", paste(predictors, collapse = " + ")))
log_model <- glm(formula, data = train, family = binomial)Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(log_model)$coefficients[1:6, , drop = FALSE] # peek at a few coefficients Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.33989439 14.53963337 -0.5735973 5.662403e-01
radius_mean -0.74101892 3.99281153 -0.1855883 8.527677e-01
texture_mean 0.38855965 0.07352802 5.2845112 1.260408e-07
perimeter_mean -0.27927112 0.54267867 -0.5146160 6.068214e-01
area_mean 0.04017311 0.01952026 2.0580216 3.958806e-02
smoothness_mean 72.05651889 34.28104247 2.1019349 3.555898e-02
We obtain predicted probabilities of malignancy for each test observation.
p_test <- predict(log_model, newdata = test, type = "response")
head(p_test) 1 9 15 17 18 28
0.9999327 0.9878218 0.9184778 0.7153792 0.9987491 0.9998869
Lower log loss indicates better calibrated probability predictions. We map B -> 0 and M -> 1 and compute the average cross-entropy.
# Binary cross-entropy / log loss helper
log_loss <- function(y, p, eps = 1e-15) {
p <- pmin(pmax(p, eps), 1 - eps) # avoid log(0)
-mean(y * log(p) + (1 - y) * log(1 - p))
}
y_test <- ifelse(test$diagnosis == "M", 1, 0)
loss <- log_loss(y_test, p_test)
loss[1] 0.1193837
A log loss of 0.119 indicates strong model performance. Log loss measures how well the model’s predicted probabilities match the true outcomes where lower is better. A value close to 0 means the model is making accurate and well-calibrated predictions, showing high confidence when it is correct and low confidence when uncertain.
- Diagnosis is encoded with levels c(“B”,“M”) so that M corresponds to the positive class (1) for log-loss computation.
- No feature scaling is required for logistic regression to work, but standardization can sometimes help convergence or interpretability.
- Log loss evaluates the quality of probabilities, not just the final class labels.
Logistic Regression coefficients are by default in terms of log odds meaning that they tell you how much the predicted log odds of being in group 1 will change when the predictor increases by 1-unit. We grab the coefficients for the model above:
# ---- Extract Coefficients ----
coefs <- summary(log_model)$coefficients # from glm()
# Convert to a data frame with names
coef_df <- data.frame(
Name = rownames(coefs),
Coef = coefs[, "Estimate"],
row.names = NULL
)
# ---- Add Odds Ratios ----
coef_df$Odds <- exp(coef_df$Coef)
coef_df Name Coef Odds
1 (Intercept) -8.33989439 2.387976e-04
2 radius_mean -0.74101892 4.766280e-01
3 texture_mean 0.38855965 1.474855e+00
4 perimeter_mean -0.27927112 7.563348e-01
5 area_mean 0.04017311 1.040991e+00
6 smoothness_mean 72.05651889 1.966747e+31
7 compactness_mean -1.35444224 2.580912e-01
8 concavity_mean 6.66078793 7.811662e+02
9 concave.points_mean 74.10069490 1.518878e+32
10 symmetry_mean 14.00627785 1.210178e+06
11 fractal_dimension_mean -39.60760668 6.289773e-18
1.7 Question
How do you interpret the results?
Positive coefficients (Odds > 1) increase the likelihood of a malignant diagnosis.
Variables liketexture_mean,area_mean,concavity_mean,concave.points_mean, andsymmetry_meanstrongly raise the probability of cancer, with very large odds ratios indicating powerful predictors.Negative coefficients (Odds < 1) decrease the likelihood of malignancy.
Higherradius_mean,perimeter_mean,compactness_mean, andfractal_dimension_meanvalues point more toward benign tumors.The intercept represents the baseline log-odds when predictors are zero (not directly interpretable on its own).
Overall, features related to concavity, smoothness, and symmetry strongly increase cancer risk, while higher compactness, radius, and fractal dimension values are associated with benign masses.
1.8 The Problem with Logistic Regression Coefficients
When you’re presenting your Logistic Regression Models to non-data people, you might want to be able to tell them which variables have the biggest impact on the predicted value. Typically, we might use coefficients for this because they give us a single number that summarizes the relationship between our predictors and our predicted value.
However, log odds are difficult to understand intuitively, especially if you’re not a data person. Thus, we might want a different way to present our results. Luckily, if we exponentiate our log odds coefficients, we get odds coefficients. These are easier to understand, as most people understand intuitively what odds are.
Remember, for odds the important threshold value is \(1\). So any odds coefficient \(>1\) has a direct/positive relationship with the outcome and anything with an odds coefficient \(< 1\) has an inverse/negative relationship with the outcome.
You can also use the odds coefs to give people an intuitive understanding of the relationship. If the odds coef is \(2\) then increasing the predictor by 1-unit causes your predicted odds to double. Similarly, if the odds coef is \(0.5\) then increasing the predictor by 1-unit causes your predicted odds to halve. If the odds coef is \(1.25\) then increasing the predictor by 1-unit causes your predicted odds to increase by \(25\%\).
2 KNN
KNN is a simple, distance based algorithm that let’s us CLASSIFY data points based on what class the data points around them are. Birds of a feather…
Despite it being distance based, KNN is a classification algorithm. In other words, it is supervised machine learning, as it requires truth labels (the actual class/group). However it does share characteristics with clustering algorithms we will see later.
KNN can work with binary/categorical variables, but not without some tweaking which we do not cover here.
2.1 Hyperparameters
Hyperparameters are parameters in our model that are NOT chosen by the algorithm (we must supply them). We can either choose them:
- based on domain expertise (knowledge about the data)
- based on the data (hyperparameter tuning)
Why do we have to use a validation set when hyperparameter tuning?
In this classwork we’ll use ggplot to plot the boundaries of knn, and see how the size, shape, and overlap of clusters affect these boundries.
Note: this will only work with specific 2D data, if you wanted to use it for your own data you’d need to change the code to do so
plotKNN2D <- function(Xdf, y, k = 5) {
# Xdf: data frame with exactly 2 numeric features
# y: factor labels
if (ncol(Xdf) != 2) stop("Xdf must have exactly 2 columns (2D only)")
if (!is.factor(y)) y <- factor(y)
library(class)
library(ggplot2)
# Feature names
f1 <- colnames(Xdf)[1]
f2 <- colnames(Xdf)[2]
# Create grid range
x0_range <- seq(min(Xdf[[f1]]) - sd(Xdf[[f1]]),
max(Xdf[[f1]]) + sd(Xdf[[f1]]),
length.out = 100)
x1_range <- seq(min(Xdf[[f2]]) - sd(Xdf[[f2]]),
max(Xdf[[f2]]) + sd(Xdf[[f2]]),
length.out = 100)
grid <- expand.grid(
f1 = x0_range,
f2 = x1_range
)
colnames(grid) <- c(f1, f2)
# Predict using KNN
pred <- knn(train = Xdf, test = grid, cl = y, k = k)
grid$pred <- pred
# Plot using tidy eval with .data
p <- ggplot() +
geom_point(
data = grid,
aes(x = .data[[f1]], y = .data[[f2]], color = pred),
alpha = 0.25, size = 0.6
) +
geom_point(
data = Xdf,
aes(x = .data[[f1]], y = .data[[f2]], color = y),
size = 2
) +
theme_minimal() +
labs(color = "Class",
title = paste("KNN Decision Boundary (k =", k, ")")) +
scale_color_manual(values = c("#E69F00", "#0072B2"))
p
}2.2 Let’s Explore
Let’s test this function with some fake data:
# --- Generate Fake Data (two blobs) ---
set.seed(1)
n <- 200
n_per <- n / 2
# centers: (-5, -5) and (5, 5); cluster_std = 1
X1 <- cbind(rnorm(n_per, mean = -5, sd = 1),
rnorm(n_per, mean = -5, sd = 1))
X2 <- cbind(rnorm(n_per, mean = 5, sd = 1),
rnorm(n_per, mean = 5, sd = 1))
X <- rbind(X1, X2)
colnames(X) <- c("X1", "X2")
X <- as.data.frame(X)
# labels 0/1 as a factor
y <- factor(c(rep(0, n_per), rep(1, n_per)))
# --- Plot KNN decision boundary (k = 1) ---
# assumes plotKNN2D(Xdf, y, k) is already defined
plotKNN2D(X, y, k = 1)Using the dataset knnclasswork.csv and using the plotKNN2d() function, build KNN models with K = 1, 3, 5, 20, 50, 100.
How does the decision boundary change as K changes?
dd <- read.csv(
"04-Data/KNNclasswork.csv"
)
# k = 1
plotKNN2D(dd[, c("X1", "X2")], dd$y, 1)# k = 3
# k = 5
# k = 20
# k = 50
# k = 1002.3 How does changing k affect the decision boundary (imbalanced classes)?
Now let’s see how changing k affects the boundary when the groups have different numbers of samples. Using the plotKNN2d() function, and the data loaded below (dd2), examine what happens to the decision boundaries as you try different k’s (try 1,3,5,10, 25, 50, and 100).
How does changing k affect the decision boundary when the groups are imbalanced?
dd2 <- read.csv(
"04-Data/KNNclasswork2.csv"
)
# k = 1
plotKNN2D(dd2[, c("X1", "X2")], dd2$y, 1)# k = 3
# k = 5
# k = 20
# k = 50
# k = 100In balanced data, increasing k shifts the boundary from very wiggly and noise-sensitive to smooth and generalized.
- Too small k = overfitting
- too large k = underfitting.
For imbalanced case, as k increases, the decision boundary becomes smoother and shifts toward the majority class, eventually overwhelming the minority class and reducing its predicted region, especially at very high k.
- Small k = preserves minority class but noisy
- Large k = smooth but biased toward majority
3 Exercises
3.1 Logistic Regression
Practice building Logistic Regression models. Using the purchases dataset, build a Logistic Regression model that predicts whether or not customers signed up for a rewards program based on their age, income, and whether they had made a previous purchase. Use an 80/20 Train-Test-Split. Note that logistic regression is not scale-based (it’s a linear model, not a distance-based one), so it doesn’t need standardization to function correctly. However, standardizing can improve training stability and interpretation consistency, especially when variables differ wildly in scale. Interpret the coefficients in terms of log-odds, odds and probability.
3.2 Recommendation Systems (KNN)
“If you like _________________ you should listen to ___________________ by Taylor Swift.”
We’re going to build a Recommendation System to recommend Taylor Swift songs for people by letting users select a song, and then recommending the most similar songs (according to danceability, energy, instrumentalness, valence, loudness, liveness, speechiness, acousticness).
To do this, we’re going to load in our training data called TaylorSwiftSpotify.csv, fit a NearestNeighbors() model, and then for each song in our new data called KNNCompareSpotify.csv we’ll find the 10 most similar songs and recommend them!
Below you have some code to get you started, note that you will nedd to install the package FNN. Fill in the missing parts!
# --- Read data ---
training_data <- read.csv(
"04-Data/TaylorSwiftSpotify.csv",
stringsAsFactors = FALSE
)
new_data <- read.csv(
"04-Data//KNNCompareSpotify.csv",
stringsAsFactors = FALSE
)
# --- Features ---
feat <- c("danceability", "energy", "instrumentalness", "valence",
"loudness", "liveness", "speechiness", "acousticness")
# --- Z-score using TRAINING stats only (avoid division by zero) ---
# --- Nearest Neighbors (k = 10) ---
# --- Attach neighbors to new_data ---
# Store neighbor indices as a semicolon-separated string for each row (CSV-friendly)
# --- Write as CSV file