Let’s load the penguin data set, and plot the bill length and bill depth for our three species:
# Load librarieslibrary(readr)library(ggplot2)# Read the datapengwing <-read_csv("09-data/penguins.csv")# View first few rows (equivalent to .head())head(pengwing)
# A tibble: 6 × 9
...1 species island bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 0 Adelie Torgersen 39.1 18.7 181
2 1 Adelie Torgersen 39.5 17.4 186
3 2 Adelie Torgersen 40.3 18 195
4 3 Adelie Torgersen NA NA NA
5 4 Adelie Torgersen 36.7 19.3 193
6 5 Adelie Torgersen 39.3 20.6 190
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
# Create the plotggplot(pengwing, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +geom_point() +theme_minimal() +labs(x ="Bill Length (in mm)",y ="Bill Depth (in mm)",title ="Bill Length vs. Bill Depth by Species" )
We could use a decision tree based on bill length and bill depth to classify penguins as different species. First, we could split on Bill Depth and decide that any penguin with a depth less than 16.5 mm, should be classified as a Gentoo penguin.
# Choose a split valuesplit1 <-16.5ggplot(pengwing, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +geom_point() +theme_minimal() +labs(x ="Bill Length (in mm)",y ="Bill Depth (in mm)",title ="Bill Length vs. Bill Depth by Species" ) +geom_hline(yintercept = split1, linewidth =1, linetype ="dashed")
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
That bottom group looks GREAT. Now let’s look at the top group. Most of the Chinstrap penguins have longer bill lengths. Let’s say that if a penguin has a bill depth > 16.5mm, then we will split on bill length at 44 to separate the Adelie and Chinstrap penguins.
# Choose a split value split2 <-44ggplot(pengwing, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +geom_point() +theme_minimal() +labs(x ="Bill Length (in mm)",y ="Bill Depth (in mm)",title ="Bill Length vs. Bill Depth by Species" ) +geom_hline(yintercept = split1, linewidth =1, linetype ="dashed") +geom_segment(x = split2, xend = split2,y = split1, yend =22,linewidth =0.6, linetype ="dashed", color ="black" )
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Voila! We’ve built a (very short) decision tree! It would look like this:
Entropy is a measure of disorder/chaos. We want ordered and organized data in the leaf nodes of our decision trees. So we want LOW entropy. Entropy is defined as:
\[ E = -\sum_1^N p_i* log_2(p_i) \]
Where \(N\) is the number of categories or labels in our outcome variable.
This is compared to gini impurity which is:
\[GI = 1 - \sum_1^N p_i^2\]
1.3.1 Question
WHY do we want the leaf nodes of our tree to be ordered (have low entropy or impurity?)?
1.4 Measures of Chaos for a Split
When you split a node, we now have two new nodes. In order to calculate the chaos (entropy or gini impurity) of the split, we have to calculate the chaos (entropy or gini impurity) for EACH of the new nodes and then calculate the weighted average chaos (entropy or gini impurity).
The reason we weight each node differently in this calculation, is because if a node has more data in it, than it has more impact, and therefore its measure of chaos (entropy or gini impurity) should count more.
In general, once you’ve calculated the chaos (entropy or gini impurity) for each of the new nodes, you’ll use this formula to calculate the weighted average:
Where \(N_L\) is the number of data points in the Left Node, \(N_R\) is the number of data points in the Right Node, and \(Total\) is the total number of data points in that split. \(C_R\) and \(C_L\) are the chaos measure (entropy of gini impurity) for the right and left nodes, respectively.
1.5 Decision Trees
Let’s first build a Decision Tree to classify patients as diabetic or not diabetic.
Gini impurity is probability of misclassifying a random data point from that node.
# Packageslibrary(readr)library(caret)
Loading required package: lattice
library(rpart)# Read data + peekd <-read_csv("09-data/diabetes2.csv")
Rows: 768 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, D...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Predictors / outcomepredictors <-c("Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age")X <- d[, predictors]y <- d$Outcome # should be 0/1# Train/test split (80/20), like random_state=1234set.seed(1234)idx_train <-createDataPartition(y, p =0.8, list =FALSE)X_train <- X[idx_train, , drop =FALSE]X_test <- X[-idx_train, , drop =FALSE]y_train <- y[idx_train]y_test <- y[-idx_train]# Standardize using TRAIN stats only, then apply to bothpp <-preProcess(X_train, method =c("center", "scale"))X_train_sc <-predict(pp, X_train)X_test_sc <-predict(pp, X_test)# Fit decision tree (classification)train_df <-data.frame(X_train_sc, Outcome =factor(y_train))test_df <-data.frame(X_test_sc, Outcome =factor(y_test))set.seed(1234)tree <-rpart(Outcome ~ ., data = train_df, method ="class")# Predict classespred_test <-predict(tree, newdata = test_df, type ="class")pred_train <-predict(tree, newdata = train_df, type ="class")# Confusion matrices (test + train), like ConfusionMatrixDisplayconfusionMatrix(pred_test, test_df$Outcome)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 76 26
1 15 36
Accuracy : 0.732
95% CI : (0.6545, 0.8003)
No Information Rate : 0.5948
P-Value [Acc > NIR] : 0.0002754
Kappa : 0.4279
Mcnemar's Test P-Value : 0.1183498
Sensitivity : 0.8352
Specificity : 0.5806
Pos Pred Value : 0.7451
Neg Pred Value : 0.7059
Prevalence : 0.5948
Detection Rate : 0.4967
Detection Prevalence : 0.6667
Balanced Accuracy : 0.7079
'Positive' Class : 0
confusionMatrix(pred_train, train_df$Outcome)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 368 60
1 41 146
Accuracy : 0.8358
95% CI : (0.8041, 0.8642)
No Information Rate : 0.665
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.6227
Mcnemar's Test P-Value : 0.07328
Sensitivity : 0.8998
Specificity : 0.7087
Pos Pred Value : 0.8598
Neg Pred Value : 0.7807
Prevalence : 0.6650
Detection Rate : 0.5984
Detection Prevalence : 0.6959
Balanced Accuracy : 0.8042
'Positive' Class : 0
1.5.1 Limiting Max_Depth
We talked about different ways to “prune” or limit the depth of a tree, both directly and indirectly via max_depth and min_samples_leaf. Run the following code, what does it tell you:
# Function to run repeated train/test evaluation for a given maxdeptheval_depth <-function(data, depth =NULL, reps =50, test_prop =0.2, seed =1234) {set.seed(seed) n <-nrow(data) out <-data.frame(rep =integer(0), split =character(0), acc =numeric(0))for (r in1:reps) { idx_test <-sample.int(n, size =floor(test_prop * n)) train_df <- data[-idx_test, c(predictors, "Outcome")] test_df <- data[idx_test, c(predictors, "Outcome")] ctrl <-rpart.control(cp =0, xval =0)if (!is.null(depth)) ctrl$maxdepth <- depth fit <-rpart(Outcome ~ ., data = train_df, method ="class", control = ctrl) pred_train <-predict(fit, newdata = train_df, type ="class") pred_test <-predict(fit, newdata = test_df, type ="class") acc_train <-mean(pred_train == train_df$Outcome) acc_test <-mean(pred_test == test_df$Outcome) out <-rbind(out,data.frame(rep = r, split ="Test", acc = acc_test),data.frame(rep = r, split ="Train", acc = acc_train)) } out}# Run across depths 2-9 and "none"depths <-2:9all_res <-data.frame()for (dep in depths) { tmp <-eval_depth(d, depth = dep, reps =60, test_prop =0.2, seed =1234) tmp$depth <-as.character(dep) all_res <-rbind(all_res, tmp)}# "none" = no maxdepth cap (use rpart default maxdepth)tmp_none <-eval_depth(d, depth =NULL, reps =60, test_prop =0.2, seed =1234)tmp_none$depth <-"none"all_res <-rbind(all_res, tmp_none)all_res$depth <-factor(all_res$depth, levels =c(as.character(2:9), "none"))all_res$split <-factor(all_res$split, levels =c("Test", "Train"))# Plot ggplot(all_res, aes(x = depth, y = acc, fill = split)) +geom_boxplot(width =0.7, position =position_dodge(width =0.8)) +scale_fill_manual(values =c("Test"="#D76B63", "Train"="#78D7E6")) +labs(x ="Restriction of Depth of Tree",y ="Accuracy" ) +theme_minimal()
1.5.2 Random Forests and Gradient Boosting Trees
Now let’s copy and paste the code from above and build a Random Forest to predict diabetes instead of a single tree, and then using a Gradient Boosting Tree.
library(randomForest)# Predictors and outcomepredictors <-c("Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age")X <- d[, predictors]y <- d$Outcome# Train/test split (80/20)set.seed(1234)train_idx <-createDataPartition(y, p =0.8, list =FALSE)X_train <- X[train_idx, , drop =FALSE]X_test <- X[-train_idx, , drop =FALSE]y_train <- y[train_idx]y_test <- y[-train_idx]# Standardize predictors (z-score using TRAIN stats)pp <-preProcess(X_train, method =c("center", "scale"))X_train_sc <-predict(pp, X_train)X_test_sc <-predict(pp, X_test)# z scoring not important, because none of the variables are influencing/compared to each other directly scale doesn't matter here. But z scoring wont hurt.# Combine into data frames for modelingtrain_df <-data.frame(X_train_sc, Outcome =factor(y_train))test_df <-data.frame(X_test_sc, Outcome =factor(y_test))# Fit Random Forestset.seed(1234)rf_model <-randomForest( Outcome ~ .,data = train_df,ntree =500, # number of treesmtry =3, # variables tried at each split (default ~ sqrt(p))importance =TRUE)# Predictionspred_train <-predict(rf_model, train_df)pred_test <-predict(rf_model, test_df)# Confusion matricesconfusionMatrix(pred_train, train_df$Outcome)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 409 0
1 0 206
Accuracy : 1
95% CI : (0.994, 1)
No Information Rate : 0.665
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.000
Specificity : 1.000
Pos Pred Value : 1.000
Neg Pred Value : 1.000
Prevalence : 0.665
Detection Rate : 0.665
Detection Prevalence : 0.665
Balanced Accuracy : 1.000
'Positive' Class : 0
confusionMatrix(pred_test, test_df$Outcome)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 77 29
1 14 33
Accuracy : 0.719
95% CI : (0.6407, 0.7886)
No Information Rate : 0.5948
P-Value [Acc > NIR] : 0.0009444
Kappa : 0.3936
Mcnemar's Test P-Value : 0.0327626
Sensitivity : 0.8462
Specificity : 0.5323
Pos Pred Value : 0.7264
Neg Pred Value : 0.7021
Prevalence : 0.5948
Detection Rate : 0.5033
Detection Prevalence : 0.6928
Balanced Accuracy : 0.6892
'Positive' Class : 0
library(gbm)# Predictors and outcomepredictors <-c("Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age")X <- d[, predictors]y <- d$Outcome # must be numeric 0/1 and not factor as earlier# Train/test split (80/20)set.seed(1234)train_idx <-createDataPartition(y, p =0.8, list =FALSE)X_train <- X[train_idx, , drop =FALSE]X_test <- X[-train_idx, , drop =FALSE]y_train <- y[train_idx]y_test <- y[-train_idx]# Standardize predictors (z-score using TRAIN stats)pp <-preProcess(X_train, method =c("center", "scale"))X_train_sc <-predict(pp, X_train)X_test_sc <-predict(pp, X_test)# z scoring not important, because none of the variables are influencing/compared to each other directly scale doesn't matter here. But z scoring wont hurt.# Combine into data frames for modelingtrain_df <-data.frame(X_train_sc, Outcome = y_train) test_df <-data.frame(X_test_sc, Outcome = y_test)# Fit Gradient Boosting modelset.seed(1234)gb_model <-gbm( Outcome ~ .,data = train_df,distribution ="bernoulli", # binary classificationn.trees =300,interaction.depth =3, # tree depthshrinkage =0.05, # learning raten.minobsinnode =10,bag.fraction =0.8,verbose =FALSE)# Predict probabilitiesprob_train <-predict(gb_model, train_df, n.trees =300, type ="response")prob_test <-predict(gb_model, test_df, n.trees =300, type ="response")# Convert probabilities to class labels (0.5 threshold)pred_train <-factor(ifelse(prob_train >0.5, 1, 0))pred_test <-factor(ifelse(prob_test >0.5, 1, 0))# Convert truth to factor with matching levelstrain_truth <-factor(train_df$Outcome, levels =c(0, 1))test_truth <-factor(test_df$Outcome, levels =c(0, 1))pred_train <-factor(pred_train, levels =c(0, 1))pred_test <-factor(pred_test, levels =c(0, 1))# Confusion matricesconfusionMatrix(pred_train, train_truth)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 392 43
1 17 163
Accuracy : 0.9024
95% CI : (0.8762, 0.9247)
No Information Rate : 0.665
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7739
Mcnemar's Test P-Value : 0.001249
Sensitivity : 0.9584
Specificity : 0.7913
Pos Pred Value : 0.9011
Neg Pred Value : 0.9056
Prevalence : 0.6650
Detection Rate : 0.6374
Detection Prevalence : 0.7073
Balanced Accuracy : 0.8748
'Positive' Class : 0
confusionMatrix(pred_test, test_truth)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 78 29
1 13 33
Accuracy : 0.7255
95% CI : (0.6476, 0.7945)
No Information Rate : 0.5948
P-Value [Acc > NIR] : 0.0005179
Kappa : 0.4061
Mcnemar's Test P-Value : 0.0206376
Sensitivity : 0.8571
Specificity : 0.5323
Pos Pred Value : 0.7290
Neg Pred Value : 0.7174
Prevalence : 0.5948
Detection Rate : 0.5098
Detection Prevalence : 0.6993
Balanced Accuracy : 0.6947
'Positive' Class : 0
1.5.3 Using Trees for Regression
Lastly, let’s take a quick look at what we’d need to change if we wanted to predict a continuous value instead of a categorical one. Let’s look at this data set that measures risk propensity. We’re going to predict BART Scores (a score where higher values mean you’re riskier), based on a bunch of different measures.
These are the variables in the data set:
BART: Balloon Analogue Risk Task - Measures risk-taking behavior - Higher scores means more willingness to take risks - This is the outcome (target variable)
BIS/BAS: Behavioral Inhibition / Behavioral Activation Scales - BISmeans sensitivity to punishment / avoidance - BAS means sensitivity to reward / approach behavior
Female - Binary variable (0/1)
Goal of the model is to use psychological traits (BIS/BAS + gender) to predict risk-taking behavior (BART score) using linear regression, evaluated with 5-fold cross-validation.
# read and clean data----bart <-read_csv("09-data/bart.csv")
Rows: 1000 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): BIS.Score, BAS.Drive.Score, BAS.Fun.Seeking.Score, BAS.Reward.Respo...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Metric Train Test
1 MSE 143.01960222 146.17530363
2 MAE 10.49211192 10.59655360
3 MAPE 0.87809521 0.88646195
4 R² 0.06309417 0.05407431
We can interpret the results as follows: The linear regression model shows poor predictive performance, explaining only 5–6% of the variance in BART scores. Train and test errors were nearly identical, indicating no overfitting but substantial underfitting. These results suggest that BIS/BAS traits and gender alone are insufficient predictors of risk-taking behavior as measured by the BART.
Next step would be to try non-linear models (Random Forest, Gradient Boosting).
Model Metric Train Test
1 Random Forest MSE 49.4503557 133.4441967
2 Random Forest MAE 5.7125156 9.5465513
3 Random Forest MAPE 0.4700485 0.7912291
4 Random Forest R^2 0.7497229 0.1468698
Model Metric Train Test
1 Random Forest MSE 49.4503557 133.4441967
2 Random Forest MAE 5.7125156 9.5465513
3 Random Forest MAPE 0.4700485 0.7912291
4 Random Forest R^2 0.7497229 0.1468698
5 Gradient Boosting MSE 100.5563924 131.6130762
6 Gradient Boosting MAE 8.2969512 9.5293227
7 Gradient Boosting MAPE 0.6858956 0.7808945
8 Gradient Boosting R^2 0.3510419 0.1564954
What do you conclude?
Interpretation
While Random Forest achieved excellent training performance, it substantially overfit and failed to generalize. Gradient Boosting produced slightly better test performance and more stable behavior, though overall predictive power remained modest. These results suggest that nonlinear relationships exist but that BIS/BAS measures and gender alone account for only a small portion of variance in BART risk-taking behavior.
1.5.4 Feature Importance
# Assuming rf_fit was trained on train_dfrf_imp <-importance(rf_fit)# Convert to data framerf_imp_df <-data.frame(Feature =rownames(rf_imp),Importance = rf_imp[, "IncNodePurity"])# Sort by importancerf_imp_df <- rf_imp_df[order(rf_imp_df$Importance, decreasing =TRUE), ]rf_imp_df
Feature Importance
BAS.Drive.Score BAS.Drive.Score 17676.021
BAS.Fun.Seeking.Score BAS.Fun.Seeking.Score 16728.337
Age Age 15398.407
BIS.Score BIS.Score 14628.120
BAS.Reward.Responsiveness.Score BAS.Reward.Responsiveness.Score 14473.869
Female Female 3253.537
# Relative influence from gbmgb_imp <-summary(gb_fit, plotit =FALSE)gb_imp
var rel.inf
BAS.Fun.Seeking.Score BAS.Fun.Seeking.Score 25.462993
BAS.Drive.Score BAS.Drive.Score 25.375267
BAS.Reward.Responsiveness.Score BAS.Reward.Responsiveness.Score 21.312970
BIS.Score BIS.Score 13.361662
Age Age 12.722097
Female Female 1.765011
Feature importance analyses from both Random Forest and Gradient Boosting consistently identified BAS-related traits, particularly Drive and Fun Seeking, as the most influential predictors of BART risk-taking behavior. In contrast, BIS, age, and gender showed relatively weak influence. Although these findings suggest that reward sensitivity plays a central role in risk-taking, the overall predictive power of the models remained modest, indicating substantial unexplained variability.
2 Classwork
Using the bmd.csv, fit a tree to predict bone mineral density (BMD) based on AGE.
#libraries that we will needset.seed(1234) #fix the random generator seed library(rpart) #library for CARTlibrary(rpart.plot) # plotting#read the datasetbmd.data <-read.csv("09-data/bmd.csv", stringsAsFactors =TRUE)
t1 <-rpart(bmd ~ age,data = bmd.data, method ="anova", #indicates the outcome is continuouscontrol =rpart.control(minsplit =1, # min number of observ for a split minbucket =1, # min nr of obs in terminal nodescp=0) #decrease in complex for a split )#the rpart.plot() may take a long time to plot#the complete tree#if you cannot run it, just try plot(t1); text(t1, pretty=1)#and you will see just the structure of the treerpart.plot(t1)
Warning: labs do not fit even at cex 0.15, there may be some overplotting
We can now prune the tree using a limit for the complexity parameter CP. This will indicate that only a split with CP higher than the limit is worth it.
The printcp() function is used to examine the complexity parameter table and identify an appropriate pruning threshold. The tree is then pruned using cp = 0.02, which removes splits that do not sufficiently reduce prediction error, resulting in a simpler and more generalizable model.
In summary: CP controls tree size, i.e.
It penalizes adding splits that don’t improve the model enough
A split is kept only if it reduces error by at least CP
Let’s use the pruned tree from aboved to predict the BMD for an individual 70 years old and compare it with the predictions from a linear model. We plot the predictions from the tree for ages 40 through 90 together with the predictions from the linear model:
lm1 <-lm(bmd ~ age,data = bmd.data)predict(prune.t1, #prediction using the treenewdata =data.frame(age=70))
1
0.8070857
predict(lm1, #prediction using the linear modelnewdata =data.frame(age=70))
1
0.7567922
# Create prediction gridage_grid <-data.frame(age =seq(40, 90, 1))# Get predictionsage_grid$tree <-predict(prune.t1, newdata = age_grid)age_grid$linear <-predict(lm1, newdata = age_grid)# Convert to long format for ggplotplot_df <-data.frame(age =rep(age_grid$age, 2),bmd =c(age_grid$tree, age_grid$linear),model =factor(rep(c("Tree", "Linear Model"), each =nrow(age_grid)) ))# Plotggplot(plot_df, aes(x = age, y = bmd, color = model)) +geom_line(linewidth =1) +scale_color_manual(values =c("Tree"="blue","Linear Model"="red" ) ) +labs(x ="Age",y ="BMD",color ="Model" ) +theme_minimal()
Let’s fit now a tree to predict bone mineral density (BMD) based on AGE, SEX and BMI (BMI has to be computed) and compute the MSE.
Notice that the caret will prune the tree based on the cross-validated cp.
#compute BMIbmd.data$bmi <- bmd.data$weight_kg / (bmd.data$height_cm/100)^2trctrl <-trainControl(method ="repeatedcv", number =10, repeats =10)t2.caret <-train(bmd ~ age + sex + bmi, data = bmd.data, method ="rpart",trControl=trctrl,tuneGrid =expand.grid(cp=seq(0.001, 0.1, 0.001)) )#Plot the RMSE versus the CPplot(t2.caret)
#Plot the tree with the optimal CPrpart.plot(t2.caret$finalModel)
We can compare the RMSE and R2 of the tree above with the linear model:
trctrl <-trainControl(method ="repeatedcv", number =5, repeats =10)lm2.caret<-train(bmd ~ age + sex + bmi, data = bmd.data, method ="lm",trControl=trctrl )lm2.caret$results
#extracts the row with the RMSE and R2 from the table of results#corresponding to the cp with lowest RMSE (best tune)t2.caret$results[t2.caret$results$cp==t2.caret$bestTune[1,1], ]