Tree Based Methods

Author

Termeh Shafie

1 Review

1.1 Models

	Decision Tree	Random Forest	Gradient Boosting Tree
num trees	one	many	many
make predictions	mode or mean of leaf node	each tree votes	sum of tree outputs
tree independence	NOT applicable	independent	dependent
Data Used	all	bagging + random feature selection	all

1.2 Decision Trees, Graphically

Let’s load the penguin data set, and plot the bill length and bill depth for our three species:

# Load libraries
library(readr)
library(ggplot2)

# Read the data
pengwing <- read_csv("09-data/penguins.csv")

# View first few rows (equivalent to .head())
head(pengwing)

# A tibble: 6 × 9
   ...1 species island    bill_length_mm bill_depth_mm flipper_length_mm
  <dbl> <chr>   <chr>              <dbl>         <dbl>             <dbl>
1     0 Adelie  Torgersen           39.1          18.7               181
2     1 Adelie  Torgersen           39.5          17.4               186
3     2 Adelie  Torgersen           40.3          18                 195
4     3 Adelie  Torgersen           NA            NA                  NA
5     4 Adelie  Torgersen           36.7          19.3               193
6     5 Adelie  Torgersen           39.3          20.6               190
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

# Create the plot
ggplot(pengwing, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point() +
  theme_minimal() +
  labs(
    x = "Bill Length (in mm)",
    y = "Bill Depth (in mm)",
    title = "Bill Length vs. Bill Depth by Species"
  )

We could use a decision tree based on bill length and bill depth to classify penguins as different species. First, we could split on Bill Depth and decide that any penguin with a depth less than 16.5 mm, should be classified as a Gentoo penguin.

# Choose a split value
split1 <- 16.5

ggplot(pengwing, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point() +
  theme_minimal() +
  labs(
    x = "Bill Length (in mm)",
    y = "Bill Depth (in mm)",
    title = "Bill Length vs. Bill Depth by Species"
  ) +
  geom_hline(yintercept = split1, linewidth = 1, linetype = "dashed")

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

That bottom group looks GREAT. Now let’s look at the top group. Most of the Chinstrap penguins have longer bill lengths. Let’s say that if a penguin has a bill depth > 16.5mm, then we will split on bill length at 44 to separate the Adelie and Chinstrap penguins.

# Choose a split value 
split2 <- 44

ggplot(pengwing, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point() +
  theme_minimal() +
  labs(
    x = "Bill Length (in mm)",
    y = "Bill Depth (in mm)",
    title = "Bill Length vs. Bill Depth by Species"
  ) +
  geom_hline(yintercept = split1, linewidth = 1, linetype = "dashed") +
  geom_segment(
    x = split2, xend = split2,
    y = split1, yend = 22,
    linewidth = 0.6, linetype = "dashed", color = "black"
  )

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Voila! We’ve built a (very short) decision tree! It would look like this:

# Install if needed
# install.packages("DiagrammeR")

library(DiagrammeR)

grViz("
digraph penguin_tree {

  graph [layout = dot, rankdir = TB]

  node [
    shape = box,
    style = rounded,
    fontname = Helvetica,
    fontsize = 14
  ]

  # Decision nodes
  node1 [label = 'bill_depth < 16.5']
  node2 [label = 'bill_length < 44']

  # Leaf nodes
  gentoo    [label = 'Gentoo', fillcolor = '#dbe9f6', style = 'rounded,filled']
  chinstrap[label = 'Chinstrap', fillcolor = '#e3f0dd', style = 'rounded,filled']
  adelie   [label = 'Adelie', fillcolor = '#f4d6d6', style = 'rounded,filled']

  # Edges
  node1 -> gentoo     [label = 'yes']
  node1 -> node2      [label = 'no']

  node2 -> adelie     [label = 'yes']
  node2 -> chinstrap  [label = 'no']
}
")

1.3 Entropy

Entropy is a measure of disorder/chaos. We want ordered and organized data in the leaf nodes of our decision trees. So we want LOW entropy. Entropy is defined as:

\[ E = -\sum_1^N p_i* log_2(p_i) \]

Where \(N\) is the number of categories or labels in our outcome variable.

This is compared to gini impurity which is:

\[GI = 1 - \sum_1^N p_i^2\]

1.3.1 Question

WHY do we want the leaf nodes of our tree to be ordered (have low entropy or impurity?)?

1.4 Measures of Chaos for a Split

When you split a node, we now have two new nodes. In order to calculate the chaos (entropy or gini impurity) of the split, we have to calculate the chaos (entropy or gini impurity) for EACH of the new nodes and then calculate the weighted average chaos (entropy or gini impurity).

The reason we weight each node differently in this calculation, is because if a node has more data in it, than it has more impact, and therefore its measure of chaos (entropy or gini impurity) should count more.

In general, once you’ve calculated the chaos (entropy or gini impurity) for each of the new nodes, you’ll use this formula to calculate the weighted average:

\[ WC = (\frac{N_L}{Total}* C_L) + (\frac{N_R}{Total}* C_R)\]

Where \(N_L\) is the number of data points in the Left Node, \(N_R\) is the number of data points in the Right Node, and \(Total\) is the total number of data points in that split. \(C_R\) and \(C_L\) are the chaos measure (entropy of gini impurity) for the right and left nodes, respectively.

1.5 Decision Trees

Let’s first build a Decision Tree to classify patients as diabetic or not diabetic.

Gini impurity is probability of misclassifying a random data point from that node.

# Packages
library(readr)
library(caret)

Loading required package: lattice

library(rpart)

# Read data + peek
d <- read_csv("09-data/diabetes2.csv")

Rows: 768 Columns: 9

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, D...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(d)

# A tibble: 6 × 9
  Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
        <dbl>   <dbl>         <dbl>         <dbl>   <dbl> <dbl>
1           6     148            72            35       0  33.6
2           1      85            66            29       0  26.6
3           8     183            64             0       0  23.3
4           1      89            66            23      94  28.1
5           0     137            40            35     168  43.1
6           5     116            74             0       0  25.6
# ℹ 3 more variables: DiabetesPedigreeFunction <dbl>, Age <dbl>, Outcome <dbl>

# Predictors / outcome
predictors <- c("Pregnancies","Glucose","BloodPressure","SkinThickness",
                "Insulin","BMI","DiabetesPedigreeFunction","Age")

X <- d[, predictors]
y <- d$Outcome   # should be 0/1

# Train/test split (80/20), like random_state=1234
set.seed(1234)
idx_train <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[idx_train, , drop = FALSE]
X_test  <- X[-idx_train, , drop = FALSE]
y_train <- y[idx_train]
y_test  <- y[-idx_train]

# Standardize using TRAIN stats only, then apply to both
pp <- preProcess(X_train, method = c("center", "scale"))
X_train_sc <- predict(pp, X_train)
X_test_sc  <- predict(pp, X_test)

# Fit decision tree (classification)
train_df <- data.frame(X_train_sc, Outcome = factor(y_train))
test_df  <- data.frame(X_test_sc,  Outcome = factor(y_test))

set.seed(1234)
tree <- rpart(Outcome ~ ., data = train_df, method = "class")

# Predict classes
pred_test  <- predict(tree, newdata = test_df, type = "class")
pred_train <- predict(tree, newdata = train_df, type = "class")

# Confusion matrices (test + train), like ConfusionMatrixDisplay
confusionMatrix(pred_test,  test_df$Outcome)

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 76 26
         1 15 36
                                          
               Accuracy : 0.732           
                 95% CI : (0.6545, 0.8003)
    No Information Rate : 0.5948          
    P-Value [Acc > NIR] : 0.0002754       
                                          
                  Kappa : 0.4279          
                                          
 Mcnemar's Test P-Value : 0.1183498       
                                          
            Sensitivity : 0.8352          
            Specificity : 0.5806          
         Pos Pred Value : 0.7451          
         Neg Pred Value : 0.7059          
             Prevalence : 0.5948          
         Detection Rate : 0.4967          
   Detection Prevalence : 0.6667          
      Balanced Accuracy : 0.7079          
                                          
       'Positive' Class : 0

confusionMatrix(pred_train, train_df$Outcome)

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 368  60
         1  41 146
                                          
               Accuracy : 0.8358          
                 95% CI : (0.8041, 0.8642)
    No Information Rate : 0.665           
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.6227          
                                          
 Mcnemar's Test P-Value : 0.07328         
                                          
            Sensitivity : 0.8998          
            Specificity : 0.7087          
         Pos Pred Value : 0.8598          
         Neg Pred Value : 0.7807          
             Prevalence : 0.6650          
         Detection Rate : 0.5984          
   Detection Prevalence : 0.6959          
      Balanced Accuracy : 0.8042          
                                          
       'Positive' Class : 0

1.5.1 Limiting Max_Depth

We talked about different ways to “prune” or limit the depth of a tree, both directly and indirectly via max_depth and min_samples_leaf. Run the following code, what does it tell you:

# Function to run repeated train/test evaluation for a given maxdepth
eval_depth <- function(data, depth = NULL, reps = 50, test_prop = 0.2, seed = 1234) {
  set.seed(seed)
  n <- nrow(data)

  out <- data.frame(rep = integer(0), split = character(0), acc = numeric(0))

  for (r in 1:reps) {
    idx_test <- sample.int(n, size = floor(test_prop * n))
    train_df <- data[-idx_test, c(predictors, "Outcome")]
    test_df  <- data[idx_test,  c(predictors, "Outcome")]

    ctrl <- rpart.control(cp = 0, xval = 0)
    if (!is.null(depth)) ctrl$maxdepth <- depth

    fit <- rpart(Outcome ~ ., data = train_df, method = "class", control = ctrl)

    pred_train <- predict(fit, newdata = train_df, type = "class")
    pred_test  <- predict(fit, newdata = test_df,  type = "class")

    acc_train <- mean(pred_train == train_df$Outcome)
    acc_test  <- mean(pred_test  == test_df$Outcome)

    out <- rbind(out,
                 data.frame(rep = r, split = "Test",  acc = acc_test),
                 data.frame(rep = r, split = "Train", acc = acc_train))
  }
  out
}

# Run across depths 2-9 and "none"
depths <- 2:9
all_res <- data.frame()

for (dep in depths) {
  tmp <- eval_depth(d, depth = dep, reps = 60, test_prop = 0.2, seed = 1234)
  tmp$depth <- as.character(dep)
  all_res <- rbind(all_res, tmp)
}

# "none" = no maxdepth cap (use rpart default maxdepth)
tmp_none <- eval_depth(d, depth = NULL, reps = 60, test_prop = 0.2, seed = 1234)
tmp_none$depth <- "none"
all_res <- rbind(all_res, tmp_none)

all_res$depth <- factor(all_res$depth, levels = c(as.character(2:9), "none"))
all_res$split <- factor(all_res$split, levels = c("Test", "Train"))

# Plot 
ggplot(all_res, aes(x = depth, y = acc, fill = split)) +
  geom_boxplot(width = 0.7, position = position_dodge(width = 0.8)) +
  scale_fill_manual(values = c("Test" = "#D76B63", "Train" = "#78D7E6")) +
  labs(
    x = "Restriction of Depth of Tree",
    y = "Accuracy"
  ) +
  theme_minimal()

1.5.2 Random Forests and Gradient Boosting Trees

Now let’s copy and paste the code from above and build a Random Forest to predict diabetes instead of a single tree, and then using a Gradient Boosting Tree.

library(randomForest)


# Predictors and outcome
predictors <- c(
  "Pregnancies","Glucose","BloodPressure","SkinThickness",
  "Insulin","BMI","DiabetesPedigreeFunction","Age"
)

X <- d[, predictors]
y <- d$Outcome

# Train/test split (80/20)
set.seed(1234)
train_idx <- createDataPartition(y, p = 0.8, list = FALSE)

X_train <- X[train_idx, , drop = FALSE]
X_test  <- X[-train_idx, , drop = FALSE]
y_train <- y[train_idx]
y_test  <- y[-train_idx]

# Standardize predictors (z-score using TRAIN stats)
pp <- preProcess(X_train, method = c("center", "scale"))
X_train_sc <- predict(pp, X_train)
X_test_sc  <- predict(pp, X_test)
# z scoring not important, because none of the variables are influencing/compared to each other directly scale doesn't matter here. But z scoring wont hurt.


# Combine into data frames for modeling
train_df <- data.frame(X_train_sc, Outcome = factor(y_train))
test_df  <- data.frame(X_test_sc,  Outcome = factor(y_test))

# Fit Random Forest
set.seed(1234)
rf_model <- randomForest(
  Outcome ~ .,
  data = train_df,
  ntree = 500,        # number of trees
  mtry = 3,           # variables tried at each split (default ~ sqrt(p))
  importance = TRUE
)

# Predictions
pred_train <- predict(rf_model, train_df)
pred_test  <- predict(rf_model, test_df)

# Confusion matrices
confusionMatrix(pred_train, train_df$Outcome)

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 409   0
         1   0 206
                                    
               Accuracy : 1         
                 95% CI : (0.994, 1)
    No Information Rate : 0.665     
    P-Value [Acc > NIR] : < 2.2e-16 
                                    
                  Kappa : 1         
                                    
 Mcnemar's Test P-Value : NA        
                                    
            Sensitivity : 1.000     
            Specificity : 1.000     
         Pos Pred Value : 1.000     
         Neg Pred Value : 1.000     
             Prevalence : 0.665     
         Detection Rate : 0.665     
   Detection Prevalence : 0.665     
      Balanced Accuracy : 1.000     
                                    
       'Positive' Class : 0

confusionMatrix(pred_test,  test_df$Outcome)

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 77 29
         1 14 33
                                          
               Accuracy : 0.719           
                 95% CI : (0.6407, 0.7886)
    No Information Rate : 0.5948          
    P-Value [Acc > NIR] : 0.0009444       
                                          
                  Kappa : 0.3936          
                                          
 Mcnemar's Test P-Value : 0.0327626       
                                          
            Sensitivity : 0.8462          
            Specificity : 0.5323          
         Pos Pred Value : 0.7264          
         Neg Pred Value : 0.7021          
             Prevalence : 0.5948          
         Detection Rate : 0.5033          
   Detection Prevalence : 0.6928          
      Balanced Accuracy : 0.6892          
                                          
       'Positive' Class : 0

library(gbm)

# Predictors and outcome
predictors <- c(
  "Pregnancies","Glucose","BloodPressure","SkinThickness",
  "Insulin","BMI","DiabetesPedigreeFunction","Age"
)

X <- d[, predictors]
y <- d$Outcome # must be numeric 0/1 and not factor as earlier


# Train/test split (80/20)
set.seed(1234)
train_idx <- createDataPartition(y, p = 0.8, list = FALSE)

X_train <- X[train_idx, , drop = FALSE]
X_test  <- X[-train_idx, , drop = FALSE]
y_train <- y[train_idx]
y_test  <- y[-train_idx]

# Standardize predictors (z-score using TRAIN stats)
pp <- preProcess(X_train, method = c("center", "scale"))
X_train_sc <- predict(pp, X_train)
X_test_sc  <- predict(pp, X_test)
# z scoring not important, because none of the variables are influencing/compared to each other directly scale doesn't matter here. But z scoring wont hurt.


# Combine into data frames for modeling
train_df <- data.frame(X_train_sc, Outcome = y_train) 
test_df  <- data.frame(X_test_sc,  Outcome = y_test)

# Fit Gradient Boosting model
set.seed(1234)
gb_model <- gbm(
  Outcome ~ .,
  data = train_df,
  distribution = "bernoulli",   # binary classification
  n.trees = 300,
  interaction.depth = 3,        # tree depth
  shrinkage = 0.05,              # learning rate
  n.minobsinnode = 10,
  bag.fraction = 0.8,
  verbose = FALSE
)

# Predict probabilities
prob_train <- predict(gb_model, train_df, n.trees = 300, type = "response")
prob_test  <- predict(gb_model, test_df,  n.trees = 300, type = "response")

# Convert probabilities to class labels (0.5 threshold)
pred_train <- factor(ifelse(prob_train > 0.5, 1, 0))
pred_test  <- factor(ifelse(prob_test  > 0.5, 1, 0))

# Convert truth to factor with matching levels
train_truth <- factor(train_df$Outcome, levels = c(0, 1))
test_truth  <- factor(test_df$Outcome,  levels = c(0, 1))

pred_train <- factor(pred_train, levels = c(0, 1))
pred_test  <- factor(pred_test,  levels = c(0, 1))

# Confusion matrices
confusionMatrix(pred_train, train_truth)

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 392  43
         1  17 163
                                          
               Accuracy : 0.9024          
                 95% CI : (0.8762, 0.9247)
    No Information Rate : 0.665           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7739          
                                          
 Mcnemar's Test P-Value : 0.001249        
                                          
            Sensitivity : 0.9584          
            Specificity : 0.7913          
         Pos Pred Value : 0.9011          
         Neg Pred Value : 0.9056          
             Prevalence : 0.6650          
         Detection Rate : 0.6374          
   Detection Prevalence : 0.7073          
      Balanced Accuracy : 0.8748          
                                          
       'Positive' Class : 0

confusionMatrix(pred_test,  test_truth)

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 78 29
         1 13 33
                                          
               Accuracy : 0.7255          
                 95% CI : (0.6476, 0.7945)
    No Information Rate : 0.5948          
    P-Value [Acc > NIR] : 0.0005179       
                                          
                  Kappa : 0.4061          
                                          
 Mcnemar's Test P-Value : 0.0206376       
                                          
            Sensitivity : 0.8571          
            Specificity : 0.5323          
         Pos Pred Value : 0.7290          
         Neg Pred Value : 0.7174          
             Prevalence : 0.5948          
         Detection Rate : 0.5098          
   Detection Prevalence : 0.6993          
      Balanced Accuracy : 0.6947          
                                          
       'Positive' Class : 0

1.5.3 Using Trees for Regression

Lastly, let’s take a quick look at what we’d need to change if we wanted to predict a continuous value instead of a categorical one. Let’s look at this data set that measures risk propensity. We’re going to predict BART Scores (a score where higher values mean you’re riskier), based on a bunch of different measures.

These are the variables in the data set:

BART: Balloon Analogue Risk Task - Measures risk-taking behavior - Higher scores means more willingness to take risks - This is the outcome (target variable)

BIS/BAS: Behavioral Inhibition / Behavioral Activation Scales - BISmeans sensitivity to punishment / avoidance - BAS means sensitivity to reward / approach behavior

Female - Binary variable (0/1)

Goal of the model is to use psychological traits (BIS/BAS + gender) to predict risk-taking behavior (BART score) using linear regression, evaluated with 5-fold cross-validation.

# read and clean data----
bart <- read_csv(
  "09-data/bart.csv"
)

Rows: 1000 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): BIS.Score, BAS.Drive.Score, BAS.Fun.Seeking.Score, BAS.Reward.Respo...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Drop missing values
bart <- na.omit(bart)

# Reset row names 
rownames(bart) <- NULL

# Define predictors and outcome----
# Outcome
y <- bart$BART

# Predictors
predictors <- setdiff(colnames(bart), "BART")

# Continuous predictors (everything except Female)
contin <- setdiff(predictors, "Female")

X <- bart[, predictors]

# set up CV with 5 folds-----
set.seed(1234)
kf <- createFolds(y, k = 5, returnTrain = TRUE)

# Storage for metrics ----
mse  <- list(train = c(), test = c())
mae  <- list(train = c(), test = c())
mape <- list(train = c(), test = c())
r2   <- list(train = c(), test = c())


# Cross-validation loop
for (i in seq_along(kf)) {

  train_idx <- kf[[i]]
  test_idx  <- setdiff(seq_len(nrow(X)), train_idx)

  X_train <- X[train_idx, ]
  X_test  <- X[test_idx, ]
  y_train <- y[train_idx]
  y_test  <- y[test_idx]

  # Z-score continuous predictors only
  pp <- preProcess(X_train[, contin], method = c("center", "scale"))

  X_train_sc <- X_train
  X_test_sc  <- X_test

  X_train_sc[, contin] <- predict(pp, X_train[, contin])
  X_test_sc[, contin]  <- predict(pp, X_test[, contin])

  # Fit linear regression
  train_df <- data.frame(X_train_sc, BART = y_train)
  test_df  <- data.frame(X_test_sc,  BART = y_test)

  lm_fit <- lm(BART ~ ., data = train_df)

  # Predictions
  pred_train <- predict(lm_fit, train_df)
  pred_test  <- predict(lm_fit, test_df)

  # Metrics
  mse$train  <- c(mse$train, mean((y_train - pred_train)^2))
  mse$test   <- c(mse$test,  mean((y_test  - pred_test)^2))

  mae$train  <- c(mae$train, mean(abs(y_train - pred_train)))
  mae$test   <- c(mae$test,  mean(abs(y_test  - pred_test)))

  mape$train <- c(mape$train, mean(abs((y_train - pred_train) / y_train)))
  mape$test  <- c(mape$test,  mean(abs((y_test  - pred_test)  / y_test)))

  r2$train   <- c(r2$train, cor(y_train, pred_train)^2)
  r2$test    <- c(r2$test,  cor(y_test,  pred_test)^2)
}


# Create summary table-----
results_table <- data.frame(
  Metric = c("MSE", "MAE", "MAPE", "R²"),
  Train = c(
    mean(mse$train),
    mean(mae$train),
    mean(mape$train),
    mean(r2$train)
  ),
  Test = c(
    mean(mse$test),
    mean(mae$test),
    mean(mape$test),
    mean(r2$test)
  )
)

# Print table
results_table

  Metric        Train         Test
1    MSE 143.01960222 146.17530363
2    MAE  10.49211192  10.59655360
3   MAPE   0.87809521   0.88646195
4     R²   0.06309417   0.05407431

We can interpret the results as follows: The linear regression model shows poor predictive performance, explaining only 5–6% of the variance in BART scores. Train and test errors were nearly identical, indicating no overfitting but substantial underfitting. These results suggest that BIS/BAS traits and gender alone are insufficient predictors of risk-taking behavior as measured by the BART.

Next step would be to try non-linear models (Random Forest, Gradient Boosting).

set.seed(1234)
folds <- createFolds(y, k = 5, returnTrain = TRUE)

mse  <- list(train = c(), test = c())
mae  <- list(train = c(), test = c())
mape <- list(train = c(), test = c())
r2   <- list(train = c(), test = c())

for (i in seq_along(folds)) {
  tr <- folds[[i]]
  te <- setdiff(seq_len(nrow(X)), tr)

  X_train <- X[tr, , drop = FALSE]
  X_test  <- X[te, , drop = FALSE]
  y_train <- y[tr]
  y_test  <- y[te]

  # z-score continuous predictors using TRAIN stats
  pp <- preProcess(X_train[, contin, drop = FALSE], method = c("center", "scale"))
  X_train_sc <- X_train; X_test_sc <- X_test
  X_train_sc[, contin] <- predict(pp, X_train[, contin, drop = FALSE])
  X_test_sc[, contin]  <- predict(pp, X_test[, contin, drop = FALSE])

  train_df <- data.frame(X_train_sc, BART = y_train)
  test_df  <- data.frame(X_test_sc,  BART = y_test)

  set.seed(1234)
  rf_fit <- randomForest(
    BART ~ .,
    data = train_df,
    ntree = 500,
    mtry = max(1, floor(sqrt(length(predictors))))
  )

  pred_train <- predict(rf_fit, train_df)
  pred_test  <- predict(rf_fit, test_df)

  mse$train  <- c(mse$train, mean((y_train - pred_train)^2))
  mse$test   <- c(mse$test,  mean((y_test  - pred_test)^2))

  mae$train  <- c(mae$train, mean(abs(y_train - pred_train)))
  mae$test   <- c(mae$test,  mean(abs(y_test  - pred_test)))

  mape$train <- c(mape$train, mean(abs((y_train - pred_train) / y_train)))
  mape$test  <- c(mape$test,  mean(abs((y_test  - pred_test)  / y_test)))

  r2$train   <- c(r2$train, cor(y_train, pred_train)^2)
  r2$test    <- c(r2$test,  cor(y_test,  pred_test)^2)
}

rf_results <- data.frame(
  Model = "Random Forest",
  Metric = c("MSE", "MAE", "MAPE", "R^2"),
  Train = c(mean(mse$train), mean(mae$train), mean(mape$train), mean(r2$train)),
  Test  = c(mean(mse$test),  mean(mae$test),  mean(mape$test),  mean(r2$test))
)

rf_results

          Model Metric      Train        Test
1 Random Forest    MSE 49.4503557 133.4441967
2 Random Forest    MAE  5.7125156   9.5465513
3 Random Forest   MAPE  0.4700485   0.7912291
4 Random Forest    R^2  0.7497229   0.1468698

set.seed(1234)
folds <- createFolds(y, k = 5, returnTrain = TRUE)

mse_g  <- list(train = c(), test = c())
mae_g  <- list(train = c(), test = c())
mape_g <- list(train = c(), test = c())
r2_g   <- list(train = c(), test = c())

for (i in seq_along(folds)) {
  tr <- folds[[i]]
  te <- setdiff(seq_len(nrow(X)), tr)

  X_train <- X[tr, , drop = FALSE]
  X_test  <- X[te, , drop = FALSE]
  y_train <- y[tr]
  y_test  <- y[te]

  # z-score continuous predictors using TRAIN stats
  pp <- preProcess(X_train[, contin, drop = FALSE], method = c("center", "scale"))
  X_train_sc <- X_train; X_test_sc <- X_test
  X_train_sc[, contin] <- predict(pp, X_train[, contin, drop = FALSE])
  X_test_sc[, contin]  <- predict(pp, X_test[, contin, drop = FALSE])

  train_df <- data.frame(X_train_sc, BART = y_train)
  test_df  <- data.frame(X_test_sc,  BART = y_test)

  set.seed(1234)
  gb_fit <- gbm(
    BART ~ .,
    data = train_df,
    distribution = "gaussian",     # regression
    n.trees = 1500,
    interaction.depth = 3,
    shrinkage = 0.01,
    n.minobsinnode = 10,
    bag.fraction = 0.8,
    verbose = FALSE
  )

  pred_train <- predict(gb_fit, train_df, n.trees = 1500)
  pred_test  <- predict(gb_fit, test_df,  n.trees = 1500)

  mse_g$train  <- c(mse_g$train, mean((y_train - pred_train)^2))
  mse_g$test   <- c(mse_g$test,  mean((y_test  - pred_test)^2))

  mae_g$train  <- c(mae_g$train, mean(abs(y_train - pred_train)))
  mae_g$test   <- c(mae_g$test,  mean(abs(y_test  - pred_test)))

  mape_g$train <- c(mape_g$train, mean(abs((y_train - pred_train) / y_train)))
  mape_g$test  <- c(mape_g$test,  mean(abs((y_test  - pred_test)  / y_test)))

  r2_g$train   <- c(r2_g$train, cor(y_train, pred_train)^2)
  r2_g$test    <- c(r2_g$test,  cor(y_test,  pred_test)^2)
}

gb_results <- data.frame(
  Model = "Gradient Boosting",
  Metric = c("MSE", "MAE", "MAPE", "R^2"),
  Train = c(mean(mse_g$train), mean(mae_g$train), mean(mape_g$train), mean(r2_g$train)),
  Test  = c(mean(mse_g$test),  mean(mae_g$test),  mean(mape_g$test),  mean(r2_g$test))
)

gb_results

              Model Metric       Train        Test
1 Gradient Boosting    MSE 100.5563924 131.6130762
2 Gradient Boosting    MAE   8.2969512   9.5293227
3 Gradient Boosting   MAPE   0.6858956   0.7808945
4 Gradient Boosting    R^2   0.3510419   0.1564954

Let’s combine the results to compare:

all_results <- rbind(rf_results, gb_results)
all_results

              Model Metric       Train        Test
1     Random Forest    MSE  49.4503557 133.4441967
2     Random Forest    MAE   5.7125156   9.5465513
3     Random Forest   MAPE   0.4700485   0.7912291
4     Random Forest    R^2   0.7497229   0.1468698
5 Gradient Boosting    MSE 100.5563924 131.6130762
6 Gradient Boosting    MAE   8.2969512   9.5293227
7 Gradient Boosting   MAPE   0.6858956   0.7808945
8 Gradient Boosting    R^2   0.3510419   0.1564954

What do you conclude?

Interpretation

While Random Forest achieved excellent training performance, it substantially overfit and failed to generalize. Gradient Boosting produced slightly better test performance and more stable behavior, though overall predictive power remained modest. These results suggest that nonlinear relationships exist but that BIS/BAS measures and gender alone account for only a small portion of variance in BART risk-taking behavior.

1.5.4 Feature Importance

# Assuming rf_fit was trained on train_df
rf_imp <- importance(rf_fit)

# Convert to data frame
rf_imp_df <- data.frame(
  Feature = rownames(rf_imp),
  Importance = rf_imp[, "IncNodePurity"]
)

# Sort by importance
rf_imp_df <- rf_imp_df[order(rf_imp_df$Importance, decreasing = TRUE), ]

rf_imp_df

                                                        Feature Importance
BAS.Drive.Score                                 BAS.Drive.Score  17676.021
BAS.Fun.Seeking.Score                     BAS.Fun.Seeking.Score  16728.337
Age                                                         Age  15398.407
BIS.Score                                             BIS.Score  14628.120
BAS.Reward.Responsiveness.Score BAS.Reward.Responsiveness.Score  14473.869
Female                                                   Female   3253.537

# Relative influence from gbm
gb_imp <- summary(gb_fit, plotit = FALSE)

gb_imp

                                                            var   rel.inf
BAS.Fun.Seeking.Score                     BAS.Fun.Seeking.Score 25.462993
BAS.Drive.Score                                 BAS.Drive.Score 25.375267
BAS.Reward.Responsiveness.Score BAS.Reward.Responsiveness.Score 21.312970
BIS.Score                                             BIS.Score 13.361662
Age                                                         Age 12.722097
Female                                                   Female  1.765011

library(patchwork)

# Random Forest plot
p_rf <- ggplot(rf_imp_df, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_col(fill = "#2F6F73") +
  coord_flip() +
  labs(
    title = "Random Forest",
    x = "Feature",
    y = "Importance (IncNodePurity)"
  ) +
  theme_minimal()

# Gradient Boosting plot
p_gb <- ggplot(gb_imp, aes(x = reorder(var, rel.inf), y = rel.inf)) +
  geom_col(fill = "#F28E2B") +
  coord_flip() +
  labs(
    title = "Gradient Boosting",
    x = "Feature",
    y = "Relative Influence"
  ) +
  theme_minimal()

p_rf / p_gb

Feature importance analyses from both Random Forest and Gradient Boosting consistently identified BAS-related traits, particularly Drive and Fun Seeking, as the most influential predictors of BART risk-taking behavior. In contrast, BIS, age, and gender showed relatively weak influence. Although these findings suggest that reward sensitivity plays a central role in risk-taking, the overall predictive power of the models remained modest, indicating substantial unexplained variability.

2 Classwork

Using the bmd.csv, fit a tree to predict bone mineral density (BMD) based on AGE.

#libraries that we will need
set.seed(1234) #fix the random generator seed 

library(rpart)  #library for CART
library(rpart.plot) # plotting
#read the dataset
bmd.data     <-  read.csv(
  "09-data/bmd.csv", stringsAsFactors = TRUE
)

t1 <- rpart(bmd ~ age,
            data = bmd.data, 
            method = "anova",         #indicates the outcome is continuous
            control = rpart.control(
                       minsplit = 1,  # min number of observ for a split 
                       minbucket = 1, # min nr of obs in terminal nodes
                       cp=0)          #decrease in complex for a split 
)

#the rpart.plot() may take a long time to plot
#the complete tree
#if you cannot run it, just try plot(t1); text(t1, pretty=1)
#and you will see just the structure of the tree
rpart.plot(t1)

Warning: labs do not fit even at cex 0.15, there may be some overplotting

We can now prune the tree using a limit for the complexity parameter CP. This will indicate that only a split with CP higher than the limit is worth it.

printcp(t1)                      #CP for the complete tree


Regression tree:
rpart(formula = bmd ~ age, data = bmd.data, method = "anova", 
    control = rpart.control(minsplit = 1, minbucket = 1, cp = 0))

Variables actually used in tree construction:
[1] age

Root node error: 4.659/169 = 0.027568

n= 169 

            CP nsplit  rel error  xerror    xstd
1   1.5011e-01      0 1.0000e+00 1.00752 0.11223
2   2.2857e-02      1 8.4989e-01 0.88021 0.11329
3   2.1625e-02      2 8.2703e-01 1.00527 0.12150
4   2.0621e-02      8 6.9728e-01 1.02789 0.12227
5   1.7896e-02     11 6.2374e-01 1.02015 0.12170
6   1.6691e-02     13 5.8795e-01 1.04816 0.12398
7   1.3855e-02     14 5.7126e-01 1.05215 0.12506
8   1.2133e-02     15 5.5740e-01 1.09014 0.13382
9   1.1410e-02     16 5.4527e-01 1.18367 0.14818
10  1.0975e-02     17 5.3386e-01 1.22320 0.15014
11  1.0843e-02     21 4.8996e-01 1.23199 0.15007
12  1.0550e-02     26 4.3574e-01 1.21597 0.14484
13  1.0318e-02     27 4.2519e-01 1.21718 0.14328
14  9.5921e-03     28 4.1488e-01 1.22821 0.14316
15  9.5636e-03     29 4.0528e-01 1.24631 0.14492
16  9.4801e-03     35 3.4511e-01 1.24631 0.14492
17  8.8498e-03     37 3.2615e-01 1.27487 0.14880
18  8.7833e-03     38 3.1730e-01 1.30437 0.14966
19  8.6510e-03     39 3.0851e-01 1.29524 0.14983
20  8.1753e-03     42 2.8164e-01 1.26099 0.14307
21  8.1257e-03     43 2.7347e-01 1.25794 0.14306
22  7.9341e-03     45 2.5722e-01 1.27192 0.14317
23  6.8660e-03     46 2.4928e-01 1.31851 0.14412
24  6.2708e-03     47 2.4242e-01 1.40537 0.15478
25  5.9903e-03     52 2.1106e-01 1.41452 0.15452
26  5.9250e-03     54 1.9908e-01 1.40721 0.15436
27  5.9229e-03     56 1.8723e-01 1.40721 0.15436
28  5.8272e-03     57 1.8131e-01 1.40721 0.15436
29  5.4523e-03     58 1.7548e-01 1.40090 0.15426
30  5.0336e-03     59 1.7003e-01 1.42253 0.15466
31  4.9113e-03     60 1.6500e-01 1.43632 0.15665
32  4.6286e-03     63 1.5026e-01 1.44164 0.15661
33  4.5364e-03     64 1.4563e-01 1.47586 0.15635
34  4.3127e-03     66 1.3656e-01 1.47686 0.15634
35  4.3104e-03     68 1.2794e-01 1.47835 0.15635
36  4.2807e-03     69 1.2363e-01 1.47835 0.15635
37  3.9800e-03     71 1.1506e-01 1.47621 0.15643
38  3.6958e-03     72 1.1108e-01 1.46197 0.15633
39  3.1052e-03     75 9.9997e-02 1.43906 0.15590
40  3.0222e-03     76 9.6892e-02 1.44520 0.15595
41  2.7468e-03     78 9.0848e-02 1.44816 0.15580
42  2.6775e-03     79 8.8101e-02 1.45534 0.15572
43  2.5828e-03     81 8.2746e-02 1.45691 0.15875
44  2.5347e-03     82 8.0163e-02 1.45495 0.15876
45  2.4665e-03     83 7.7628e-02 1.45095 0.15875
46  2.4609e-03     84 7.5162e-02 1.45900 0.15906
47  2.4243e-03     86 7.0240e-02 1.45900 0.15906
48  2.4140e-03     87 6.7816e-02 1.45900 0.15906
49  2.3335e-03     88 6.5402e-02 1.45900 0.15906
50  2.3321e-03     89 6.3068e-02 1.46489 0.15935
51  2.2471e-03     91 5.8404e-02 1.47050 0.15929
52  2.1072e-03     92 5.6157e-02 1.46871 0.15786
53  2.0870e-03     93 5.4050e-02 1.46963 0.15782
54  2.0451e-03    100 3.7574e-02 1.46963 0.15782
55  1.9386e-03    101 3.5529e-02 1.46560 0.15783
56  1.8005e-03    102 3.3590e-02 1.46711 0.15776
57  1.3686e-03    103 3.1790e-02 1.46865 0.15796
58  1.3438e-03    107 2.6306e-02 1.47106 0.15801
59  1.2138e-03    108 2.4962e-02 1.46963 0.15806
60  1.1334e-03    110 2.2535e-02 1.46612 0.15641
61  1.0723e-03    111 2.1401e-02 1.46176 0.15658
62  1.0605e-03    113 1.9257e-02 1.46676 0.15654
63  1.0035e-03    114 1.8196e-02 1.46518 0.15661
64  9.4464e-04    115 1.7193e-02 1.46207 0.15669
65  7.5924e-04    116 1.6248e-02 1.44567 0.15628
66  7.5061e-04    118 1.4730e-02 1.44557 0.15628
67  7.4468e-04    119 1.3979e-02 1.44557 0.15628
68  7.3755e-04    120 1.3234e-02 1.44557 0.15628
69  7.1811e-04    121 1.2497e-02 1.44820 0.15641
70  7.1728e-04    122 1.1779e-02 1.45159 0.15641
71  7.1635e-04    123 1.1061e-02 1.45159 0.15641
72  7.1110e-04    124 1.0345e-02 1.45159 0.15641
73  6.6133e-04    125 9.6340e-03 1.44761 0.15640
74  6.5629e-04    126 8.9727e-03 1.44315 0.15666
75  6.5461e-04    127 8.3164e-03 1.44315 0.15666
76  5.8428e-04    128 7.6618e-03 1.44689 0.15544
77  5.1243e-04    129 7.0775e-03 1.45836 0.15511
78  4.7746e-04    130 6.5651e-03 1.45858 0.15495
79  4.7460e-04    131 6.0876e-03 1.45858 0.15495
80  4.5064e-04    132 5.6130e-03 1.45858 0.15495
81  4.3443e-04    133 5.1624e-03 1.45931 0.15490
82  4.0723e-04    134 4.7279e-03 1.46306 0.15490
83  4.0723e-04    135 4.3207e-03 1.46502 0.15480
84  4.0195e-04    136 3.9135e-03 1.46502 0.15480
85  3.6728e-04    137 3.5115e-03 1.46765 0.15474
86  3.3896e-04    138 3.1442e-03 1.46949 0.15465
87  3.1877e-04    139 2.8053e-03 1.47112 0.15462
88  3.1811e-04    140 2.4865e-03 1.47128 0.15461
89  2.7369e-04    141 2.1684e-03 1.46930 0.15429
90  2.0990e-04    142 1.8947e-03 1.46905 0.15431
91  1.8394e-04    143 1.6848e-03 1.46767 0.15408
92  1.8306e-04    144 1.5008e-03 1.47013 0.15416
93  1.8237e-04    145 1.3178e-03 1.47013 0.15416
94  1.3863e-04    146 1.1354e-03 1.47029 0.15415
95  1.1758e-04    148 8.5816e-04 1.47000 0.15395
96  9.2318e-05    149 7.4058e-04 1.46685 0.15235
97  9.1614e-05    150 6.4826e-04 1.46632 0.15238
98  8.9016e-05    152 4.6503e-04 1.46632 0.15238
99  8.1752e-05    153 3.7601e-04 1.46632 0.15238
100 5.6280e-05    154 2.9426e-04 1.46685 0.15239
101 5.0071e-05    155 2.3798e-04 1.46893 0.15235
102 4.1650e-05    156 1.8791e-04 1.46834 0.15237
103 2.7818e-05    157 1.4626e-04 1.46775 0.15231
104 2.2876e-05    158 1.1844e-04 1.46788 0.15226
105 1.7583e-05    159 9.5567e-05 1.46788 0.15226
106 1.5454e-05    160 7.7983e-05 1.46788 0.15226
107 1.3947e-05    161 6.2529e-05 1.46697 0.15226
108 1.3462e-05    162 4.8582e-05 1.46704 0.15228
109 1.2059e-05    163 3.5120e-05 1.46648 0.15226
110 1.2058e-05    164 2.3061e-05 1.46648 0.15226
111 6.8126e-06    165 1.1003e-05 1.46648 0.15226
112 2.4728e-06    166 4.1899e-06 1.46582 0.15227
113 1.7171e-06    167 1.7171e-06 1.46569 0.15225
114 0.0000e+00    168 0.0000e+00 1.46518 0.15217

prune.t1 <- prune(t1, cp=0.02)   #prune the tree with cp=0.02

printcp(prune.t1)


Regression tree:
rpart(formula = bmd ~ age, data = bmd.data, method = "anova", 
    control = rpart.control(minsplit = 1, minbucket = 1, cp = 0))

Variables actually used in tree construction:
[1] age

Root node error: 4.659/169 = 0.027568

n= 169 

        CP nsplit rel error  xerror    xstd
1 0.150111      0   1.00000 1.00752 0.11223
2 0.022857      1   0.84989 0.88021 0.11329
3 0.021625      2   0.82703 1.00527 0.12150
4 0.020621      8   0.69728 1.02789 0.12227
5 0.020000     11   0.62374 1.02015 0.12170

rpart.plot(prune.t1)              #pruned tree

The printcp() function is used to examine the complexity parameter table and identify an appropriate pruning threshold. The tree is then pruned using cp = 0.02, which removes splits that do not sufficiently reduce prediction error, resulting in a simpler and more generalizable model.

In summary: CP controls tree size, i.e.

It penalizes adding splits that don’t improve the model enough
A split is kept only if it reduces error by at least CP

Let’s use the pruned tree from aboved to predict the BMD for an individual 70 years old and compare it with the predictions from a linear model. We plot the predictions from the tree for ages 40 through 90 together with the predictions from the linear model:

lm1 <- lm(bmd ~ age,
           data = bmd.data)

predict(prune.t1,                       #prediction using the tree
        newdata = data.frame(age=70))

        1 
0.8070857

predict(lm1,                           #prediction using the linear model
        newdata = data.frame(age=70))

        1 
0.7567922

# Create prediction grid
age_grid <- data.frame(age = seq(40, 90, 1))

# Get predictions
age_grid$tree <- predict(prune.t1, newdata = age_grid)
age_grid$linear <- predict(lm1, newdata = age_grid)

# Convert to long format for ggplot
plot_df <- data.frame(
  age = rep(age_grid$age, 2),
  bmd = c(age_grid$tree, age_grid$linear),
  model = factor(
    rep(c("Tree", "Linear Model"), each = nrow(age_grid))
  )
)

# Plot
ggplot(plot_df, aes(x = age, y = bmd, color = model)) +
  geom_line(linewidth = 1) +
  scale_color_manual(
    values = c(
      "Tree" = "blue",
      "Linear Model" = "red"
    )
  ) +
  labs(
    x = "Age",
    y = "BMD",
    color = "Model"
  ) +
  theme_minimal()

Let’s fit now a tree to predict bone mineral density (BMD) based on AGE, SEX and BMI (BMI has to be computed) and compute the MSE.

Notice that the caret will prune the tree based on the cross-validated cp.

#compute BMI
bmd.data$bmi <- bmd.data$weight_kg / (bmd.data$height_cm/100)^2

trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
t2.caret  <-  train(bmd ~ age + sex + bmi, 
                        data = bmd.data, 
                        method = "rpart",
                        trControl=trctrl,
                        tuneGrid = expand.grid(cp=seq(0.001, 0.1, 0.001))
                        )

#Plot the RMSE versus the CP
plot(t2.caret)

#Plot the tree with the optimal CP
rpart.plot(t2.caret$finalModel)

We can compare the RMSE and R2 of the tree above with the linear model:

trctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 10)
lm2.caret<-  train(bmd ~ age + sex + bmi, 
                        data = bmd.data, 
                        method = "lm",
                        trControl=trctrl
                        )

lm2.caret$results

  intercept      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
1      TRUE 0.1374844 0.3307248 0.1041004 0.02212903  0.1310581 0.01472284

t2.caret$finalModel$cptable

          CP nsplit rel error
1 0.26960125      0 1.0000000
2 0.07577601      1 0.7303988
3 0.06448418      2 0.6546227
4 0.05400352      3 0.5901386
5 0.03600000      4 0.5361350

#extracts the row with the RMSE and R2 from the table of results
#corresponding to the cp with lowest RMSE  (best tune)
t2.caret$results[t2.caret$results$cp==t2.caret$bestTune[1,1], ]

      cp      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
36 0.036 0.1324524 0.3988161 0.1027905 0.02556867  0.1817756 0.01802933

Now by yourselves try to fit random forests and gradient boosting trees. Compare the results from these fits, and to the fit ofd the linear model.