Classification II

Author

Termeh Shafie

1 Metrics

Once we build a classification model, it’s important to make sure that we know how it performs! We talked about a few metrics you can calculate to help you assess how well your model is doing.

Accuracy: \(\frac{TP + TN}{TP + TN + FP + FN}\)
Confusion Matrix Patterns

	Actually 1	Actually 0
Predicted 1	True Positive (TP)	False Positive (FP)
Predicted 0	False Negative (FN)	True Negative (TN)

Precision: \(\frac{TP}{TP + FP}\), how many of the predicted positives are true positives?
Recall/Sensitivity: \(\frac{TP}{TP + FN}\), how many of the actual positives did we accurately predict?
Specificity: \(\frac{TN}{TN + FP}\), how many of the actual negatives did we accurately predict?
F1 Score: \(\frac{2 * Precision * Recall}{Precision + Recall}\), a combination of precision and recall.
ROC AUC: The area under the ROC curve which puts the False Positive Rate (FPR) on the x-axis, and the True Positive Rate on the y-axis.

1.1 Question

If you were designing a Flu test, which of these metrics would be most important to you (there’s no one right answer) and why?

2 Naive Bayes

2.1 Naive

Naive Bayes is Naive because it makes the assumption of conditional independence. This means that within classes (the groups we’re trying to predict), we assume the predictors are independent. However we know that’s not true, so it’s a Naive assumption to make.

However, it simplifies our computation. When events are independent we can calculate their joint probability just by multiplying them:

\[\underbrace{P(A,B,C)}_\text{probability of A, B, and C} = P(A) * P(B) * P(C)\]

Rather than calculating the joint probability \(P(A,B,C)\) by taking into account any relationships between the predictors.

2.2 Bayes

When classifying a data point, we use Bayes’ Theorem (at least the numerator of it) to calculate the score for each potential category. Then we choose the category with the highest score. Because we’re comparing scores that have the same denominator, we can ignore it (which is nice, as it’s difficiult to calculate).

\[ \overbrace{P(\text{group}_i | \mathbf{X}) = \frac{P(\mathbf{X} | \text{group}_i) * P(\text{group}_i)}{P(\mathbf{X})}}^\text{Bayes' Theorem}\]

\(P(\text{group}_i | \mathbf{X})\) is the probability of our data point being in group \(i\) based on their predictor values \(\mathbf{X}\)
\(P(\mathbf{X} | \text{group}_i)\) is the likelihood of seeing features like \(\mathbf{X}\) in group \(i\) (if our features are commonly seen in group \(i\), we’re more likely to predict you’re in group \(i\))
\(P(\text{group}_i)\) is the probability of being in group \(i\) overall (if a group is very rare, we don’t want to predict it often)
\(P(\mathbf{X})\) is the probability of seeing features like \(\mathbf{X}\) overall, in any group (but we ignore this term)

2.3 Naive Bayes in R

From a training collection of sms text messages, that are labeled Spam or Ham, a set of used words can be extracted. After a couple of data cleaning operations (see below) it is counted how often words occur in the text messages in the training set. Words with a frequency equal or higher than a chosen threshold, e.g. five, are collected in a dictionary. Only these words are used to distinguish Spam from Ham. As the next step, each word is converted into a binary variable with the length of the number of messages in the training set, with value 1 in cell \(j\) if the word occurs one or more times in the \(j\)th text message and 0 otherwise. These binary variables are mutated into factor variables which can be used to generate NaiveBayes model that distinguishes Spam from Ham.

2.3.1 Packages needed

library(tidyverse)
library(flextable) # install if not on your machine
library(tm) # install if not on your machine
library(e1071) # install if not on your machine
library(caret) # install if not on your machine
library(pROC) # install if not on your machine

2.3.2 Analysis

df <- read.csv("05-data/sms_spam.csv")

#transform df$type into a factor variable
df$type <- factor(df$type, levels = c("ham", "spam"),
                       labels=c("ham", "spam"))

flextable(head(df)) %>% autofit()

type	text
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives around here though
spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

The R package tm (text mining) comes with a couple of helpful functions for Text Mining. In order to use the text mining functions, the data to be investigated must be in a so-called Corpus of text documents. The first step in this analysis is to convert the set of SMS messages into such a corpus. First the text messages are transformed into a vector source this is an R vector that interprets every element as a text document.

First Five Elements of SMS message in a Vector Source and create a Corpus with all the sms messages as text documents. This is in fact a list of lists, every list contains the text of the document and metadata about the document.

sms_vector_source <- VectorSource(df$text)
sms_vector_source[1:5]

[1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."                                            
[2] "Ok lar... Joking wif u oni..."                                                                                                                              
[3] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
[4] "U dun say so early hor... U c already then say..."                                                                                                          
[5] "Nah I don't think he goes to usf, he lives around here though"

sms_corpus <- Corpus(sms_vector_source)

Now it is time to perform a couple of text cleaning preparation steps. The tm package has functions for these actions:

Because it is assumed that capitalised letters will not be used to distinguish Ham from Spam, replace all uppercase letters by lowercase letters.
Remove numbers from the text messages. If it is assumed that numbers in the messages can be helpful to diferentiate between Spam and Ham, a more advanced way to deal with numbers is required
Lots of short words, like and, or, if, on, to etc. are not useful to distinct Ham from Spam; they can be removed from the texts before generating a model.
Remove punctuation.
Remove unnecessary spaces.

#transform to lower case text messages
#the tm::tm_map function applies a function on a corpus object
corpus_clean <- tm_map(sms_corpus, tolower) %>% 
#remove numbers
  tm_map(removeNumbers) %>% 
#remove stopwords
  tm_map(removeWords, stopwords()) %>% 
#remove punctuation
  tm_map(removePunctuation) %>% 
#remove additional spaces
  tm_map(stripWhitespace)

corpus_clean[1]$content

[1] "go jurong point crazy available bugis n great world la e buffet cine got amore wat"

Assuming that splitting the data in a training and a test set will be used to assess the Naive Bayes model, the next step is splitting the data, e.g. 70% in the training set and 30% in the test set.

set.seed(123)
train <- sample(1:length(corpus_clean),
                size = .7*length(corpus_clean),
                replace=FALSE)

sms_train <- df[train,]
sms_test <- df[-train,]

corpus_train <- corpus_clean[train]
corpus_test <- corpus_clean[-train]

sum(train)

[1] 10780878

Let’s now create a document term matrix:

dtm_train <- DocumentTermMatrix(corpus_train)
inspect(dtm_train)

<<DocumentTermMatrix (documents: 3901, terms: 6476)>>
Non-/sparse entries: 30139/25232737
Sparsity           : 100%
Maximal term length: 40
Weighting          : term frequency (tf)
Sample             :
      Terms
Docs   call can free get just know like ltgt now will
  17      0   0    1   0    0    0    1    0   0    0
  2155    0   3    1   1    0    0    0    6   0    0
  2252    0   1    0   0    0    0    0    1   0    0
  277     0   0    1   0    0    0    1    0   0    0
  2933    0   0    0   0    0    0    0    2   0    0
  3313    0   0    0   1    0    0    1    0   0   11
  3368    0   0    0   0    0    0    0    0   0    0
  3652    0   0    0   0    0    0    0    0   0    0
  489     0   0    0   0    0    1    1    0   0    0
  66      0   0    0   0    0    0    0    1   0    0

To distinguish Ham from Spam not every word in the corpus are useful. Words must appear in a couple of messages to be useful. A choice must be made for the threshold of the number of messages in which a word appears to be used in the model, e.g. 5 times.

First construct a vector with words with a frequency of at least 5:

frequent_terms_5 <- findFreqTerms(dtm_train, lowfreq=5)
frequent_terms_5[1:10]

 [1] "anything" "lar"      "already"  "ard"      "can"      "check"   
 [7] "dat"      "later"    "like"     "picking"

The Naive Bayes model uses as features not the number of times a term appears in a message, but only whether a term appears in a message.

It is possible to construct a Binary DTM in which the cells indicate whether a document contains the term (cell value = 1) or not (cell value = 0). It is this Binary DTM that is used in the Naive Bayes model. The Binary DTM is constructed for the words with frequency at least.

#Binary DTM for training data
dtm_train_bin <- DocumentTermMatrix(
                      corpus_train,
                      control=list(weighting=weightBin,
                                   dictionary=frequent_terms_5))

Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
ignored

#Binary DTM for test data; needed to asses the model
dtm_test_bin <- DocumentTermMatrix(
                      corpus_test,
                      control=list(weighting=weightBin,
                                   dictionary=frequent_terms_5))

Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
ignored

inspect(dtm_train_bin)

<<DocumentTermMatrix (documents: 3901, terms: 1179)>>
Non-/sparse entries: 22345/4576934
Sparsity           : 100%
Maximal term length: 19
Weighting          : binary (bin)
Sample             :
      Terms
Docs   call can free get got just know like now will
  2229    0   0    0   0   0    0    0    0   0    0
  2252    0   1    0   0   0    0    0    0   0    0
  327     1   1    0   1   0    1    0    0   1    0
  3313    0   0    0   1   0    0    0    1   0    1
  3368    0   0    0   0   0    0    0    0   0    0
  3602    0   0    0   0   0    0    0    0   0    0
  3652    0   0    0   0   0    0    0    0   0    0
  489     0   0    0   0   0    0    1    1   0    0
  514     0   1    0   0   0    0    0    0   0    0
  66      0   0    0   0   0    0    0    0   0    0

The columns in the Binary DTM must be transformed into factor variables to use them in a Naive Bayes model. Then the Binary DTM is ready to generate a Naive Bayes model.

#first use as matrix() to convert DTM matrix from a list into a matrix 
dtm_train_bin_matrix <- as.matrix(dtm_train_bin)
dtm_test_bin_matrix <- as.matrix(dtm_test_bin)

#convert the columns into factor
dtm_train_bin_matrix <- apply(dtm_train_bin_matrix, 2, factor)
dtm_test_bin_matrix <- apply(dtm_test_bin_matrix, 2, factor)

#generate model
nb_model <- naiveBayes(x=dtm_train_bin_matrix,
                       y=sms_train$type)

summary(nb_model)

          Length Class  Mode     
apriori      2   table  numeric  
tables    1179   -none- list     
levels       2   -none- character
isnumeric 1179   -none- logical  
call         3   -none- call

Nest we are assessing the model:

Use the model to make predictions on the test data
Assess the model using a confusion matrix

preds <- predict(nb_model, dtm_test_bin_matrix)

cf <- table(preds, sms_test$type)

cf

      
preds   ham spam
  ham  1448   24
  spam    3  198

As can be seen in the Confusion Matrix, the model makes a good distinction between Spam and Ham. Only 3 of the 1451 Ham messages are classified as Spam while 24 of the 198 Spam messages are classified as Ham.

The caret::confusionMatrix() function gives a lot of metrics which can be used to assess a classification model. Which metric is most applicable depends on the context of the problem in question.

confusionMatrix(preds, sms_test$type)

Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham  1448   24
      spam    3  198
                                          
               Accuracy : 0.9839          
                 95% CI : (0.9766, 0.9893)
    No Information Rate : 0.8673          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.927           
                                          
 Mcnemar's Test P-Value : 0.0001186       
                                          
            Sensitivity : 0.9979          
            Specificity : 0.8919          
         Pos Pred Value : 0.9837          
         Neg Pred Value : 0.9851          
             Prevalence : 0.8673          
         Detection Rate : 0.8655          
   Detection Prevalence : 0.8799          
      Balanced Accuracy : 0.9449          
                                          
       'Positive' Class : ham

2.3.3 Question

Which metric would you choose to evaluate the model here?

3 Classwork

3.1 Lizzo

Build a Logistic Regression Model to predict the mode of different Lizzo songs. Then check the performance of the model. Fill int he msising parts in the code below:

# Read data
d <- read.csv("05-Data/Lizzo_data.csv")

# Convert boolean/logical TRUE/FALSE to 1/0
# (works even if column is already logical)
d$explicit <- as.integer(d$explicit)

# View first rows
flextable(head(d)) %>% autofit()

X	artist_name	artist_id	album_id	album_type	album_release_date	album_release_year	album_release_date_precision	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	track_id	analysis_url	time_signature	disc_number	duration_ms	explicit	track_href	is_local	track_name	track_preview_url	track_number	type	track_uri	external_urls.spotify	album_name	key_name	mode_name	key_mode
1	Lizzo	56oDRnqbIiwx4mymNEv7dS	6dFFcYQ8VhifgdKgYY5LYL	album	2019-04-19	2,019	day	0.566	0.660	1	-4.039	1	0.0479	0.01560	0.00000	0.2340	0.520	162.159	6YdQgWSpsxhVeX6Xmv3IFJ	https://api.spotify.com/v1/audio-analysis/6YdQgWSpsxhVeX6Xmv3IFJ	3	1	179,978	1	https://api.spotify.com/v1/tracks/6YdQgWSpsxhVeX6Xmv3IFJ	false	Cuz I Love You	https://p.scdn.co/mp3-preview/04472d81bc5a9669b8de9bd000085ac6456abc98?cid=841bc9c803cd42d7ae6230b2f464867f	1	track	spotify:track:6YdQgWSpsxhVeX6Xmv3IFJ	https://open.spotify.com/track/6YdQgWSpsxhVeX6Xmv3IFJ	Cuz I Love You	C#	major	C# major
2	Lizzo	56oDRnqbIiwx4mymNEv7dS	6dFFcYQ8VhifgdKgYY5LYL	album	2019-04-19	2,019	day	0.694	0.807	2	-3.183	1	0.0888	0.01670	0.00000	0.5470	0.797	144.031	5I7sGubUsKo4mVJpBoSVUr	https://api.spotify.com/v1/audio-analysis/5I7sGubUsKo4mVJpBoSVUr	4	1	184,857	1	https://api.spotify.com/v1/tracks/5I7sGubUsKo4mVJpBoSVUr	false	Like a Girl	https://p.scdn.co/mp3-preview/4f917112d7f75e2ca776c61b6dc8ab72fb9f8d06?cid=841bc9c803cd42d7ae6230b2f464867f	2	track	spotify:track:5I7sGubUsKo4mVJpBoSVUr	https://open.spotify.com/track/5I7sGubUsKo4mVJpBoSVUr	Cuz I Love You	D	major	D major
3	Lizzo	56oDRnqbIiwx4mymNEv7dS	6dFFcYQ8VhifgdKgYY5LYL	album	2019-04-19	2,019	day	0.767	0.889	7	-2.988	1	0.0963	0.00611	0.00000	0.4020	0.843	119.930	0k664IuFwVP557Gnx7RhIl	https://api.spotify.com/v1/audio-analysis/0k664IuFwVP557Gnx7RhIl	4	1	195,144	1	https://api.spotify.com/v1/tracks/0k664IuFwVP557Gnx7RhIl	false	Juice	https://p.scdn.co/mp3-preview/7ca1598cec36c49118562b02491b9a030e233d88?cid=841bc9c803cd42d7ae6230b2f464867f	3	track	spotify:track:0k664IuFwVP557Gnx7RhIl	https://open.spotify.com/track/0k664IuFwVP557Gnx7RhIl	Cuz I Love You	G	major	G major
4	Lizzo	56oDRnqbIiwx4mymNEv7dS	6dFFcYQ8VhifgdKgYY5LYL	album	2019-04-19	2,019	day	0.693	0.849	7	-4.517	1	0.0892	0.00466	0.00127	0.5030	0.767	99.021	6h2wpo2pshM8QnAvRySEO0	https://api.spotify.com/v1/audio-analysis/6h2wpo2pshM8QnAvRySEO0	4	1	175,234	1	https://api.spotify.com/v1/tracks/6h2wpo2pshM8QnAvRySEO0	false	Soulmate	https://p.scdn.co/mp3-preview/467bc6a734bfb66e77f069ed62c40598c057c083?cid=841bc9c803cd42d7ae6230b2f464867f	4	track	spotify:track:6h2wpo2pshM8QnAvRySEO0	https://open.spotify.com/track/6h2wpo2pshM8QnAvRySEO0	Cuz I Love You	G	major	G major
5	Lizzo	56oDRnqbIiwx4mymNEv7dS	6dFFcYQ8VhifgdKgYY5LYL	album	2019-04-19	2,019	day	0.674	0.542	8	-6.983	1	0.2530	0.04760	0.00000	0.0682	0.534	150.979	3kxsEF30mM0TZWfkOv4XsS	https://api.spotify.com/v1/audio-analysis/3kxsEF30mM0TZWfkOv4XsS	3	1	231,570	0	https://api.spotify.com/v1/tracks/3kxsEF30mM0TZWfkOv4XsS	false	Jerome	https://p.scdn.co/mp3-preview/8f662de51ad3f613063baba18babe9c508e84be2?cid=841bc9c803cd42d7ae6230b2f464867f	5	track	spotify:track:3kxsEF30mM0TZWfkOv4XsS	https://open.spotify.com/track/3kxsEF30mM0TZWfkOv4XsS	Cuz I Love You	G#	major	G# major
6	Lizzo	56oDRnqbIiwx4mymNEv7dS	6dFFcYQ8VhifgdKgYY5LYL	album	2019-04-19	2,019	day	0.623	0.719	11	-3.331	0	0.0403	0.03830	0.00000	0.2470	0.722	131.027	6pRLJSrcskYSKYuKgJtDgD	https://api.spotify.com/v1/audio-analysis/6pRLJSrcskYSKYuKgJtDgD	4	1	175,952	0	https://api.spotify.com/v1/tracks/6pRLJSrcskYSKYuKgJtDgD	false	Crybaby	https://p.scdn.co/mp3-preview/d9ef0675e835123cbb253a6a96bc37036306990e?cid=841bc9c803cd42d7ae6230b2f464867f	6	track	spotify:track:6pRLJSrcskYSKYuKgJtDgD	https://open.spotify.com/track/6pRLJSrcskYSKYuKgJtDgD	Cuz I Love You	B	minor	B minor

Fit the model:

# --- 1. Define predictors and outcome ---

predictors <- c("danceability", "energy", "instrumentalness", "explicit")
contin     <- c("danceability", "energy", "instrumentalness")

# Make sure outcome is a factor (binary classification)
d$mode <- as.factor(d$mode)

# --- 2. Train / test split (80/20, seed = 123) ---
set.seed(123)
train_index <- createDataPartition(d$mode, p = 0.8, list = FALSE) # from caret

train <- 
test  <- 


# --- 3. Scale only continuous predictors ---
# fit preprocessing on training data only
preproc <- preProcess(train[, contin], method = c("center", "scale"))

# apply to train and test
train_scaled <- train
train_scaled[, contin] <- predict(preproc, train[, contin])

test_scaled <- test
test_scaled[, contin] <- predict(preproc, test[, contin])

# --- 4. Fit logistic regression model  ---

# formula: mode ~ danceability + energy + instrumentalness + explicit
model <- glm(  
                
              )

  
# --- 5. Predictions: class and probabilities ---

# Training set
train_prob <- predict(model, newdata = train_scaled, type = "response")
# glm with factor response: prob is for the *second* level of mode
# assume "1" is the positive class:
train_pred <- ifelse(train_prob > 0.5, "1", "0")
train_pred <- factor(train_pred, levels = levels(train$mode))

# Test set
test_prob <- predict(model, newdata = test_scaled, type = "response")
test_pred <- ifelse(test_prob > 0.5, "1", "0")
test_pred <- factor(test_pred, levels = levels(test$mode))

  
# --- 6. Metrics: accuracy, precision, recall, F1, ROC AUC ---

# helper to compute metrics
metrics_fun <- function(y_true, y_pred, y_prob) {
  cm <- table(truth = y_true, pred = y_pred)
  
  TP <- cm["1", "1"]
  TN <- cm["0", "0"]
  FP <- cm["0", "1"]
  FN <- cm["1", "0"]
  
  acc  <- (TP + TN) / sum(cm)
  prec <- TP / (TP + FP)
  rec  <- TP / (TP + FN)
  f1   <- 2 * prec * rec / (prec + rec)
  
  # ROC AUC
  roc_obj <- roc(response = y_true, predictor = y_prob, levels = c("0", "1"))
  auc_val <- auc(roc_obj)
  
  c(Accuracy = acc,
    Precision = prec,
    Recall = rec,
    F1 = f1,
    ROC_AUC = as.numeric(auc_val))
}

# Train metrics
train_metrics <- 
test_metrics  <-

Visualize a confusion matrix:

# Build confusion matrix table to visualize
cm <- table(Predicted = train_pred, Actual = train$mode)

ggplot(cm_df, aes(x = Predicted, y = Actual, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 6) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Confusion Matrix (Train Set)", fill = "Count") +
  theme_minimal(base_size = 14)

or create a direct table:

# table
table(Actual = train$mode, Predicted = train_pred)

Finally plot a ROC curve:

roc_train <- roc(response = train$mode,
                 predictor = train_prob,
                 levels   = c("0", "1"))   # "0" = negative, "1" = positive

# Basic ROC plot with AUC
plot(roc_train, print.auc = TRUE, main = "ROC Curve (Train)")

3.1.1 Question

Grab and diplay the coefficients fromt he above model. How do you interpret them?