Classification II

Author

Termeh Shafie

1 Metrics

Once we build a classification model, it’s important to make sure that we know how it performs! We talked about a few metrics you can calculate to help you assess how well your model is doing.

  • Accuracy: \(\frac{TP + TN}{TP + TN + FP + FN}\)
  • Confusion Matrix Patterns
Actually 1 Actually 0
Predicted 1 True Positive (TP) False Positive (FP)
Predicted 0 False Negative (FN) True Negative (TN)
  • Precision: \(\frac{TP}{TP + FP}\), how many of the predicted positives are true positives?

  • Recall/Sensitivity: \(\frac{TP}{TP + FN}\), how many of the actual positives did we accurately predict?

  • Specificity: \(\frac{TN}{TN + FP}\), how many of the actual negatives did we accurately predict?

  • F1 Score: \(\frac{2 * Precision * Recall}{Precision + Recall}\), a combination of precision and recall.

  • ROC AUC: The area under the ROC curve which puts the False Positive Rate (FPR) on the x-axis, and the True Positive Rate on the y-axis.

1.1 Question

If you were designing a Flu test, which of these metrics would be most important to you (there’s no one right answer) and why?

2 Naive Bayes

2.1 Naive

Naive Bayes is Naive because it makes the assumption of conditional independence. This means that within classes (the groups we’re trying to predict), we assume the predictors are independent. However we know that’s not true, so it’s a Naive assumption to make.

However, it simplifies our computation. When events are independent we can calculate their joint probability just by multiplying them:

\[\underbrace{P(A,B,C)}_\text{probability of A, B, and C} = P(A) * P(B) * P(C)\]

Rather than calculating the joint probability \(P(A,B,C)\) by taking into account any relationships between the predictors.

2.2 Bayes

When classifying a data point, we use Bayes’ Theorem (at least the numerator of it) to calculate the score for each potential category. Then we choose the category with the highest score. Because we’re comparing scores that have the same denominator, we can ignore it (which is nice, as it’s difficiult to calculate).

\[ \overbrace{P(\text{group}_i | \mathbf{X}) = \frac{P(\mathbf{X} | \text{group}_i) * P(\text{group}_i)}{P(\mathbf{X})}}^\text{Bayes' Theorem}\]

  • \(P(\text{group}_i | \mathbf{X})\) is the probability of our data point being in group \(i\) based on their predictor values \(\mathbf{X}\)
  • \(P(\mathbf{X} | \text{group}_i)\) is the likelihood of seeing features like \(\mathbf{X}\) in group \(i\) (if our features are commonly seen in group \(i\), we’re more likely to predict you’re in group \(i\))
  • \(P(\text{group}_i)\) is the probability of being in group \(i\) overall (if a group is very rare, we don’t want to predict it often)
  • \(P(\mathbf{X})\) is the probability of seeing features like \(\mathbf{X}\) overall, in any group (but we ignore this term)

2.3 Naive Bayes in R

From a training collection of sms text messages, that are labeled Spam or Ham, a set of used words can be extracted. After a couple of data cleaning operations (see below) it is counted how often words occur in the text messages in the training set. Words with a frequency equal or higher than a chosen threshold, e.g. five, are collected in a dictionary. Only these words are used to distinguish Spam from Ham. As the next step, each word is converted into a binary variable with the length of the number of messages in the training set, with value 1 in cell \(j\) if the word occurs one or more times in the \(j\)th text message and 0 otherwise. These binary variables are mutated into factor variables which can be used to generate NaiveBayes model that distinguishes Spam from Ham.

2.3.1 Packages needed

library(tidyverse)
library(flextable) # install if not on your machine
library(tm) # install if not on your machine
library(e1071) # install if not on your machine
library(caret) # install if not on your machine
library(pROC) # install if not on your machine

2.3.2 Analysis

df <- read.csv("05-data/sms_spam.csv")

#transform df$type into a factor variable
df$type <- factor(df$type, levels = c("ham", "spam"),
                       labels=c("ham", "spam"))

flextable(head(df)) %>% autofit()

type

text

ham

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

ham

Ok lar... Joking wif u oni...

spam

Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's

ham

U dun say so early hor... U c already then say...

ham

Nah I don't think he goes to usf, he lives around here though

spam

FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

The R package tm (text mining) comes with a couple of helpful functions for Text Mining. In order to use the text mining functions, the data to be investigated must be in a so-called Corpus of text documents. The first step in this analysis is to convert the set of SMS messages into such a corpus. First the text messages are transformed into a vector source this is an R vector that interprets every element as a text document.

First Five Elements of SMS message in a Vector Source and create a Corpus with all the sms messages as text documents. This is in fact a list of lists, every list contains the text of the document and metadata about the document.

sms_vector_source <- VectorSource(df$text)
sms_vector_source[1:5]
[1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."                                            
[2] "Ok lar... Joking wif u oni..."                                                                                                                              
[3] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
[4] "U dun say so early hor... U c already then say..."                                                                                                          
[5] "Nah I don't think he goes to usf, he lives around here though"                                                                                              
sms_corpus <- Corpus(sms_vector_source)

Now it is time to perform a couple of text cleaning preparation steps. The tm package has functions for these actions:

  1. Because it is assumed that capitalised letters will not be used to distinguish Ham from Spam, replace all uppercase letters by lowercase letters.

  2. Remove numbers from the text messages. If it is assumed that numbers in the messages can be helpful to diferentiate between Spam and Ham, a more advanced way to deal with numbers is required

  3. Lots of short words, like and, or, if, on, to etc. are not useful to distinct Ham from Spam; they can be removed from the texts before generating a model.

  4. Remove punctuation.

  5. Remove unnecessary spaces.

#transform to lower case text messages
#the tm::tm_map function applies a function on a corpus object
corpus_clean <- tm_map(sms_corpus, tolower) %>% 
#remove numbers
  tm_map(removeNumbers) %>% 
#remove stopwords
  tm_map(removeWords, stopwords()) %>% 
#remove punctuation
  tm_map(removePunctuation) %>% 
#remove additional spaces
  tm_map(stripWhitespace)

corpus_clean[1]$content
[1] "go jurong point crazy available bugis n great world la e buffet cine got amore wat"

Assuming that splitting the data in a training and a test set will be used to assess the Naive Bayes model, the next step is splitting the data, e.g. 70% in the training set and 30% in the test set.

set.seed(123)
train <- sample(1:length(corpus_clean),
                size = .7*length(corpus_clean),
                replace=FALSE)

sms_train <- df[train,]
sms_test <- df[-train,]

corpus_train <- corpus_clean[train]
corpus_test <- corpus_clean[-train]

sum(train)
[1] 10780878

Let’s now create a document term matrix:

dtm_train <- DocumentTermMatrix(corpus_train)
inspect(dtm_train)
<<DocumentTermMatrix (documents: 3901, terms: 6476)>>
Non-/sparse entries: 30139/25232737
Sparsity           : 100%
Maximal term length: 40
Weighting          : term frequency (tf)
Sample             :
      Terms
Docs   call can free get just know like ltgt now will
  17      0   0    1   0    0    0    1    0   0    0
  2155    0   3    1   1    0    0    0    6   0    0
  2252    0   1    0   0    0    0    0    1   0    0
  277     0   0    1   0    0    0    1    0   0    0
  2933    0   0    0   0    0    0    0    2   0    0
  3313    0   0    0   1    0    0    1    0   0   11
  3368    0   0    0   0    0    0    0    0   0    0
  3652    0   0    0   0    0    0    0    0   0    0
  489     0   0    0   0    0    1    1    0   0    0
  66      0   0    0   0    0    0    0    1   0    0

To distinguish Ham from Spam not every word in the corpus are useful. Words must appear in a couple of messages to be useful. A choice must be made for the threshold of the number of messages in which a word appears to be used in the model, e.g. 5 times.

First construct a vector with words with a frequency of at least 5:

frequent_terms_5 <- findFreqTerms(dtm_train, lowfreq=5)
frequent_terms_5[1:10]
 [1] "anything" "lar"      "already"  "ard"      "can"      "check"   
 [7] "dat"      "later"    "like"     "picking" 

The Naive Bayes model uses as features not the number of times a term appears in a message, but only whether a term appears in a message.

It is possible to construct a Binary DTM in which the cells indicate whether a document contains the term (cell value = 1) or not (cell value = 0). It is this Binary DTM that is used in the Naive Bayes model. The Binary DTM is constructed for the words with frequency at least.

#Binary DTM for training data
dtm_train_bin <- DocumentTermMatrix(
                      corpus_train,
                      control=list(weighting=weightBin,
                                   dictionary=frequent_terms_5))
Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
ignored
#Binary DTM for test data; needed to asses the model
dtm_test_bin <- DocumentTermMatrix(
                      corpus_test,
                      control=list(weighting=weightBin,
                                   dictionary=frequent_terms_5))
Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
ignored
inspect(dtm_train_bin)
<<DocumentTermMatrix (documents: 3901, terms: 1179)>>
Non-/sparse entries: 22345/4576934
Sparsity           : 100%
Maximal term length: 19
Weighting          : binary (bin)
Sample             :
      Terms
Docs   call can free get got just know like now will
  2229    0   0    0   0   0    0    0    0   0    0
  2252    0   1    0   0   0    0    0    0   0    0
  327     1   1    0   1   0    1    0    0   1    0
  3313    0   0    0   1   0    0    0    1   0    1
  3368    0   0    0   0   0    0    0    0   0    0
  3602    0   0    0   0   0    0    0    0   0    0
  3652    0   0    0   0   0    0    0    0   0    0
  489     0   0    0   0   0    0    1    1   0    0
  514     0   1    0   0   0    0    0    0   0    0
  66      0   0    0   0   0    0    0    0   0    0

The columns in the Binary DTM must be transformed into factor variables to use them in a Naive Bayes model. Then the Binary DTM is ready to generate a Naive Bayes model.

#first use as matrix() to convert DTM matrix from a list into a matrix 
dtm_train_bin_matrix <- as.matrix(dtm_train_bin)
dtm_test_bin_matrix <- as.matrix(dtm_test_bin)

#convert the columns into factor
dtm_train_bin_matrix <- apply(dtm_train_bin_matrix, 2, factor)
dtm_test_bin_matrix <- apply(dtm_test_bin_matrix, 2, factor)

#generate model
nb_model <- naiveBayes(x=dtm_train_bin_matrix,
                       y=sms_train$type)

summary(nb_model)
          Length Class  Mode     
apriori      2   table  numeric  
tables    1179   -none- list     
levels       2   -none- character
isnumeric 1179   -none- logical  
call         3   -none- call     

Nest we are assessing the model:

  1. Use the model to make predictions on the test data

  2. Assess the model using a confusion matrix

preds <- predict(nb_model, dtm_test_bin_matrix)

cf <- table(preds, sms_test$type)

cf
      
preds   ham spam
  ham  1448   24
  spam    3  198

As can be seen in the Confusion Matrix, the model makes a good distinction between Spam and Ham. Only 3 of the 1451 Ham messages are classified as Spam while 24 of the 198 Spam messages are classified as Ham.

The caret::confusionMatrix() function gives a lot of metrics which can be used to assess a classification model. Which metric is most applicable depends on the context of the problem in question.

confusionMatrix(preds, sms_test$type)
Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham  1448   24
      spam    3  198
                                          
               Accuracy : 0.9839          
                 95% CI : (0.9766, 0.9893)
    No Information Rate : 0.8673          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.927           
                                          
 Mcnemar's Test P-Value : 0.0001186       
                                          
            Sensitivity : 0.9979          
            Specificity : 0.8919          
         Pos Pred Value : 0.9837          
         Neg Pred Value : 0.9851          
             Prevalence : 0.8673          
         Detection Rate : 0.8655          
   Detection Prevalence : 0.8799          
      Balanced Accuracy : 0.9449          
                                          
       'Positive' Class : ham             
                                          

2.3.3 Question

Which metric would you choose to evaluate the model here?

3 Classwork

3.1 Lizzo

Build a Logistic Regression Model to predict the mode of different Lizzo songs. Then check the performance of the model. Fill int he msising parts in the code below:

# Read data
d <- read.csv("05-Data/Lizzo_data.csv")

# Convert boolean/logical TRUE/FALSE to 1/0
# (works even if column is already logical)
d$explicit <- as.integer(d$explicit)

# View first rows
flextable(head(d)) %>% autofit()

X

artist_name

artist_id

album_id

album_type

album_release_date

album_release_year

album_release_date_precision

danceability

energy

key

loudness

mode

speechiness

acousticness

instrumentalness

liveness

valence

tempo

track_id

analysis_url

time_signature

disc_number

duration_ms

explicit

track_href

is_local

track_name

track_preview_url

track_number

type

track_uri

external_urls.spotify

album_name

key_name

mode_name

key_mode

1

Lizzo

56oDRnqbIiwx4mymNEv7dS

6dFFcYQ8VhifgdKgYY5LYL

album

2019-04-19

2,019

day

0.566

0.660

1

-4.039

1

0.0479

0.01560

0.00000

0.2340

0.520

162.159

6YdQgWSpsxhVeX6Xmv3IFJ

https://api.spotify.com/v1/audio-analysis/6YdQgWSpsxhVeX6Xmv3IFJ

3

1

179,978

1

https://api.spotify.com/v1/tracks/6YdQgWSpsxhVeX6Xmv3IFJ

false

Cuz I Love You

https://p.scdn.co/mp3-preview/04472d81bc5a9669b8de9bd000085ac6456abc98?cid=841bc9c803cd42d7ae6230b2f464867f

1

track

spotify:track:6YdQgWSpsxhVeX6Xmv3IFJ

https://open.spotify.com/track/6YdQgWSpsxhVeX6Xmv3IFJ

Cuz I Love You

C#

major

C# major

2

Lizzo

56oDRnqbIiwx4mymNEv7dS

6dFFcYQ8VhifgdKgYY5LYL

album

2019-04-19

2,019

day

0.694

0.807

2

-3.183

1

0.0888

0.01670

0.00000

0.5470

0.797

144.031

5I7sGubUsKo4mVJpBoSVUr

https://api.spotify.com/v1/audio-analysis/5I7sGubUsKo4mVJpBoSVUr

4

1

184,857

1

https://api.spotify.com/v1/tracks/5I7sGubUsKo4mVJpBoSVUr

false

Like a Girl

https://p.scdn.co/mp3-preview/4f917112d7f75e2ca776c61b6dc8ab72fb9f8d06?cid=841bc9c803cd42d7ae6230b2f464867f

2

track

spotify:track:5I7sGubUsKo4mVJpBoSVUr

https://open.spotify.com/track/5I7sGubUsKo4mVJpBoSVUr

Cuz I Love You

D

major

D major

3

Lizzo

56oDRnqbIiwx4mymNEv7dS

6dFFcYQ8VhifgdKgYY5LYL

album

2019-04-19

2,019

day

0.767

0.889

7

-2.988

1

0.0963

0.00611

0.00000

0.4020

0.843

119.930

0k664IuFwVP557Gnx7RhIl

https://api.spotify.com/v1/audio-analysis/0k664IuFwVP557Gnx7RhIl

4

1

195,144

1

https://api.spotify.com/v1/tracks/0k664IuFwVP557Gnx7RhIl

false

Juice

https://p.scdn.co/mp3-preview/7ca1598cec36c49118562b02491b9a030e233d88?cid=841bc9c803cd42d7ae6230b2f464867f

3

track

spotify:track:0k664IuFwVP557Gnx7RhIl

https://open.spotify.com/track/0k664IuFwVP557Gnx7RhIl

Cuz I Love You

G

major

G major

4

Lizzo

56oDRnqbIiwx4mymNEv7dS

6dFFcYQ8VhifgdKgYY5LYL

album

2019-04-19

2,019

day

0.693

0.849

7

-4.517

1

0.0892

0.00466

0.00127

0.5030

0.767

99.021

6h2wpo2pshM8QnAvRySEO0

https://api.spotify.com/v1/audio-analysis/6h2wpo2pshM8QnAvRySEO0

4

1

175,234

1

https://api.spotify.com/v1/tracks/6h2wpo2pshM8QnAvRySEO0

false

Soulmate

https://p.scdn.co/mp3-preview/467bc6a734bfb66e77f069ed62c40598c057c083?cid=841bc9c803cd42d7ae6230b2f464867f

4

track

spotify:track:6h2wpo2pshM8QnAvRySEO0

https://open.spotify.com/track/6h2wpo2pshM8QnAvRySEO0

Cuz I Love You

G

major

G major

5

Lizzo

56oDRnqbIiwx4mymNEv7dS

6dFFcYQ8VhifgdKgYY5LYL

album

2019-04-19

2,019

day

0.674

0.542

8

-6.983

1

0.2530

0.04760

0.00000

0.0682

0.534

150.979

3kxsEF30mM0TZWfkOv4XsS

https://api.spotify.com/v1/audio-analysis/3kxsEF30mM0TZWfkOv4XsS

3

1

231,570

0

https://api.spotify.com/v1/tracks/3kxsEF30mM0TZWfkOv4XsS

false

Jerome

https://p.scdn.co/mp3-preview/8f662de51ad3f613063baba18babe9c508e84be2?cid=841bc9c803cd42d7ae6230b2f464867f

5

track

spotify:track:3kxsEF30mM0TZWfkOv4XsS

https://open.spotify.com/track/3kxsEF30mM0TZWfkOv4XsS

Cuz I Love You

G#

major

G# major

6

Lizzo

56oDRnqbIiwx4mymNEv7dS

6dFFcYQ8VhifgdKgYY5LYL

album

2019-04-19

2,019

day

0.623

0.719

11

-3.331

0

0.0403

0.03830

0.00000

0.2470

0.722

131.027

6pRLJSrcskYSKYuKgJtDgD

https://api.spotify.com/v1/audio-analysis/6pRLJSrcskYSKYuKgJtDgD

4

1

175,952

0

https://api.spotify.com/v1/tracks/6pRLJSrcskYSKYuKgJtDgD

false

Crybaby

https://p.scdn.co/mp3-preview/d9ef0675e835123cbb253a6a96bc37036306990e?cid=841bc9c803cd42d7ae6230b2f464867f

6

track

spotify:track:6pRLJSrcskYSKYuKgJtDgD

https://open.spotify.com/track/6pRLJSrcskYSKYuKgJtDgD

Cuz I Love You

B

minor

B minor

Fit the model:

# --- 1. Define predictors and outcome ---

predictors <- c("danceability", "energy", "instrumentalness", "explicit")
contin     <- c("danceability", "energy", "instrumentalness")

# Make sure outcome is a factor (binary classification)
d$mode <- as.factor(d$mode)

# --- 2. Train / test split (80/20, seed = 123) ---
set.seed(123)
train_index <- createDataPartition(d$mode, p = 0.8, list = FALSE) # from caret

train <- 
test  <- 


# --- 3. Scale only continuous predictors ---
# fit preprocessing on training data only
preproc <- preProcess(train[, contin], method = c("center", "scale"))

# apply to train and test
train_scaled <- train
train_scaled[, contin] <- predict(preproc, train[, contin])

test_scaled <- test
test_scaled[, contin] <- predict(preproc, test[, contin])

# --- 4. Fit logistic regression model  ---

# formula: mode ~ danceability + energy + instrumentalness + explicit
model <- glm(  
                
              )

  
# --- 5. Predictions: class and probabilities ---

# Training set
train_prob <- predict(model, newdata = train_scaled, type = "response")
# glm with factor response: prob is for the *second* level of mode
# assume "1" is the positive class:
train_pred <- ifelse(train_prob > 0.5, "1", "0")
train_pred <- factor(train_pred, levels = levels(train$mode))

# Test set
test_prob <- predict(model, newdata = test_scaled, type = "response")
test_pred <- ifelse(test_prob > 0.5, "1", "0")
test_pred <- factor(test_pred, levels = levels(test$mode))

  
# --- 6. Metrics: accuracy, precision, recall, F1, ROC AUC ---

# helper to compute metrics
metrics_fun <- function(y_true, y_pred, y_prob) {
  cm <- table(truth = y_true, pred = y_pred)
  
  TP <- cm["1", "1"]
  TN <- cm["0", "0"]
  FP <- cm["0", "1"]
  FN <- cm["1", "0"]
  
  acc  <- (TP + TN) / sum(cm)
  prec <- TP / (TP + FP)
  rec  <- TP / (TP + FN)
  f1   <- 2 * prec * rec / (prec + rec)
  
  # ROC AUC
  roc_obj <- roc(response = y_true, predictor = y_prob, levels = c("0", "1"))
  auc_val <- auc(roc_obj)
  
  c(Accuracy = acc,
    Precision = prec,
    Recall = rec,
    F1 = f1,
    ROC_AUC = as.numeric(auc_val))
}

# Train metrics
train_metrics <- 
test_metrics  <- 

Visualize a confusion matrix:

# Build confusion matrix table to visualize
cm <- table(Predicted = train_pred, Actual = train$mode)

ggplot(cm_df, aes(x = Predicted, y = Actual, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), size = 6) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  labs(title = "Confusion Matrix (Train Set)", fill = "Count") +
  theme_minimal(base_size = 14)

or create a direct table:

# table
table(Actual = train$mode, Predicted = train_pred)

Finally plot a ROC curve:

roc_train <- roc(response = train$mode,
                 predictor = train_prob,
                 levels   = c("0", "1"))   # "0" = negative, "1" = positive

# Basic ROC plot with AUC
plot(roc_train, print.auc = TRUE, main = "ROC Curve (Train)")

3.1.1 Question

Grab and diplay the coefficients fromt he above model. How do you interpret them?