Supervised Learning in R: Classification (DataCamp)

Ch. 1 - k-Nearest Neighbors (kNN)

Classification with Nearest Neighbors

[Video]

Applying nearest neighbors in R

# library(class)
# pred <- knn(training_data, testing_data, training_labels)

Recognizing a road sign with kNN

# Load the 'class' package
library(class)

# Create a vector of labels
sign_types <- signs$sign_type

# Classify the next sign observed
knn(train = signs[-1], test = next_sign, cl = sign_types)

## [1] stop
## Levels: pedestrian speed stop

Thinking like kNN

With your help, the test car successfully identified the sign and stopped safely at the intersection.

How did the knn() function correctly classify the stop sign?

It learned that stop signs are red
[*] The sign was in some way similar to another stop sign
Stop signs have eight sides
The other types of signs were less likely

Exploring the traffic sign dataset

# Examine the structure of the signs dataset
str(signs)

## 'data.frame':    146 obs. of  49 variables:
##  $ sign_type: Factor w/ 3 levels "pedestrian","speed",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ r1       : int  155 142 57 22 169 75 136 149 13 123 ...
##  $ g1       : int  228 217 54 35 179 67 149 225 34 124 ...
##  $ b1       : int  251 242 50 41 170 60 157 241 28 107 ...
##  $ r2       : int  135 166 187 171 231 131 200 34 5 83 ...
##  $ g2       : int  188 204 201 178 254 89 203 45 21 61 ...
##  $ b2       : int  101 44 68 26 27 53 107 1 11 26 ...
##  $ r3       : int  156 142 51 19 97 214 150 155 123 116 ...
##  $ g3       : int  227 217 51 27 107 144 167 226 154 124 ...
##  $ b3       : int  245 242 45 29 99 75 134 238 140 115 ...
##  $ r4       : int  145 147 59 19 123 156 171 147 21 67 ...
##  $ g4       : int  211 219 62 27 147 169 218 222 46 67 ...
##  $ b4       : int  228 242 65 29 152 190 252 242 41 52 ...
##  $ r5       : int  166 164 156 42 221 67 171 170 36 70 ...
##  $ g5       : int  233 228 171 37 236 50 158 191 60 53 ...
##  $ b5       : int  245 229 50 3 117 36 108 113 26 26 ...
##  $ r6       : int  212 84 254 217 205 37 157 26 75 26 ...
##  $ g6       : int  254 116 255 228 225 36 186 37 108 26 ...
##  $ b6       : int  52 17 36 19 80 42 11 12 44 21 ...
##  $ r7       : int  212 217 211 221 235 44 26 34 13 52 ...
##  $ g7       : int  254 254 226 235 254 42 35 45 27 45 ...
##  $ b7       : int  11 26 70 20 60 44 10 19 25 27 ...
##  $ r8       : int  188 155 78 181 90 192 180 221 133 117 ...
##  $ g8       : int  229 203 73 183 110 131 211 249 163 109 ...
##  $ b8       : int  117 128 64 73 9 73 236 184 126 83 ...
##  $ r9       : int  170 213 220 237 216 123 129 226 83 110 ...
##  $ g9       : int  216 253 234 234 236 74 109 246 125 74 ...
##  $ b9       : int  120 51 59 44 66 22 73 59 19 12 ...
##  $ r10      : int  211 217 254 251 229 36 161 30 13 98 ...
##  $ g10      : int  254 255 255 254 255 34 190 40 27 70 ...
##  $ b10      : int  3 21 51 2 12 37 10 34 25 26 ...
##  $ r11      : int  212 217 253 235 235 44 161 34 9 20 ...
##  $ g11      : int  254 255 255 243 254 42 190 44 23 21 ...
##  $ b11      : int  19 21 44 12 60 44 6 35 18 20 ...
##  $ r12      : int  172 158 66 19 163 197 187 241 85 113 ...
##  $ g12      : int  235 225 68 27 168 114 215 255 128 76 ...
##  $ b12      : int  244 237 68 29 152 21 236 54 21 14 ...
##  $ r13      : int  172 164 69 20 124 171 141 205 83 106 ...
##  $ g13      : int  235 227 65 29 117 102 142 229 125 69 ...
##  $ b13      : int  244 237 59 34 91 26 140 46 19 9 ...
##  $ r14      : int  172 182 76 64 188 197 189 226 85 102 ...
##  $ g14      : int  228 228 84 61 205 114 171 246 128 67 ...
##  $ b14      : int  235 143 22 4 78 21 140 59 21 6 ...
##  $ r15      : int  177 171 82 211 125 123 214 235 85 106 ...
##  $ g15      : int  235 228 93 222 147 74 221 252 128 69 ...
##  $ b15      : int  244 196 17 78 20 22 201 67 21 9 ...
##  $ r16      : int  22 164 58 19 160 180 188 237 83 43 ...
##  $ g16      : int  52 227 60 27 183 107 211 254 125 29 ...
##  $ b16      : int  53 237 60 29 187 26 227 53 19 11 ...

# Count the number of signs of each type
table(signs$sign_type)

## 
## pedestrian      speed       stop 
##         46         49         51

# Check r10's average red level by sign type
aggregate(r10 ~ sign_type, data = signs, mean)

##    sign_type       r10
## 1 pedestrian 113.71739
## 2      speed  80.63265
## 3       stop 132.39216

Classifying a collection of road signs

# Use kNN to identify the test road signs
sign_types <- signs$sign_type
signs_pred <- knn(train = signs[-1], test = test_signs[-1], cl = sign_types)

# Create a confusion matrix of the predicted versus actual values
signs_actual <- test_signs$sign_type
table(signs_pred, signs_actual)

##             signs_actual
## signs_pred   pedestrian speed stop
##   pedestrian         19     2    0
##   speed               0    17    0
##   stop                0     2   19

# Compute the accuracy
mean(signs_pred == signs_actual)

## [1] 0.9322034

What about the ‘k’ in kNN?

[Video]

Understanding the impact of ‘k’

There is a complex relationship between k and classification accuracy. Bigger is not always better.

Which of these is a valid reason for keeping k as small as possible (but no smaller)?

A smaller k requires less processing power
A smaller k reduces the impact of noisy data
A smaller k minimizes the chance of a tie vote
[*] A smaller k may utilize more subtle patterns

Testing other ‘k’ values

# Compute the accuracy of the baseline model (default k = 1)
k_1 <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types)
mean(k_1 == signs_actual)

## [1] 0.9322034

# Modify the above to set k = 7
k_7 <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types, k= 7)
mean(k_7 == signs_actual)

## [1] 0.9491525

# Set k = 15 and compare to the above
k_15 <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types, k = 15)
mean(k_15 == signs_actual)

## [1] 0.8644068

Seeing how the neighbors voted

# Use the prob parameter to get the proportion of votes for the winning class
sign_pred <- knn(train = signs[-1], test = signs_test[-1], cl = sign_types, k= 7, prob = TRUE)

# Get the "prob" attribute from the predicted classes
sign_prob <- attr(sign_pred, "prob")

# Examine the first several predictions
head(sign_pred)

## [1] pedestrian pedestrian pedestrian stop       pedestrian pedestrian
## Levels: pedestrian speed stop

# Examine the proportion of votes for the winning class
head(sign_prob)

## [1] 0.5714286 0.5714286 0.8571429 0.5714286 0.8571429 0.5714286

Data preparation for kNN

[Video]

Normalizing data in R

# define a min-max normalize() function
normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

# normalized version of r1
summary(normalize(signs$r1))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1894  0.3571  0.4237  0.6483  1.0000

# un-normalized versioin of r1
summary(signs$r1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   46.75   85.50  100.87  152.75  234.00

Why normalize data?

Before applying kNN to a classification task, it is common practice to rescale the data using a technique like min-max normalization. What is the purpose of this step?

[*] To ensure all data elements may contribute equal shares to distance.
To help the kNN algorithm converge on a solution faster.
To convert all of the data elements to numbers.
To redistribute the data as a normal bell curve.

Ch. 2 - Naive Bayes

Understanding Bayesian methods

[Video]

Making predictions with Naives Bayes

# # building a Naives Bayes model
# library(naivebayes)
# m <- naive_bayes(location ~ time_of_day, data = location_history)

# # making predictions with Naive Bayes
# future_location <- predict(m, future_conditions)

Computing probabilities

# Compute P(A) 
p_A <- nrow(subset(where9am, location == "office")) / nrow(where9am)

# Compute P(B)
p_B <- nrow(subset(where9am, daytype == "weekday")) / nrow(where9am)

# Compute the observed P(A and B)
p_AB <- nrow(subset(where9am, location == "office" & daytype == "weekday")) / nrow(where9am)

# Compute P(A | B) and print its value
p_A_given_B <- p_AB / p_B
p_A_given_B

## [1] 0.6

Understanding dependent events

In the previous exercise, you found that there is a 60% chance Brett is in the office at 9am given that it is a weekday. On the other hand, if Brett is never in the office on a weekend, which of the following is/are true?

P(office and weekend) = 0.
P(office | weekend) = 0.
Brett’s location is dependent on the day of the week.
[*] All of the above.

A simple Naive Bayes location model

# Load the naivebayes package
library(naivebayes)

## naivebayes 0.9.7 loaded

# Build the location prediction model
locmodel <- naive_bayes(location ~ daytype, data = where9am)

## Warning: naive_bayes(): Feature daytype - zero probabilities are present.
## Consider Laplace smoothing.

# Predict Thursday's 9am location
predict(locmodel, newdata = thursday9am)

## [1] <NA> <NA>
## Levels: appointment campus home office restaurant store theater

# Predict Saturdays's 9am location
predict(locmodel, newdata = saturday9am)

## [1] <NA> <NA>
## Levels: appointment campus home office restaurant store theater

Examining “raw” probabilities

# The 'naivebayes' package is loaded into the workspace
# and the Naive Bayes 'locmodel' has been built

# Examine the location prediction model
locmodel

## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = location ~ daytype, data = where9am)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
## appointment      campus        home      office  restaurant       store 
##  0.01098901  0.10989011  0.45054945  0.42857143  0.00000000  0.00000000 
##     theater 
##  0.00000000 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: daytype (Bernoulli) 
## --------------------------------------------------------------------------------- 
##          
## daytype   appointment    campus      home    office restaurant store theater
##   weekday   1.0000000 1.0000000 0.3658537 1.0000000                         
##   weekend   0.0000000 0.0000000 0.6341463 0.0000000                         
## 
## ---------------------------------------------------------------------------------

# Obtain the predicted probabilities for Thursday at 9am
predict(locmodel, newdata = thursday9am , type = "prob")

##      appointment campus home office restaurant store theater
## [1,]         NaN    NaN  NaN    NaN        NaN   NaN     NaN
## [2,]         NaN    NaN  NaN    NaN        NaN   NaN     NaN

# Obtain the predicted probabilities for Saturday at 9am
predict(locmodel, newdata = saturday9am , type = "prob")

##      appointment campus home office restaurant store theater
## [1,]         NaN    NaN  NaN    NaN        NaN   NaN     NaN
## [2,]         NaN    NaN  NaN    NaN        NaN   NaN     NaN

Understanding independence

Understanding the idea of event independence will become important as you learn more about how “naive” Bayes got its name. Which of the following is true about independent events?

The events cannot occur at the same time.
A Venn diagram will always show no intersection.
[*] Knowing the outcome of one event does not help predict the other.
At least one of the events is completely random.

Understanding NB’s “naivety”

[Video]

Who are you calling naive?

The Naive Bayes algorithm got its name because it makes a “naive” assumption about event independence.

What is the purpose of making this assumption?

Independent events can never have a joint probability of zero.
[*] The joint probability calculation is simpler for independent events.
Conditional probability is undefined for dependent events.
Dependent events cannot be used to make predictions.

A more sophisticated location model

# The 'naivebayes' package is loaded into the workspace already

# Build a NB model of location
locmodel <- naive_bayes(location ~ daytype + hourtype, data = locations)

## Warning: naive_bayes(): Feature daytype - zero probabilities are present.
## Consider Laplace smoothing.

## Warning: naive_bayes(): Feature hourtype - zero probabilities are present.
## Consider Laplace smoothing.

# Predict Brett's location on a weekday afternoon
predict(locmodel, newdata = weekday_afternoon)

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

## [1] office
## Levels: appointment campus home office restaurant store theater

# Predict Brett's location on a weekday evening
predict(locmodel, newdata = weekday_evening)

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

## [1] home
## Levels: appointment campus home office restaurant store theater

Preparing for unforeseen circumstances

# The 'naivebayes' package is loaded into the workspace already
# The Naive Bayes location model (locmodel) has already been built

# Observe the predicted probabilities for a weekend afternoon
predict(locmodel, newdata = weekend_afternoon, type = "prob")

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

##      appointment       campus      home      office restaurant      store
## [1,]  0.02462883 0.0004802622 0.8439145 0.003349521  0.1111338 0.01641922
##          theater
## [1,] 7.38865e-05

# Build a new model using the Laplace correction
locmodel2 <- naive_bayes(location ~ daytype + hourtype, data = locations, laplace = 1)

# Observe the new predicted probabilities for a weekend afternoon
predict(locmodel2, newdata = weekend_afternoon, type = "prob")

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

##      appointment      campus      home      office restaurant      store
## [1,]  0.02013872 0.006187715 0.8308154 0.007929249  0.1098743 0.01871085
##          theater
## [1,] 0.006343697

Understanding the Laplace correction

By default, the naive_bayes() function in the naivebayes package does not use the Laplace correction. What is the risk of leaving this parameter unset?

[*] Some potential outcomes may be predicted to be impossible.
The algorithm may have a divide by zero error.
Naive Bayes will ignore features with zero values.
The model may not estimate probabilities for some cases.

Applying Naive Bayes to other problems

[Video]

Handling numeric predictors

Numeric data is often binned before it is used with Naive Bayes. Which of these is not an example of binning?

age values recoded as ‘child’ or ‘adult’ categories
geographic coordinates recoded into geographic regions (West, East, etc.)
test scores divided into four groups by percentile
[*] income values standardized to follow a normal bell curve press

Ch. 3 - Logistic Regression

Making binary predictions with regression

[Video]

Building simple logistic regression models

# Examine the dataset to identify potential independent variables
str(donors)

## 'data.frame':    93462 obs. of  13 variables:
##  $ donated          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ veteran          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ bad_address      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age              : int  60 46 NA 70 78 NA 38 NA NA 65 ...
##  $ has_children     : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ wealth_rating    : int  0 3 1 2 1 0 2 3 1 0 ...
##  $ interest_veterans: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ interest_religion: int  0 0 0 0 1 0 0 0 0 0 ...
##  $ pet_owner        : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ catalog_shopper  : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ recency          : Factor w/ 2 levels "CURRENT","LAPSED": 1 1 1 1 1 1 1 1 1 1 ...
##  $ frequency        : Factor w/ 2 levels "FREQUENT","INFREQUENT": 1 1 1 1 1 2 2 1 2 2 ...
##  $ money            : Factor w/ 2 levels "HIGH","MEDIUM": 2 1 2 2 2 2 2 2 2 2 ...

# Explore the dependent variable
table(donors$donated)

## 
##     0     1 
## 88751  4711

# Build the donation model
donation_model <- glm(donated ~ bad_address + interest_religion + interest_veterans, 
                      data = donors, family = "binomial")

# Summarize the model results
summary(donation_model)

## 
## Call:
## glm(formula = donated ~ bad_address + interest_religion + interest_veterans, 
##     family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3480  -0.3192  -0.3192  -0.3192   2.5678  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -2.95139    0.01652 -178.664   <2e-16 ***
## bad_address       -0.30780    0.14348   -2.145   0.0319 *  
## interest_religion  0.06724    0.05069    1.327   0.1847    
## interest_veterans  0.11009    0.04676    2.354   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37316  on 93458  degrees of freedom
## AIC: 37324
## 
## Number of Fisher Scoring iterations: 5

Making a binary prediction

# Estimate the donation probability
donors$donation_prob <- predict(donation_model, type = "response")

# Find the donation probability of the average prospect
mean(donors$donated)

## [1] 0.05040551

# Predict a donation if probability of donation is greater than average
donors$donation_pred <- ifelse(donors$donation_prob > 0.0504, 1, 0)

# Calculate the model's accuracy
mean(donors$donated == donors$donation_pred)

## [1] 0.794815

The limitations of accuracy

In the previous exercise, you found that the logistic regression model made a correct prediction nearly 80% of the time. Despite this relatively high accuracy, the result is misleading due to the rarity of outcome being predicted.

The donors dataset is available in your workspace. What would the accuracy have been if a model had simply predicted “no donation” for each person?

80%
85%
90%
[*] 95%

Model performance tradeoffs

[Video]

Calculating ROC Curves and AUC

# Load the pROC package
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Create a ROC curve
ROC <- roc(donors$donated, donors$donation_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Plot the ROC curve
plot(ROC, col = "blue")

# Calculate the area under the curve (AUC)
auc(ROC)

## Area under the curve: 0.5102

Comparing ROC curves

Which of the following ROC curves illustrates the best model?

AUC 0.55
AUC 0.59
AUC 0.62
[*] I need more information!

Dummy variables, missing data, and interactions

[Video]

Coding categorical features

# Convert the wealth rating to a factor
donors$wealth_levels <- factor(donors$wealth_rating, levels = c(0, 1, 2, 3), labels = c("Unknown", "Low", "Medium", "High"))

# Use relevel() to change reference category
donors$wealth_levels <- relevel(donors$wealth_levels, ref = "Medium")

# See how our factor coding impacts the model
summary(glm(donated ~ wealth_levels, data = donors, family = "binomial"))

## 
## Call:
## glm(formula = donated ~ wealth_levels, family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3320  -0.3243  -0.3175  -0.3175   2.4582  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.91894    0.03614 -80.772   <2e-16 ***
## wealth_levelsUnknown -0.04373    0.04243  -1.031    0.303    
## wealth_levelsLow     -0.05245    0.05332  -0.984    0.325    
## wealth_levelsHigh     0.04804    0.04768   1.008    0.314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37323  on 93458  degrees of freedom
## AIC: 37331
## 
## Number of Fisher Scoring iterations: 5

Handling missing data

# Find the average age among non-missing values
summary(donors$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   48.00   62.00   61.65   75.00   98.00   22546

# Impute missing age values with the mean age
donors$imputed_age <- ifelse(is.na(donors$age), round(mean(donors$age, na.rm = TRUE), 2), donors$age)

# Create missing value indicator for age
donors$missing_age <- ifelse(is.na(donors$age), 1, 0)

Understanding missing value indicators

A missing value indicator provides a reminder that, before imputation, there was a missing value present on the record.

Why is it often useful to include this indicator as a predictor in the model?

A missing value may represent a unique category by itself
There may be an important difference between records with and without missing data
Whatever caused the missing value may also be related to the outcome
[*] All of the above

Building a more sophisticated model

# Build a recency, frequency, and money (RFM) model
rfm_model <- glm(donated ~ money + recency * frequency, data = donors, family = "binomial")

# Summarize the RFM model to see how the parameters were coded
summary(rfm_model)

## 
## Call:
## glm(formula = donated ~ money + recency * frequency, family = "binomial", 
##     data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3696  -0.3696  -0.2895  -0.2895   2.7924  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -3.01142    0.04279 -70.375   <2e-16 ***
## moneyMEDIUM                        0.36186    0.04300   8.415   <2e-16 ***
## recencyLAPSED                     -0.86677    0.41434  -2.092   0.0364 *  
## frequencyINFREQUENT               -0.50148    0.03107 -16.143   <2e-16 ***
## recencyLAPSED:frequencyINFREQUENT  1.01787    0.51713   1.968   0.0490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 36938  on 93457  degrees of freedom
## AIC: 36948
## 
## Number of Fisher Scoring iterations: 6

# Compute predicted probabilities for the RFM model
rfm_prob <- predict(rfm_model, type = "response")

# Plot the ROC curve and find AUC for the new model
library(pROC)
ROC <- roc(donors$donated, rfm_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(ROC, col = "red")

auc(ROC)

## Area under the curve: 0.5785

Automatic feature selection

[Video]

The dangers of stepwise regression

In spite of its utility for feature selection, stepwise regression is not frequently used in disciplines outside of machine learning due to some important caveats. Which of these is NOT one of these concerns?

It is not guaranteed to find the best possible model
[*] A stepwise model’s predictions can not be trusted
The stepwise regression procedure violates some statistical assumptions
It can result in a model that makes little sense in the real world

Building a stepwise regression model

# Specify a null model with no predictors
null_model <- glm(donated ~ 1, data = donors, family = "binomial")

# Specify the full model using all of the potential predictors
full_model <- glm(donated ~ ., data = donors, family = "binomial")

# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")

## Start:  AIC=37332.13
## donated ~ 1

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + frequency          1    28502 37122
## + money              1    28621 37241
## + wealth_rating      1    28705 37326
## + has_children       1    28705 37326
## + age                1    28707 37328
## + imputed_age        1    28707 37328
## + wealth_levels      3    28704 37328
## + interest_veterans  1    28709 37330
## + donation_prob      1    28710 37330
## + donation_pred      1    28710 37330
## + catalog_shopper    1    28710 37330
## + pet_owner          1    28711 37331
## <none>                    28714 37332
## + interest_religion  1    28712 37333
## + recency            1    28713 37333
## + bad_address        1    28714 37334
## + veteran            1    28714 37334
## 
## Step:  AIC=37024.77
## donated ~ frequency

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + money              1    28441 36966
## + wealth_rating      1    28493 37018
## + wealth_levels      3    28490 37019
## + has_children       1    28494 37019
## + donation_prob      1    28498 37023
## + interest_veterans  1    28498 37023
## + catalog_shopper    1    28499 37024
## + donation_pred      1    28499 37024
## + age                1    28499 37024
## + imputed_age        1    28499 37024
## + pet_owner          1    28499 37024
## <none>                    28502 37025
## + interest_religion  1    28501 37026
## + recency            1    28501 37026
## + bad_address        1    28502 37026
## + veteran            1    28502 37027
## 
## Step:  AIC=36949.71
## donated ~ frequency + money

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + wealth_levels      3    28427 36942
## + wealth_rating      1    28431 36942
## + has_children       1    28432 36943
## + interest_veterans  1    28438 36948
## + donation_prob      1    28438 36949
## + catalog_shopper    1    28438 36949
## + donation_pred      1    28438 36949
## + age                1    28438 36949
## + imputed_age        1    28438 36949
## + pet_owner          1    28439 36949
## <none>                    28441 36950
## + interest_religion  1    28440 36951
## + recency            1    28440 36951
## + bad_address        1    28441 36951
## + veteran            1    28441 36952
## 
## Step:  AIC=36945.48
## donated ~ frequency + money + wealth_levels

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + has_children       1    28416 36937
## + age                1    28424 36944
## + imputed_age        1    28424 36944
## + interest_veterans  1    28424 36945
## + donation_prob      1    28424 36945
## + catalog_shopper    1    28424 36945
## + donation_pred      1    28425 36945
## <none>                    28427 36945
## + pet_owner          1    28425 36946
## + interest_religion  1    28426 36947
## + recency            1    28426 36947
## + bad_address        1    28427 36947
## + veteran            1    28427 36947
## 
## Step:  AIC=36938.4
## donated ~ frequency + money + wealth_levels + has_children

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + pet_owner          1    28413 36937
## + donation_prob      1    28413 36937
## + catalog_shopper    1    28413 36937
## + interest_veterans  1    28413 36937
## + donation_pred      1    28414 36938
## <none>                    28416 36938
## + interest_religion  1    28415 36939
## + age                1    28416 36940
## + imputed_age        1    28416 36940
## + recency            1    28416 36940
## + bad_address        1    28416 36940
## + veteran            1    28416 36940
## 
## Step:  AIC=36932.25
## donated ~ frequency + money + wealth_levels + has_children + 
##     pet_owner

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, : using
## the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## <none>                    28413 36932
## + donation_prob      1    28411 36932
## + interest_veterans  1    28411 36932
## + catalog_shopper    1    28412 36933
## + donation_pred      1    28412 36933
## + age                1    28412 36933
## + imputed_age        1    28412 36933
## + recency            1    28413 36934
## + interest_religion  1    28413 36934
## + bad_address        1    28413 36934
## + veteran            1    28413 36934

# Estimate the stepwise donation probability
step_prob <- predict(step_model, type = "response")

# Plot the ROC of the stepwise model
library(pROC)
ROC <- roc(donors$donated, step_prob)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(ROC, col = "red")

auc(ROC)

## Area under the curve: 0.5849

Ch. 4 - Classification Trees

Making decisions with trees

[Video]

Building a simple decision tree

# Load the rpart package
library(rpart)

# Build a lending model predicting loan outcome versus loan amount and credit score
loan_model <- rpart(outcome ~ loan_amount + credit_score, data = loans, method = "class", control = rpart.control(cp = 0))

# Make a prediction for someone with good credit
predict(loan_model, newdata = good_credit, type = "class")

##      1 
## repaid 
## Levels: default repaid

# Make a prediction for someone with bad credit
predict(loan_model, newdata = bad_credit, type = "class")

##       1 
## default 
## Levels: default repaid

Visualizing classification trees

# Examine the loan_model object
loan_model

## n= 11312 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 11312 5654 repaid (0.4998232 0.5001768)  
##    2) credit_score=AVERAGE,LOW 9490 4437 default (0.5324552 0.4675448)  
##      4) credit_score=LOW 1667  631 default (0.6214757 0.3785243) *
##      5) credit_score=AVERAGE 7823 3806 default (0.5134859 0.4865141)  
##       10) loan_amount=HIGH 2472 1079 default (0.5635113 0.4364887) *
##       11) loan_amount=LOW,MEDIUM 5351 2624 repaid (0.4903756 0.5096244)  
##         22) loan_amount=LOW 1810  874 default (0.5171271 0.4828729) *
##         23) loan_amount=MEDIUM 3541 1688 repaid (0.4767015 0.5232985) *
##    3) credit_score=HIGH 1822  601 repaid (0.3298573 0.6701427) *

# Load the rpart.plot package
library(rpart.plot)

# Plot the loan_model with default settings
rpart.plot(loan_model)

# Plot the loan_model with customized settings
rpart.plot(loan_model, type = 3, box.palette = c("red", "green"), fallen.leaves = TRUE)

Understanding the tree’s decisions

The following image shows the structure of a classification tree predicting loan outcome from the applicant’s credit score and requested loan amount.

Someone with an average credit score and a low requested loan amount.
Someone with a low credit score and a medium requested loan amount.
Someone with a high requested loan amount and average credit.
[*] Someone with a low requested loan amount and high credit.

Growing larger classification trees

[Video]

Why do some branches split?

A classification tree grows using a divide-and-conquer process. Each time the tree grows larger, it splits groups of data into smaller subgroups, creating new branches in the tree.

Given a dataset to divide-and-conquer, which groups would the algorithm prioritize to split first?

The group with the largest number of examples.
The group creating branches that improve the model’s prediction accuracy.
[*] The group it can split to create the greatest improvement in subgroup homogeneity.
The group that has not been split already.

Creating random test datasets

# Determine the number of rows for training
nrow(loans) * 0.75

## [1] 8484

# Create a random sample of row IDs
sample_rows <- sample(nrow(loans), nrow(loans) * 0.75)

# Create the training dataset
loans_train <- loans[sample_rows, ]

# Create the test dataset
loans_test <- loans[-sample_rows, ]

Building and evaluating a larger tree

# Grow a tree using all of the available applicant data
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Make predictions on the test dataset
loans_test$pred <- predict(loan_model, loans_test, type = "class")

# Examine the confusion matrix
table(loans_test$pred, loans_test$outcome)

##          
##           default repaid
##   default     764    597
##   repaid      666    801

# Compute the accuracy on the test dataset
mean(loans_test$pred == loans_test$outcome)

## [1] 0.5533946

Conducting a fair performance evaluation

Holding out test data reduces the amount of data available for growing the decision tree. In spite of this, it is very important to evaluate decision trees on data it has not seen before.

Which of these is NOT true about the evaluation of decision tree performance?

Decision trees sometimes overfit the training data.
[*] The model’s accuracy is unaffected by the rarity of the outcome.
Performance on the training dataset can overestimate performance on future data.
Creating a test dataset simulates the model’s performance on unseen data.

Tending to classification trees

[Video]

Preventing overgrown trees

# Grow a tree with maxdepth of 6
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0, maxdepth = 6))

# Make a class prediction on the test set
loans_test$pred <- predict(loan_model, newdata = loans_test, type = "class")

# Compute the accuracy of the simpler tree
mean(loans_test$pred == loans_test$outcome)

## [1] 0.582744

# Swap maxdepth for a minimum split of 500 
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(minsplit = 500, cp = 0))

# Run this. How does the accuracy change?
loans_test$pred <- predict(loan_model, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)

## [1] 0.582744

Creating a nicely pruned tree

# Grow an overly complex tree
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Examine the complexity plot
plotcp(loan_model)

# Prune the tree
loan_model_pruned <- prune(loan_model, cp = 0.0014)

# Compute the accuracy of the pruned tree
loans_test$pred <- predict(loan_model_pruned, newdata = loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)

## [1] 0.5884017

Why do trees benefit from pruning?

Classification trees can grow indefinitely, until they are told to stop or run out of data to divide-and-conquer.

Just like trees in nature, classification trees that grow overly large can require pruning to reduce the excess growth. However, this generally results in a tree that classifies fewer training examples correctly.

Why, then, are pre-pruning and post-pruning almost always used?

Simpler trees are easier to interpret
Simpler trees using early stopping are faster to train
Simpler trees may perform better on the testing data
[*] All of the above

Seeing the forest from the trees

[Video]

Understanding random forests

Groups of classification trees can be combined into an ensemble that generates a single prediction by allowing the trees to “vote” on the outcome.

Why might someone think that this could result in more accurate predictions than a single tree?

Each tree in the forest is larger and more complex than a typical single tree.
Every tree in a random forest uses the complete set of predictors.
[*] The diversity among the trees may lead it to discover more subtle patterns.
The random forest is not affected by noisy data.

Building a random forest model

# Load the randomForest package
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

# Build a random forest model
loan_model <- randomForest(outcome ~ ., data = loans_train, method = "class")

# Compute the accuracy of the random forest
loans_test$pred <- predict(loan_model, newdata = loans_test)
mean(loans_test$pred == loans_test$outcome)

## [1] 0.5887553

About Michael Mallari

Michael is a hybrid thinker and doer—a byproduct of being a StrengthsFinder “Learner” over time. With 20+ years of engineering, design, and product experience, he helps organizations identify market needs, mobilize internal and external resources, and deliver delightful digital customer experiences that align with business goals. He has been entrusted with problem-solving for brands—ranging from Fortune 500 companies to early-stage startups to not-for-profit organizations.

Michael earned his BS in Computer Science from New York Institute of Technology and his MBA from the University of Maryland, College Park. He is also a candidate to receive his MS in Applied Analytics from Columbia University.

LinkedIn | Twitter | www.michaelmallari.com/data | www.columbia.edu/~mm5470