In this learning path, we’ll learn how to create Machine learning models using R
😊. Machine learning is the foundation for predictive modeling and
artificial intelligence. We’ll learn the core principles of machine
learning and how to use common tools and frameworks to train, evaluate,
and use machine learning models.
Modules that will be covered in this learning path include:
Explore and analyze data with R
Train and evaluate regression models
Train and evaluate classification models
Train and evaluate clustering models (under development)
Train and evaluate deep learning models (under development)
This learning path assumes knowledge of basic mathematical concepts. Some experience with R and the tidyverse
is also beneficial though we’ll try as much as possible to skim through
the core concepts. To get started with R and the tidyverse, the best
place would be R for Data Science
an O’Reilly book written by Hadley Wickham and Garrett Grolemund. It’s
designed to take you from knowing nothing about R or the tidyverse to
having all the basic tools of data science at your fingertips.
The Python
edition of the learning path can be found at this learning path.
Why R?
First, while not the only good option, R has been shown to be popular and effective in modern data analysis. Second, R is free and open-source. You can install it anywhere, modify the code, and have the ability to see exactly how computations are performed. Third, it has excellent community support via the canonical R mailing lists and, more importantly, with Twitter, StackOverflow, and RStudio Community. Anyone who asks a reasonable, reproducible question has a pretty good chance of getting an answer. -
Feature Engineering and Selection: A Practical Approach for Predictive Models, Max Kuhn and Kjell Johnson
Now, let’s get started!
Artwork by @allison_horst
Classification is a form of machine learning in which you train a model to predict which category an item belongs to. Categorical data has distinct ‘classes’, rather than numeric values.
For example, a health clinic might use diagnostic data such as a patient’s height, weight, blood pressure, blood-glucose level to predict whether or not the patient is diabetic.
Classification is an example of a supervised machine learning technique, which means it relies on data that includes known feature values (for example, diagnostic measurements for patients) as well as known label
values (for example, a classification of non-diabetic or diabetic). A
classification algorithm is used to fit a subset of the data to a
function that can calculate the probability
for each class
label from the feature values. The remaining data is used to evaluate
the model by comparing the predictions it generates from the features to
the known class labels.
The simplest form of classification is binary classification, in which the label is 0 or 1, representing one of two classes; for example, “True” or “False”; “Internal” or “External”; “Profitable” or “Non-Profitable”; and so on.
The class prediction is made by determining the probability for each possible class as a value between 0 -impossible - and 1 - certain. The total probability for all classes is 1, as the patient is definitely either diabetic or non-diabetic. So, if the predicted probability of a patient being diabetic is 0.3, then there is a corresponding probability of 0.7 that the patient is non-diabetic.
A threshold value, usually 0.5, is used to determine the predicted class - so if the positive class (in this case, diabetic) has a predicted probability greater than the threshold, then a classification of diabetic is predicted.
The best way to learn about classification is to try it for yourself, so that’s what you’ll do in this exercise.
We’ll require some packages to knock-off this module. You can have them installed as:
install.packages(c('tidyverse', 'tidymodels', 'ranger', 'vip', 'palmerpenguins', 'skimr', 'paletteer', 'nnet', 'here'))
Alternatively, the script below checks whether you have the packages required to complete this module and installs them for you in case some are missing.
suppressWarnings(if(!require("pacman")) install.packages("pacman"))
## Loading required package: pacman
::p_load('tidyverse', 'tidymodels', 'ranger',
pacman'vip', 'skimr', 'here','palmerpenguins', 'kernlab',
'janitor', 'paletteer', 'nnet')
Let’s start by looking at an example of binary classification, where the model must predict a label that belongs to one of two classes. In this exercise, we’ll train a binary classifier to predict whether or not a patient should be tested for diabetes based on some medical data.
The first step in any machine learning project is to explore the data
that you will use to train a model. And before we can explore the data, we must first have it in our environment, right?
So, let’s begin by importing a CSV file of patent data into a tibble
(a modern a modern reimagining of the data frame):
Citation: The diabetes dataset used in this exercise is based on data originally collected by the National Institute of Diabetes and Digestive and Kidney Diseases.
# Load the core tidyverse and make it available in your current R session
library(tidyverse)
# Read the csv file into a tibble
<- read_csv(file = "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/diabetes.csv")
diabetes
# Print the first 10 rows of the data
%>%
diabetes slice_head(n = 10)
PatientID <dbl> | Pregnancies <dbl> | PlasmaGlucose <dbl> | DiastolicBloodPressure <dbl> | TricepsThickness <dbl> | |
---|---|---|---|---|---|
1354778 | 0 | 171 | 80 | 34 | |
1147438 | 8 | 92 | 93 | 47 | |
1640031 | 7 | 115 | 47 | 52 | |
1883350 | 9 | 103 | 78 | 25 | |
1424119 | 1 | 85 | 59 | 27 | |
1619297 | 0 | 82 | 92 | 9 | |
1660149 | 0 | 133 | 47 | 19 | |
1458769 | 0 | 67 | 87 | 43 | |
1201647 | 8 | 80 | 95 | 33 | |
1403912 | 1 | 72 | 31 | 40 |
Sometimes, we may want some little more information on our data. We can have a look at the data
, its structure
and the data type
of its features by using the glimpse() function as below:
# Take a quick glance at the data
%>%
diabetes glimpse()
## Rows: 15,000
## Columns: 10
## $ PatientID <dbl> 1354778, 1147438, 1640031, 1883350, 1424119, 16~
## $ Pregnancies <dbl> 0, 8, 7, 9, 1, 0, 0, 0, 8, 1, 1, 3, 5, 7, 0, 3,~
## $ PlasmaGlucose <dbl> 171, 92, 115, 103, 85, 82, 133, 67, 80, 72, 88,~
## $ DiastolicBloodPressure <dbl> 80, 93, 47, 78, 59, 92, 47, 87, 95, 31, 86, 96,~
## $ TricepsThickness <dbl> 34, 47, 52, 25, 27, 9, 19, 43, 33, 40, 11, 31, ~
## $ SerumInsulin <dbl> 23, 36, 35, 304, 35, 253, 227, 36, 24, 42, 58, ~
## $ BMI <dbl> 43.50973, 21.24058, 41.51152, 29.58219, 42.6045~
## $ DiabetesPedigree <dbl> 1.21319135, 0.15836498, 0.07901857, 1.28286985,~
## $ Age <dbl> 21, 23, 23, 43, 22, 26, 21, 26, 53, 26, 22, 23,~
## $ Diabetic <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,~
This data consists of diagnostic information about some patients who
have been tested for diabetes. Note that the final column in the dataset
(Diabetic
) contains the value 0
for patients who tested negative
for diabetes, and 1
for patients who tested positive. This is the label that we will train our model to predict; most of the other columns (Pregnancies, PlasmaGlucose, DiastolicBloodPressure, BMI and so on) are the features we will use to predict the Diabetic label.
Let’s kick off our adventure by reformatting the data to make it easier for a model to use effectively. For now, let’s drop the PatientID column, encode the Diabetic column as a categorical variable, and make the column names a bit friend_lieR:
# Load the janitor package for cleaning data
library(janitor)
# Clean data a bit
<- diabetes %>%
diabetes_select # Encode Diabetic as category
mutate(Diabetic = factor(Diabetic, levels = c("1","0"))) %>%
# Drop PatientID column
select(-PatientID) %>%
# Clean column names
clean_names()
# View data set
%>%
diabetes_select slice_head(n = 10)
pregnancies <dbl> | plasma_glucose <dbl> | diastolic_blood_pressure <dbl> | triceps_thickness <dbl> | |
---|---|---|---|---|
0 | 171 | 80 | 34 | |
8 | 92 | 93 | 47 | |
7 | 115 | 47 | 52 | |
9 | 103 | 78 | 25 | |
1 | 85 | 59 | 27 | |
0 | 82 | 92 | 9 | |
0 | 133 | 47 | 19 | |
0 | 67 | 87 | 43 | |
8 | 80 | 95 | 33 | |
1 | 72 | 31 | 40 |
The goal of this exploration is to try to understand the relationships
between its attributes; in particular, any apparent correlation between the features and the label your model will try to predict. One way of doing this is by using data visualization.
Now let’s compare the feature distributions for each label value. We’ll begin by formatting the data to a long format to make it somewhat easier to make multiple facets.
# Pivot data to a long format
<- diabetes_select %>%
diabetes_select_long pivot_longer(!diabetic, names_to = "features", values_to = "values")
# Print the first 10 rows
%>%
diabetes_select_long slice_head(n = 10)
diabetic <fct> | features <chr> | values <dbl> | ||
---|---|---|---|---|
0 | pregnancies | 0.000000 | ||
0 | plasma_glucose | 171.000000 | ||
0 | diastolic_blood_pressure | 80.000000 | ||
0 | triceps_thickness | 34.000000 | ||
0 | serum_insulin | 23.000000 | ||
0 | bmi | 43.509726 | ||
0 | diabetes_pedigree | 1.213191 | ||
0 | age | 21.000000 | ||
0 | pregnancies | 8.000000 | ||
0 | plasma_glucose | 92.000000 |
Perfect! Now, let’s make some plots.
theme_set(theme_light())
# Make a box plot for each predictor feature
%>%
diabetes_select_long ggplot(mapping = aes(x = diabetic, y = values, fill = features)) +
geom_boxplot() +
facet_wrap(~ features, scales = "free", ncol = 4) +
scale_color_viridis_d(option = "plasma", end = .7) +
theme(legend.position = "none")
Amazing🤩! For some of the features, there’s a noticeable difference in the distribution for each label value. In particular, Pregnancies
and Age
show markedly different distributions for diabetic patients than for
non-diabetic patients. These features may help predict whether or not a
patient is diabetic.
Our dataset includes known values for the label, so we can use this to train a classifier so that it finds a statistical relationship between the features and the label value; but how will we know if our model is any good? How do we know it will predict correctly when we use it with new data that it wasn’t trained with?
It is best practice to hold out some of your data for testing in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set.
Well, we can take advantage of the fact we have a large dataset with known label values, use only some of it to train the model, and hold back some to test the trained model - enabling us to compare the predicted labels with the already known labels in the test set.
In R, the amazing Tidymodels framework provides a collection of packages for modeling and machine learning using tidyverse principles. For instance, rsample, a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:
initial_split()
: specifies how data will be split into a training and testing set
training()
and testing()
functions extract the data in each split
use ?initial_split()
for more details.
Here is a great place to get started with Tidymodels: Get Started
# Load the Tidymodels packages
library(tidymodels)
# Split data into 70% for training and 30% for testing
set.seed(2056)
<- diabetes_select %>%
diabetes_split initial_split(prop = 0.70)
# Extract the data in each split
<- training(diabetes_split)
diabetes_train <- testing(diabetes_split)
diabetes_test
# Print the number of cases in each split
cat("Training cases: ", nrow(diabetes_train), "\n",
"Test cases: ", nrow(diabetes_test), sep = "")
## Training cases: 10500
## Test cases: 4500
# Print out the first 5 rows of the training set
%>%
diabetes_train slice_head(n = 5)
pregnancies <dbl> | plasma_glucose <dbl> | diastolic_blood_pressure <dbl> | triceps_thickness <dbl> | |
---|---|---|---|---|
0 | 84 | 101 | 9 | |
1 | 157 | 56 | 11 | |
0 | 102 | 56 | 15 | |
1 | 81 | 90 | 10 | |
8 | 97 | 65 | 24 |
OK, now we’re ready to train our model by fitting the training features to the training labels (diabetic
). There are various algorithms we can use to train the model. In this example, we’ll use Logistic Regression
,
which (despite its name) is a well-established algorithm for
classification. Logistic regression is a binary classification
algorithm, meaning it predicts 2 categories.
There are quite a number of ways to fit a logistic regression model in Tidymodels. See ?logistic_reg()
For now, let’s fit a logistic regression model via the default stats::glm()
engine.
# Make a model specifcation
<- logistic_reg() %>%
logreg_spec set_engine("glm") %>%
set_mode("classification")
# Print the model specification
logreg_spec
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
After a model has been specified, the model can be estimated
or trained
using the fit()
function, typically using a symbolic description of the model (a formula) and some data.
# Train a logistic regression model
<- logreg_spec %>%
logreg_fit fit(diabetic ~ ., data = diabetes_train)
# Print the model object
logreg_fit
## parsnip model object
##
## Fit time: 51ms
##
## Call: stats::glm(formula = diabetic ~ ., family = stats::binomial,
## data = data)
##
## Coefficients:
## (Intercept) pregnancies plasma_glucose
## 8.624243 -0.266296 -0.009615
## diastolic_blood_pressure triceps_thickness serum_insulin
## -0.012297 -0.022807 -0.003932
## bmi diabetes_pedigree age
## -0.049028 -0.923262 -0.056876
##
## Degrees of Freedom: 10499 Total (i.e. Null); 10491 Residual
## Null Deviance: 13290
## Residual Deviance: 9221 AIC: 9239
The model print out shows the coefficients learned during training.
Now we’ve trained the model using the training data, we can use the test data we held back to evaluate how well it predicts using parsnip::predict(). Let’s start by using the model to predict labels for our test set, and compare the predicted labels to the known labels:
# Make predictions then bind them to the test set
<- diabetes_test %>% select(diabetic) %>%
results bind_cols(logreg_fit %>% predict(new_data = diabetes_test))
# Compare predictions
%>%
results slice_head(n = 10)
diabetic <fct> | .pred_class <fct> | |||
---|---|---|---|---|
0 | 0 | |||
0 | 0 | |||
0 | 0 | |||
0 | 0 | |||
1 | 1 | |||
0 | 0 | |||
0 | 0 | |||
1 | 0 | |||
0 | 0 | |||
0 | 0 |
Comparing each prediction with its corresponding “ground truth”
actual value isn’t a very efficient way to determine how well the model
is predicting. Fortunately, Tidymodels has a few more tricks up its
sleeve: yardstick
- a package used to measure the effectiveness of models using performance metrics.
The most obvious thing you might want to do is to check the accuracy of the predictions - in simple terms, what proportion of the labels did the model predict correctly?
yardstick::accuracy()
does just that!
# Calculate accuracy: proportion of data predicted correctly
accuracy(data = results, truth = diabetic, estimate = .pred_class)
.metric <chr> | .estimator <chr> | .estimate <dbl> | ||
---|---|---|---|---|
accuracy | binary | 0.7888889 |
The accuracy is returned as a decimal value - a value of 1.0 would mean that the model got 100% of the predictions right; while an accuracy of 0.0 is, well, pretty useless 😐!
Accuracy seems like a sensible metric to evaluate (and to a certain extent it is), but you need to be careful about drawing too many conclusions from the accuracy of a classifier. Remember that it’s simply a measure of how many cases were predicted correctly. Suppose only 3% of the population is diabetic. You could create a classifier that always just predicts 0, and it would be 97% accurate - but not terribly helpful in identifying patients with diabetes!
Fortunately, there are some other metrics that reveal a little more about how our classification model is performing.
One performance metric associated with classification problems is the confusion matrix
.
A confusion matrix describes how well a classification model performs
by tabulating how many examples in each class were correctly classified
by a model. In our case, it will show you how many cases were classified
as negative (0) and how many as positive (1); the confusion matrix also
shows you how many were classified into the wrong categories.
The conf_mat()
function from yardstick calculates this cross-tabulation of observed and predicted classes.
# Confusion matrix for prediction results
conf_mat(data = results, truth = diabetic, estimate = .pred_class)
## Truth
## Prediction 1 0
## 1 897 293
## 0 657 2653
Awesome!
Let’s interpret the confusion matrix. Our model is asked to classify cases between two binary categories, category 1
for patients who tested positive for diabetes and category 0
for patients who tested negative.
If your model predicts a patient as 1
(positive) and they belong to category 1
(positive) in reality we call this a true positive
, shown by the top left number 897
.
If your model predicts a patient as 0
(negative) and they belong to category 1
(positive) in reality we call this a false negative
, shown by the bottom left number 657
.
If your model predicts a patient as 1
(positive) and they belong to category 0
(negative) in reality we call this a false positive
, shown by the top right number 293
.
If your model predicts a patient as 0
(negative) and they belong to category 0
(negative) in reality we call this a true negative
, shown by the bottom right number 2653
.
Our confusion matrix can thus be expressed in the following form:
Truth |
---|
Predicted | 1 | 0 |
1 | ||
0 |
Note that the correct (true
) predictions form a diagonal line from top left to bottom right - these figures should be significantly higher than the false predictions if the model is any good.
As you might have guessed it’s preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.
The confusion matrix is helpful since it gives rise to other metrics that can help us better evaluate the performance of a classification model. Let’s go through some of them:
🎓 Precision: TP/(TP + FP)
defined as the proportion of predicted positives that are actually positive. Also called positive predictive value
🎓 Recall: TP/(TP + FN)
defined as the proportion of positive results out of the number of samples which were actually positive. Also known as sensitivity
.
🎓 Specificity: TN/(TN + FP)
defined as the proportion of negative results out of the number of samples which were actually negative.
🎓 Accuracy: TP + TN/(TP + TN + FP + FN)
The percentage of labels predicted accurately for a sample.
🎓 F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.
Tidymodels provides yet another succinct way of evaluating all these metrics. Using yardstick::metric_set()
, you can combine multiple metrics together into a new function that calculates all of them at once.
# Combine metrics and evaluate them all at once
<- metric_set(ppv, recall, accuracy, f_meas)
eval_metrics eval_metrics(data = results, truth = diabetic, estimate = .pred_class)
.metric <chr> | .estimator <chr> | .estimate <dbl> | ||
---|---|---|---|---|
ppv | binary | 0.7537815 | ||
recall | binary | 0.5772201 | ||
accuracy | binary | 0.7888889 | ||
f_meas | binary | 0.6537901 |
Using the precision (ppv) metric, we are able to answer the question:
Using the recall metric, we are able to answer the question:
Great job, we just made predictions and evaluated them using a number of metrics.
Until now, we’ve considered the predictions from the model as being
either 1 or 0 class labels. Actually, things are a little more complex
than that. Statistical machine learning algorithms, like logistic
regression, are based on probability
; so what actually gets predicted by a binary classifier is the probability that the label is true () and the probability that the label is false (). A threshold value of 0.5 is used to decide whether the predicted label is a 1
() or a 0
(). Let’s see the probability pairs for each case:
# Predict class probabilities and bind them to results
<- results %>%
results bind_cols(logreg_fit %>%
predict(new_data = diabetes_test, type = "prob"))
# Print out the results
%>%
results slice_head(n = 10)
diabetic <fct> | .pred_class <fct> | .pred_1 <dbl> | .pred_0 <dbl> | |
---|---|---|---|---|
0 | 0 | 0.41666247 | 0.5833375 | |
0 | 0 | 0.09848140 | 0.9015186 | |
0 | 0 | 0.04694784 | 0.9530522 | |
0 | 0 | 0.05606136 | 0.9439386 | |
1 | 1 | 0.58055760 | 0.4194424 | |
0 | 0 | 0.33066116 | 0.6693388 | |
0 | 0 | 0.28826501 | 0.7117350 | |
1 | 0 | 0.26991877 | 0.7300812 | |
0 | 0 | 0.27501792 | 0.7249821 | |
0 | 0 | 0.13127572 | 0.8687243 |
The decision to score a prediction as a 1 or a 0 depends on the threshold to which the predicted probabilities are compared. If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion matrix. A common way to evaluate a classifier is to examine the true positive rate (which is another name for recall) and the false positive rate (1 - specificity) for a range of possible thresholds. These rates are then plotted against all possible thresholds to form a chart known as a received operator characteristic (ROC) chart, like this:
# Make a roc_chart
%>%
results roc_curve(truth = diabetic, .pred_1) %>%
autoplot()
The ROC chart shows the curve of the true and false positive rates for different threshold values between 0 and 1. A perfect classifier would have a curve that goes straight up the left side and straight across the top. The diagonal line across the chart represents the probability of predicting correctly with a 50/50 random prediction; so you obviously want the curve to be higher than that (or your model is no better than simply guessing!).
The area under the curve (AUC) is a value between 0 and 1 that
quantifies the overall performance of the model. One way of interpreting
AUC is as the probability that the model ranks a random positive
example more highly than a random negative example. The closer to 1 this
value is, the better the model. Once again, Tidymodels includes a
function to calculate this metric: yardstick::roc_auc()
# Compute the AUC
%>%
results roc_auc(diabetic, .pred_1)
.metric <chr> | .estimator <chr> | .estimate <dbl> | ||
---|---|---|---|---|
roc_auc | binary | 0.8602241 |
In this case, the ROC curve and its AUC indicate that the model performs better than a random guess which is not bad considering we performed very little preprocessing of the data.
In practice, it’s common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it. There’s a huge range of preprocessing transformations you can perform to get your data ready for modeling, but we’ll limit ourselves to a few common techniques:
Scaling numeric features so they’re on the same scale. This prevents features with large values from producing coefficients that disproportionately affect the predictions.
Encoding categorical variables. For example, by using a one hot encoding technique you can create “dummy” or indicator variables which replace the original categorical feature with numeric columns whose values are either 1 or 0.
Tidymodels provides yet another neat package: recipes- a package for preprocessing data. Let’s specify a recipe that encodes the age column then normalizes the rest of the predictor features.
# Preprocess the data for modelling
<- recipe(diabetic ~ ., data = diabetes_train) %>%
diabetes_recipe step_mutate(age = factor(age)) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(age)
# Print the recipe
diabetes_recipe
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 8
##
## Operations:
##
## Variable mutation for age
## Centering and scaling for all_numeric_predictors()
## Dummy variables from age
We just created a recipe containing an outcome and its corresponding predictors, specifying that the age variable should be converted to a categorical variable (factor), all the numeric predictors normalized and creating dummy variables for the nominal predictor (age) 🙌.
Now that we have a recipe and a model specification we defined previously, we need to find a way of bundling them together into an object that will first preprocess the data, fit the model on the preprocessed data and also allow for potential post-processing activities.
In Tidymodels, this convenient object is called a workflow
and conveniently holds your modeling components.
The workflows package allows the user to bind modeling and preprocessing objects together. You can then fit the entire workflow to the data, such that the model encapsulates all of the preprocessing steps as well as the algorithm.
# Redefine the model specification
<- logistic_reg() %>%
logreg_spec set_engine("glm") %>%
set_mode("classification")
# Bundle the recipe and model specification
<- workflow() %>%
lr_wf add_recipe(diabetes_recipe) %>%
add_model(logreg_spec)
# Print the workflow
lr_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_mutate()
## * step_normalize()
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
After a workflow has been specified, a model can be trained
using the fit()
function.
# Fit a workflow object
<- lr_wf %>%
lr_wf_fit fit(data = diabetes_train)
# Print wf object
lr_wf_fit
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: logistic_reg()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_mutate()
## * step_normalize()
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) pregnancies plasma_glucose
## 0.8687 -0.9424 -0.3204
## diastolic_blood_pressure triceps_thickness serum_insulin
## -0.1974 -0.3358 -0.5528
## bmi diabetes_pedigree age_X22
## -0.4849 -0.3669 0.1369
## age_X23 age_X24 age_X25
## 1.0163 18.9613 18.7737
## age_X26 age_X28 age_X29
## 18.8089 -19.4492 -19.9201
## age_X30 age_X31 age_X32
## 0.5942 0.8685 0.9317
## age_X33 age_X34 age_X35
## 0.4930 0.5297 0.7340
## age_X36 age_X37 age_X38
## -1.3050 -1.7518 -2.5862
## age_X39 age_X40 age_X41
## -1.9703 -2.1639 -2.2658
## age_X42 age_X43 age_X44
## -0.8612 -1.9764 -1.9533
## age_X45 age_X46 age_X47
## -1.8894 -2.0556 -1.9258
## age_X48 age_X49 age_X50
## 0.7884 2.9357 1.2163
## age_X51 age_X52 age_X53
## 0.2695 0.3483 -1.0652
## age_X54 age_X55 age_X56
## -2.2427 -2.6624 -2.4732
## age_X57 age_X58 age_X59
## -3.0821 -2.2541 -2.0212
## age_X60 age_X61 age_X62
## -3.1082 -2.1049 -1.9737
## age_X63 age_X64 age_X65
## -1.6991 -0.6576 -1.8492
## age_X66 age_X67 age_X68
## 1.1912 0.8770 18.3957
## age_X69 age_X70 age_X71
## 18.6030 20.1025 18.7479
## age_X72 age_X73 age_X74
## 18.4269 19.3626 19.3518
## age_X75 age_X76 age_X77
## 18.6483 19.0067 18.0447
##
## Degrees of Freedom: 10499 Total (i.e. Null); 10437 Residual
## Null Deviance: 13290
## Residual Deviance: 6838 AIC: 6964
##
## ...
## and 0 more lines.
Good job👏! We now have a trained workflow. The workflow print out shows the coefficients learned during training.
This allows us to use the model trained by this workflow to predict labels for our test set, and compare the performance metrics with the basic model we created previously.
# Make predictions on the test set
<- diabetes_test %>% select(diabetic) %>%
results bind_cols(lr_wf_fit %>%
predict(new_data = diabetes_test)) %>%
bind_cols(lr_wf_fit %>%
predict(new_data = diabetes_test, type = "prob"))
# Print the results
%>%
results slice_head(n = 10)
diabetic <fct> | .pred_class <fct> | .pred_1 <dbl> | .pred_0 <dbl> | |
---|---|---|---|---|
0 | 0 | 3.181359e-01 | 0.68186413 | |
0 | 0 | 1.456229e-01 | 0.85437708 | |
0 | 0 | 8.747333e-02 | 0.91252667 | |
0 | 0 | 5.700564e-10 | 1.00000000 | |
1 | 1 | 9.310216e-01 | 0.06897839 | |
0 | 0 | 5.451350e-09 | 0.99999999 | |
0 | 1 | 5.697916e-01 | 0.43020837 | |
1 | 1 | 6.938903e-01 | 0.30610969 | |
0 | 0 | 4.401889e-09 | 1.00000000 | |
0 | 0 | 1.913352e-01 | 0.80866481 |
Let’s take a look at the confusion matrix:
# Confusion matrix for prediction results
%>%
results conf_mat(truth = diabetic, estimate = .pred_class)
## Truth
## Prediction 1 0
## 1 1116 245
## 0 438 2701
🤩🤩 Look at those metrics!
Can we visualize this? Of course, nothing is impaRsible!
# Visualize conf mat
update_geom_defaults(geom = "rect", new = list(fill = "midnightblue", alpha = 0.7))
%>%
results conf_mat(diabetic, .pred_class) %>%
autoplot()
What about our other metrics such as ppv, sensitivity etc?
# Evaluate other desired metrics
eval_metrics(data = results, truth = diabetic, estimate = .pred_class)
.metric <chr> | .estimator <chr> | .estimate <dbl> | ||
---|---|---|---|---|
ppv | binary | 0.8199853 | ||
recall | binary | 0.7181467 | ||
accuracy | binary | 0.8482222 | ||
f_meas | binary | 0.7656947 |
# Evaluate ROC_AUC metrics
%>%
results roc_auc(diabetic, .pred_1)
.metric <chr> | .estimator <chr> | .estimate <dbl> | ||
---|---|---|---|---|
roc_auc | binary | 0.9248788 |
# Plot ROC_CURVE
%>%
results roc_curve(diabetic, .pred_1) %>%
autoplot()
Comparing with previous predictions, the metrics look better, so clearly preprocessing the data has made a difference.
Now let’s try a different algorithm. Previously we used a logistic regression algorithm, which is a linear algorithm. There are many kinds of classification algorithm we could try, including:
Support Vector Machine algorithms: Algorithms that define a hyperplane that separates classes.
Tree-based algorithms: Algorithms that build a decision tree to reach a prediction
Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to improve generalizability.
This time, we’ll train the model using an ensemble algorithm named Random Forest that averages the outputs of multiple random decision trees. Random forests help to reduce tree correlation by injecting more randomness into the tree-growing process. More specifically, instead of considering all predictors in the data, for calculating a given split, random forests pick a random sample of predictors to be considered for that split.
For further reading on Tree based models, please see:
As we already have a gist of how to perform classification using Tidymodels, let’s get right into specifying and fitting a random forest algorithm.
# Preprocess the data for modelling
<- recipe(diabetic ~ ., data = diabetes_train) %>%
diabetes_recipe step_mutate(age = factor(age)) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(age)
# Build a random forest model specification
<- rand_forest() %>%
rf_spec set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
# Bundle recipe and model spec into a workflow
<- workflow() %>%
rf_wf add_recipe(diabetes_recipe) %>%
add_model(rf_spec)
# Fit a model
<- rf_wf %>%
rf_wf_fit fit(data = diabetes_train)
# Make predictions on test data
<- diabetes_test %>% select(diabetic) %>%
results bind_cols(rf_wf_fit %>%
predict(new_data = diabetes_test)) %>%
bind_cols(rf_wf_fit %>%
predict(new_data = diabetes_test, type = "prob"))
# Print out predictions
%>%
results slice_head(n = 10)
diabetic <fct> | .pred_class <fct> | .pred_1 <dbl> | .pred_0 <dbl> | |
---|---|---|---|---|
0 | 0 | 0.284647131 | 0.7153529 | |
0 | 0 | 0.012075218 | 0.9879248 | |
0 | 0 | 0.009816002 | 0.9901840 | |
0 | 0 | 0.010763838 | 0.9892362 | |
1 | 1 | 0.892012626 | 0.1079874 | |
0 | 0 | 0.113799606 | 0.8862004 | |
0 | 0 | 0.229868076 | 0.7701319 | |
1 | 0 | 0.266318709 | 0.7336813 | |
0 | 0 | 0.052735319 | 0.9472647 | |
0 | 0 | 0.038482162 | 0.9615178 |
💪 There goes our random_forest model. Is it any good? Let’s evaluate its metrics!
# Confusion metrics for rf_predictions
%>%
results conf_mat(diabetic, .pred_class)
## Truth
## Prediction 1 0
## 1 1371 100
## 0 183 2846
# Confusion matrix plot
%>%
results conf_mat(diabetic, .pred_class) %>%
autoplot()
There is a considerable increase in the number of True Positives
and True Negatives
, which is a step in the right direction.
Let’s take a look at other evaluation metrics
# Evaluate other intuitive classification metrics
<- results %>%
rf_met eval_metrics(truth = diabetic, estimate = .pred_class)
# Evaluate ROC_AOC
<- results %>%
auc roc_auc(diabetic, .pred_1)
# Plot ROC_CURVE
<- results %>%
curve roc_curve(diabetic, .pred_1) %>%
autoplot
# Return metrics
list(rf_met, auc, curve)
## [[1]]
## # A tibble: 4 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 ppv binary 0.932
## 2 recall binary 0.882
## 3 accuracy binary 0.937
## 4 f_meas binary 0.906
##
## [[2]]
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.985
##
## [[3]]
For the sheer sake of adventure, let’s make a Variable Importance Plot to see which predictor variables have the most impact in our model.
# Load vip
library(vip)
# Extract the fitted model from the workflow
%>%
rf_wf_fit extract_fit_parsnip() %>%
# Make VIP plot
vip()
Just as we had anticipated from our data exploration 😊! This goes to show the importance of data exploration.
As revealed by the performance metrics, the random forest model seemed to have done a great job in increasing the True Positives/Negatives and reducing the False Positives/Negatives.
Now that we have a reasonably useful trained model, we can save it for use later to predict labels for new data:
# Save trained workflow
saveRDS(rf_wf_fit, "diabetes_rf_model.rds")
Now, we can load it whenever we need it, and use it to predict labels for new data. This is often called scoring
or inferencing
.
For example, lets create a simulated data set by picking a random value for each column in our test set then make predictions using the saved model.
# Load the model into the current R session
<- readRDS("diabetes_rf_model.rds")
loaded_model
# Create new simulated data
<- lapply(diabetes_test, function(x){sample(x, size = 2)}) %>%
new_data as_tibble()
new_data
pregnancies <dbl> | plasma_glucose <dbl> | diastolic_blood_pressure <dbl> | triceps_thickness <dbl> | |
---|---|---|---|---|
6 | 175 | 90 | 22 | |
1 | 84 | 55 | 9 |
# Make predictions
<- new_data %>%
predictions bind_cols(loaded_model %>% predict(new_data))
predictions
pregnancies <dbl> | plasma_glucose <dbl> | diastolic_blood_pressure <dbl> | triceps_thickness <dbl> | |
---|---|---|---|---|
6 | 175 | 90 | 22 | |
1 | 84 | 55 | 9 |
Binary classification techniques work well when the data observations belong to one of two classes or categories, such as True
or False
. When the data can be categorized into more than two classes
, you must use a multiclass classification algorithm
.
Multiclass classification can be thought of as a combination
of multiple binary classifiers
. There are two ways in which you approach the problem:
One vs Rest (OVR), in which a classifier is created for each possible class value, with a positive outcome for cases where the prediction is this class, and negative predictions for cases where the prediction is any other class. A classification problem with four possible shape classes (square, circle, triangle, hexagon) would require four classifiers that predict:
square or not
circle or not
triangle or not
hexagon or not
One vs One (OVO), in which a classifier for each possible pair of classes is created. The classification problem with four shape classes would require the following binary classifiers:
square or circle
square or triangle
square or hexagon
circle or triangle
circle or hexagon
triangle or hexagon
In both approaches, the overall model that combines the classifiers generates a vector of predictions in which the probabilities generated from the individual binary classifiers are used to determine which class to predict.
Fortunately, in most machine learning frameworks, including Tidymodels, implementing a multiclass classification model is not significantly more complex than binary classification.
The Palmer Archipelago penguins. Artwork by @allison_horst.
In this sections, we’ll build a multiclass classifier for classifying penguins!
The palmerpenguins
data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.
These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.
In R, the package palmerpenguins
by Allison Marie Horst and Alison Presmanes Hill and Kristen B Gorman provides us with data related to these adorable creatures.
The corresponding csv data used in the Python tutorial can be found here.
Care for a pun?
What is a penguin’s favorite movie?
Frozen ❄️👧.
With that, let’s get started exploring some penguins 🐧🐧!
# Load the dataset
library(palmerpenguins)
# Take a peek into the data
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex <fct> male, female, female, NA, female, male, female, male~
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~
The data contains the following columns:
species: a factor denoting the penguin species (Adelie
, Chinstrap
, or Gentoo
)
island: a factor denoting the island (in Palmer Archipelago, Antarctica) where observed
bill_length_mm (aka culmen_length): a number denoting length of the dorsal ridge of penguin bill (millimeters)
bill_depth_mm (aka culmen_depth): a number denoting the depth of the penguin bill (millimeters)
flipper_length_mm: an integer denoting penguin flipper length (millimeters)
body_mass_g: an integer denoting penguin body mass (grams)
sex: a factor denoting penguin sex (male, female)
year: an integer denoting the study year (2007, 2008, or 2009)
The species column containing penguin species Adelie
, Chinstrap
, or Gentoo
, is the label we want to train a model to predict.
The corresponding Python learning module used a data set with the following variables: bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, species.
Let’s narrow down to those and make some summary statistics while at it. The skimr package provides a strong set of summary statistics that are generated for a variety of different data types.
# Select desired columns
<- penguins %>%
penguins_select select(c(bill_length_mm, bill_depth_mm, flipper_length_mm,
body_mass_g, species))
# Dso some summary statistics
%>%
penguins_select skim()
Name | Piped data |
Number of rows | 344 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
From the neat summary provided by skimr, we can see that each our predictor columns contains missing 2 values while our label/outcome column contains none.
Let’s dig a little deeper and filter the rows that contain missing values.
%>%
penguins_select filter(if_any(everything(), is.na))
bill_length_mm <dbl> | bill_depth_mm <dbl> | flipper_length_mm <int> | body_mass_g <int> | species <fct> |
---|---|---|---|---|
NA | NA | NA | NA | Adelie |
NA | NA | NA | NA | Gentoo |
There are two rows that contain no feature values at all (NA
stands for Not Available ), so these won’t be useful in training a model. Let’s discard them from the dataset.
# Drop rows containing missing values
<- penguins_select %>%
penguins_select drop_na()
# Confirm there are no missing values
%>%
penguins_select anyNA()
## [1] FALSE
# Proportion of each species in the data
%>%
penguins_select count(species)
species <fct> | n <int> | |||
---|---|---|---|---|
Adelie | 151 | |||
Chinstrap | 68 | |||
Gentoo | 123 |
Now that we’ve dealt with the missing values, let’s explore how the features relate to the label by creating some box charts.
# Pivot data to a long format
<- penguins_select %>%
penguins_select_long pivot_longer(!species, names_to = "predictors", values_to = "values")
# Make box plots
%>%
penguins_select_long ggplot(mapping = aes(x = species, y = values, fill = predictors)) +
geom_boxplot() +
facet_wrap(~predictors, scales = "free") +
scale_fill_paletteer_d("nbapalettes::supersonics_holiday") +
theme(legend.position = "none")
From the box plots, it looks like species Adelie
and Chinstrap
have similar data profiles for bill_depth, flipper_length, and body_mass, but Chinstraps tend to have longer bill_length. Gentoo
tends to have fairly clearly differentiated features from the others; which should help us train a good classification model.
Just as for binary classification, before training the model, we need
to split the data into subsets for training and validation. We’ll also
apply a stratification
technique when splitting the data to maintain the proportion of each label value
in the training and validation datasets.
set.seed(2056)
# Split data 70%-30% into training set and test set
<- penguins_select %>%
penguins_split initial_split(prop = 0.70, strata = species)
# Extract data in each split
<- training(penguins_split)
penguins_train <- testing(penguins_split)
penguins_test
# Print the number of observations in each split
cat("Training cases: ", nrow(penguins_train), "\n",
"Test cases: ", nrow(penguins_test), sep = "")
## Training cases: 238
## Test cases: 104
Now that we have a set of training features and corresponding training labels, we can fit a multiclass classification algorithm to the data to create a model.
parsnip::multinom_reg()
defines a model that uses linear predictors to predict multiclass data using the multinomial distribution.
Let’s fit Multinomial regression via nnet package. This model usually has 1 tuning hyperparameter, penalty
,
which describes the amount of regularization. This is used to
counteract any bias in the sample, and help the model generalize well by
avoiding overfitting the model to the training data. We can of
course tune this parameter, like we will later on in this lesson, but
for now, let’s choose an arbitrary value of 1
# Specify a multinomial regression via nnet
<- multinom_reg(penalty = 1) %>%
multireg_spec set_engine("nnet") %>%
set_mode("classification")
# Train a multinomial regression model without any preprocessing
set.seed(2056)
<- multireg_spec %>%
multireg_fit fit(species ~ ., data = penguins_train)
# Print the model
multireg_fit
## parsnip model object
##
## Fit time: 20ms
## Call:
## nnet::multinom(formula = species ~ ., data = data, decay = ~1,
## trace = FALSE)
##
## Coefficients:
## (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap -0.09689103 1.3965690 -0.7974207 -0.153233777
## Gentoo -0.03813310 0.2345437 -1.7038451 0.006228115
## body_mass_g
## Chinstrap -0.004706975
## Gentoo 0.003833670
##
## Residual Deviance: 31.38783
## AIC: 51.38783
Now we can use the trained model to predict the labels for the test features, and evaluate performance:
# Make predictions for the test set
<- penguins_test %>% select(species) %>%
penguins_results bind_cols(multireg_fit %>%
predict(new_data = penguins_test)) %>%
bind_cols(multireg_fit %>%
predict(new_data = penguins_test, type = "prob"))
# Print predictions
%>%
penguins_results slice_head(n = 5)
species <fct> | .pred_class <fct> | .pred_Adelie <dbl> | .pred_Chinstrap <dbl> | .pred_Gentoo <dbl> |
---|---|---|---|---|
Adelie | Adelie | 0.9999058 | 4.334555e-05 | 5.085677e-05 |
Adelie | Adelie | 0.9996200 | 3.584026e-04 | 2.157623e-05 |
Adelie | Adelie | 0.9895737 | 8.454760e-03 | 1.971491e-03 |
Adelie | Adelie | 0.9916696 | 6.837429e-03 | 1.493014e-03 |
Adelie | Adelie | 0.9236931 | 7.613199e-02 | 1.748642e-04 |
Now, let’s look at the confusion matrix for our model
# Confusion matrix
%>%
penguins_results conf_mat(species, .pred_class)
## Truth
## Prediction Adelie Chinstrap Gentoo
## Adelie 46 1 0
## Chinstrap 0 20 0
## Gentoo 0 0 37
The confusion matrix shows the intersection of predicted and actual label values for each class - in simple terms, the diagonal intersections from top-left to bottom-right indicate the number of correct predictions.
When dealing with multiple classes, it’s generally more intuitive to visualize this as a heat map, like this:
update_geom_defaults(geom = "tile", new = list(color = "black", alpha = 0.7))
# Visualize confusion matrix
%>%
penguins_results conf_mat(species, .pred_class) %>%
autoplot(type = "heatmap")
The darker squares in the confusion matrix plot indicate high numbers of cases, and you can hopefully see a diagonal line of darker squares indicating cases where the predicted and actual label are the same.
Let’s now calculate summary statistics for the confusion matrix.
# Statistical summaries for the confusion matrix
conf_mat(data = penguins_results, truth = species, estimate = .pred_class) %>%
summary()
.metric <chr> | .estimator <chr> | .estimate <dbl> | ||
---|---|---|---|---|
accuracy | multiclass | 0.9903846 | ||
kap | multiclass | 0.9848507 | ||
sens | macro | 0.9841270 | ||
spec | macro | 0.9942529 | ||
ppv | macro | 0.9929078 | ||
npv | macro | 0.9960317 | ||
mcc | multiclass | 0.9850012 | ||
j_index | macro | 0.9783799 | ||
bal_accuracy | macro | 0.9891899 | ||
detection_prevalence | macro | 0.3333333 |
The tibble shows the overall metrics of how well the model performs across all three classes.
Let’s evaluate the ROC metrics. In the case of a multiclass classification model, a single ROC curve showing true positive rate vs false positive rate is not possible. However, you can use the rates for each class in a One vs Rest (OVR) comparison to create a ROC chart for each class.
# Make a ROC_CURVE
%>%
penguins_results roc_curve(species, c(.pred_Adelie, .pred_Chinstrap, .pred_Gentoo)) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, color = .level)) +
geom_abline(lty = 2, color = "gray80", size = 0.9) +
geom_path(show.legend = T, alpha = 0.6, size = 1.2) +
coord_equal()
To quantify the ROC performance, you can calculate an aggregate area under the curve score that is averaged across all of the OVR curves.
# Calculate ROC_AOC
%>%
penguins_results roc_auc(species, c(.pred_Adelie, .pred_Chinstrap, .pred_Gentoo))
.metric <chr> | .estimator <chr> | .estimate <dbl> | ||
---|---|---|---|---|
roc_auc | hand_till | 1 |
That went down well! The model did a great job in classifying the penguins. What kind of adventure would it be, if we didn’t preprocess the data?
Again, just like with binary classification, you can use a workflow to apply preprocessing steps to the data before fitting it to an algorithm to train a model. Let’s scale the numeric features in a transformation step before training, try a different algorithm (a support vector machine) and tune some model hyperparameters, just to show that we can!
Support Vector Machines
try to find a hyperplane in some feature space that “best” separates the classes. Please see:
Support Vector Machines, Hands-on Machine Learning with R
Support Vector Machines, An Introduction to Statistical Learning with Applications in R
SVM kernels, Scikit learn
We’ll fit a radial basis function
support vector machine to these data and tune the SVM cost parameter
and the σ parameter
in the kernel function (The margin parameter does not apply to classification models)
A cost argument allows us to specify the cost of a violation to the margin. When the cost argument is small, then the margins will be wide and many support vectors will be on the margin or will violate the margin. This could make the model more robust and lead to better classification. When the cost argument is large, then the margins will be narrow and there will be few support vectors on the margin or violating the margin.
As σ
decreases, the fit becomes more non-linear, and the model becomes more flexible.
Both parameters can have a profound effect on the model complexity and performance.
The radial basis kernel is extremely flexible and as a rule of thumb, we generally start with this kernel when fitting SVMs in practice.
Parameters are marked for tuning by assigning them a value of tune()
. Also, let’s try out a new succinct way of creating workflows that minimizes a lot of piping steps as suggested by David’s blog post (winner of sliced!!)
# Create a model specification
<- svm_rbf(mode = "classification", engine = "kernlab",
svm_spec cost = tune(), rbf_sigma = tune())
# Create a workflow that encapsulates a recipe and a model
<- recipe(species ~ ., data = penguins_train) %>%
svm_wflow step_normalize(all_numeric_predictors()) %>%
workflow(svm_spec)
# Print out workflow
svm_wflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: svm_rbf()
##
## -- Preprocessor ----------------------------------------------------------------
## 1 Recipe Step
##
## * step_normalize()
##
## -- Model -----------------------------------------------------------------------
## Radial Basis Function Support Vector Machine Specification (classification)
##
## Main Arguments:
## cost = tune()
## rbf_sigma = tune()
##
## Computational engine: kernlab
Pretty neat, right ✨?
Now that we have specified what parameter to tune, we’ll need to figure out a set of possible values to try out then choose the best.
To do this, we’ll create a grid! In this example, we’ll work through a regular grid of hyperparameter values, try them out, and see what pair results in the best model performance.
set.seed(2056)
# Create regular grid of 6 values for each tuning parameters
<- grid_regular(parameters(svm_spec), levels = 6)
svm_grid
# Print out some parameters in our grid
%>%
svm_grid slice_head(n = 10)
cost <dbl> | rbf_sigma <dbl> | |||
---|---|---|---|---|
9.765625e-04 | 1e-10 | |||
7.812500e-03 | 1e-10 | |||
6.250000e-02 | 1e-10 | |||
5.000000e-01 | 1e-10 | |||
4.000000e+00 | 1e-10 | |||
3.200000e+01 | 1e-10 | |||
9.765625e-04 | 1e-08 | |||
7.812500e-03 | 1e-08 | |||
6.250000e-02 | 1e-08 | |||
5.000000e-01 | 1e-08 |
Awesome! One thing about hyperparameters is that they are not learned
directly from the training set. Instead, they are estimated using simulated data sets
created from a process called resampling
. In our previous, we used cross-validation
resampling method. Let’s try out another resampling technique: bootstrap resampling
.
Bootstrap resampling means drawing with replacement
from our original dataset then then fit a model on that new set that contains some duplicates
, and evaluate the model on the data points that were not included
.
Then we do that again (default behaviour is 25 boostraps but this can be changed). Okay, let’s create some simulated data sets.
set.seed(2056)
# Bootstrap resampling
<- bootstraps(penguins_train, times = 10)
penguins_bs
penguins_bs
splits <list> | id <chr> | |||
---|---|---|---|---|
<S3: boot_split> | Bootstrap01 | |||
<S3: boot_split> | Bootstrap02 | |||
<S3: boot_split> | Bootstrap03 | |||
<S3: boot_split> | Bootstrap04 | |||
<S3: boot_split> | Bootstrap05 | |||
<S3: boot_split> | Bootstrap06 | |||
<S3: boot_split> | Bootstrap07 | |||
<S3: boot_split> | Bootstrap08 | |||
<S3: boot_split> | Bootstrap09 | |||
<S3: boot_split> | Bootstrap10 |
We are ready to tune! Let’s use tune_grid()
to fit models at all the different values we chose for each tuned hyperparameter.
::registerDoParallel()
doParallel
# Model tuning via a grid search
set.seed(2056)
<- tune_grid(
svm_res object = svm_wflow,
resamples = penguins_bs,
grid = svm_grid
)
Now that we have our tuning results, we can extract the performance metrics using collect_metrics()
:
# Obtain performance metrics
%>%
svm_res collect_metrics() %>%
slice_head(n = 7)
cost <dbl> | rbf_sigma <dbl> | .metric <chr> | .estimator <chr> | mean <dbl> | n <int> | std_err <dbl> | |
---|---|---|---|---|---|---|---|
0.0009765625 | 1e-10 | accuracy | multiclass | 0.4279212 | 10 | 0.010790273 | |
0.0009765625 | 1e-10 | roc_auc | hand_till | 0.9809805 | 10 | 0.003231725 | |
0.0078125000 | 1e-10 | accuracy | multiclass | 0.4279212 | 10 | 0.010790273 | |
0.0078125000 | 1e-10 | roc_auc | hand_till | 0.9809669 | 10 | 0.003253624 | |
0.0625000000 | 1e-10 | accuracy | multiclass | 0.4279212 | 10 | 0.010790273 | |
0.0625000000 | 1e-10 | roc_auc | hand_till | 0.9803196 | 10 | 0.003260246 | |
0.5000000000 | 1e-10 | accuracy | multiclass | 0.4279212 | 10 | 0.010790273 |
🤩🤩 Let’s see if we could get more by visualizing the results:
# Visualize tuning metrics
%>%
svm_res collect_metrics() %>%
mutate(rbf_sigma = factor(rbf_sigma)) %>%
ggplot(mapping = aes(x = cost, y = mean, color = rbf_sigma)) +
geom_line(size = 1.5, alpha = 0.7) +
geom_point(size = 2) +
facet_wrap(~.metric, scales = "free", nrow = 2) +
scale_x_log10(labels = scales::label_number()) +
scale_color_viridis_d(option = "viridis", begin = .1)
It seems that an SVM with an rbf_sigma of 1 and 0.01 really performed well across all candidate values of cost. The show_best()
function can help us make a clearer distinction:
# Show best submodel
%>%
svm_res show_best("accuracy")
cost <dbl> | rbf_sigma <dbl> | .metric <chr> | .estimator <chr> | mean <dbl> | n <int> | std_err <dbl> | |
---|---|---|---|---|---|---|---|
32.0 | 1.00 | accuracy | multiclass | 0.9751752 | 10 | 0.005042863 | |
32.0 | 0.01 | accuracy | multiclass | 0.9749459 | 10 | 0.004247809 | |
4.0 | 1.00 | accuracy | multiclass | 0.9739020 | 10 | 0.005514761 | |
4.0 | 0.01 | accuracy | multiclass | 0.9706938 | 10 | 0.004727220 | |
0.5 | 1.00 | accuracy | multiclass | 0.9694269 | 10 | 0.004825592 |
Much better! Let’s now use the select_best()
function to pull out the single set of hyperparameter values in the best sub-model:
# Select best model hyperparameters
<- svm_res %>%
best_svm select_best("accuracy")
best_svm
cost <dbl> | rbf_sigma <dbl> | .config <chr> | ||
---|---|---|---|---|
32 | 1 | Preprocessor1_Model36 |
Perfect! These are the values for rbf_sigma
and cost
that maximize accuracy for our penguins!
We can now finalize
our workflow such that the parameters we had marked for tuning by assigning them a value of tune()
can get updated with the values from best_svm
# Finalize workflow
<- svm_wflow %>%
final_wflow finalize_workflow(best_svm)
final_wflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: svm_rbf()
##
## -- Preprocessor ----------------------------------------------------------------
## 1 Recipe Step
##
## * step_normalize()
##
## -- Model -----------------------------------------------------------------------
## Radial Basis Function Support Vector Machine Specification (classification)
##
## Main Arguments:
## cost = 32
## rbf_sigma = 1
##
## Computational engine: kernlab
That marks the end of tuning 💃!
Finally, let’s fit this final model to the training data and evaluate
how it would perform on new data using our test data. We can use the
function last_fit()
with our finalized model; this function fits the finalized model on the full training data set and evaluates it on the testing data.
# The last fit
<- last_fit(object = final_wflow, split = penguins_split)
final_fit
# Collect metrics
%>%
final_fit collect_metrics()
.metric <chr> | .estimator <chr> | .estimate <dbl> | .config <chr> | |
---|---|---|---|---|
accuracy | multiclass | 0.9711538 | Preprocessor1_Model1 | |
roc_auc | hand_till | 0.9981021 | Preprocessor1_Model1 |
Much better 🤩! You can of course go ahead and obtain the hard class and probability predictions using collect predictions()
and you will be well on your way to computing the confusion matrix and other summaries that come with it.
# Collect predictions and make confusion matrix
%>%
final_fit collect_predictions() %>%
conf_mat(truth = species, estimate = .pred_class)
## Truth
## Prediction Adelie Chinstrap Gentoo
## Adelie 45 2 0
## Chinstrap 1 19 0
## Gentoo 0 0 37
Now let’s save our trained model so we can use it again later. Begin by extracting the trained workflow object from final_fit
object.
# Extract trained workflow
<- final_fit %>%
penguins_svm_model extract_workflow()
# Save workflow
saveRDS(penguins_svm_model, "penguins_svm_model.rds")
OK, so now we have a trained model. Let’s use it to predict the class of a new penguin observation:
# Load model
<- readRDS("penguins_svm_model.rds")
loaded_psvm_model
# Create new tibble of observations
<- tibble(
new_obs bill_length_mm = c(49.5, 38.2),
bill_depth_mm = c(18.4, 20.1),
flipper_length_mm = c(195, 190),
body_mass_g = c(3600, 3900))
# Make predictions
<- new_obs %>%
new_results bind_cols(loaded_psvm_model %>%
predict(new_data = new_obs))
# Show predictions
new_results
bill_length_mm <dbl> | bill_depth_mm <dbl> | flipper_length_mm <dbl> | body_mass_g <dbl> | .pred_class <fct> |
---|---|---|---|---|
49.5 | 18.4 | 195 | 3600 | Chinstrap |
38.2 | 20.1 | 190 | 3900 | Adelie |
Good job! A working model 🐧🐧!
We need to chill, right? 😅 We hope you had a flippin’ good time!
In this module, you learned how classification can be used to create a machine learning model that predicts categories, or classes. You then used the amazing Tidymodels framework in R
to train and evaluate a classification model using different
algorithms, do some data preprocessing, tuned some hyperparameters and
made better predictions.
While Tidymodels
and scikit-learn
(Python)
are popular framework for writing code to train classification models,
you can also create machine learning solutions for classification using
the graphical tools in Microsoft Azure Machine Learning. You can learn
more about no-code development of classification models using Azure
Machine Learning in Create a classification model with Azure Machine Learning designer module.
Feel like challenging yourself to train a classification model? Try the challenge in the /challenges/03 - Wine Classification Challenge.ipynb notebook to see if you can classify wines into their grape varietals! Find the data here.
Allison Horst
for creating the amazing illustrations that make R more welcoming and engaging. Find more illustrations at her gallery.
Bethany
, Gold Microsoft Learn Student Ambassador, for her valuable feedback and suggestions.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An introduction to statistical learning : with applications in R. Corresponding Tidymodels labs by Emil Hvitfeldt can be found here.
Max Kuhn and Julia Silge, Tidy Modeling with R.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Bradley Boehmke & Brandon Greenwell, Hands-On Machine Learning with R.
Jorge Cimentada, Machine Learning for Social Scientists.
Tidymodels reference website.
H. Wickham and G. Grolemund, R for Data Science: Visualize, Model, Transform, Tidy, and Import Data.
Happy leaRning,
Eric (R_ic), Gold Microsoft Learn Student Ambassador.