• Buckle up 🚀
    • Prerequisites
  • A gentle introduction to classification
  • 1. Binary classification
    • Explore the data
    • Split the data
    • Train and Evaluate a Binary Classification Model
  • 2. Recipes and workflows
  • 3. Try a different algorithm
  • 4. Multiclass Classification
  • 7. Summary

Buckle up 🚀

In this learning path, we’ll learn how to create Machine learning models using R 😊. Machine learning is the foundation for predictive modeling and artificial intelligence. We’ll learn the core principles of machine learning and how to use common tools and frameworks to train, evaluate, and use machine learning models.

Modules that will be covered in this learning path include:

  • Explore and analyze data with R

  • Train and evaluate regression models

  • Train and evaluate classification models

  • Train and evaluate clustering models (under development)

  • Train and evaluate deep learning models (under development)

Prerequisites

This learning path assumes knowledge of basic mathematical concepts. Some experience with R and the tidyverse is also beneficial though we’ll try as much as possible to skim through the core concepts. To get started with R and the tidyverse, the best place would be R for Data Science an O’Reilly book written by Hadley Wickham and Garrett Grolemund. It’s designed to take you from knowing nothing about R or the tidyverse to having all the basic tools of data science at your fingertips.

The Python edition of the learning path can be found at this learning path.

Why R?

First, while not the only good option, R has been shown to be popular and effective in modern data analysis. Second, R is free and open-source. You can install it anywhere, modify the code, and have the ability to see exactly how computations are performed. Third, it has excellent community support via the canonical R mailing lists and, more importantly, with Twitter, StackOverflow, and RStudio Community. Anyone who asks a reasonable, reproducible question has a pretty good chance of getting an answer. - Feature Engineering and Selection: A Practical Approach for Predictive Models, Max Kuhn and Kjell Johnson

Now, let’s get started!

Artwork by @allison_horst

A gentle introduction to classification

Classification is a form of machine learning in which you train a model to predict which category an item belongs to. Categorical data has distinct ‘classes’, rather than numeric values.

For example, a health clinic might use diagnostic data such as a patient’s height, weight, blood pressure, blood-glucose level to predict whether or not the patient is diabetic.

Classification is an example of a supervised machine learning technique, which means it relies on data that includes known feature values (for example, diagnostic measurements for patients) as well as known label values (for example, a classification of non-diabetic or diabetic). A classification algorithm is used to fit a subset of the data to a function that can calculate the probability for each class label from the feature values. The remaining data is used to evaluate the model by comparing the predictions it generates from the features to the known class labels.

The simplest form of classification is binary classification, in which the label is 0 or 1, representing one of two classes; for example, “True” or “False”; “Internal” or “External”; “Profitable” or “Non-Profitable”; and so on.

The class prediction is made by determining the probability for each possible class as a value between 0 -impossible - and 1 - certain. The total probability for all classes is 1, as the patient is definitely either diabetic or non-diabetic. So, if the predicted probability of a patient being diabetic is 0.3, then there is a corresponding probability of 0.7 that the patient is non-diabetic.

A threshold value, usually 0.5, is used to determine the predicted class - so if the positive class (in this case, diabetic) has a predicted probability greater than the threshold, then a classification of diabetic is predicted.

The best way to learn about classification is to try it for yourself, so that’s what you’ll do in this exercise.

We’ll require some packages to knock-off this module. You can have them installed as: install.packages(c('tidyverse', 'tidymodels', 'ranger', 'vip', 'palmerpenguins', 'skimr', 'paletteer', 'nnet', 'here'))

Alternatively, the script below checks whether you have the packages required to complete this module and installs them for you in case some are missing.

suppressWarnings(if(!require("pacman")) install.packages("pacman"))
## Loading required package: pacman
pacman::p_load('tidyverse', 'tidymodels', 'ranger',
               'vip', 'skimr', 'here','palmerpenguins', 'kernlab',
               'janitor', 'paletteer', 'nnet')

1. Binary classification

Let’s start by looking at an example of binary classification, where the model must predict a label that belongs to one of two classes. In this exercise, we’ll train a binary classifier to predict whether or not a patient should be tested for diabetes based on some medical data.

Explore the data

The first step in any machine learning project is to explore the data that you will use to train a model. And before we can explore the data, we must first have it in our environment, right?

So, let’s begin by importing a CSV file of patent data into a tibble (a modern a modern reimagining of the data frame):

Citation: The diabetes dataset used in this exercise is based on data originally collected by the National Institute of Diabetes and Digestive and Kidney Diseases.

# Load the core tidyverse and make it available in your current R session
library(tidyverse)


# Read the csv file into a tibble
diabetes <- read_csv(file = "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/diabetes.csv")


# Print the first 10 rows of the data
diabetes %>% 
  slice_head(n = 10)
ABCDEFGHIJ0123456789
PatientID
<dbl>
Pregnancies
<dbl>
PlasmaGlucose
<dbl>
DiastolicBloodPressure
<dbl>
TricepsThickness
<dbl>
135477801718034
11474388929347
164003171154752
188335091037825
14241191855927
1619297082929
166014901334719
14587690678743
12016478809533
14039121723140

Sometimes, we may want some little more information on our data. We can have a look at the data, its structure and the data type of its features by using the glimpse() function as below:

# Take a quick glance at the data
diabetes %>% 
  glimpse()
## Rows: 15,000
## Columns: 10
## $ PatientID              <dbl> 1354778, 1147438, 1640031, 1883350, 1424119, 16~
## $ Pregnancies            <dbl> 0, 8, 7, 9, 1, 0, 0, 0, 8, 1, 1, 3, 5, 7, 0, 3,~
## $ PlasmaGlucose          <dbl> 171, 92, 115, 103, 85, 82, 133, 67, 80, 72, 88,~
## $ DiastolicBloodPressure <dbl> 80, 93, 47, 78, 59, 92, 47, 87, 95, 31, 86, 96,~
## $ TricepsThickness       <dbl> 34, 47, 52, 25, 27, 9, 19, 43, 33, 40, 11, 31, ~
## $ SerumInsulin           <dbl> 23, 36, 35, 304, 35, 253, 227, 36, 24, 42, 58, ~
## $ BMI                    <dbl> 43.50973, 21.24058, 41.51152, 29.58219, 42.6045~
## $ DiabetesPedigree       <dbl> 1.21319135, 0.15836498, 0.07901857, 1.28286985,~
## $ Age                    <dbl> 21, 23, 23, 43, 22, 26, 21, 26, 53, 26, 22, 23,~
## $ Diabetic               <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,~

This data consists of diagnostic information about some patients who have been tested for diabetes. Note that the final column in the dataset (Diabetic) contains the value 0 for patients who tested negative for diabetes, and 1 for patients who tested positive. This is the label that we will train our model to predict; most of the other columns (Pregnancies, PlasmaGlucose, DiastolicBloodPressure, BMI and so on) are the features we will use to predict the Diabetic label.

Let’s kick off our adventure by reformatting the data to make it easier for a model to use effectively. For now, let’s drop the PatientID column, encode the Diabetic column as a categorical variable, and make the column names a bit friend_lieR:

# Load the janitor package for cleaning data
library(janitor)

# Clean data a bit
diabetes_select <- diabetes %>%
  # Encode Diabetic as category
  mutate(Diabetic = factor(Diabetic, levels = c("1","0"))) %>% 
  # Drop PatientID column
  select(-PatientID) %>% 
  # Clean column names
  clean_names()


# View data set
diabetes_select %>% 
  slice_head(n = 10)
ABCDEFGHIJ0123456789
pregnancies
<dbl>
plasma_glucose
<dbl>
diastolic_blood_pressure
<dbl>
triceps_thickness
<dbl>
01718034
8929347
71154752
91037825
1855927
082929
01334719
0678743
8809533
1723140

The goal of this exploration is to try to understand the relationships between its attributes; in particular, any apparent correlation between the features and the label your model will try to predict. One way of doing this is by using data visualization.

Now let’s compare the feature distributions for each label value. We’ll begin by formatting the data to a long format to make it somewhat easier to make multiple facets.

# Pivot data to a long format
diabetes_select_long <- diabetes_select %>% 
    pivot_longer(!diabetic, names_to = "features", values_to = "values")


# Print the first 10 rows
diabetes_select_long %>% 
  slice_head(n = 10)
ABCDEFGHIJ0123456789
diabetic
<fct>
features
<chr>
values
<dbl>
0pregnancies0.000000
0plasma_glucose171.000000
0diastolic_blood_pressure80.000000
0triceps_thickness34.000000
0serum_insulin23.000000
0bmi43.509726
0diabetes_pedigree1.213191
0age21.000000
0pregnancies8.000000
0plasma_glucose92.000000

Perfect! Now, let’s make some plots.

theme_set(theme_light())
# Make a box plot for each predictor feature
diabetes_select_long %>% 
  ggplot(mapping = aes(x = diabetic, y = values, fill = features)) +
  geom_boxplot() + 
  facet_wrap(~ features, scales = "free", ncol = 4) +
  scale_color_viridis_d(option = "plasma", end = .7) +
  theme(legend.position = "none")

Amazing🤩! For some of the features, there’s a noticeable difference in the distribution for each label value. In particular, Pregnancies and Age show markedly different distributions for diabetic patients than for non-diabetic patients. These features may help predict whether or not a patient is diabetic.

Split the data

Our dataset includes known values for the label, so we can use this to train a classifier so that it finds a statistical relationship between the features and the label value; but how will we know if our model is any good? How do we know it will predict correctly when we use it with new data that it wasn’t trained with?

It is best practice to hold out some of your data for testing in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set.

Well, we can take advantage of the fact we have a large dataset with known label values, use only some of it to train the model, and hold back some to test the trained model - enabling us to compare the predicted labels with the already known labels in the test set.

In R, the amazing Tidymodels framework provides a collection of packages for modeling and machine learning using tidyverse principles. For instance, rsample, a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:

  • initial_split(): specifies how data will be split into a training and testing set

  • training() and testing() functions extract the data in each split

use ?initial_split() for more details.

Here is a great place to get started with Tidymodels: Get Started

# Load the Tidymodels packages
library(tidymodels)



# Split data into 70% for training and 30% for testing
set.seed(2056)
diabetes_split <- diabetes_select %>% 
  initial_split(prop = 0.70)


# Extract the data in each split
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)


# Print the number of cases in each split
cat("Training cases: ", nrow(diabetes_train), "\n",
    "Test cases: ", nrow(diabetes_test), sep = "")
## Training cases: 10500
## Test cases: 4500
# Print out the first 5 rows of the training set
diabetes_train %>% 
  slice_head(n = 5)
ABCDEFGHIJ0123456789
pregnancies
<dbl>
plasma_glucose
<dbl>
diastolic_blood_pressure
<dbl>
triceps_thickness
<dbl>
0841019
11575611
01025615
1819010
8976524

Train and Evaluate a Binary Classification Model

OK, now we’re ready to train our model by fitting the training features to the training labels (diabetic). There are various algorithms we can use to train the model. In this example, we’ll use Logistic Regression, which (despite its name) is a well-established algorithm for classification. Logistic regression is a binary classification algorithm, meaning it predicts 2 categories.

There are quite a number of ways to fit a logistic regression model in Tidymodels. See ?logistic_reg() For now, let’s fit a logistic regression model via the default stats::glm() engine.

# Make a model specifcation
logreg_spec <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")


# Print the model specification
logreg_spec
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

After a model has been specified, the model can be estimated or trained using the fit() function, typically using a symbolic description of the model (a formula) and some data.

# Train a logistic regression model
logreg_fit <- logreg_spec %>% 
  fit(diabetic ~ ., data = diabetes_train)


# Print the model object
logreg_fit
## parsnip model object
## 
## Fit time:  51ms 
## 
## Call:  stats::glm(formula = diabetic ~ ., family = stats::binomial, 
##     data = data)
## 
## Coefficients:
##              (Intercept)               pregnancies            plasma_glucose  
##                 8.624243                 -0.266296                 -0.009615  
## diastolic_blood_pressure         triceps_thickness             serum_insulin  
##                -0.012297                 -0.022807                 -0.003932  
##                      bmi         diabetes_pedigree                       age  
##                -0.049028                 -0.923262                 -0.056876  
## 
## Degrees of Freedom: 10499 Total (i.e. Null);  10491 Residual
## Null Deviance:       13290 
## Residual Deviance: 9221  AIC: 9239

The model print out shows the coefficients learned during training.

Now we’ve trained the model using the training data, we can use the test data we held back to evaluate how well it predicts using parsnip::predict(). Let’s start by using the model to predict labels for our test set, and compare the predicted labels to the known labels:

# Make predictions then bind them to the test set
results <- diabetes_test %>% select(diabetic) %>% 
  bind_cols(logreg_fit %>% predict(new_data = diabetes_test))


# Compare predictions
results %>% 
  slice_head(n = 10)
ABCDEFGHIJ0123456789
diabetic
<fct>
.pred_class
<fct>
00
00
00
00
11
00
00
10
00
00

Comparing each prediction with its corresponding “ground truth” actual value isn’t a very efficient way to determine how well the model is predicting. Fortunately, Tidymodels has a few more tricks up its sleeve: yardstick - a package used to measure the effectiveness of models using performance metrics.

The most obvious thing you might want to do is to check the accuracy of the predictions - in simple terms, what proportion of the labels did the model predict correctly?

yardstick::accuracy() does just that!

# Calculate accuracy: proportion of data predicted correctly
accuracy(data = results, truth = diabetic, estimate = .pred_class)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
accuracybinary0.7888889

The accuracy is returned as a decimal value - a value of 1.0 would mean that the model got 100% of the predictions right; while an accuracy of 0.0 is, well, pretty useless 😐!

Accuracy seems like a sensible metric to evaluate (and to a certain extent it is), but you need to be careful about drawing too many conclusions from the accuracy of a classifier. Remember that it’s simply a measure of how many cases were predicted correctly. Suppose only 3% of the population is diabetic. You could create a classifier that always just predicts 0, and it would be 97% accurate - but not terribly helpful in identifying patients with diabetes!

Fortunately, there are some other metrics that reveal a little more about how our classification model is performing.

One performance metric associated with classification problems is the confusion matrix. A confusion matrix describes how well a classification model performs by tabulating how many examples in each class were correctly classified by a model. In our case, it will show you how many cases were classified as negative (0) and how many as positive (1); the confusion matrix also shows you how many were classified into the wrong categories.

The conf_mat() function from yardstick calculates this cross-tabulation of observed and predicted classes.

# Confusion matrix for prediction results
conf_mat(data = results, truth = diabetic, estimate = .pred_class)
##           Truth
## Prediction    1    0
##          1  897  293
##          0  657 2653

Awesome!

Let’s interpret the confusion matrix. Our model is asked to classify cases between two binary categories, category 1 for patients who tested positive for diabetes and category 0 for patients who tested negative.

  • If your model predicts a patient as 1 (positive) and they belong to category 1 (positive) in reality we call this a true positive, shown by the top left number 897.

  • If your model predicts a patient as 0 (negative) and they belong to category 1 (positive) in reality we call this a false negative, shown by the bottom left number 657.

  • If your model predicts a patient as 1 (positive) and they belong to category 0 (negative) in reality we call this a false positive, shown by the top right number 293.

  • If your model predicts a patient as 0 (negative) and they belong to category 0 (negative) in reality we call this a true negative, shown by the bottom right number 2653.

Our confusion matrix can thus be expressed in the following form:

Truth
Predicted 1 0
1 897   TP 293   FP
0 657   FN 2653   TN

Note that the correct (true) predictions form a diagonal line from top left to bottom right - these figures should be significantly higher than the false predictions if the model is any good.

As you might have guessed it’s preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.

The confusion matrix is helpful since it gives rise to other metrics that can help us better evaluate the performance of a classification model. Let’s go through some of them:

🎓 Precision: TP/(TP + FP) defined as the proportion of predicted positives that are actually positive. Also called positive predictive value

🎓 Recall: TP/(TP + FN) defined as the proportion of positive results out of the number of samples which were actually positive. Also known as sensitivity.

🎓 Specificity: TN/(TN + FP) defined as the proportion of negative results out of the number of samples which were actually negative.

🎓 Accuracy: TP + TN/(TP + TN + FP + FN) The percentage of labels predicted accurately for a sample.

🎓 F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.

Tidymodels provides yet another succinct way of evaluating all these metrics. Using yardstick::metric_set(), you can combine multiple metrics together into a new function that calculates all of them at once.

# Combine metrics and evaluate them all at once
eval_metrics <- metric_set(ppv, recall, accuracy, f_meas)
eval_metrics(data = results, truth = diabetic, estimate = .pred_class)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
ppvbinary0.7537815
recallbinary0.5772201
accuracybinary0.7888889
f_measbinary0.6537901

Using the precision (ppv) metric, we are able to answer the question:

  • Of all the patients the model predicted are diabetic, how many are actually diabetic?

Using the recall metric, we are able to answer the question:

  • Of all the patients that are actually diabetic, how many did the model identify?

Great job, we just made predictions and evaluated them using a number of metrics.

Until now, we’ve considered the predictions from the model as being either 1 or 0 class labels. Actually, things are a little more complex than that. Statistical machine learning algorithms, like logistic regression, are based on probability; so what actually gets predicted by a binary classifier is the probability that the label is true (P(y)) and the probability that the label is false (1P(y)). A threshold value of 0.5 is used to decide whether the predicted label is a 1 (P(y)>0.5) or a 0 (P(y)<=0.5). Let’s see the probability pairs for each case:

# Predict class probabilities and bind them to results
results <- results %>% 
  bind_cols(logreg_fit %>% 
              predict(new_data = diabetes_test, type = "prob"))

  


# Print out the results
results %>% 
  slice_head(n = 10)
ABCDEFGHIJ0123456789
diabetic
<fct>
.pred_class
<fct>
.pred_1
<dbl>
.pred_0
<dbl>
000.416662470.5833375
000.098481400.9015186
000.046947840.9530522
000.056061360.9439386
110.580557600.4194424
000.330661160.6693388
000.288265010.7117350
100.269918770.7300812
000.275017920.7249821
000.131275720.8687243

The decision to score a prediction as a 1 or a 0 depends on the threshold to which the predicted probabilities are compared. If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion matrix. A common way to evaluate a classifier is to examine the true positive rate (which is another name for recall) and the false positive rate (1 - specificity) for a range of possible thresholds. These rates are then plotted against all possible thresholds to form a chart known as a received operator characteristic (ROC) chart, like this:

# Make a roc_chart
results %>% 
  roc_curve(truth = diabetic, .pred_1) %>% 
  autoplot()

The ROC chart shows the curve of the true and false positive rates for different threshold values between 0 and 1. A perfect classifier would have a curve that goes straight up the left side and straight across the top. The diagonal line across the chart represents the probability of predicting correctly with a 50/50 random prediction; so you obviously want the curve to be higher than that (or your model is no better than simply guessing!).

The area under the curve (AUC) is a value between 0 and 1 that quantifies the overall performance of the model. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. The closer to 1 this value is, the better the model. Once again, Tidymodels includes a function to calculate this metric: yardstick::roc_auc()

# Compute the AUC
results %>% 
  roc_auc(diabetic, .pred_1)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
roc_aucbinary0.8602241

2. Recipes and workflows

Data preprocessing with recipes

In this case, the ROC curve and its AUC indicate that the model performs better than a random guess which is not bad considering we performed very little preprocessing of the data.

In practice, it’s common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it. There’s a huge range of preprocessing transformations you can perform to get your data ready for modeling, but we’ll limit ourselves to a few common techniques:

  • Scaling numeric features so they’re on the same scale. This prevents features with large values from producing coefficients that disproportionately affect the predictions.

  • Encoding categorical variables. For example, by using a one hot encoding technique you can create “dummy” or indicator variables which replace the original categorical feature with numeric columns whose values are either 1 or 0.

Tidymodels provides yet another neat package: recipes- a package for preprocessing data. Let’s specify a recipe that encodes the age column then normalizes the rest of the predictor features.

# Preprocess the data for modelling
diabetes_recipe <- recipe(diabetic ~ ., data = diabetes_train) %>% 
  step_mutate(age = factor(age)) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_dummy(age)

# Print the recipe
diabetes_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          8
## 
## Operations:
## 
## Variable mutation for age
## Centering and scaling for all_numeric_predictors()
## Dummy variables from age

We just created a recipe containing an outcome and its corresponding predictors, specifying that the age variable should be converted to a categorical variable (factor), all the numeric predictors normalized and creating dummy variables for the nominal predictor (age) 🙌.

Bundling it all together using a workflow

Now that we have a recipe and a model specification we defined previously, we need to find a way of bundling them together into an object that will first preprocess the data, fit the model on the preprocessed data and also allow for potential post-processing activities.

In Tidymodels, this convenient object is called a workflow and conveniently holds your modeling components.

The workflows package allows the user to bind modeling and preprocessing objects together. You can then fit the entire workflow to the data, such that the model encapsulates all of the preprocessing steps as well as the algorithm.

# Redefine the model specification
logreg_spec <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

# Bundle the recipe and model specification
lr_wf <- workflow() %>% 
  add_recipe(diabetes_recipe) %>% 
  add_model(logreg_spec)

# Print the workflow
lr_wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_mutate()
## * step_normalize()
## * step_dummy()
## 
## -- Model -----------------------------------------------------------------------
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

After a workflow has been specified, a model can be trained using the fit() function.

# Fit a workflow object
lr_wf_fit <- lr_wf %>% 
  fit(data = diabetes_train)

# Print wf object
lr_wf_fit
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_mutate()
## * step_normalize()
## * step_dummy()
## 
## -- Model -----------------------------------------------------------------------
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##              (Intercept)               pregnancies            plasma_glucose  
##                   0.8687                   -0.9424                   -0.3204  
## diastolic_blood_pressure         triceps_thickness             serum_insulin  
##                  -0.1974                   -0.3358                   -0.5528  
##                      bmi         diabetes_pedigree                   age_X22  
##                  -0.4849                   -0.3669                    0.1369  
##                  age_X23                   age_X24                   age_X25  
##                   1.0163                   18.9613                   18.7737  
##                  age_X26                   age_X28                   age_X29  
##                  18.8089                  -19.4492                  -19.9201  
##                  age_X30                   age_X31                   age_X32  
##                   0.5942                    0.8685                    0.9317  
##                  age_X33                   age_X34                   age_X35  
##                   0.4930                    0.5297                    0.7340  
##                  age_X36                   age_X37                   age_X38  
##                  -1.3050                   -1.7518                   -2.5862  
##                  age_X39                   age_X40                   age_X41  
##                  -1.9703                   -2.1639                   -2.2658  
##                  age_X42                   age_X43                   age_X44  
##                  -0.8612                   -1.9764                   -1.9533  
##                  age_X45                   age_X46                   age_X47  
##                  -1.8894                   -2.0556                   -1.9258  
##                  age_X48                   age_X49                   age_X50  
##                   0.7884                    2.9357                    1.2163  
##                  age_X51                   age_X52                   age_X53  
##                   0.2695                    0.3483                   -1.0652  
##                  age_X54                   age_X55                   age_X56  
##                  -2.2427                   -2.6624                   -2.4732  
##                  age_X57                   age_X58                   age_X59  
##                  -3.0821                   -2.2541                   -2.0212  
##                  age_X60                   age_X61                   age_X62  
##                  -3.1082                   -2.1049                   -1.9737  
##                  age_X63                   age_X64                   age_X65  
##                  -1.6991                   -0.6576                   -1.8492  
##                  age_X66                   age_X67                   age_X68  
##                   1.1912                    0.8770                   18.3957  
##                  age_X69                   age_X70                   age_X71  
##                  18.6030                   20.1025                   18.7479  
##                  age_X72                   age_X73                   age_X74  
##                  18.4269                   19.3626                   19.3518  
##                  age_X75                   age_X76                   age_X77  
##                  18.6483                   19.0067                   18.0447  
## 
## Degrees of Freedom: 10499 Total (i.e. Null);  10437 Residual
## Null Deviance:       13290 
## Residual Deviance: 6838  AIC: 6964
## 
## ...
## and 0 more lines.

Good job👏! We now have a trained workflow. The workflow print out shows the coefficients learned during training.

This allows us to use the model trained by this workflow to predict labels for our test set, and compare the performance metrics with the basic model we created previously.

# Make predictions on the test set
results <- diabetes_test %>% select(diabetic) %>% 
  bind_cols(lr_wf_fit %>% 
              predict(new_data = diabetes_test)) %>% 
  bind_cols(lr_wf_fit %>% 
              predict(new_data = diabetes_test, type = "prob"))

# Print the results
results %>% 
  slice_head(n = 10)
ABCDEFGHIJ0123456789
diabetic
<fct>
.pred_class
<fct>
.pred_1
<dbl>
.pred_0
<dbl>
003.181359e-010.68186413
001.456229e-010.85437708
008.747333e-020.91252667
005.700564e-101.00000000
119.310216e-010.06897839
005.451350e-090.99999999
015.697916e-010.43020837
116.938903e-010.30610969
004.401889e-091.00000000
001.913352e-010.80866481

Let’s take a look at the confusion matrix:

# Confusion matrix for prediction results
results %>% 
  conf_mat(truth = diabetic, estimate = .pred_class)
##           Truth
## Prediction    1    0
##          1 1116  245
##          0  438 2701

🤩🤩 Look at those metrics!

Can we visualize this? Of course, nothing is impaRsible!

# Visualize conf mat
update_geom_defaults(geom = "rect", new = list(fill = "midnightblue", alpha = 0.7))

results %>% 
  conf_mat(diabetic, .pred_class) %>% 
  autoplot()

What about our other metrics such as ppv, sensitivity etc?

# Evaluate other desired metrics
eval_metrics(data = results, truth = diabetic, estimate = .pred_class)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
ppvbinary0.8199853
recallbinary0.7181467
accuracybinary0.8482222
f_measbinary0.7656947
# Evaluate ROC_AUC metrics
results %>% 
  roc_auc(diabetic, .pred_1)
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
roc_aucbinary0.9248788
# Plot ROC_CURVE
results %>% 
  roc_curve(diabetic, .pred_1) %>% 
  autoplot()

Comparing with previous predictions, the metrics look better, so clearly preprocessing the data has made a difference.

3. Try a different algorithm

Now let’s try a different algorithm. Previously we used a logistic regression algorithm, which is a linear algorithm. There are many kinds of classification algorithm we could try, including:

  • Support Vector Machine algorithms: Algorithms that define a hyperplane that separates classes.

  • Tree-based algorithms: Algorithms that build a decision tree to reach a prediction

  • Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to improve generalizability.

This time, we’ll train the model using an ensemble algorithm named Random Forest that averages the outputs of multiple random decision trees. Random forests help to reduce tree correlation by injecting more randomness into the tree-growing process. More specifically, instead of considering all predictors in the data, for calculating a given split, random forests pick a random sample of predictors to be considered for that split.

For further reading on Tree based models, please see:

Machine Learning for Social Scientists

As we already have a gist of how to perform classification using Tidymodels, let’s get right into specifying and fitting a random forest algorithm.

# Preprocess the data for modelling
diabetes_recipe <- recipe(diabetic ~ ., data = diabetes_train) %>% 
  step_mutate(age = factor(age)) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_dummy(age)

# Build a random forest model specification
rf_spec <- rand_forest() %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# Bundle recipe and model spec into a workflow
rf_wf <- workflow() %>% 
  add_recipe(diabetes_recipe) %>% 
  add_model(rf_spec)

# Fit a model
rf_wf_fit <- rf_wf %>% 
  fit(data = diabetes_train)

# Make predictions on test data
results <- diabetes_test %>% select(diabetic) %>% 
  bind_cols(rf_wf_fit %>% 
              predict(new_data = diabetes_test)) %>% 
  bind_cols(rf_wf_fit %>% 
              predict(new_data = diabetes_test, type = "prob"))

# Print out predictions
results %>% 
  slice_head(n = 10)
ABCDEFGHIJ0123456789
diabetic
<fct>
.pred_class
<fct>
.pred_1
<dbl>
.pred_0
<dbl>
000.2846471310.7153529
000.0120752180.9879248
000.0098160020.9901840
000.0107638380.9892362
110.8920126260.1079874
000.1137996060.8862004
000.2298680760.7701319
100.2663187090.7336813
000.0527353190.9472647
000.0384821620.9615178

💪 There goes our random_forest model. Is it any good? Let’s evaluate its metrics!

# Confusion metrics for rf_predictions
results %>% 
  conf_mat(diabetic, .pred_class)
##           Truth
## Prediction    1    0
##          1 1371  100
##          0  183 2846
# Confusion matrix plot
results %>% 
  conf_mat(diabetic, .pred_class) %>% 
  autoplot()

There is a considerable increase in the number of True Positives and True Negatives, which is a step in the right direction.

Let’s take a look at other evaluation metrics

# Evaluate other intuitive classification metrics
rf_met <- results %>% 
  eval_metrics(truth = diabetic, estimate = .pred_class)

# Evaluate ROC_AOC
auc <- results %>% 
  roc_auc(diabetic, .pred_1)

# Plot ROC_CURVE
curve <- results %>% 
  roc_curve(diabetic, .pred_1) %>% 
  autoplot

# Return metrics
list(rf_met, auc, curve)
## [[1]]
## # A tibble: 4 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 ppv      binary         0.932
## 2 recall   binary         0.882
## 3 accuracy binary         0.937
## 4 f_meas   binary         0.906
## 
## [[2]]
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.985
## 
## [[3]]

For the sheer sake of adventure, let’s make a Variable Importance Plot to see which predictor variables have the most impact in our model.

# Load vip
library(vip)

# Extract the fitted model from the workflow
rf_wf_fit %>% 
  extract_fit_parsnip() %>% 
# Make VIP plot
  vip()

Just as we had anticipated from our data exploration 😊! This goes to show the importance of data exploration.

As revealed by the performance metrics, the random forest model seemed to have done a great job in increasing the True Positives/Negatives and reducing the False Positives/Negatives.

Use the model for inferencing

Now that we have a reasonably useful trained model, we can save it for use later to predict labels for new data:

# Save trained workflow
saveRDS(rf_wf_fit, "diabetes_rf_model.rds")

Now, we can load it whenever we need it, and use it to predict labels for new data. This is often called scoring or inferencing.

For example, lets create a simulated data set by picking a random value for each column in our test set then make predictions using the saved model.

# Load the model into the current R session
loaded_model <- readRDS("diabetes_rf_model.rds")

# Create new simulated data
new_data <- lapply(diabetes_test, function(x){sample(x, size = 2)}) %>% 
  as_tibble()

new_data
ABCDEFGHIJ0123456789
pregnancies
<dbl>
plasma_glucose
<dbl>
diastolic_blood_pressure
<dbl>
triceps_thickness
<dbl>
61759022
184559
# Make predictions
predictions <- new_data %>% 
  bind_cols(loaded_model %>% predict(new_data))

predictions
ABCDEFGHIJ0123456789
pregnancies
<dbl>
plasma_glucose
<dbl>
diastolic_blood_pressure
<dbl>
triceps_thickness
<dbl>
61759022
184559

4. Multiclass Classification

Binary classification techniques work well when the data observations belong to one of two classes or categories, such as True or False. When the data can be categorized into more than two classes, you must use a multiclass classification algorithm.

Multiclass classification can be thought of as a combination of multiple binary classifiers. There are two ways in which you approach the problem:

  • One vs Rest (OVR), in which a classifier is created for each possible class value, with a positive outcome for cases where the prediction is this class, and negative predictions for cases where the prediction is any other class. A classification problem with four possible shape classes (square, circle, triangle, hexagon) would require four classifiers that predict:

    • square or not

    • circle or not

    • triangle or not

    • hexagon or not

  • One vs One (OVO), in which a classifier for each possible pair of classes is created. The classification problem with four shape classes would require the following binary classifiers:

    • square or circle

    • square or triangle

    • square or hexagon

    • circle or triangle

    • circle or hexagon

    • triangle or hexagon

In both approaches, the overall model that combines the classifiers generates a vector of predictions in which the probabilities generated from the individual binary classifiers are used to determine which class to predict.

Fortunately, in most machine learning frameworks, including Tidymodels, implementing a multiclass classification model is not significantly more complex than binary classification.

Meet the data

The Palmer Archipelago penguins. Artwork by @allison_horst.

In this sections, we’ll build a multiclass classifier for classifying penguins!

The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.

In R, the package palmerpenguins by Allison Marie Horst and Alison Presmanes Hill and Kristen B Gorman provides us with data related to these adorable creatures.

The corresponding csv data used in the Python tutorial can be found here.

Care for a pun?

What is a penguin’s favorite movie?

Frozen ❄️👧.

With that, let’s get started exploring some penguins 🐧🐧!

Explore the data

# Load the dataset
library(palmerpenguins)

# Take a peek into the data
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
## $ sex               <fct> male, female, female, NA, female, male, female, male~
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~

The data contains the following columns:

  • species: a factor denoting the penguin species (Adelie, Chinstrap, or Gentoo)

  • island: a factor denoting the island (in Palmer Archipelago, Antarctica) where observed

  • bill_length_mm (aka culmen_length): a number denoting length of the dorsal ridge of penguin bill (millimeters)

  • bill_depth_mm (aka culmen_depth): a number denoting the depth of the penguin bill (millimeters)

  • flipper_length_mm: an integer denoting penguin flipper length (millimeters)

  • body_mass_g: an integer denoting penguin body mass (grams)

  • sex: a factor denoting penguin sex (male, female)

  • year: an integer denoting the study year (2007, 2008, or 2009)

The species column containing penguin species Adelie, Chinstrap, or Gentoo, is the label we want to train a model to predict.

The corresponding Python learning module used a data set with the following variables: bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, species.

Let’s narrow down to those and make some summary statistics while at it. The skimr package provides a strong set of summary statistics that are generated for a variety of different data types.

# Select desired columns
penguins_select <- penguins %>% 
  select(c(bill_length_mm, bill_depth_mm, flipper_length_mm,
           body_mass_g, species))

# Dso some summary statistics
penguins_select %>% 
  skim()
Data summary
Name Piped data
Number of rows 344
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1 FALSE 3 Ade: 152, Gen: 124, Chi: 68

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂

From the neat summary provided by skimr, we can see that each our predictor columns contains missing 2 values while our label/outcome column contains none.

Let’s dig a little deeper and filter the rows that contain missing values.

penguins_select %>% 
  filter(if_any(everything(), is.na))
ABCDEFGHIJ0123456789
bill_length_mm
<dbl>
bill_depth_mm
<dbl>
flipper_length_mm
<int>
body_mass_g
<int>
species
<fct>
NANANANAAdelie
NANANANAGentoo

There are two rows that contain no feature values at all (NA stands for Not Available ), so these won’t be useful in training a model. Let’s discard them from the dataset.

# Drop rows containing missing values
penguins_select <- penguins_select %>% 
  drop_na()

# Confirm there are no missing values
penguins_select %>% 
  anyNA()
## [1] FALSE
# Proportion of each species in the data
penguins_select %>% 
  count(species)
ABCDEFGHIJ0123456789
species
<fct>
n
<int>
Adelie151
Chinstrap68
Gentoo123

Now that we’ve dealt with the missing values, let’s explore how the features relate to the label by creating some box charts.

# Pivot data to a long format
penguins_select_long <- penguins_select %>% 
  pivot_longer(!species, names_to = "predictors", values_to = "values")

# Make box plots
penguins_select_long %>%
  ggplot(mapping = aes(x = species, y = values, fill = predictors)) +
  geom_boxplot() +
  facet_wrap(~predictors, scales = "free") +
  scale_fill_paletteer_d("nbapalettes::supersonics_holiday") +
  theme(legend.position = "none")

From the box plots, it looks like species Adelie and Chinstrap have similar data profiles for bill_depth, flipper_length, and body_mass, but Chinstraps tend to have longer bill_length. Gentoo tends to have fairly clearly differentiated features from the others; which should help us train a good classification model.

Prepare the data

Just as for binary classification, before training the model, we need to split the data into subsets for training and validation. We’ll also apply a stratification technique when splitting the data to maintain the proportion of each label value in the training and validation datasets.

set.seed(2056)
# Split data 70%-30% into training set and test set
penguins_split <- penguins_select %>% 
  initial_split(prop = 0.70, strata = species)

# Extract data in each split
penguins_train <- training(penguins_split)
penguins_test <- testing(penguins_split)

# Print the number of observations in each split
cat("Training cases: ", nrow(penguins_train), "\n",
    "Test cases: ", nrow(penguins_test), sep = "")
## Training cases: 238
## Test cases: 104

Train and evaluate a multiclass classifier

Now that we have a set of training features and corresponding training labels, we can fit a multiclass classification algorithm to the data to create a model.

parsnip::multinom_reg() defines a model that uses linear predictors to predict multiclass data using the multinomial distribution.

Let’s fit Multinomial regression via nnet package. This model usually has 1 tuning hyperparameter, penalty, which describes the amount of regularization. This is used to counteract any bias in the sample, and help the model generalize well by avoiding overfitting the model to the training data. We can of course tune this parameter, like we will later on in this lesson, but for now, let’s choose an arbitrary value of 1

# Specify a multinomial regression via nnet
multireg_spec <- multinom_reg(penalty = 1) %>% 
  set_engine("nnet") %>% 
  set_mode("classification")

# Train a multinomial regression model without any preprocessing
set.seed(2056)
multireg_fit <- multireg_spec %>% 
  fit(species ~ ., data = penguins_train)

# Print the model
multireg_fit
## parsnip model object
## 
## Fit time:  20ms 
## Call:
## nnet::multinom(formula = species ~ ., data = data, decay = ~1, 
##     trace = FALSE)
## 
## Coefficients:
##           (Intercept) bill_length_mm bill_depth_mm flipper_length_mm
## Chinstrap -0.09689103      1.3965690    -0.7974207      -0.153233777
## Gentoo    -0.03813310      0.2345437    -1.7038451       0.006228115
##            body_mass_g
## Chinstrap -0.004706975
## Gentoo     0.003833670
## 
## Residual Deviance: 31.38783 
## AIC: 51.38783

Now we can use the trained model to predict the labels for the test features, and evaluate performance:

# Make predictions for the test set
penguins_results <- penguins_test %>% select(species) %>% 
  bind_cols(multireg_fit %>% 
              predict(new_data = penguins_test)) %>% 
  bind_cols(multireg_fit %>% 
              predict(new_data = penguins_test, type = "prob"))

# Print predictions
penguins_results %>% 
  slice_head(n = 5)
ABCDEFGHIJ0123456789
species
<fct>
.pred_class
<fct>
.pred_Adelie
<dbl>
.pred_Chinstrap
<dbl>
.pred_Gentoo
<dbl>
AdelieAdelie0.99990584.334555e-055.085677e-05
AdelieAdelie0.99962003.584026e-042.157623e-05
AdelieAdelie0.98957378.454760e-031.971491e-03
AdelieAdelie0.99166966.837429e-031.493014e-03
AdelieAdelie0.92369317.613199e-021.748642e-04

Now, let’s look at the confusion matrix for our model

# Confusion matrix
penguins_results %>% 
  conf_mat(species, .pred_class)
##            Truth
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        46         1      0
##   Chinstrap      0        20      0
##   Gentoo         0         0     37

The confusion matrix shows the intersection of predicted and actual label values for each class - in simple terms, the diagonal intersections from top-left to bottom-right indicate the number of correct predictions.

When dealing with multiple classes, it’s generally more intuitive to visualize this as a heat map, like this:

update_geom_defaults(geom = "tile", new = list(color = "black", alpha = 0.7))
# Visualize confusion matrix
penguins_results %>% 
  conf_mat(species, .pred_class) %>% 
  autoplot(type = "heatmap")

The darker squares in the confusion matrix plot indicate high numbers of cases, and you can hopefully see a diagonal line of darker squares indicating cases where the predicted and actual label are the same.

Let’s now calculate summary statistics for the confusion matrix.

# Statistical summaries for the confusion matrix
conf_mat(data = penguins_results, truth = species, estimate = .pred_class) %>% 
  summary()
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
accuracymulticlass0.9903846
kapmulticlass0.9848507
sensmacro0.9841270
specmacro0.9942529
ppvmacro0.9929078
npvmacro0.9960317
mccmulticlass0.9850012
j_indexmacro0.9783799
bal_accuracymacro0.9891899
detection_prevalencemacro0.3333333

The tibble shows the overall metrics of how well the model performs across all three classes.

Let’s evaluate the ROC metrics. In the case of a multiclass classification model, a single ROC curve showing true positive rate vs false positive rate is not possible. However, you can use the rates for each class in a One vs Rest (OVR) comparison to create a ROC chart for each class.

# Make a ROC_CURVE
penguins_results %>% 
  roc_curve(species, c(.pred_Adelie, .pred_Chinstrap, .pred_Gentoo)) %>% 
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = .level)) +
  geom_abline(lty = 2, color = "gray80", size = 0.9) +
  geom_path(show.legend = T, alpha = 0.6, size = 1.2) +
  coord_equal()

To quantify the ROC performance, you can calculate an aggregate area under the curve score that is averaged across all of the OVR curves.

# Calculate ROC_AOC
penguins_results %>% 
  roc_auc(species, c(.pred_Adelie, .pred_Chinstrap, .pred_Gentoo))
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
roc_auchand_till1

That went down well! The model did a great job in classifying the penguins. What kind of adventure would it be, if we didn’t preprocess the data?

Workflows + A different algorithm

Again, just like with binary classification, you can use a workflow to apply preprocessing steps to the data before fitting it to an algorithm to train a model. Let’s scale the numeric features in a transformation step before training, try a different algorithm (a support vector machine) and tune some model hyperparameters, just to show that we can!

Support Vector Machines try to find a hyperplane in some feature space that “best” separates the classes. Please see:

We’ll fit a radial basis function support vector machine to these data and tune the SVM cost parameter and the σ parameter in the kernel function (The margin parameter does not apply to classification models)

  • A cost argument allows us to specify the cost of a violation to the margin. When the cost argument is small, then the margins will be wide and many support vectors will be on the margin or will violate the margin. This could make the model more robust and lead to better classification. When the cost argument is large, then the margins will be narrow and there will be few support vectors on the margin or violating the margin.

  • As σ decreases, the fit becomes more non-linear, and the model becomes more flexible.

Both parameters can have a profound effect on the model complexity and performance.

The radial basis kernel is extremely flexible and as a rule of thumb, we generally start with this kernel when fitting SVMs in practice.

Parameters are marked for tuning by assigning them a value of tune(). Also, let’s try out a new succinct way of creating workflows that minimizes a lot of piping steps as suggested by David’s blog post (winner of sliced!!)

# Create a model specification
svm_spec <- svm_rbf(mode = "classification", engine = "kernlab",
            cost = tune(), rbf_sigma = tune())


# Create a workflow that encapsulates a recipe and a model
svm_wflow <- recipe(species ~ ., data = penguins_train) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  workflow(svm_spec)

# Print out workflow
svm_wflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: svm_rbf()
## 
## -- Preprocessor ----------------------------------------------------------------
## 1 Recipe Step
## 
## * step_normalize()
## 
## -- Model -----------------------------------------------------------------------
## Radial Basis Function Support Vector Machine Specification (classification)
## 
## Main Arguments:
##   cost = tune()
##   rbf_sigma = tune()
## 
## Computational engine: kernlab

Pretty neat, right ✨?

Now that we have specified what parameter to tune, we’ll need to figure out a set of possible values to try out then choose the best.

To do this, we’ll create a grid! In this example, we’ll work through a regular grid of hyperparameter values, try them out, and see what pair results in the best model performance.

set.seed(2056)
# Create regular grid of 6 values for each tuning parameters
svm_grid <- grid_regular(parameters(svm_spec), levels = 6)

# Print out some parameters in our grid
svm_grid %>% 
  slice_head(n = 10)
ABCDEFGHIJ0123456789
cost
<dbl>
rbf_sigma
<dbl>
9.765625e-041e-10
7.812500e-031e-10
6.250000e-021e-10
5.000000e-011e-10
4.000000e+001e-10
3.200000e+011e-10
9.765625e-041e-08
7.812500e-031e-08
6.250000e-021e-08
5.000000e-011e-08

Awesome! One thing about hyperparameters is that they are not learned directly from the training set. Instead, they are estimated using simulated data sets created from a process called resampling. In our previous, we used cross-validation resampling method. Let’s try out another resampling technique: bootstrap resampling.

Bootstrap resampling means drawing with replacement from our original dataset then then fit a model on that new set that contains some duplicates, and evaluate the model on the data points that were not included.

Then we do that again (default behaviour is 25 boostraps but this can be changed). Okay, let’s create some simulated data sets.

set.seed(2056)
# Bootstrap resampling
penguins_bs <- bootstraps(penguins_train, times = 10)

penguins_bs
ABCDEFGHIJ0123456789
splits
<list>
id
<chr>
<S3: boot_split>Bootstrap01
<S3: boot_split>Bootstrap02
<S3: boot_split>Bootstrap03
<S3: boot_split>Bootstrap04
<S3: boot_split>Bootstrap05
<S3: boot_split>Bootstrap06
<S3: boot_split>Bootstrap07
<S3: boot_split>Bootstrap08
<S3: boot_split>Bootstrap09
<S3: boot_split>Bootstrap10

Model tuning via grid search.

We are ready to tune! Let’s use tune_grid() to fit models at all the different values we chose for each tuned hyperparameter.

doParallel::registerDoParallel()

# Model tuning via a grid search
set.seed(2056)
svm_res <- tune_grid(
  object = svm_wflow,
  resamples = penguins_bs,
  grid = svm_grid
)

Now that we have our tuning results, we can extract the performance metrics using collect_metrics():

# Obtain performance metrics
svm_res %>% 
  collect_metrics() %>% 
  slice_head(n = 7)
ABCDEFGHIJ0123456789
cost
<dbl>
rbf_sigma
<dbl>
.metric
<chr>
.estimator
<chr>
mean
<dbl>
n
<int>
std_err
<dbl>
0.00097656251e-10accuracymulticlass0.4279212100.010790273
0.00097656251e-10roc_auchand_till0.9809805100.003231725
0.00781250001e-10accuracymulticlass0.4279212100.010790273
0.00781250001e-10roc_auchand_till0.9809669100.003253624
0.06250000001e-10accuracymulticlass0.4279212100.010790273
0.06250000001e-10roc_auchand_till0.9803196100.003260246
0.50000000001e-10accuracymulticlass0.4279212100.010790273

🤩🤩 Let’s see if we could get more by visualizing the results:

# Visualize tuning metrics
svm_res %>% 
  collect_metrics() %>% 
  mutate(rbf_sigma = factor(rbf_sigma)) %>% 
  ggplot(mapping = aes(x = cost, y = mean, color = rbf_sigma)) +
  geom_line(size = 1.5, alpha = 0.7) +
  geom_point(size = 2) +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  scale_x_log10(labels = scales::label_number()) +
  scale_color_viridis_d(option = "viridis", begin = .1)

It seems that an SVM with an rbf_sigma of 1 and 0.01 really performed well across all candidate values of cost. The show_best() function can help us make a clearer distinction:

# Show best submodel 
svm_res %>% 
  show_best("accuracy")
ABCDEFGHIJ0123456789
cost
<dbl>
rbf_sigma
<dbl>
.metric
<chr>
.estimator
<chr>
mean
<dbl>
n
<int>
std_err
<dbl>
32.01.00accuracymulticlass0.9751752100.005042863
32.00.01accuracymulticlass0.9749459100.004247809
4.01.00accuracymulticlass0.9739020100.005514761
4.00.01accuracymulticlass0.9706938100.004727220
0.51.00accuracymulticlass0.9694269100.004825592

Much better! Let’s now use the select_best() function to pull out the single set of hyperparameter values in the best sub-model:

# Select best model hyperparameters
best_svm <- svm_res %>% 
  select_best("accuracy")

best_svm
ABCDEFGHIJ0123456789
cost
<dbl>
rbf_sigma
<dbl>
.config
<chr>
321Preprocessor1_Model36

Perfect! These are the values for rbf_sigma and cost that maximize accuracy for our penguins!

We can now finalize our workflow such that the parameters we had marked for tuning by assigning them a value of tune() can get updated with the values from best_svm

# Finalize workflow
final_wflow <- svm_wflow %>% 
  finalize_workflow(best_svm)

final_wflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: svm_rbf()
## 
## -- Preprocessor ----------------------------------------------------------------
## 1 Recipe Step
## 
## * step_normalize()
## 
## -- Model -----------------------------------------------------------------------
## Radial Basis Function Support Vector Machine Specification (classification)
## 
## Main Arguments:
##   cost = 32
##   rbf_sigma = 1
## 
## Computational engine: kernlab

That marks the end of tuning 💃!

The last fit

Finally, let’s fit this final model to the training data and evaluate how it would perform on new data using our test data. We can use the function last_fit() with our finalized model; this function fits the finalized model on the full training data set and evaluates it on the testing data.

# The last fit
final_fit <- last_fit(object = final_wflow, split = penguins_split)

# Collect metrics
final_fit %>% 
  collect_metrics()
ABCDEFGHIJ0123456789
.metric
<chr>
.estimator
<chr>
.estimate
<dbl>
.config
<chr>
accuracymulticlass0.9711538Preprocessor1_Model1
roc_auchand_till0.9981021Preprocessor1_Model1

Much better 🤩! You can of course go ahead and obtain the hard class and probability predictions using collect predictions() and you will be well on your way to computing the confusion matrix and other summaries that come with it.

# Collect predictions and make confusion matrix
final_fit %>% 
  collect_predictions() %>% 
  conf_mat(truth = species, estimate = .pred_class)
##            Truth
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        45         2      0
##   Chinstrap      1        19      0
##   Gentoo         0         0     37

Use the model with new data observations

Now let’s save our trained model so we can use it again later. Begin by extracting the trained workflow object from final_fit object.

# Extract trained workflow
penguins_svm_model <- final_fit %>% 
  extract_workflow()

# Save workflow
saveRDS(penguins_svm_model, "penguins_svm_model.rds")

OK, so now we have a trained model. Let’s use it to predict the class of a new penguin observation:

# Load model
loaded_psvm_model <- readRDS("penguins_svm_model.rds") 

# Create new tibble of observations
new_obs <- tibble(
  bill_length_mm = c(49.5, 38.2),
  bill_depth_mm = c(18.4, 20.1),
  flipper_length_mm = c(195, 190),
  body_mass_g = c(3600, 3900))

# Make predictions
new_results <- new_obs %>% 
  bind_cols(loaded_psvm_model %>% 
              predict(new_data = new_obs))

# Show predictions
new_results
ABCDEFGHIJ0123456789
bill_length_mm
<dbl>
bill_depth_mm
<dbl>
flipper_length_mm
<dbl>
body_mass_g
<dbl>
.pred_class
<fct>
49.518.41953600Chinstrap
38.220.11903900Adelie

Good job! A working model 🐧🐧!

7. Summary

We need to chill, right? 😅 We hope you had a flippin’ good time!

In this module, you learned how classification can be used to create a machine learning model that predicts categories, or classes. You then used the amazing Tidymodels framework in R to train and evaluate a classification model using different algorithms, do some data preprocessing, tuned some hyperparameters and made better predictions.

While Tidymodels and scikit-learn (Python) are popular framework for writing code to train classification models, you can also create machine learning solutions for classification using the graphical tools in Microsoft Azure Machine Learning. You can learn more about no-code development of classification models using Azure Machine Learning in Create a classification model with Azure Machine Learning designer module.

Challenge: Predict Real Estate Prices

Feel like challenging yourself to train a classification model? Try the challenge in the /challenges/03 - Wine Classification Challenge.ipynb notebook to see if you can classify wines into their grape varietals! Find the data here.

THANK YOU TO:

Allison Horst for creating the amazing illustrations that make R more welcoming and engaging. Find more illustrations at her gallery.

Bethany, Gold Microsoft Learn Student Ambassador, for her valuable feedback and suggestions.

FURTHER READING

Happy leaRning,

Eric (R_ic), Gold Microsoft Learn Student Ambassador.

---
title: 'Train and Evaluate Classification Models using Tidymodels'
output:
  html_document:
    css: style_7.css
    df_print: paged
    theme: flatly
    highlight: breezedark
    toc: yes
    toc_float: yes
    code_download: yes
---

```{r setup, include=FALSE, eval=T}
library(tidyverse)
library(tidymodels)
library(ranger)
library(vip)
library(palmerpenguins)
library(paletteer)
library(nnet)
library(skimr)
library(here)
library(kernlab)
# slice <- dplyr::slice
# eval_metrics <- metric_set(rmse, rsq)

```

## Buckle up 🚀

In this learning path, we'll learn how to create Machine learning models using `R` 😊. Machine learning is the foundation for predictive modeling and artificial intelligence. We'll learn the core principles of machine learning and how to use common tools and frameworks to train, evaluate, and use machine learning models.

Modules that will be covered in this learning path include:

-   Explore and analyze data with R

-   Train and evaluate regression models

-   Train and evaluate classification models

-   *Train and evaluate clustering models (under development)*

-   *Train and evaluate deep learning models (under development)*

### **Prerequisites**

This learning path assumes knowledge of basic mathematical concepts. Some experience with `R and the tidyverse` is also beneficial though we'll try as much as possible to skim through the core concepts. To get started with R and the tidyverse, the best place would be [R for Data Science](http://r4ds.had.co.nz/) an O'Reilly book written by Hadley Wickham and Garrett Grolemund. It's designed to take you from knowing nothing about R or the tidyverse to having all the basic tools of data science at your fingertips.

The `Python` edition of the learning path can be found at [this learning path](https://docs.microsoft.com/en-us/learn/paths/create-machine-learn-models/).

**Why R?**

> First, while not the only good option, R has been shown to be popular and effective in modern data analysis. Second, R is free and open-source. You can install it anywhere, modify the code, and have the ability to see exactly how computations are performed. Third, it has excellent community support via the canonical R mailing lists and, more importantly, with Twitter, StackOverflow, and RStudio Community. Anyone who asks a reasonable, reproducible question has a pretty good chance of getting an answer. - [`Feature Engineering and Selection: A Practical Approach for Predictive Models, Max Kuhn and Kjell Johnson`](https://bookdown.org/max/FES/)

Now, let's get started!

![Artwork by \@allison_horst](images/encouRage.jpg){width="630"}

## A gentle introduction to classification

*Classification* is a form of machine learning in which you train a model to predict which category an item belongs to. *Categorical* data has distinct 'classes', rather than numeric values.

For example, a health clinic might use diagnostic data such as a patient's height, weight, blood pressure, blood-glucose level to predict whether or not the patient is diabetic.

![](images/diabetes-classification.png)

Classification is an example of a *supervised* machine learning technique, which means it relies on data that includes known *feature* values (for example, diagnostic measurements for patients) as well as known *label* values (for example, a classification of non-diabetic or diabetic). A classification algorithm is used to fit a subset of the data to a function that can calculate the `probability` for each class label from the feature values. The remaining data is used to evaluate the model by comparing the predictions it generates from the features to the known class labels.

The simplest form of classification is *binary* classification, in which the label is 0 or 1, representing one of two classes; for example, "True" or "False"; "Internal" or "External"; "Profitable" or "Non-Profitable"; and so on.

The class prediction is made by determining the *probability* for each possible class as a value between 0 -impossible - and 1 - certain. The total probability for all classes is 1, as the patient is definitely either diabetic or non-diabetic. So, if the predicted probability of a patient being diabetic is 0.3, then there is a corresponding probability of 0.7 that the patient is non-diabetic.

A threshold value, usually 0.5, is used to determine the predicted class - so if the *positive* class (in this case, diabetic) has a predicted probability greater than the threshold, then a classification of diabetic is predicted.

The best way to learn about classification is to try it for yourself, so that's what you'll do in this exercise.

> We'll require some packages to knock-off this module. You can have them installed as: `install.packages(c('tidyverse', 'tidymodels', 'ranger', 'vip', 'palmerpenguins', 'skimr', 'paletteer', 'nnet', 'here'))`

Alternatively, the script below checks whether you have the packages required to complete this module and installs them for you in case some are missing.

```{r}
suppressWarnings(if(!require("pacman")) install.packages("pacman"))

pacman::p_load('tidyverse', 'tidymodels', 'ranger',
               'vip', 'skimr', 'here','palmerpenguins', 'kernlab',
               'janitor', 'paletteer', 'nnet')
```

## 1. Binary classification

Let's start by looking at an example of *binary classification*, where the model must predict a label that belongs to one of two classes. In this exercise, we'll train a binary classifier to predict whether or not a patient should be tested for diabetes based on some medical data.

### Explore the data

The first step in any machine learning project is to `explore the data` that you will use to train a model. And before we can explore the data, we must first have it in our environment, right?

So, let's begin by importing a CSV file of patent data into a `tibble` (a modern a modern reimagining of the data frame):

> **Citation**: The diabetes dataset used in this exercise is based on data originally collected by the National Institute of Diabetes and Digestive and Kidney Diseases.

```{r read_url, message=F, warning=F, exercise.setup = "setupA"}
# Load the core tidyverse and make it available in your current R session
library(tidyverse)


# Read the csv file into a tibble
diabetes <- read_csv(file = "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/diabetes.csv")


# Print the first 10 rows of the data
diabetes %>% 
  slice_head(n = 10)



```

Sometimes, we may want some little more information on our data. We can have a look at the `data`, `its structure` and the `data type` of its features by using the [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function as below:

```{r glimpse, message=F, warning=F}
# Take a quick glance at the data
diabetes %>% 
  glimpse()
```

This data consists of diagnostic information about some patients who have been tested for diabetes. Note that the final column in the dataset (`Diabetic`) contains the value *`0`* for patients who tested `negative` for diabetes, and *`1`* for patients who tested positive. This is the label that we will train our model to predict; most of the other columns (**Pregnancies**, **PlasmaGlucose**, **DiastolicBloodPressure**, **BMI** and so on) are the features we will use to predict the **Diabetic** label.

Let's kick off our adventure by reformatting the data to make it easier for a model to use effectively. For now, let's drop the PatientID column, encode the Diabetic column as a categorical variable, and make the column names a bit friend_lieR:

```{r reformat}
# Load the janitor package for cleaning data
library(janitor)

# Clean data a bit
diabetes_select <- diabetes %>%
  # Encode Diabetic as category
  mutate(Diabetic = factor(Diabetic, levels = c("1","0"))) %>% 
  # Drop PatientID column
  select(-PatientID) %>% 
  # Clean column names
  clean_names()


# View data set
diabetes_select %>% 
  slice_head(n = 10)
```

The goal of this exploration is to try to understand the `relationships` between its attributes; in particular, any apparent correlation between the *features* and the *label* your model will try to predict. One way of doing this is by using data visualization.

Now let's compare the feature distributions for each label value. We'll begin by formatting the data to a *long* format to make it somewhat easier to make multiple facets.

```{r long_format, message=F, warning=F}
# Pivot data to a long format
diabetes_select_long <- diabetes_select %>% 
    pivot_longer(!diabetic, names_to = "features", values_to = "values")


# Print the first 10 rows
diabetes_select_long %>% 
  slice_head(n = 10)
```

Perfect! Now, let's make some plots.

```{r plot_long, message=F, warning=F}
theme_set(theme_light())
# Make a box plot for each predictor feature
diabetes_select_long %>% 
  ggplot(mapping = aes(x = diabetic, y = values, fill = features)) +
  geom_boxplot() + 
  facet_wrap(~ features, scales = "free", ncol = 4) +
  scale_color_viridis_d(option = "plasma", end = .7) +
  theme(legend.position = "none")

```

Amazing🤩! For some of the features, there's a noticeable difference in the distribution for each label value. In particular, `Pregnancies` and `Age` show markedly different distributions for diabetic patients than for non-diabetic patients. These features may help predict whether or not a patient is diabetic.

### Split the data

Our dataset includes known values for the label, so we can use this to train a classifier so that it finds a statistical relationship between the features and the label value; but how will we know if our model is any good? How do we know it will predict correctly when we use it with new data that it wasn't trained with?

It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set.

Well, we can take advantage of the fact we have a large dataset with known label values, use only some of it to train the model, and hold back some to test the trained model - enabling us to compare the predicted labels with the already known labels in the test set.

In R, the amazing Tidymodels framework provides a collection of packages for modeling and machine learning using **tidyverse** principles. For instance, [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:

-   `initial_split()`: specifies how data will be split into a training and testing set

-   `training()` and `testing()` functions extract the data in each split

use `?initial_split()` for more details.

> Here is a great place to get started with Tidymodels: [Get Started](https://www.tidymodels.org/start/)

```{r plot, message=F, warning=F}
# Load the Tidymodels packages
library(tidymodels)



# Split data into 70% for training and 30% for testing
set.seed(2056)
diabetes_split <- diabetes_select %>% 
  initial_split(prop = 0.70)


# Extract the data in each split
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)


# Print the number of cases in each split
cat("Training cases: ", nrow(diabetes_train), "\n",
    "Test cases: ", nrow(diabetes_test), sep = "")


# Print out the first 5 rows of the training set
diabetes_train %>% 
  slice_head(n = 5)

```

### Train and Evaluate a Binary Classification Model

OK, now we're ready to train our model by fitting the training features to the training labels (`diabetic`). There are various algorithms we can use to train the model. In this example, we'll use *`Logistic Regression`*, which (despite its name) is a well-established algorithm for classification. Logistic regression is a binary classification algorithm, meaning it predicts 2 categories.

There are quite a number of ways to fit a logistic regression model in Tidymodels. See `?logistic_reg()` For now, let's fit a logistic regression model via the default `stats::glm()` engine.

```{r log_glm, message=F, warning=F}
# Make a model specifcation
logreg_spec <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")


# Print the model specification
logreg_spec
```

After a model has been *specified*, the model can be `estimated` or `trained` using the [`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html) function, typically using a symbolic description of the model (a formula) and some data.

```{r log_glm_fit, message=F, warning=F}
# Train a logistic regression model
logreg_fit <- logreg_spec %>% 
  fit(diabetic ~ ., data = diabetes_train)


# Print the model object
logreg_fit


```

The model print out shows the coefficients learned during training.

Now we've trained the model using the training data, we can use the test data we held back to evaluate how well it predicts using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). Let's start by using the model to predict labels for our test set, and compare the predicted labels to the known labels:

```{r model_eval,message=F,warning=F}
# Make predictions then bind them to the test set
results <- diabetes_test %>% select(diabetic) %>% 
  bind_cols(logreg_fit %>% predict(new_data = diabetes_test))


# Compare predictions
results %>% 
  slice_head(n = 10)
```

Comparing each prediction with its corresponding "ground truth" actual value isn't a very efficient way to determine how well the model is predicting. Fortunately, Tidymodels has a few more tricks up its sleeve: [`yardstick`](https://yardstick.tidymodels.org/) - a package used to measure the effectiveness of models using performance metrics.

The most obvious thing you might want to do is to check the *accuracy* of the predictions - in simple terms, what proportion of the labels did the model predict correctly?

`yardstick::accuracy()` does just that!

```{r acc,message=F,warning=F}
# Calculate accuracy: proportion of data predicted correctly
accuracy(data = results, truth = diabetic, estimate = .pred_class)
```

The accuracy is returned as a decimal value - a value of 1.0 would mean that the model got 100% of the predictions right; while an accuracy of 0.0 is, well, pretty useless 😐!

Accuracy seems like a sensible metric to evaluate (and to a certain extent it is), but you need to be careful about drawing too many conclusions from the accuracy of a classifier. Remember that it's simply a measure of how many cases were predicted correctly. Suppose only 3% of the population is diabetic. You could create a classifier that always just predicts 0, and it would be 97% accurate - but not terribly helpful in identifying patients with diabetes!

Fortunately, there are some other metrics that reveal a little more about how our classification model is performing.

One performance metric associated with classification problems is the [`confusion matrix`](https://wikipedia.org/wiki/Confusion_matrix). A confusion matrix describes how well a classification model performs by tabulating how many examples in each class were correctly classified by a model. In our case, it will show you how many cases were classified as negative (0) and how many as positive (1); the confusion matrix also shows you how many were classified into the *wrong* categories.

The [`conf_mat()`](https://tidymodels.github.io/yardstick/reference/conf_mat.html) function from yardstick calculates this cross-tabulation of observed and predicted classes.

```{r conf_mat}
# Confusion matrix for prediction results
conf_mat(data = results, truth = diabetic, estimate = .pred_class)

```

Awesome!

Let's interpret the confusion matrix. Our model is asked to classify cases between two binary categories, category `1` for patients who tested positive for diabetes and category `0` for patients who tested negative.

-   If your model predicts a patient as `1` (positive) and they belong to category `1` (positive) in reality we call this a `true positive`, shown by the top left number `897`.

-   If your model predicts a patient as `0` (negative) and they belong to category `1` (positive) in reality we call this a `false negative`, shown by the bottom left number `657`.

-   If your model predicts a patient as `1` (positive) and they belong to category `0` (negative) in reality we call this a `false positive`, shown by the top right number `293`.

-   If your model predicts a patient as `0` (negative) and they belong to category `0` (negative) in reality we call this a `true negative`, shown by the bottom right number `2653`.

Our confusion matrix can thus be expressed in the following form:

| Truth |
|:-----:|

|               |                  |                   |
|:-------------:|:----------------:|:-----------------:|
| **Predicted** |        1         |         0         |
|       1       | $897_{\ \ \ TP}$ | $293_{\ \ \ FP}$  |
|       0       | $657_{\ \ \ FN}$ | $2653_{\ \ \ TN}$ |

Note that the correct (*`true`*) predictions form a diagonal line from top left to bottom right - these figures should be significantly higher than the *false* predictions if the model is any good.

As you might have guessed it's preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.

The confusion matrix is helpful since it gives rise to other metrics that can help us better evaluate the performance of a classification model. Let's go through some of them:

🎓 Precision: `TP/(TP + FP)` defined as the proportion of predicted positives that are actually positive. Also called [positive predictive value](https://en.wikipedia.org/wiki/Positive_predictive_value "Positive predictive value")

🎓 Recall: `TP/(TP + FN)` defined as the proportion of positive results out of the number of samples which were actually positive. Also known as `sensitivity`.

🎓 Specificity: `TN/(TN + FP)` defined as the proportion of negative results out of the number of samples which were actually negative.

🎓 Accuracy: `TP + TN/(TP + TN + FP + FN)` The percentage of labels predicted accurately for a sample.

🎓 F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.

Tidymodels provides yet another succinct way of evaluating all these metrics. Using `yardstick::metric_set()`, you can combine multiple metrics together into a new function that calculates all of them at once.

```{r metric_set}
# Combine metrics and evaluate them all at once
eval_metrics <- metric_set(ppv, recall, accuracy, f_meas)
eval_metrics(data = results, truth = diabetic, estimate = .pred_class)

```

Using the precision (ppv) metric, we are able to answer the question:

-   Of all the patients the model predicted are diabetic, how many are actually diabetic?

Using the recall metric, we are able to answer the question:

-   Of all the patients that are actually diabetic, how many did the model identify?

Great job, we just made predictions and evaluated them using a number of metrics.

Until now, we've considered the predictions from the model as being either 1 or 0 class labels. Actually, things are a little more complex than that. Statistical machine learning algorithms, like logistic regression, are based on `probability`; so what actually gets predicted by a binary classifier is the probability that the label is true ($P(y)$) and the probability that the label is false ($1-P(y)$). A threshold value of 0.5 is used to decide whether the predicted label is a `1` ($P(y)>0.5$) or a `0` ($P(y)<=0.5$). Let's see the probability pairs for each case:

```{r prob}
# Predict class probabilities and bind them to results
results <- results %>% 
  bind_cols(logreg_fit %>% 
              predict(new_data = diabetes_test, type = "prob"))

  


# Print out the results
results %>% 
  slice_head(n = 10)


```

The decision to score a prediction as a 1 or a 0 depends on the threshold to which the predicted probabilities are compared. If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion matrix. A common way to evaluate a classifier is to examine the *true positive rate* (which is another name for recall) and the *false positive rate* (1 - specificity) for a range of possible thresholds. These rates are then plotted against all possible thresholds to form a chart known as a *received operator characteristic (ROC) chart*, like this:

```{r roc_curve}
# Make a roc_chart
results %>% 
  roc_curve(truth = diabetic, .pred_1) %>% 
  autoplot()

```

The ROC chart shows the curve of the true and false positive rates for different threshold values between 0 and 1. A perfect classifier would have a curve that goes straight up the left side and straight across the top. The diagonal line across the chart represents the probability of predicting correctly with a 50/50 random prediction; so you obviously want the curve to be higher than that (or your model is no better than simply guessing!).

The area under the curve (AUC) is a value between 0 and 1 that quantifies the overall performance of the model. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. The closer to 1 this value is, the better the model. Once again, Tidymodels includes a function to calculate this metric: `yardstick::roc_auc()`

```{r auc}
# Compute the AUC
results %>% 
  roc_auc(diabetic, .pred_1)

```

## 2. Recipes and workflows

#### Data preprocessing with recipes

In this case, the ROC curve and its AUC indicate that the model performs better than a random guess which is not bad considering we performed very little preprocessing of the data.

In practice, it's common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it. There's a huge range of preprocessing transformations you can perform to get your data ready for modeling, but we'll limit ourselves to a few common techniques:

-   Scaling numeric features so they're on the same scale. This prevents features with large values from producing coefficients that disproportionately affect the predictions.

-   Encoding categorical variables. For example, by using a *one hot encoding* technique you can create "*dummy*" or *indicator variables* which replace the original categorical feature with numeric columns whose values are either 1 or 0.

Tidymodels provides yet another neat package: [recipes](https://recipes.tidymodels.org/)- a package for preprocessing data. Let's specify a recipe that encodes the age column then normalizes the rest of the predictor features.

```{r recipes}
# Preprocess the data for modelling
diabetes_recipe <- recipe(diabetic ~ ., data = diabetes_train) %>% 
  step_mutate(age = factor(age)) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_dummy(age)

# Print the recipe
diabetes_recipe

```

We just created a recipe containing an outcome and its corresponding predictors, specifying that the age variable should be converted to a categorical variable (factor), all the numeric predictors normalized and creating dummy variables for the nominal predictor (age) 🙌.

#### Bundling it all together using a workflow

Now that we have a recipe and a model specification we defined previously, we need to find a way of bundling them together into an object that will first preprocess the data, fit the model on the preprocessed data and also allow for potential post-processing activities.

In Tidymodels, this convenient object is called a [`workflow`](https://workflows.tidymodels.org/) and conveniently holds your modeling components.

The [**workflows**](https://workflows.tidymodels.org/) package allows the user to bind modeling and preprocessing objects together. You can then fit the entire workflow to the data, such that the model encapsulates all of the preprocessing steps as well as the algorithm.

```{r workflow}
# Redefine the model specification
logreg_spec <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

# Bundle the recipe and model specification
lr_wf <- workflow() %>% 
  add_recipe(diabetes_recipe) %>% 
  add_model(logreg_spec)

# Print the workflow
lr_wf

```

After a workflow has been *specified*, a model can be `trained` using the [`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html) function.

```{r fit_wf}
# Fit a workflow object
lr_wf_fit <- lr_wf %>% 
  fit(data = diabetes_train)

# Print wf object
lr_wf_fit

```

Good job👏! We now have a trained workflow. The workflow print out shows the coefficients learned during training.

This allows us to use the model trained by this workflow to predict labels for our test set, and compare the performance metrics with the basic model we created previously.

```{r eval_wf}
# Make predictions on the test set
results <- diabetes_test %>% select(diabetic) %>% 
  bind_cols(lr_wf_fit %>% 
              predict(new_data = diabetes_test)) %>% 
  bind_cols(lr_wf_fit %>% 
              predict(new_data = diabetes_test, type = "prob"))

# Print the results
results %>% 
  slice_head(n = 10)

```

Let's take a look at the confusion matrix:

```{r conf_mat2}
# Confusion matrix for prediction results
results %>% 
  conf_mat(truth = diabetic, estimate = .pred_class)


```

🤩🤩 Look at those metrics!

Can we visualize this? Of course, nothing is impaRsible!

```{r conf_mat_viz}
# Visualize conf mat
update_geom_defaults(geom = "rect", new = list(fill = "midnightblue", alpha = 0.7))

results %>% 
  conf_mat(diabetic, .pred_class) %>% 
  autoplot()

```

What about our other metrics such as ppv, sensitivity etc?

```{r eval_met}
# Evaluate other desired metrics
eval_metrics(data = results, truth = diabetic, estimate = .pred_class)

# Evaluate ROC_AUC metrics
results %>% 
  roc_auc(diabetic, .pred_1)

# Plot ROC_CURVE
results %>% 
  roc_curve(diabetic, .pred_1) %>% 
  autoplot()
```

Comparing with previous predictions, the metrics look better, so clearly preprocessing the data has made a difference.

## 3. Try a different algorithm

Now let's try a different algorithm. Previously we used a logistic regression algorithm, which is a *linear* algorithm. There are many kinds of classification algorithm we could try, including:

-   **Support Vector Machine algorithms**: Algorithms that define a *hyperplane* that separates classes.

-   **Tree-based algorithms**: Algorithms that build a decision tree to reach a prediction

-   **Ensemble algorithms**: Algorithms that combine the outputs of multiple base algorithms to improve generalizability.

This time, we'll train the model using an *ensemble* algorithm named *Random Forest* that averages the outputs of multiple random decision trees. Random forests help to reduce tree correlation by injecting more randomness into the tree-growing process. More specifically, instead of considering all predictors in the data, for calculating a given split, random forests pick a random sample of predictors to be considered for that split.

> For further reading on Tree based models, please see:
>
> [Machine Learning for Social Scientists](https://cimentadaj.github.io/ml_socsci/tree-based-methods.html#random-forests)

As we already have a gist of how to perform classification using Tidymodels, let's get right into specifying and fitting a random forest algorithm.

```{r rand_forest}
# Preprocess the data for modelling
diabetes_recipe <- recipe(diabetic ~ ., data = diabetes_train) %>% 
  step_mutate(age = factor(age)) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_dummy(age)

# Build a random forest model specification
rf_spec <- rand_forest() %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# Bundle recipe and model spec into a workflow
rf_wf <- workflow() %>% 
  add_recipe(diabetes_recipe) %>% 
  add_model(rf_spec)

# Fit a model
rf_wf_fit <- rf_wf %>% 
  fit(data = diabetes_train)

# Make predictions on test data
results <- diabetes_test %>% select(diabetic) %>% 
  bind_cols(rf_wf_fit %>% 
              predict(new_data = diabetes_test)) %>% 
  bind_cols(rf_wf_fit %>% 
              predict(new_data = diabetes_test, type = "prob"))

# Print out predictions
results %>% 
  slice_head(n = 10)

```

💪 There goes our random_forest model. Is it any good? Let's evaluate its metrics!

```{r eval_rf}
# Confusion metrics for rf_predictions
results %>% 
  conf_mat(diabetic, .pred_class)

# Confusion matrix plot
results %>% 
  conf_mat(diabetic, .pred_class) %>% 
  autoplot()

```

There is a considerable increase in the number of `True Positives` and `True Negatives`, which is a step in the right direction.

Let's take a look at other evaluation metrics

```{r other_met}
# Evaluate other intuitive classification metrics
rf_met <- results %>% 
  eval_metrics(truth = diabetic, estimate = .pred_class)

# Evaluate ROC_AOC
auc <- results %>% 
  roc_auc(diabetic, .pred_1)

# Plot ROC_CURVE
curve <- results %>% 
  roc_curve(diabetic, .pred_1) %>% 
  autoplot

# Return metrics
list(rf_met, auc, curve)


```

For the sheer sake of adventure, let's make a Variable Importance Plot to see which predictor variables have the most impact in our model.

```{r}
# Load vip
library(vip)

# Extract the fitted model from the workflow
rf_wf_fit %>% 
  extract_fit_parsnip() %>% 
# Make VIP plot
  vip()
```

Just as we had anticipated from our data exploration 😊! This goes to show the importance of data exploration.

As revealed by the performance metrics, the random forest model seemed to have done a great job in increasing the True Positives/Negatives and reducing the False Positives/Negatives.

#### Use the model for inferencing

Now that we have a reasonably useful trained model, we can save it for use later to predict labels for new data:

```{r rf_save}
# Save trained workflow
saveRDS(rf_wf_fit, "diabetes_rf_model.rds")

```

Now, we can load it whenever we need it, and use it to predict labels for new data. This is often called *`scoring`* or *`inferencing`*.

For example, lets create a simulated data set by picking a random value for each column in our test set then make predictions using the saved model.

```{r rf_inference}
# Load the model into the current R session
loaded_model <- readRDS("diabetes_rf_model.rds")

# Create new simulated data
new_data <- lapply(diabetes_test, function(x){sample(x, size = 2)}) %>% 
  as_tibble()

new_data

# Make predictions
predictions <- new_data %>% 
  bind_cols(loaded_model %>% predict(new_data))

predictions
```

## 4. Multiclass Classification

Binary classification techniques work well when the data observations belong to one of two classes or categories, such as `True` or `False`. When the data can be categorized into `more than two classes`, you must use a `multiclass classification algorithm`.

Multiclass classification can be thought of as a `combination` of `multiple binary classifiers`. There are two ways in which you approach the problem:

-   **One vs Rest (OVR)**, in which a classifier is created for each possible class value, with a positive outcome for cases where the prediction is *this* class, and negative predictions for cases where the prediction is any other class. A classification problem with four possible shape classes (*square*, *circle*, *triangle*, *hexagon*) would require four classifiers that predict:

    -   *square* or not

    -   *circle* or not

    -   *triangle* or not

    -   *hexagon* or not

-   **One vs One (OVO)**, in which a classifier for each possible pair of classes is created. The classification problem with four shape classes would require the following binary classifiers:

    -   *square* or *circle*

    -   *square* or *triangle*

    -   *square* or *hexagon*

    -   *circle* or *triangle*

    -   *circle* or *hexagon*

    -   *triangle* or *hexagon*

In both approaches, the overall model that combines the classifiers generates a vector of predictions in which the probabilities generated from the individual binary classifiers are used to determine which class to predict.

Fortunately, in most machine learning frameworks, including Tidymodels, implementing a multiclass classification model is not significantly more complex than binary classification.

#### Meet the data

![The Palmer Archipelago penguins. Artwork by \@allison_horst.](images/lter_penguins.png){width="600"}

In this sections, we'll build a multiclass classifier for classifying penguins!

The `palmerpenguins` data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

> These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the [Palmer Station Long Term Ecological Research Program](https://pal.lternet.edu/), part of the [US Long Term Ecological Research Network](https://lternet.edu/). The data were imported directly from the [Environmental Data Initiative](https://environmentaldatainitiative.org/) (EDI) Data Portal, and are available for use by CC0 license ("No Rights Reserved") in accordance with the [Palmer Station Data Policy](https://pal.lternet.edu/data/policies).

In R, the package `palmerpenguins` by [Allison Marie Horst and Alison Presmanes Hill and Kristen B Gorman](https://allisonhorst.github.io/palmerpenguins/articles/intro.html) provides us with data related to these adorable creatures.

The corresponding csv data used in the Python tutorial can be found [here](https://github.com/MicrosoftDocs/ml-basics/tree/master/data).

Care for a pun?

    What is a penguin’s favorite movie?

    Frozen ❄️👧.

With that, let's get started exploring some penguins 🐧🐧!

#### Explore the data

```{r penguins}
# Load the dataset
library(palmerpenguins)

# Take a peek into the data
glimpse(penguins)


```

The data contains the following columns:

-   **species:** a factor denoting the penguin species (`Adelie`, `Chinstrap`, or `Gentoo`)

-   **island:** a factor denoting the island (in Palmer Archipelago, Antarctica) where observed

-   **bill_length_mm (aka culmen_length):** a number denoting length of the dorsal ridge of penguin bill (millimeters)

-   **bill_depth_mm (aka culmen_depth):** a number denoting the depth of the penguin bill (millimeters)

-   **flipper_length_mm:** an integer denoting penguin flipper length (millimeters)

-   **body_mass_g:** an integer denoting penguin body mass (grams)

-   **sex:** a factor denoting penguin sex (male, female)

-   **year:** an integer denoting the study year (2007, 2008, or 2009)

The **species** column containing penguin species `Adelie`, `Chinstrap`, or `Gentoo`, is the label we want to train a model to predict.

The corresponding [Python learning module](https://docs.microsoft.com/en-us/learn/modules/train-evaluate-classification-models/) used a data set with the following variables: **bill_length_mm**, **bill_depth_mm**, **flipper_length_mm**, **body_mass_g**, **species**.

Let's narrow down to those and make some summary statistics while at it. The [skimr package](https://cran.r-project.org/web/packages/skimr/vignettes/skimr.html) provides a strong set of summary statistics that are generated for a variety of different data types.

```{r skimr}
# Select desired columns
penguins_select <- penguins %>% 
  select(c(bill_length_mm, bill_depth_mm, flipper_length_mm,
           body_mass_g, species))

# Dso some summary statistics
penguins_select %>% 
  skim()

```

From the neat summary provided by skimr, we can see that each our predictor columns contains missing 2 values while our label/outcome column contains none.

Let's dig a little deeper and filter the rows that contain missing values.

```{r missing_rows}
penguins_select %>% 
  filter(if_any(everything(), is.na))

```

There are two rows that contain no feature values at all (`NA` stands for Not Available ), so these won't be useful in training a model. Let's discard them from the dataset.

```{r drop_na}
# Drop rows containing missing values
penguins_select <- penguins_select %>% 
  drop_na()

# Confirm there are no missing values
penguins_select %>% 
  anyNA()

# Proportion of each species in the data
penguins_select %>% 
  count(species)

```

Now that we've dealt with the missing values, let's explore how the features relate to the label by creating some box charts.

```{r box_plt_penguins}
# Pivot data to a long format
penguins_select_long <- penguins_select %>% 
  pivot_longer(!species, names_to = "predictors", values_to = "values")

# Make box plots
penguins_select_long %>%
  ggplot(mapping = aes(x = species, y = values, fill = predictors)) +
  geom_boxplot() +
  facet_wrap(~predictors, scales = "free") +
  scale_fill_paletteer_d("nbapalettes::supersonics_holiday") +
  theme(legend.position = "none")
  

```

From the box plots, it looks like species `Adelie` and `Chinstrap` have similar data profiles for bill_depth, flipper_length, and body_mass, but Chinstraps tend to have longer bill_length. `Gentoo` tends to have fairly clearly differentiated features from the others; which should help us train a good classification model.

#### Prepare the data

Just as for binary classification, before training the model, we need to split the data into subsets for training and validation. We'll also apply a `stratification` technique when splitting the data to `maintain the proportion of each label value` in the training and validation datasets.

```{r penguin_split}
set.seed(2056)
# Split data 70%-30% into training set and test set
penguins_split <- penguins_select %>% 
  initial_split(prop = 0.70, strata = species)

# Extract data in each split
penguins_train <- training(penguins_split)
penguins_test <- testing(penguins_split)

# Print the number of observations in each split
cat("Training cases: ", nrow(penguins_train), "\n",
    "Test cases: ", nrow(penguins_test), sep = "")
```

#### Train and evaluate a multiclass classifier

Now that we have a set of training features and corresponding training labels, we can fit a multiclass classification algorithm to the data to create a model.

`parsnip::multinom_reg()` defines a model that uses linear predictors to predict multiclass data using the multinomial distribution.

Let's fit Multinomial regression via [nnet](https://cran.r-project.org/web/packages/nnet/nnet.pdf) package. This model usually has 1 tuning hyperparameter, `penalty`, which describes the amount of regularization. This is used to counteract any bias in the sample, and help the model generalize well by avoiding *overfitting* the model to the training data. We can of course tune this parameter, like we will later on in this lesson, but for now, let's choose an arbitrary value of `1`

```{r mr_spec}
# Specify a multinomial regression via nnet
multireg_spec <- multinom_reg(penalty = 1) %>% 
  set_engine("nnet") %>% 
  set_mode("classification")

# Train a multinomial regression model without any preprocessing
set.seed(2056)
multireg_fit <- multireg_spec %>% 
  fit(species ~ ., data = penguins_train)

# Print the model
multireg_fit


```

Now we can use the trained model to predict the labels for the test features, and evaluate performance:

```{r penguins_eval}
# Make predictions for the test set
penguins_results <- penguins_test %>% select(species) %>% 
  bind_cols(multireg_fit %>% 
              predict(new_data = penguins_test)) %>% 
  bind_cols(multireg_fit %>% 
              predict(new_data = penguins_test, type = "prob"))

# Print predictions
penguins_results %>% 
  slice_head(n = 5)

```

Now, let's look at the confusion matrix for our model

```{r mr_conf}
# Confusion matrix
penguins_results %>% 
  conf_mat(species, .pred_class)


```

The confusion matrix shows the intersection of predicted and actual label values for each class - in simple terms, the diagonal intersections from top-left to bottom-right indicate the number of correct predictions.

When dealing with multiple classes, it's generally more intuitive to visualize this as a heat map, like this:

```{r mr_conf_viz}
update_geom_defaults(geom = "tile", new = list(color = "black", alpha = 0.7))
# Visualize confusion matrix
penguins_results %>% 
  conf_mat(species, .pred_class) %>% 
  autoplot(type = "heatmap")
```

The darker squares in the confusion matrix plot indicate high numbers of cases, and you can hopefully see a diagonal line of darker squares indicating cases where the predicted and actual label are the same.

Let's now calculate summary statistics for the confusion matrix.

```{r penguins_summ}
# Statistical summaries for the confusion matrix
conf_mat(data = penguins_results, truth = species, estimate = .pred_class) %>% 
  summary()

```

The tibble shows the overall metrics of how well the model performs across all three classes.

Let's evaluate the ROC metrics. In the case of a multiclass classification model, a single ROC curve showing true positive rate vs false positive rate is not possible. However, you can use the rates for each class in a One vs Rest (OVR) comparison to create a ROC chart for each class.

```{r penguins_roc}
# Make a ROC_CURVE
penguins_results %>% 
  roc_curve(species, c(.pred_Adelie, .pred_Chinstrap, .pred_Gentoo)) %>% 
  ggplot(aes(x = 1 - specificity, y = sensitivity, color = .level)) +
  geom_abline(lty = 2, color = "gray80", size = 0.9) +
  geom_path(show.legend = T, alpha = 0.6, size = 1.2) +
  coord_equal()

```

To quantify the ROC performance, you can calculate an aggregate area under the curve score that is averaged across all of the OVR curves.

```{r penguins_AOC}
# Calculate ROC_AOC
penguins_results %>% 
  roc_auc(species, c(.pred_Adelie, .pred_Chinstrap, .pred_Gentoo))

```

That went down well! The model did a great job in classifying the penguins. What kind of adventure would it be, if we didn't preprocess the data?

#### Workflows + A different algorithm

Again, just like with binary classification, you can use a workflow to apply preprocessing steps to the data before fitting it to an algorithm to train a model. Let's scale the numeric features in a transformation step before training, try a different algorithm (a support vector machine) and tune some model hyperparameters, just to show that we can!

`Support Vector Machines` try to find a *hyperplane* in some feature space that "best" separates the classes. Please see:

-   [*Support Vector Machines*](https://bradleyboehmke.github.io/HOML/svm.html), Hands-on Machine Learning with R

-   *Support Vector Machines*, [An Introduction to Statistical Learning with Applications in R](https://www.statlearning.com/)

-   [Support Vector Machines under the hood](https://towardsdatascience.com/support-vector-machines-under-the-hood-c609e57a4b09)

-   [SVM kernels](https://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html), Scikit learn

We'll fit a `radial basis function` support vector machine to these data and tune the `SVM cost parameter` and the `σ parameter` in the kernel function (The margin parameter does not apply to classification models)

-   A cost argument allows us to specify the cost of a violation to the margin. When the cost argument is small, then the margins will be wide and many support vectors will be on the margin or will violate the margin. This *could* make the model more robust and lead to better classification. When the cost argument is large, then the margins will be narrow and there will be few support vectors on the margin or violating the margin.

-   As `σ` decreases, the fit becomes more non-linear, and the model *becomes* more flexible.

Both parameters can have a profound effect on the model complexity and performance.

> The radial basis kernel is extremely flexible and as a rule of thumb, we generally start with this kernel when fitting SVMs in practice.

Parameters are marked for tuning by assigning them a value of `tune()`. Also, let's try out a new succinct way of creating workflows that minimizes a lot of piping steps as suggested by [David's blog post](http://varianceexplained.org/r/sliced-ml/) (winner of [sliced](https://mobile.twitter.com/meganrisdal/status/1428039365008060424)!!)

```{r tune_svm}
# Create a model specification
svm_spec <- svm_rbf(mode = "classification", engine = "kernlab",
            cost = tune(), rbf_sigma = tune())


# Create a workflow that encapsulates a recipe and a model
svm_wflow <- recipe(species ~ ., data = penguins_train) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  workflow(svm_spec)

# Print out workflow
svm_wflow
```

Pretty neat, right ✨?

Now that we have specified what parameter to tune, we'll need to figure out a set of possible values to try out then choose the best.

To do this, we'll create a grid! In this example, we'll work through a regular grid of hyperparameter values, try them out, and see what pair results in the best model performance.

```{r}
set.seed(2056)
# Create regular grid of 6 values for each tuning parameters
svm_grid <- grid_regular(parameters(svm_spec), levels = 6)

# Print out some parameters in our grid
svm_grid %>% 
  slice_head(n = 10)

```

Awesome! One thing about hyperparameters is that they are not learned directly from the training set. Instead, they are estimated using `simulated data sets` created from a process called `resampling`. In our previous, we used `cross-validation` resampling method. Let's try out another resampling technique: `bootstrap resampling`.

Bootstrap resampling means drawing with `replacement` from our original dataset then then fit a model on that new set that `contains some duplicates`, and evaluate the model on the `data points that were not included`.

Then we do that again (default behaviour is 25 boostraps but this can be changed). Okay, let's create some simulated data sets.

```{r bootstraps}
set.seed(2056)
# Bootstrap resampling
penguins_bs <- bootstraps(penguins_train, times = 10)

penguins_bs

```

#### Model tuning via grid search.

We are ready to tune! Let's use [`tune_grid()`](https://tune.tidymodels.org/reference/tune_grid.html) to fit models at all the different values we chose for each tuned hyperparameter.

```{r grid_search}
doParallel::registerDoParallel()

# Model tuning via a grid search
set.seed(2056)
svm_res <- tune_grid(
  object = svm_wflow,
  resamples = penguins_bs,
  grid = svm_grid
)
```

Now that we have our tuning results, we can extract the performance metrics using `collect_metrics()`:

```{r collect_metrics}
# Obtain performance metrics
svm_res %>% 
  collect_metrics() %>% 
  slice_head(n = 7)

```

🤩🤩 Let's see if we could get more by visualizing the results:

```{r}
# Visualize tuning metrics
svm_res %>% 
  collect_metrics() %>% 
  mutate(rbf_sigma = factor(rbf_sigma)) %>% 
  ggplot(mapping = aes(x = cost, y = mean, color = rbf_sigma)) +
  geom_line(size = 1.5, alpha = 0.7) +
  geom_point(size = 2) +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  scale_x_log10(labels = scales::label_number()) +
  scale_color_viridis_d(option = "viridis", begin = .1)

```

It seems that an SVM with an rbf_sigma of 1 and 0.01 really performed well across all candidate values of cost. The [`show_best()`](https://tune.tidymodels.org/reference/show_best.html) function can help us make a clearer distinction:

```{r show_best}
# Show best submodel 
svm_res %>% 
  show_best("accuracy")


```

Much better! Let's now use the [`select_best()`](https://tune.tidymodels.org/reference/show_best.html) function to pull out the single set of hyperparameter values in the best sub-model:

```{r select_best}
# Select best model hyperparameters
best_svm <- svm_res %>% 
  select_best("accuracy")

best_svm

```

Perfect! These are the values for `rbf_sigma` and `cost` that maximize accuracy for our penguins!

We can now `finalize` our workflow such that the parameters we had marked for tuning by assigning them a value of `tune()` can get updated with the values from `best_svm`

```{r}
# Finalize workflow
final_wflow <- svm_wflow %>% 
  finalize_workflow(best_svm)

final_wflow
  
```

That marks the end of tuning 💃!

#### The last fit

Finally, let's fit this final model to the training data and evaluate how it would perform on new data using our test data. We can use the function [`last_fit()`](https://tune.tidymodels.org/reference/last_fit.html) with our finalized model; this function *fits* the finalized model on the full training data set and *evaluates* it on the testing data.

```{r last_fit}
# The last fit
final_fit <- last_fit(object = final_wflow, split = penguins_split)

# Collect metrics
final_fit %>% 
  collect_metrics()
```

Much better 🤩! You can of course go ahead and obtain the hard class and probability predictions using `collect predictions()` and you will be well on your way to computing the confusion matrix and other summaries that come with it.

```{r collect_predictions}
# Collect predictions and make confusion matrix
final_fit %>% 
  collect_predictions() %>% 
  conf_mat(truth = species, estimate = .pred_class)
```

#### Use the model with new data observations

Now let's save our trained model so we can use it again later. Begin by extracting the *trained workflow* object from `final_fit` object.

```{r save_wf}
# Extract trained workflow
penguins_svm_model <- final_fit %>% 
  extract_workflow()

# Save workflow
saveRDS(penguins_svm_model, "penguins_svm_model.rds")
```

OK, so now we have a trained model. Let's use it to predict the class of a new penguin observation:

```{r load_model}
# Load model
loaded_psvm_model <- readRDS("penguins_svm_model.rds") 

# Create new tibble of observations
new_obs <- tibble(
  bill_length_mm = c(49.5, 38.2),
  bill_depth_mm = c(18.4, 20.1),
  flipper_length_mm = c(195, 190),
  body_mass_g = c(3600, 3900))

# Make predictions
new_results <- new_obs %>% 
  bind_cols(loaded_psvm_model %>% 
              predict(new_data = new_obs))

# Show predictions
new_results
```

Good job! A working model 🐧🐧!

## 7. Summary

We need to *chill*, right? 😅 We hope you had a flippin' good time!

In this module, you learned how classification can be used to create a machine learning model that predicts categories, or *classes*. You then used the amazing **Tidymodels** framework in `R` to train and evaluate a classification model using different algorithms, do some data preprocessing, tuned some hyperparameters and made better predictions.

While `Tidymodels` and `scikit-learn` (Python) are popular framework for writing code to train classification models, you can also create machine learning solutions for classification using the graphical tools in Microsoft Azure Machine Learning. You can learn more about no-code development of classification models using Azure Machine Learning in [Create a classification model with Azure Machine Learning designer](https://docs.microsoft.com/en-us/learn/modules/create-classification-model-azure-machine-learning-designer/) module.

#### **Challenge: Predict Real Estate Prices**

Feel like challenging yourself to train a classification model? Try the challenge in the [/challenges/03 - Wine Classification Challenge.ipynb](https://github.com/MicrosoftDocs/ml-basics/blob/master/challenges/03%20-%20Wine%20Classification%20Challenge.ipynb) notebook to see if you can classify wines into their grape varietals! Find the data [here](https://github.com/MicrosoftDocs/ml-basics/tree/master/challenges/data).

#### THANK YOU TO:

`Allison Horst` for creating the amazing illustrations that make R more welcoming and engaging. Find more illustrations at her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).

`Bethany`, *Gold Microsoft Learn Student Ambassador*, for her valuable feedback and suggestions.

#### FURTHER READING

-   Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. [*An introduction to statistical learning : with applications in R*](https://www.statlearning.com/)*.* Corresponding Tidymodels labs by Emil Hvitfeldt can be found [*here*](https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/index.html)*.*

-   Max Kuhn and Julia Silge, [*Tidy Modeling with R*](https://www.tmwr.org/)*.*

-   Kuhn, M, and K Johnson. 2013. *Applied Predictive Modeling*. Springer.

-   Bradley Boehmke & Brandon Greenwell, [*Hands-On Machine Learning with R*](https://bradleyboehmke.github.io/HOML/)*.*

-   Jorge Cimentada, [*Machine Learning for Social Scientists*](https://cimentadaj.github.io/ml_socsci/)*.*

-   Tidymodels [reference website](https://www.tidymodels.org/start/).

-   H. Wickham and G. Grolemund, [*R for Data Science: Visualize, Model, Transform, Tidy, and Import Data*](https://r4ds.had.co.nz/).

Happy leaRning,

[Eric (R_ic)](https://twitter.com/ericntay), *Gold Microsoft Learn Student Ambassador*.

```{r multi_spec, eval=FALSE, message=TRUE, warning=TRUE, include=FALSE}
# # Specify a multinomial regression via nnet
# multireg_spec <- multinom_reg(penalty = tune()) %>%
#   set_engine("nnet") %>%
#   set_mode("classification")
# 
# # Specify a recipe
# multireg_recipe <- recipe(species ~ ., data = penguins_train) 
# 
# # Bundle model specification and recipe into a workflow
# multireg_wf <- workflow() %>%
#   add_recipe(multireg_recipe) %>%
#   add_model(multireg_spec)
# 
# # Grid search
# tree_grid <- grid_regular(penalty(), levels = 50)
# 
# # Resamples
# folds <- vfold_cv(penguins_train, v = 5, repeats = 1)




```

```{r eval=FALSE, include=FALSE}
# doParallel::registerDoParallel()
# 
# tree_res <- tune_grid(
#   multireg_wf,
#   resamples = folds,
#   grid = tree_grid
# )
```
