Introduction

Machine Learning with tidymodels

Classification Models

Feature Engineering

Workflows and Hyperparameter Tuning

Machine learning workflows

Estimating performance with cross validation

Hyperparameter tuning

Selecting the best model

Finalizing a workflow
Training a finalized workflow

Congratulations

Introduction

Course notes from the Modeling with tidymodels in R course on DataCamp taught by David Svancer

Course Description

Tidymodels is a powerful suite of R packages designed to streamline machine learning workflows. Learn to split datasets for cross-validation, preprocess data with tidymodels’ recipe package, and fine-tune machine learning algorithms. You’ll learn key concepts such as defining model objects and creating modeling workflows. Then, you’ll apply your skills to predict home prices and classify employees by their risk of leaving a company.

What is Covered

Machine Learning with tidymodels
Classification Models
Feature Engineering
Workflows and Hyperparameter Tuning

Libraries and Data

library(tidyverse)
library(knitr)
library(tidymodels)

home_sales <- read_rds("data/home_sales.rds")
telecom_df <- read_rds("data/telecom_df.rds")
loans_df <- read_rds("data/loan_df.rds")

Machine Learning with tidymodels

In this chapter, you’ll explore the rich ecosystem of R packages that power tidymodels and learn how they can streamline your machine learning workflows. You’ll then put your tidymodels skills to the test by predicting house sale prices in Seattle, Washington.

The tidymodels ecosystem

Creating training and test datasets

The rsample package is designed to create training and test datasets. Creating a test dataset is important for estimating how a trained model will likely perform on new data. It also guards against overfitting, where a model memorizes patterns that exist only in the training data and performs poorly on new data.

In this exercise, you will create training and test datasets from the home_sales data. This data contains information on homes sold in the Seattle, Washington area between 2015 and 2016.

The outcome variable in this data is selling_price.

The tidymodels package will be pre-loaded in every exercise in the course. The home_sales tibble has also been loaded for you.

# Create a data split object
home_split <-initial_split (home_sales, 
                  prop = 0.7, 
                  strata = selling_price)

# Create the training data
home_training <- home_split %>%
 training()

# Create the test data
home_test <- home_split %>% 
  testing()
# Check number of rows in each dataset
nrow(home_training)

## [1] 1042

nrow(home_test)

## [1] 450

Great job! Since the home_sales data has nearly 1,500 rows, it is appropriate to allocate more rows into the test set. This will provide more data for the model evaluation step.

Distribution of outcome variable values

Stratifying by the outcome variable when generating training and test datasets ensures that the outcome variable values have a similar range in both datasets.

Since the original data is split at random, stratification avoids placing all the expensive homes in home_sales into the test dataset, for example. In this case, your model would most likely perform poorly because it was trained on less expensive homes.

In this exercise, you will calculate summary statistics for the selling_price variable in the training and test datasets. The home_training and home_test tibbles have been loaded from the previous exercise.

# Distribution of selling_price in training data
home_training %>% 
  summarize(min_sell_price = min(selling_price),
            max_sell_price = max(selling_price),
            mean_sell_price = mean(selling_price),
            sd_sell_price = sd(selling_price))

## # A tibble: 1 x 4
##   min_sell_price max_sell_price mean_sell_price sd_sell_price
##            <dbl>          <dbl>           <dbl>         <dbl>
## 1         350000         650000         479780.        81542.

# Distribution of selling_price in test data
home_test %>% 
  summarize(min_sell_price = min(selling_price),
            max_sell_price = max(selling_price),
            mean_sell_price = mean(selling_price),
            sd_sell_price = sd(selling_price))

## # A tibble: 1 x 4
##   min_sell_price max_sell_price mean_sell_price sd_sell_price
##            <dbl>          <dbl>           <dbl>         <dbl>
## 1         350000         650000         477475.        79725.

Excellent work! The minimum and maximum selling prices in both datasets are the same. The mean and standard deviation are also similar. Stratifying by the outcome variable ensures the model fitting process is performed on a representative sample of the original data.

Linear regression with tidymodels

Fitting a linear regression model

The parsnip package provides a unified syntax for the model fitting process in R.

With parsnip, it is easy to define models using the various packages, or engines, that exist in the R ecosystem.

In this exercise, you will define a parsnip linear regression object and train your model to predict selling_price using home_age and sqft_living as predictor variables from the home_sales data.

The home_training and home_test tibbles that you created in the previous lesson have been loaded into this session.

# Initialize a linear regression object, linear_model
linear_model <- linear_reg() %>% 
  # Set the model engine
  set_engine('lm') %>% 
  # Set the model mode
  set_mode('regression')

# Train the model with the training data
lm_fit <- linear_model %>% 
  fit(selling_price ~ home_age + sqft_living,
      data = home_training)

# Print lm_fit to view model information
lm_fit

## parsnip model object
## 
## Fit time:  0ms 
## 
## Call:
## stats::lm(formula = selling_price ~ home_age + sqft_living, data = data)
## 
## Coefficients:
## (Intercept)     home_age  sqft_living  
##    293527.1      -1658.4        103.4

Excellent work! You have defined your model with linear_reg() and trained it to predict selling_price using home_age and sqft_living. Printing a parsnip model fit object displays useful model information, such as the training time, model formula used during training, and the estimated model parameters. ### Exploring estimated model parameters In the previous exercise, you trained a linear regression model to predict selling_price using home_age and sqft_living as predictor variables.

Pass your trained model object into the appropriate function to explore the estimated model parameters and select the true statement.

Your trained model, lm_fit, has been loaded into your session.

tidy(lm_fit)

## # A tibble: 3 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  293527.   7400.       39.7  3.18e-210
## 2 home_age      -1658.    175.       -9.49 1.46e- 20
## 3 sqft_living     103.      2.68     38.6  9.26e-203

The true statement is: The estimated parameter for the sqft_living predictor variable is 103.417.The remaining statements are false as per the above table.

Great job! The tidy() function automatically creates a tibble of estimated model parameters. Since sqft_living has a positive estimated parameter, the selling price of homes increases with the square footage. Conversely, since home_age has a negative estimated parameter, older homes tend to have lower selling prices.

Predicting home selling prices

After fitting a model using the training data, the next step is to use it to make predictions on the test dataset. The test dataset acts as a new source of data for the model and will allow you to evaluate how well it performs.

Before you can evaluate model performance, you must add your predictions to the test dataset.

In this exercise, you will use your trained model, lm_fit, to predict selling_price in the home_test dataset.

Your trained model, lm_fit, as well as the test dataset, home_test have been loaded into your session.

# Predict selling_price
home_predictions <- predict(lm_fit,
                        new_data = home_test)

# View predicted selling prices
home_predictions

## # A tibble: 450 x 1
##      .pred
##      <dbl>
##  1 539624.
##  2 380538.
##  3 633342.
##  4 401613.
##  5 509865.
##  6 484170.
##  7 448133.
##  8 531387.
##  9 535915.
## 10 500466.
## # ... with 440 more rows

# Combine test data with predictions
home_test_results <- home_test %>% 
  select(selling_price, home_age, sqft_living) %>% 
  bind_cols(home_predictions)

# View results
home_test_results

## # A tibble: 450 x 4
##    selling_price home_age sqft_living   .pred
##            <dbl>    <dbl>       <dbl>   <dbl>
##  1        487000       10        2540 539624.
##  2        411000       18        1130 380538.
##  3        635000        4        3350 633342.
##  4        356000       24        1430 401613.
##  5        495000        3        2140 509865.
##  6        525000       16        2100 484170.
##  7        552321       29        1960 448133.
##  8        475000        0        2300 531387.
##  9        485000        6        2440 535915.
## 10        525000       28        2450 500466.
## # ... with 440 more rows

Congratualtions! You have trained a linear regression model and used it to predict the selling prices of homes in the test dataset! The model only used two predictor variables, but the predicted values in the .pred column seem reasonable!

Evaluating model performance

Model performance metrics

Evaluating model results is an important step in the modeling process. Model evaluation should be done on the test dataset in order to see how well a model will generalize to new datasets.

In the previous exercise, you trained a linear regression model to predict selling_price using home_age and sqft_living as predictor variables. You then created the home_test_results tibble using your trained model on the home_test data.

In this exercise, you will calculate the RMSE and R squared metrics using your results in home_test_results.

The home_test_results tibble has been loaded into your session.

# Print home_test_results
home_test_results

## # A tibble: 450 x 4
##    selling_price home_age sqft_living   .pred
##            <dbl>    <dbl>       <dbl>   <dbl>
##  1        487000       10        2540 539624.
##  2        411000       18        1130 380538.
##  3        635000        4        3350 633342.
##  4        356000       24        1430 401613.
##  5        495000        3        2140 509865.
##  6        525000       16        2100 484170.
##  7        552321       29        1960 448133.
##  8        475000        0        2300 531387.
##  9        485000        6        2440 535915.
## 10        525000       28        2450 500466.
## # ... with 440 more rows

# Caculate the RMSE metric
rmse <-home_test_results %>% 
  rmse(truth = selling_price, estimate = .pred)
rmse

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      48505.

# Calculate the R squared metric
rsq <- home_test_results %>% 
  rsq(truth = selling_price, estimate = .pred)
rsq

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.630

Great job! The RMSE metric indicates that the average prediction error for home selling prices is about $48,000. Not bad considering you only used home_age and sqft_living as predictor variables!

R squared plot

In the previous exercise, you got an R squared value of 0.630.The R squared metric ranges from 0 to 1, 0 being the worst and 1 the best.

Calculating the R squared value is only the first step in studying your model’s predictions.

Making an R squared plot is extremely important because it will uncover potential problems with your model, such as non-linear patterns or regions where your model is either over or under-predicting the outcome variable.

In this exercise, you will create an R squared plot of your model’s performance.

The home_test_results tibble has been loaded into your session.

# Create an R squared plot of model performance
ggplot(home_test_results, aes(x = selling_price, y = .pred)) +
  geom_point(alpha = 0.5) + 
  geom_abline(color = 'blue', linetype = 2) +
  coord_obs_pred() +
  labs(x = 'Actual Home Selling Price', y = 'Predicted Selling Price')

Good work! From the plot, you can see that your model tends to over-predict selling prices for homes that sold for less than $400,000, and under-predict for homes that sold for $600,000 or more. This indicates that you will have to add more predictors to your model or that linear regression may not be able to model the relationship as well as more advanced modeling techniques!

Complete model fitting process with last_fit()

In this exercise, you will train and evaluate the performance of a linear regression model that predicts selling_price using all the predictors available in the home_sales tibble.

This exercise will give you a chance to perform the entire model fitting process with tidymodels, from defining your model object to evaluating its performance on the test data.

Earlier in the chapter, you created an rsample object called home_split by passing the home_sales tibble into initial_split(). The home_split object contains the instructions for randomly splitting home_sales into training and test sets.

The home_sales tibble, and home_split object have been loaded into this session.

# Define a linear regression model
linear_model <- linear_reg() %>% 
  set_engine('lm') %>% 
  set_mode('regression')

# Train linear_model with last_fit()
linear_fit <- linear_model %>% 
  last_fit(selling_price ~. , split = home_split)

# Collect predictions and view results
predictions_df <- linear_fit %>% collect_predictions()
predictions_df

## # A tibble: 450 x 5
##    id                 .pred  .row selling_price .config             
##    <chr>              <dbl> <int>         <dbl> <chr>               
##  1 train/test split 529274.     1        487000 Preprocessor1_Model1
##  2 train/test split 398602.     3        411000 Preprocessor1_Model1
##  3 train/test split 693173.     4        635000 Preprocessor1_Model1
##  4 train/test split 435176.    11        356000 Preprocessor1_Model1
##  5 train/test split 479692.    12        495000 Preprocessor1_Model1
##  6 train/test split 503484.    13        525000 Preprocessor1_Model1
##  7 train/test split 468670.    15        552321 Preprocessor1_Model1
##  8 train/test split 465154.    16        475000 Preprocessor1_Model1
##  9 train/test split 564799.    17        485000 Preprocessor1_Model1
## 10 train/test split 457006.    19        525000 Preprocessor1_Model1
## # ... with 440 more rows

# Make an R squared plot using predictions_df
ggplot(predictions_df, aes(x = selling_price, y = .pred)) + 
  geom_point(alpha = 0.5) + 
  geom_abline(color = 'blue', linetype = 2) +
  coord_obs_pred() +
  labs(x = 'Actual Home Selling Price', y = 'Predicted Selling Price')

Great work! You have created your first machine learning pipeline and visualized the performance of your model. From the R squared plot, the model still tends to over-predict selling prices for homes that sold for less than $400,000 and under-predict for homes at $600,000 or more, but it is a slight improvement over your previous model with only two predictor variables.

Classification Models

Learn how to predict categorical outcomes by training classification models. Using the skills you’ve gained so far, you’ll predict the likelihood of customers canceling their service with a telecommunications company.

classification models

Data resampling

The first step in a machine learning project is to create training and test datasets for model fitting and evaluation. The test dataset provides an estimate of how your model will perform on new data and helps to guard against overfitting.

You will be working with the telecom_df dataset which contains information on customers of a telecommunications company. The outcome variable is canceled_service and it records whether a customer canceled their contract with the company. The predictor variables contain information about customers’ cell phone and internet usage as well as their contract type and monthly charges.

The telecom_df tibble has been loaded into your session.

# Create data split object
telecom_split <- initial_split(telecom_df, prop = 0.75,
                     strata = canceled_service)

# Create the training data
telecom_training <- telecom_split %>% 
training ()

# Create the test data
telecom_test <- telecom_split %>%
 testing()
 
# Check the number of rows
nrow(telecom_training)

## [1] 731

nrow(telecom_test)

## [1] 244

Good job! You have 731 rows in your training data and 244 rows in your test dataset. Now you can begin the model fitting process using telecom_training.

Fitting a logistic regression model

In addition to regression models, the parsnip package also provides a general interface to classification models in R.

In this exercise, you will define a parsnip logistic regression object and train your model to predict canceled_service using avg_call_mins, avg_intl_mins, and monthly_charges as predictor variables from the telecom_df data.

The telecom_training and telecom_test tibbles that you created in the previous lesson have been loaded into this session.

# Specify a logistic regression model
logistic_model <- logistic_reg() %>%
  # Set the engine
  set_engine("glm") %>% 
  # Set the mode
  set_mode("classification")

# Print the model specification
logistic_model

## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

# Fit to training data
logistic_fit <- logistic_model %>% 
  fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges, 
      data = telecom_training )

# Print model fit object
logistic_fit

## parsnip model object
## 
## Fit time:  10ms 
## 
## Call:  stats::glm(formula = canceled_service ~ avg_call_mins + avg_intl_mins + 
##     monthly_charges, family = stats::binomial, data = data)
## 
## Coefficients:
##     (Intercept)    avg_call_mins    avg_intl_mins  monthly_charges  
##        1.725750        -0.010457         0.023631         0.002039  
## 
## Degrees of Freedom: 730 Total (i.e. Null);  727 Residual
## Null Deviance:       932.4 
## Residual Deviance: 801.3     AIC: 809.3

Great job! You have defined your model with logistic_reg() and trained it to predict canceled_service using avg_call_mins, avg_intl_mins, and monthly_charges. Printing a parsnip model specification object displays useful model information, such as the model type, computational engine, and mode. Printing a model fit object will display the estimated model coefficients.

Combining test dataset results

Evaluating your model’s performance on the test dataset gives insights into how well your model predicts on new data sources. These insights will help you communicate your model’s value in solving problems or improving decision making.

Before you can calculate classification metrics such as sensitivity or specificity, you must create a results tibble with the required columns for yardstick metric functions.

In this exercise, you will use your trained model to predict the outcome variable in the telecom_test dataset and combine it with the true outcome values in the canceled_service column.

Your trained model, logistic_fit, and test dataset, telecom_test, have been loaded from the previous exercise.

# Predict outcome categories
class_preds <- predict(logistic_fit, new_data = telecom_test,
                       type = 'class')

# Obtain estimated probabilities for each outcome value
prob_preds <- predict(logistic_fit, new_data = telecom_test, 
                      type = 'prob')

# Combine test set results
telecom_results <- telecom_test %>% 
  select(canceled_service) %>% 
  bind_cols(class_preds, prob_preds)

# View results tibble
telecom_results

## # A tibble: 244 x 4
##    canceled_service .pred_class .pred_yes .pred_no
##    <fct>            <fct>           <dbl>    <dbl>
##  1 yes              no             0.381     0.619
##  2 no               no             0.209     0.791
##  3 no               no             0.230     0.770
##  4 no               no             0.168     0.832
##  5 no               no             0.422     0.578
##  6 no               yes            0.629     0.371
##  7 no               no             0.107     0.893
##  8 yes              yes            0.541     0.459
##  9 no               no             0.0192    0.981
## 10 no               no             0.0939    0.906
## # ... with 234 more rows

Good job! You have created a tibble of model results using the test dataset. Your results tibble contains all the necessary columns for calculating classification metrics. Next, you’ll use this tibble and the yardstick package to evaluate your model’s performance.

Assessing model fit

Calculating metrics from the confusion matrix

The confusion matrix of a binary classification model lists the number of correct and incorrect predictions obtained on the test dataset and is useful for evaluating the performance of your model.

Suppose you have trained a classification model that predicts whether customers will cancel their service at a telecommunications company and obtained the following confusion matrix on your test dataset. Here yes represents the positive class, while no represents the negative class.

Choose the true statement from the options below.

That’s correct. The sensitivity is calculated by taking the proportion of true positives among predicted positives. Your model was able to correctly classify 75% of all customers who actually canceled their service.

Evaluating performance with yardstick

In the previous exercise, you calculated classification metrics from a sample confusion matrix. The yardstick package was designed to automate this process.

For classification models, yardstick functions require a tibble of model results as the first argument. This should include the actual outcome values, predicted outcome values, and estimated probabilities for each value of the outcome variable.

In this exercise, you will use the results from your logistic regression model, telecom_results, to calculate performance metrics.

The telecom_results tibble has been loaded into your session.

# Calculate the confusion matrix
conf_mat(telecom_results, truth = canceled_service,
    estimate = .pred_class)

##           Truth
## Prediction yes  no
##        yes  38  32
##        no   44 130

# Calculate the accuracy
acc <- accuracy(telecom_results,
                    truth = canceled_service,
                    estimate = .pred_class)
acc

## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.689

# Calculate the sensitivity
sens_cl<- sensitivity(telecom_results,
                    truth = canceled_service,
                    estimate = .pred_class)
sens_cl

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 sens    binary         0.463

# Calculate the specificity
spec_cl <- specificity(telecom_results,
                        truth = canceled_service,
                        estimate = .pred_class)
spec_cl

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 spec    binary         0.802

Excellent work! The specificity of your logistic regression model is 0.802, which is more than double the sensitivity of 0.463. This indicates that your model is much better at detecting customers who will not cancel their telecommunications service versus the ones who will.

Creating custom metric sets

The yardstick package also provides the ability to create custom sets of model metrics. In cases where the cost of obtaining false negative errors is different from the cost of false positive errors, it may be important to examine a specific set of performance metrics.

Instead of calculating accuracy, sensitivity, and specificity separately, you can create your own metric function that calculates all three at the same time.

In this exercise, you will use the results from your logistic regression model, telecom_results, to calculate a custom set of performance metrics. You will also use a confusion matrix to calculate all available binary classification metrics in tidymodels all at once.

The telecom_results tibble has been loaded into your session.

# Create a custom metric function
telecom_metrics <- metric_set(accuracy, sens, spec)

# Calculate metrics using model results tibble
telecom_metrics(telecom_results, truth = canceled_service,
                estimate = .pred_class)

## # A tibble: 3 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.689
## 2 sens     binary         0.463
## 3 spec     binary         0.802

# Create a confusion matrix
conf_mat(telecom_results,
                      truth = canceled_service,
                      estimate = .pred_class) %>% 
  # Pass to the summary() function
  summary()

## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.689
##  2 kap                  binary         0.276
##  3 sens                 binary         0.463
##  4 spec                 binary         0.802
##  5 ppv                  binary         0.543
##  6 npv                  binary         0.747
##  7 mcc                  binary         0.278
##  8 j_index              binary         0.266
##  9 bal_accuracy         binary         0.633
## 10 detection_prevalence binary         0.287
## 11 precision            binary         0.543
## 12 recall               binary         0.463
## 13 f_meas               binary         0.5

Nice work! You created a custom metric function to calculate accuracy, sensitivity, and specificity. Oftentimes, you will be interested in tracking certain performance metrics for a given modeling problem, but passing a confusion matrix to the summary() function will calculate all available binary classification metrics in tidymodels at once!

Visualizing model performance

Plotting the confusion matrix

Calculating performance metrics with the yardstick package provides insight into how well a classification model is performing on the test dataset. Most yardstick functions return a single number that summarizes classification performance.

Many times, it is helpful to create visualizations of the confusion matrix to more easily communicate your results.

In this exercise, you will make a heat map and mosaic plot of the confusion matrix from your logistic regression model on the telecom_df dataset.

Your model results tibble, telecom_results, has been loaded into your session.

# Create a confusion matrix
conf_mat(telecom_results,
         truth = canceled_service,
         estimate = .pred_class) %>% 
  # Create a heat map
  autoplot(type = "heatmap")

# Create a confusion matrix
conf_mat(telecom_results,
         truth = canceled_service,
         estimate = .pred_class) %>% 
  # Create a heat map
  autoplot(type = "mosaic")

Great job! The mosaic plot clearly shows that your logistic regression model performs much better in terms of specificity than sensitivity. You can see that in the yes column, a large proportion of outcomes were incorrectly predicted as no.

ROC curves and area under the ROC curve

ROC curves are used to visualize the performance of a classification model across a range of probability thresholds. An ROC curve with the majority of points near the upper left corner of the plot indicates that a classification model is able to correctly predict both the positive and negative outcomes correctly across a wide range of probability thresholds.

The area under this curve provides a letter grade summary of model performance.

In this exercise, you will create an ROC curve from your logistic regression model results and calculate the area under the ROC curve with yardstick.

Your model results tibble, telecom_results has been loaded into your session.

# Calculate metrics across thresholds
threshold_df <- telecom_results %>% 
  roc_curve(truth = canceled_service, .pred_yes)

# View results
threshold_df

## # A tibble: 246 x 3
##    .threshold specificity sensitivity
##         <dbl>       <dbl>       <dbl>
##  1  -Inf          0                 1
##  2     0.0192     0                 1
##  3     0.0397     0.00617           1
##  4     0.0477     0.0123            1
##  5     0.0589     0.0185            1
##  6     0.0639     0.0247            1
##  7     0.0684     0.0309            1
##  8     0.0706     0.0370            1
##  9     0.0793     0.0432            1
## 10     0.0797     0.0494            1
## # ... with 236 more rows

# Plot ROC curve
threshold_df %>% 
  autoplot()

# Calculate ROC AUC
telecom_roc_auc <- telecom_results %>% 
  roc_auc(
truth = canceled_service, .pred_yes)
telecom_roc_auc

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.739

Nice work! The area under the ROC curve is 0.739. This indicates that your model gets a C in terms of overall performance. This is mainly due to the low sensitivity of the model.

Automating the modeling workflow

Streamlining the modeling process

The last_fit() function is designed to streamline the modeling workflow in tidymodels. Instead of training your model on the training data and building a results tibble using the test data, last_fit() accomplishes this with one function.

In this exercise, you will train the same logistic regression model as you fit in the previous exercises, except with the last_fit() function.

Your data split object, telecom_split, and model specification, logistic_model, have been loaded into your session.

# Train model with last_fit()
telecom_last_fit <- logistic_model %>% 
  last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,
           split = telecom_split)

# View test set metrics
mets_stream <- telecom_last_fit %>% 
 collect_metrics()
mets_stream

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.689 Preprocessor1_Model1
## 2 roc_auc  binary         0.739 Preprocessor1_Model1

Excellent work! Notice that you got the same area under the ROC curve as before, just with a lot less effort!

Collecting predictions and creating custom metrics

Using the last_fit() modeling workflow also saves time in collecting model predictions. Instead of manually creating a tibble of model results, there are helper functions that extract this information automatically.

In this exercise, you will use your trained model, telecom_last_fit, to create a tibble of model results on the test dataset as well as calculate custom performance metrics.

You trained model, telecom_last_fit, has been loaded into this session.

# Collect predictions
last_fit_results <- telecom_last_fit %>% 
  collect_predictions()

# View results
last_fit_results

## # A tibble: 244 x 7
##    id         .pred_yes .pred_no  .row .pred_class canceled_service .config     
##    <chr>          <dbl>    <dbl> <int> <fct>       <fct>            <chr>       
##  1 train/tes~    0.381     0.619     2 no          yes              Preprocesso~
##  2 train/tes~    0.209     0.791     5 no          no               Preprocesso~
##  3 train/tes~    0.230     0.770     6 no          no               Preprocesso~
##  4 train/tes~    0.168     0.832    10 no          no               Preprocesso~
##  5 train/tes~    0.422     0.578    15 no          no               Preprocesso~
##  6 train/tes~    0.629     0.371    18 yes         no               Preprocesso~
##  7 train/tes~    0.107     0.893    28 no          no               Preprocesso~
##  8 train/tes~    0.541     0.459    31 yes         yes              Preprocesso~
##  9 train/tes~    0.0192    0.981    37 no          no               Preprocesso~
## 10 train/tes~    0.0939    0.906    40 no          no               Preprocesso~
## # ... with 234 more rows

# Custom metrics function
last_fit_metrics <- metric_set(accuracy, sens, spec, roc_auc)

# Calculate metrics
last_fit_metrics(last_fit_results,
                 truth = canceled_service,
                 estimate = .pred_class,
                 .pred_yes)

## # A tibble: 4 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.689
## 2 sens     binary         0.463
## 3 spec     binary         0.802
## 4 roc_auc  binary         0.739

Great job! You were able to train and evaluate your logistic regression model in half the time! Notice that all performance metrics match the results you obtained in previous exercises.

Complete modeling workflow

In this exercise, you will use the last_fit() function to train a logistic regression model and evaluate its performance on the test data by assessing the ROC curve and the area under the ROC curve.

Similar to previous exercises, you will predict canceled_service in the telecom_df data, but with an additional predictor variable to see if you can improve model performance.

The telecom_df tibble, telecom_split, and logistic_model objects from the previous exercises have been loaded into your workspace. The telecom_split object contains the instructions for randomly splitting the telecom_df tibble into training and test sets. The logistic_model object is a parsnip specification of a logistic regression model.

# Train a logistic regression model
logistic_fit <- logistic_model %>% 
  last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges + months_with_company, 
           split = telecom_split)

# Collect metrics
mets_comp <- logistic_fit %>% 
  collect_metrics()
mets_comp

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.746 Preprocessor1_Model1
## 2 roc_auc  binary         0.802 Preprocessor1_Model1

# Collect model predictions
logistic_fit %>% 
  collect_predictions() %>% 
  # Plot ROC curve
  roc_curve(truth = canceled_service, .pred_yes) %>% 
  autoplot()

Excellent work! The ROC curve shows that the logistic regression model performs better than a model that guesses at random (the dashed line in the plot). Adding the months_with_company predictor variable increased your area under the ROC curve from 0.739 in your previous model to 0.802!

Feature Engineering

Find out how to bake feature engineering pipelines with the recipes package. You’ll prepare numeric and categorical data to help machine learning algorithms optimize your predictions.

Feature engineering

Exploring recipe objects

The first step in feature engineering is to specify a recipe object with the recipe() function and add data preprocessing steps with one or more step_*() functions. Storing all of this information in a single recipe object makes it easier to manage complex feature engineering pipelines and transform new data sources.

Use the R console to explore a recipe object named telecom_rec, which was specified using the telecom_training data from the previous chapter and the code below.

telecom_rec <- recipe(canceled_service ~ ., data = telecom_df) %>% step_log(avg_call_mins, base = 10)

Both telecom_training and telecom_rec have been loaded into your session.

How many numeric and nominal predictor variables are encoded in the telecom_rec object?

You got it! Based on the results from passing telecom_rec to the summary() function, you can see that 5 predictor variables were labeled as numeric and 3 as nominal by the recipe() function.

Creating recipe objects

In the previous chapter, you fit a logistic regression model using a subset of the predictor variables from the telecom_df data. This dataset contains information on customers of a telecommunications company and the goal is predict whether they will cancel their service.

In this exercise, you will use the recipes package to apply a log transformation to the avg_call_mins and avg_intl_mins variables in the telecommunications data. This will reduce the range of these variables and potentially make their distributions more symmetric, which may increase the accuracy of your logistic regression model.

# Specify feature engineering recipe
telecom_log_rec <- recipe(canceled_service ~ . , 
                          data = telecom_training) %>%
  # Add log transformation step for numeric predictors
  step_log(avg_call_mins, avg_intl_mins, base = 10)

# Print recipe object
telecom_log_rec

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          8
## 
## Operations:
## 
## Log transformation on avg_call_mins, avg_intl_mins

# View variable roles and data types
telecom_log_rec %>%
  summary()

## # A tibble: 9 x 4
##   variable            type    role      source  
##   <chr>               <chr>   <chr>     <chr>   
## 1 cellular_service    nominal predictor original
## 2 avg_data_gb         numeric predictor original
## 3 avg_call_mins       numeric predictor original
## 4 avg_intl_mins       numeric predictor original
## 5 internet_service    nominal predictor original
## 6 contract            nominal predictor original
## 7 months_with_company numeric predictor original
## 8 monthly_charges     numeric predictor original
## 9 canceled_service    nominal outcome   original

Great job! You have created a recipe object that assigned variable roles and data types to the outcome and predictor variables in the telecom_training dataset. You also added instructions for applying a log transformation to the avg_call_mins and avg_intl_mins variables. Now it’s time to train your recipe and apply it to new data!

Training a recipe object

In the previous exercise, you created a recipe object with instructions to apply a log transformation to the avg_call_mins and avg_intl_mins predictor variables in the telecommunications data.

The next step in the feature engineering process is to train your recipe object using the training data. Then you will be able to apply your trained recipe to both the training and test datasets in order to prepare them for use in model fitting and model evaluation.

Your recipe object, telecom_log_rec, and the telecom_training and telecom_test datasets have been loaded into your session.

# Train the telecom_log_rec object
telecom_log_rec_prep <- telecom_log_rec %>% 
  prep(training = telecom_training)

# View results
telecom_log_rec_prep

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          8
## 
## Training data contained 731 data points and no missing data.
## 
## Operations:
## 
## Log transformation on avg_call_mins, avg_intl_mins [trained]

# Apply to training data
telecom_log_rec_prep %>% 
  bake(new_data = NULL)

## # A tibble: 731 x 9
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line            10.3           2.42          1.74 fiber_optic     
##  2 multiple_lines          9.4           2.49          2.17 fiber_optic     
##  3 multiple_lines         10.2           2.60          2.06 fiber_optic     
##  4 single_line             9.37          2.58          1.94 fiber_optic     
##  5 multiple_lines         10.6           2.45          2.17 fiber_optic     
##  6 multiple_lines          5.17          2.53          2.08 digital         
##  7 multiple_lines          7.86          2.58          2.21 digital         
##  8 single_line             8.67          1.97          2.12 fiber_optic     
##  9 multiple_lines          9.24          2.59          2.15 fiber_optic     
## 10 multiple_lines         11.0           2.59          1.89 fiber_optic     
## # ... with 721 more rows, and 4 more variables: contract <fct>,
## #   months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>

# Apply to test data
telecom_log_rec_prep %>% 
  bake(new_data = telecom_test)

## # A tibble: 244 x 9
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line             9.04          2.53          1.94 fiber_optic     
##  2 multiple_lines          8.05          2.52          2.09 digital         
##  3 single_line             9.3           2.51          2.06 fiber_optic     
##  4 multiple_lines          9.96          2.53          2.13 fiber_optic     
##  5 single_line             6.69          2.55          1.96 digital         
##  6 multiple_lines          4.11          2.57          1.81 digital         
##  7 single_line             7.31          2.39          2.08 digital         
##  8 multiple_lines          6.79          2.71          2.15 digital         
##  9 single_line             6.64          2.04          2.13 digital         
## 10 multiple_lines          9.67          2.55          2.23 fiber_optic     
## # ... with 234 more rows, and 4 more variables: contract <fct>,
## #   months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>

Great work! You successfully trained your recipe to be able to transform new data sources and applied it to the training and test datasets. Notice that the avg_call_mins and avg_intl_mins variables have been log transformed in the test dataset!

Numeric predictors

Discovering correlated predictors

Correlated predictor variables provide redundant information and can negatively impact the model fitting process. When two variables are highly correlated, their values change linearly with each other and hence provide the same information to your machine learning algorithms. This phenomenon is known as multicollinearity.

Before beginning the model fitting process, it’s important to explore your dataset to uncover these relationships and remove them in your feature engineering steps.

In this exercise, you will explore the telecom_training dataset by creating a correlation matrix of all the numeric predictor variables.

The telecom_training data has been loaded into your session.

telecom_training %>% 
  # Select numeric columns
  select_if(is.numeric) %>% 
  # Calculate correlation matrix
  cor()

##                     avg_data_gb avg_call_mins avg_intl_mins months_with_company
## avg_data_gb           1.0000000    0.17888971    0.14953256          0.43235407
## avg_call_mins         0.1788897    1.00000000    0.07173488          0.02821505
## avg_intl_mins         0.1495326    0.07173488    1.00000000          0.22521437
## months_with_company   0.4323541    0.02821505    0.22521437          1.00000000
## monthly_charges       0.9562531    0.18089932    0.16016788          0.45749019
##                     monthly_charges
## avg_data_gb               0.9562531
## avg_call_mins             0.1808993
## avg_intl_mins             0.1601679
## months_with_company       0.4574902
## monthly_charges           1.0000000

# Plot correlated predictors
ggplot(telecom_training, aes(x = avg_data_gb, y = monthly_charges)) + 
  # Add points
   geom_point()  + 
  # Add title
  labs(title = 'Monthly Charges vs. Average Data Usage',
       y = 'Monthly Charges ($)', x = 'Average Data Usage (GB)')

Great job! You explored the telecom_training data and discovered that monthly_charges and avg_data_gb have a correlation of 0.96. From the scatter plot, you can see that the more data customers use, the more they are charged every month. You will have to remove this redundant information with your feature engineering steps.

Removing correlated predictors with recipes

Removing correlated predictor variables from your training and test datasets is an important feature engineering step to ensure your model fitting runs as smoothly as possible.

Now that you have discovered that monthly_charges and avg_data_gb are highly correlated, you must add a correlation filter with step_corr() to your feature engineering pipeline for the telecommunications data.

In this exercise, you will create a recipe object that removes correlated predictors from the telecommunications data.

The telecom_training and telecom_test datasets have been loaded into your session.

# Specify a recipe object
telecom_cor_rec <- recipe(canceled_service ~ .,
                          data = telecom_training) %>% 
  # Remove correlated variables
  step_corr(all_numeric(), threshold = 0.8)

# Train the recipe
telecom_cor_rec_prep <- telecom_cor_rec %>% 
  prep(training = telecom_training)

# Apply to training data
telecom_cor_rec_prep %>% 
  bake(new_data = NULL)

## # A tibble: 731 x 8
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line            10.3            262            55 fiber_optic     
##  2 multiple_lines          9.4            312           147 fiber_optic     
##  3 multiple_lines         10.2            402           116 fiber_optic     
##  4 single_line             9.37           382            87 fiber_optic     
##  5 multiple_lines         10.6            281           147 fiber_optic     
##  6 multiple_lines          5.17           341           119 digital         
##  7 multiple_lines          7.86           378           164 digital         
##  8 single_line             8.67            93           131 fiber_optic     
##  9 multiple_lines          9.24           392           142 fiber_optic     
## 10 multiple_lines         11.0            390            78 fiber_optic     
## # ... with 721 more rows, and 3 more variables: contract <fct>,
## #   months_with_company <dbl>, canceled_service <fct>

# Apply to test data
telecom_cor_rec_prep %>% 
  bake(new_data = telecom_test)

## # A tibble: 244 x 8
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line             9.04           336            88 fiber_optic     
##  2 multiple_lines          8.05           328           122 digital         
##  3 single_line             9.3            326           114 fiber_optic     
##  4 multiple_lines          9.96           340           136 fiber_optic     
##  5 single_line             6.69           352            91 digital         
##  6 multiple_lines          4.11           371            64 digital         
##  7 single_line             7.31           246           119 digital         
##  8 multiple_lines          6.79           514           141 digital         
##  9 single_line             6.64           110           136 digital         
## 10 multiple_lines          9.67           352           170 fiber_optic     
## # ... with 234 more rows, and 3 more variables: contract <fct>,
## #   months_with_company <dbl>, canceled_service <fct>

Excellent! You have trained your recipe to remove all correlated predictors that exceed the 0.8 correlation threshold. Notice that your recipe found the high correlation between monthly_charges and avg_data_gb in the training data and when applied to the telecom_test data, it removed the monthly_charges column.

Multiple feature engineering steps

The power of the recipes package is that you can include multiple preprocessing steps in a single recipe object. These steps will be carried out in the order they are entered with the step_*() functions.

In this exercise, you will build upon your feature engineering from the last exercise. In addition to removing correlated predictors, you will create a recipe object that also normalizes all numeric predictors in the telecommunications data.

The telecom_training and telecom_test datasets have been loaded into your session.

# Specify a recipe object
telecom_norm_rec <- recipe(canceled_service ~ .,
                          data = telecom_training) %>% 
  # Remove correlated variables
  step_corr(all_numeric(), threshold = 0.8) %>% 
  # Normalize numeric predictors
  step_normalize(all_numeric())

# Train the recipe
telecom_norm_rec_prep <- telecom_norm_rec %>% 
  prep(training = telecom_training)

#  Apply to test data
telecom_norm_rec_prep %>% 
  bake(new_data = telecom_test)

## # A tibble: 244 x 8
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line            0.405       -0.132         -0.643 fiber_optic     
##  2 multiple_lines        -0.112       -0.239          0.430 digital         
##  3 single_line            0.541       -0.265          0.177 fiber_optic     
##  4 multiple_lines         0.885       -0.0789         0.871 fiber_optic     
##  5 single_line           -0.822        0.0807        -0.548 digital         
##  6 multiple_lines        -2.17         0.333         -1.40  digital         
##  7 single_line           -0.499       -1.33           0.335 digital         
##  8 multiple_lines        -0.770        2.24           1.03  digital         
##  9 single_line           -0.848       -3.14           0.871 digital         
## 10 multiple_lines         0.734        0.0807         1.94  fiber_optic     
## # ... with 234 more rows, and 3 more variables: contract <fct>,
## #   months_with_company <dbl>, canceled_service <fct>

Great job! When you applied your trained recipe to the telecom_test data, it removed the monthly_charges column, due to its large correlation with avg_data_gb, and normalized the numeric predictor variables!

Nominal predictors

Applying step_dummy() to predictors

You are using the telecom_training data to predict canceled_service using avg_data_gb and contract as predictor variables.

## # A tibble: 4 x 3
##   canceled_service avg_data_gb contract      
##   <chr>                  <dbl> <chr>         
## 1 yes                     7.78 month_to_month
## 2 yes                     9.04 month_to_month
## 3 yes                     5.08 one_year      
## 4 no                      8.05 two_year

In your feature engineering pipeline, you would like to create dummy variables from the contract column and leave avg_data_gb and canceled_service as is.

Which step_*() function from the options will correctly encode your recipe object?

Determine whether each step_*() specification will correctly encode your recipe object and drag it to the appropriate section. Below find the filled baskets:-

Congratulations! The special selector functions are helpful for specifying feature engineering steps without having to type out all individual variables for processing.

Ordering of step_*() functions

The step_*() functions within a recipe are carried out in sequential order. It’s important to keep this in mind so that you avoid unexpected results in your feature engineering pipeline!

In this exercise, you will combine different step_*() functions into a single recipe and see what effect the ordering of step_*() functions has on the final result.

The telecom_training and telecom_test datasets have been loaded into this session.

telecom_recipe_1 <- 
  recipe(canceled_service ~ avg_data_gb + contract, data = telecom_training)  %>% 
  step_normalize(all_numeric(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes())

# Train and apply telecom_recipe_1 on the test data
telecom_recipe_1 %>% 
  prep(training = telecom_training) %>% 
  bake(new_data = telecom_test)

## # A tibble: 244 x 4
##    avg_data_gb canceled_service contract_one_year contract_two_year
##          <dbl> <fct>                        <dbl>             <dbl>
##  1       0.405 yes                              0                 0
##  2      -0.112 no                               0                 1
##  3       0.541 no                               0                 0
##  4       0.885 no                               0                 0
##  5      -0.822 no                               0                 0
##  6      -2.17  no                               0                 1
##  7      -0.499 no                               1                 0
##  8      -0.770 yes                              1                 0
##  9      -0.848 no                               0                 1
## 10       0.734 no                               0                 0
## # ... with 234 more rows

telecom_recipe_2 <- 
  recipe(canceled_service ~ avg_data_gb + contract, data = telecom_training)  %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_normalize(all_numeric(), -all_outcomes())

# Train and apply telecom_recipe_2 on the test data
telecom_recipe_2 %>% 
  prep(training = telecom_training) %>% 
  bake(new_data = telecom_test)

## # A tibble: 244 x 4
##    avg_data_gb canceled_service contract_one_year contract_two_year
##          <dbl> <fct>                        <dbl>             <dbl>
##  1       0.405 yes                         -0.510            -0.467
##  2      -0.112 no                          -0.510             2.14 
##  3       0.541 no                          -0.510            -0.467
##  4       0.885 no                          -0.510            -0.467
##  5      -0.822 no                          -0.510            -0.467
##  6      -2.17  no                          -0.510             2.14 
##  7      -0.499 no                           1.96             -0.467
##  8      -0.770 yes                          1.96             -0.467
##  9      -0.848 no                          -0.510             2.14 
## 10       0.734 no                          -0.510            -0.467
## # ... with 234 more rows

Great job! Notice that telecom_recipe_1 produced [0, 1] values in the dummy variable columns while telecom_recipe_2 produced dummy variables which were then normalized! The predictor contract_two_year created by telecom_recipe_2 is -0.486 instead of 0 and 2.05 instead of 1 due to normalization. For model interpretation, it’s best to normalize variables before creating dummy variables. Also notice that since you only specified two predictor variables in your model formula, the rest of the columns are ignored by your recipe objects when transforming new data sources.

Complete feature engineering pipeline

The recipes package is designed to encode multiple feature engineering steps into one object, making it easier to maintain data transformations in a machine learning workflow.

In this exercise, you will train a feature engineering pipeline to prepare the telecommunications data for modeling.

The telecom_df tibble, as well as your telecom_training and telecom_test datasets from the previous exercises, have been loaded into your workspace.

# Create a recipe that predicts canceled_service using the training data
telecom_recipe <- recipe(canceled_service ~ . ,  data = telecom_training) %>% 
  # Remove correlated predictors
  step_corr(all_numeric(), threshold = 0.8) %>% 
  # Normalize numeric predictors
step_normalize(all_numeric(), - all_outcomes())  %>% 
# Create dummy variables
step_dummy(all_nominal(), -all_outcomes())

# Train your recipe and apply it to the test data
telecom_recipe %>% 
  prep(training = telecom_training) %>% 
  bake(new_data = telecom_test)

## # A tibble: 244 x 9
##    avg_data_gb avg_call_mins avg_intl_mins months_with_company canceled_service
##          <dbl>         <dbl>         <dbl>               <dbl> <fct>           
##  1       0.405       -0.132         -0.643              -0.930 yes             
##  2      -0.112       -0.239          0.430               0.656 no              
##  3       0.541       -0.265          0.177              -0.335 no              
##  4       0.885       -0.0789         0.871               1.09  no              
##  5      -0.822        0.0807        -0.548              -1.09  no              
##  6      -2.17         0.333         -1.40                1.53  no              
##  7      -0.499       -1.33           0.335               0.973 no              
##  8      -0.770        2.24           1.03               -0.850 yes             
##  9      -0.848       -3.14           0.871               0.815 no              
## 10       0.734        0.0807         1.94               -0.295 no              
## # ... with 234 more rows, and 4 more variables:
## #   cellular_service_single_line <dbl>, internet_service_digital <dbl>,
## #   contract_one_year <dbl>, contract_two_year <dbl>

Great job! You are now a feature engineering ninja! Transforming your training data for modeling is an important part of the machine learning process. In the next section, we will incorporate your feature engineering skills to the entire model fitting process for the telecommunications data.

Complete modeling workflow

Feature engineering process

To incorporate feature engineering into the modeling process, the training and test datasets must be preprocessed before the model fitting stage. With the new skills you have learned in this chapter, you will be able to use all of the available predictor variables in the telecommunications data to train your logistic regression model.

In this exercise, you will create a feature engineering pipeline on the telecommunications data and use it to transform the training and test datasets.

The telecom_training and telecom_test datasets as well as your logistic regression model specification, logistic_model, have been loaded into your session.

telecom_recipe <- recipe(canceled_service ~ ., data = telecom_training) %>% 
  # Removed correlated predictors
  step_corr(all_numeric(), threshold = 0.8) %>% 
  # Log transform numeric predictors
  step_log(all_numeric(), base = 10) %>%
  # Normalize numeric predictors
  step_normalize(all_numeric()) %>% 
  # Create dummy variables
  step_dummy(all_nominal(), -all_outcomes())

# Train recipe
telecom_recipe_prep <- telecom_recipe %>% 
  prep(training = telecom_training)

# Transform training data
telecom_training_prep <- telecom_recipe_prep %>% 
  bake(new_data = NULL)

# Transform test data
telecom_test_prep <- telecom_recipe_prep %>% 
  bake(new_data = telecom_test)

telecom_test_prep

## # A tibble: 244 x 9
##    avg_data_gb avg_call_mins avg_intl_mins months_with_company canceled_service
##          <dbl>         <dbl>         <dbl>               <dbl> <fct>           
##  1      0.472       -0.00903        -0.497              -0.457 yes             
##  2      0.0145      -0.108           0.513               0.722 no              
##  3      0.583       -0.133           0.303               0.214 no              
##  4      0.853        0.0395          0.849               0.868 no              
##  5     -0.715        0.182          -0.393              -0.832 no              
##  6     -2.63         0.397          -1.48                0.990 no              
##  7     -0.365       -1.29            0.436               0.831 no              
##  8     -0.656        1.73            0.960              -0.324 yes             
##  9     -0.744       -4.59            0.849               0.779 no              
## 10      0.737        0.182           1.54                0.243 no              
## # ... with 234 more rows, and 4 more variables:
## #   cellular_service_single_line <dbl>, internet_service_digital <dbl>,
## #   contract_one_year <dbl>, contract_two_year <dbl>

Excellent work! You have preprocessed your training and test datasets with your recipe object and are now ready to use them for the model fitting and evaluation steps. Looking at the transformed test dataset, you can see that your feature engineering steps have been applied correctly.

Model training and prediction

You have preprocessed your training and test datasets in the previous exercise. Since you incorporated feature engineering into your modeling workflow, you are able to use all of the predictor variables available in the telecommunications data!

The next step is training your logistic regression model and using it to obtain predictions on your new preprocessed test dataset.

Your preprocessed training and test datasets, telecom_training_prep and telecom_test_prep, as well as your model object, logistic_model, have been loaded into your session.

# Train logistic model
logistic_fit <- logistic_model %>% 
  fit(canceled_service ~ ., data = telecom_training_prep)

# Obtain class predictions
class_preds <- predict(logistic_fit, new_data = telecom_test_prep,
                       type = 'class')

# Obtain estimated probabilities
prob_preds <- predict(logistic_fit, new_data = telecom_test_prep, 
                      type = 'prob')

# Combine test set results
telecom_results <- telecom_test_prep %>% 
  select(canceled_service) %>% 
  bind_cols(class_preds, prob_preds)

telecom_results

## # A tibble: 244 x 4
##    canceled_service .pred_class .pred_yes .pred_no
##    <fct>            <fct>           <dbl>    <dbl>
##  1 yes              yes          0.625       0.375
##  2 no               no           0.00893     0.991
##  3 no               no           0.330       0.670
##  4 no               no           0.222       0.778
##  5 no               no           0.380       0.620
##  6 no               no           0.0232      0.977
##  7 no               no           0.0109      0.989
##  8 yes              no           0.207       0.793
##  9 no               no           0.000135    1.00 
## 10 no               no           0.249       0.751
## # ... with 234 more rows

Good job! You have created a tibble of model results on the test dataset with the actual outcome variable value, predicted outcome values, and estimated probabilities of the positive and negative classes. Now you can evaluate the performance of your model with yardstick.

Model performance metrics

In this exercise, you will use yardstick metric functions to evaluate your model’s performance on the test dataset.

When you fit a logistic regression model to the telecommunications data in Chapter 2, you predicted canceled_service using avg_call_mins, avg_intl_mins, and monthly_charges. The sensitivity of your model was 0.42 while the specificity was 0.895.

Now that you have incorporated all available predictor variables using feature engineering, you can compare your new model’s performance to your previous results.

Your model results, telecom_results, have been loaded into your session.

# Create a confusion matrix
telecom_results %>% 
  conf_mat(truth = canceled_service, estimate = .pred_class)

##           Truth
## Prediction yes  no
##        yes  52  25
##        no   30 137

# Calculate sensitivity
sens_tel <- telecom_results %>% 
  sens(truth = canceled_service, estimate = .pred_class)
sens_tel

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 sens    binary         0.634

# Calculate specificity
spec_tel <- telecom_results %>% 
  spec(truth = canceled_service, estimate = .pred_class)
spec_tel

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 spec    binary         0.846

# Plot ROC curve
telecom_results %>%
  roc_curve(truth = canceled_service, .pred_yes) %>% 
  autoplot()

Fantastic work! You have really come a long way in developing your modeling skills with tidymodels! From the results of your metric calculations, using feature engineering and incorporating all predictor variables increased your model’s sensitivity to 0.634, up from 0.463, and specificity to 0.846, up from 0.802!

Workflows and Hyperparameter Tuning

Now it’s time to streamline the modeling process using workflows and fine-tune models with cross-validation and hyperparameter tuning. You’ll learn how to tune a decision tree classification model to predict whether a bank’s customers are likely to default on their loan.

Machine learning workflows

Exploring the loans dataset

The workflows package provides the ability to bundle parsnip models and recipe objects into a single modeling workflow object. This makes managing a machine learning project much easier and removes the need to keep track of multiple modeling objects.

In this exercise, you will be working with the loans_df dataset, which contains financial information on consumer loans at a bank. The outcome variable in this data is loan_default.

You will create a decision tree model object and specify a feature engineering pipeline for the loan data. The loans_df tibble has been loaded into your session.

# Create data split object
loans_split <- initial_split(loans_df, 
                             strata = loan_default)

# Build training data
loans_training <- loans_split %>% 
  training()

# Build test data
loans_test <- loans_split %>% 
  testing()

# Check for correlated predictors
loans_training %>% 
  # Select numeric columns
  select_if(is.numeric) %>% 
  # Calculate correlation matrix
  cor()

##                loan_amount interest_rate installment annual_income
## loan_amount     1.00000000    0.03184287  0.93260718    0.35677596
## interest_rate   0.03184287    1.00000000  0.07582908   -0.06730054
## installment     0.93260718    0.07582908  1.00000000    0.29437125
## annual_income   0.35677596   -0.06730054  0.29437125    1.00000000
## debt_to_income  0.11574276    0.18579065  0.17944233   -0.20516588
##                debt_to_income
## loan_amount         0.1157428
## interest_rate       0.1857907
## installment         0.1794423
## annual_income      -0.2051659
## debt_to_income      1.0000000

Great work! You have created your training and test datasets and discovered that loan_amount and installment are highly correlated predictor variables. To remove one of these predictors, you will have to incorporate step_corr() into your feature engineering pipeline for this data.

Specifying a model and recipe

Now that you have created your training and test datasets, the next step is to specify your model and feature engineering pipeline. These are the two components that are needed to create a workflow object for the model training process.

In this exercise, you will define a decision tree model object with decision_tree() and a recipe specification with the recipe() function.

Your loans_training data has been loaded into this session.

dt_model <- decision_tree() %>% 
  # Specify the engine
  set_engine('rpart') %>% 
  # Specify the mode
  set_mode('classification')

# Build feature engineering pipeline
loans_recipe <- recipe(loan_default ~ .,
                        data = loans_training) %>% 
  # Correlation filter
  step_corr(all_numeric(), threshold = 0.85) %>% 
  # Normalize numeric predictors
  step_normalize(all_numeric()) %>% 
  # Create dummy variables
  step_dummy(all_nominal(), -all_outcomes())

loans_recipe

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          7
## 
## Operations:
## 
## Correlation filter on all_numeric()
## Centering and scaling for all_numeric()
## Dummy variables from all_nominal(), -all_outcomes()

Nice work! Now that you have your model and feature engineering steps specified, you can create a workflow object for model training.

Creating workflows

workflow objects simplify the modeling process in tidymodels. With workflows, it’s possible to train a parsnip model and recipe object at the same time.

In this exercise, you will combine your decision tree model and feature engineering recipe into a single workflow object and perform model fitting and evaluation.

Your model object, dt_model, recipe object, loans_recipe, and data split, loans_split have been loaded into this session.

# Create a workflow
loans_dt_wkfl <- workflow() %>% 
  # Include the model object
  add_model(dt_model) %>% 
  # Include the recipe object
  add_recipe(loans_recipe)

# Train the workflow
loans_dt_wkfl_fit <- loans_dt_wkfl %>% 
  last_fit(split = loans_split)

# Calculate performance metrics on test data
pmetrics_loans <- loans_dt_wkfl_fit %>% 
  collect_metrics()
pmetrics_loans

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.785 Preprocessor1_Model1
## 2 roc_auc  binary         0.817 Preprocessor1_Model1

Good job! You have trained a workflow with last_fit() that created training and test datasets, trained and applied your recipe, fit your decision tree model to the training data and calculated performance metrics on the test data all with just a few lines of code! The model performed really well, with an area under the ROC curve of 0.817.

Estimating performance with cross validation

Measuring performance with cross validation

Cross validation is a method that uses training data to provide multiple estimates of model performance. When trying different model types on your data, it is important to study their performance profile to help decide which model type performs consistently well.

In this exercise, you will perform cross validation with your decision tree model workflow to explore its performance.

The training data, loans_training, and your workflow object, loans_dt_wkfl, have been loaded into your session.

# Create cross validation folds
set.seed(290)
loans_folds <- vfold_cv(loans_training, v = 5,
                       strata = loan_default)

# Create custom metrics function
loans_metrics <- metric_set(roc_auc, sensitivity, specificity)

# Fit resamples
loans_dt_rs <- loans_dt_wkfl %>% 
  fit_resamples(resamples = loans_folds,
                metrics = loans_metrics)

# View performance metrics
loans_dt_metrics <- loans_dt_rs %>% 
  collect_metrics()
loans_dt_metrics

## # A tibble: 3 x 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.799     5  0.0256 Preprocessor1_Model1
## 2 sens    binary     0.683     5  0.0522 Preprocessor1_Model1
## 3 spec    binary     0.838     5  0.0157 Preprocessor1_Model1

Excellent work! You have used cross validation to evaluate the performance of your decision tree workflow. Across the 5 cross validation folds, the average area under the ROC curve was 0.799. The average sensitivity and specificity were 0.683 and 0.838, respectively.

Cross validation with logistic regression

Cross validation provides the ability to compare the performance profile of multiple model types. This is helpful in the early stages of modeling, when you are trying to determine which model type will perform best with your data.

In this exercise, you will perform cross validation on the loans_training data using logistic regression and compare the results to your decision tree model.

The loans_folds and loans_metrics objects from the previous exercise have been loaded into your session. Your feature engineering recipe from the previous section, loans_recipe, has also been loaded.

logistic_model <- logistic_reg() %>% 
  # Specify the engine
  set_engine('glm') %>% 
  # Specify the mode
  set_mode('classification')

# Create workflow
loans_logistic_wkfl <- workflow() %>% 
  # Add model
  add_model(logistic_model) %>% 
  # Add recipe
  add_recipe(loans_recipe)

# Fit resamples
loans_logistic_rs <- loans_logistic_wkfl %>% 
  fit_resamples(resamples = loans_folds,
                metrics = loans_metrics)

# View performance metrics
loans_logistic_metrics <- loans_logistic_rs %>% 
  collect_metrics()
loans_logistic_metrics

## # A tibble: 3 x 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.851     5  0.0288 Preprocessor1_Model1
## 2 sens    binary     0.667     5  0.0575 Preprocessor1_Model1
## 3 spec    binary     0.875     5  0.0262 Preprocessor1_Model1

Great job! For logistic regression, across the 5 cross validation folds, the average area under the ROC curve was 0.851. The average sensitivity and specificity were 0.667 and 0.875, respectively. ROC AUC and specificity are very close to the decision tree cross validation results. However, the decision tree model performed slightly better on sensitivity, with an average value of 0.683.

Comparing model performance profiles

The benefit of the collect_metrics() function is that it returns a tibble of cross validation results. This makes it easy to calculate custom summary statistics with the dplyr package.

In this exercise, you will use dplyr to explore the cross validation results of your decision tree and logistic regression models.

Your cross validation results, loans_dt_rs and loans_logistic_rs have been loaded into your session.

# Detailed cross validation results
dt_rs_results <- loans_dt_rs %>% 
  collect_metrics(summarize = FALSE)

# Explore model performance for decision tree
dt_rs_results %>% 
  group_by(.metric) %>% 
  summarize(min = min(.estimate),
            median = median(.estimate),
            max = max(.estimate))

## # A tibble: 3 x 4
##   .metric   min median   max
##   <chr>   <dbl>  <dbl> <dbl>
## 1 roc_auc 0.740  0.784 0.867
## 2 sens    0.569  0.667 0.84 
## 3 spec    0.8    0.827 0.875

# Detailed cross validation results
logistic_rs_results <- loans_logistic_rs %>% 
  collect_metrics(summarize = FALSE)

# Explore model performance for logistic regression
logistic_rs_results %>% 
  group_by(.metric) %>% 
  summarize(min = min(.estimate),
            median = median(.estimate),
            max = max(.estimate))

## # A tibble: 3 x 4
##   .metric   min median   max
##   <chr>   <dbl>  <dbl> <dbl>
## 1 roc_auc 0.783  0.868 0.935
## 2 sens    0.569  0.588 0.86 
## 3 spec    0.802  0.9   0.938

Great job! Both models have similar average values across all metrics. However, logistic regression tends to have a wider range of values on all metrics. This provides evidence that a decision tree model may produce more stable prediction accuarcy on the loans dataset.

Hyperparameter tuning

Setting model hyperparameters

Hyperparameter tuning is a method for fine-tuning the performance of your models. In most cases, the default hyperparameters values of parsnip model objects will not be the optimal values for maximizing model performance.

In this exercise, you will define a decision tree model with hyperparameters for tuning and create a tuning workflow object.

Your decision tree workflow object, loans_dt_wkfl, has been loaded into your session.

# Set tuning hyperparameters
dt_tune_model <- decision_tree(cost_complexity = tune(),
                               tree_depth = tune(),
                               min_n = tune()) %>% 
  # Specify engine
  set_engine('rpart') %>% 
  # Specify mode
  set_mode('classification')

# Create a tuning workflow
loans_tune_wkfl <- loans_dt_wkfl %>% 
  # Replace model
  update_model(dt_tune_model)

loans_tune_wkfl

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: decision_tree()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_corr()
## * step_normalize()
## * step_dummy()
## 
## -- Model -----------------------------------------------------------------------
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = tune()
##   tree_depth = tune()
##   min_n = tune()
## 
## Computational engine: rpart

Good job! When you print your new workflow object, the decision tree hyperparameters now appear under the main arguments section.

Random grid search

The most common method of hyperparameter tuning is grid search. This method creates a tuning grid with unique combinations of hyperparameter values and uses cross validation to evaluate their performance. The goal of hyperparameter tuning is to find the optimal combination of values for maximizing model performance.

In this exercise, you will create a random hyperparameter grid and tune your loans data decision tree model.

Your cross validation folds, loans_folds, workflow object, loans_tune_wkfl, custom metrics function, loans_metrics, and dt_tune_model have been loaded into your session.

# Hyperparameter tuning with grid search
set.seed(214)
dt_grid <- grid_random(parameters(dt_tune_model),
                       size = 5)

# Hyperparameter tuning
dt_tuning <- loans_tune_wkfl %>% 
  tune_grid(resamples = loans_folds,
            grid = dt_grid,
            metrics = loans_metrics)

# View results
dt_tuning %>% 
collect_metrics()

## # A tibble: 15 x 9
##    cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##              <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
##  1    0.0000000758         14    39 roc_auc binary     0.839     5  0.0263
##  2    0.0000000758         14    39 sens    binary     0.719     5  0.0465
##  3    0.0000000758         14    39 spec    binary     0.808     5  0.0272
##  4    0.0243                5    34 roc_auc binary     0.775     5  0.0361
##  5    0.0243                5    34 sens    binary     0.640     5  0.0577
##  6    0.0243                5    34 spec    binary     0.910     5  0.0196
##  7    0.00000443           11     8 roc_auc binary     0.816     5  0.0276
##  8    0.00000443           11     8 sens    binary     0.683     5  0.0491
##  9    0.00000443           11     8 spec    binary     0.793     5  0.0123
## 10    0.000000600           3     5 roc_auc binary     0.787     5  0.0374
## 11    0.000000600           3     5 sens    binary     0.640     5  0.0634
## 12    0.000000600           3     5 spec    binary     0.893     5  0.0296
## 13    0.00380               5    36 roc_auc binary     0.828     5  0.0268
## 14    0.00380               5    36 sens    binary     0.699     5  0.0545
## 15    0.00380               5    36 spec    binary     0.836     5  0.0271
## # ... with 1 more variable: .config <chr>

Good work! Since you have 5 random hyperparameter combinations and 3 performance metrics, there are 15 results in your summarized tuning results. Each row shows the average of the 5 cross validation estimates of each metric and hyperparameter combination.

Exploring tuning results

The collect_metrics() function is able to produce a detailed tibble of tuning results from a tuning object. Since this function returns a tibble, it works well with the dplyr package for further data exploration and analysis.

In this exercise, you will explore your tuning results, dt_tuning, to gain further insights into your hyperparameter tuning.

Your dt_tuning object has been loaded into this session.

# Collect detailed tuning results
dt_tuning_results <- dt_tuning %>% 
  collect_metrics(summarize = FALSE)

# Explore detailed ROC AUC results for each fold
dt_tuning_results %>% 
  filter(.metric == "roc_auc") %>% 
  group_by(id) %>% 
  summarize(min_roc_auc = min(.estimate),
            median_roc_auc = median(.estimate),
            max_roc_auc = max(.estimate))

## # A tibble: 5 x 4
##   id    min_roc_auc median_roc_auc max_roc_auc
##   <chr>       <dbl>          <dbl>       <dbl>
## 1 Fold1       0.694          0.756       0.789
## 2 Fold2       0.725          0.740       0.799
## 3 Fold3       0.732          0.792       0.813
## 4 Fold4       0.869          0.891       0.901
## 5 Fold5       0.85           0.878       0.912

Excellent work! You have now had the chance to explore the detailed results of your decision tree hyperparameter tuning. The next step will be selecting the best combination and finalizing your workflow object!

Selecting the best model

Finalizing a workflow

To incorporate hyperparameter tuning into your modeling process, an optimal hyperparameter combination must be selected based on the average value of a performance metric. Then you will be able to finalize your tuning workflow and fit your final model.

In this exercise, you will explore the best performing models from your hyperparameter tuning and finalize your tuning workflow object.

The dt_tuning and loans_tune_wkfl objects from your previous session have been loaded into your environment.

# Display 5 best performing models
dt_tuning %>% 
  show_best(metric = 'roc_auc', n = 5)

## # A tibble: 5 x 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000758         14    39 roc_auc binary     0.839     5  0.0263
## 2    0.00380               5    36 roc_auc binary     0.828     5  0.0268
## 3    0.00000443           11     8 roc_auc binary     0.816     5  0.0276
## 4    0.000000600           3     5 roc_auc binary     0.787     5  0.0374
## 5    0.0243                5    34 roc_auc binary     0.775     5  0.0361
## # ... with 1 more variable: .config <chr>

# Select based on best performance
best_dt_model <- dt_tuning %>% 
  # Choose the best model based on roc_auc
  select_best(metric = 'roc_auc')

# Finalize your workflow
final_loans_wkfl <- loans_tune_wkfl %>% 
  finalize_workflow(best_dt_model)

final_loans_wkfl

## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: decision_tree()
## 
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
## 
## * step_corr()
## * step_normalize()
## * step_dummy()
## 
## -- Model -----------------------------------------------------------------------
## Decision Tree Model Specification (classification)
## 
## Main Arguments:
##   cost_complexity = 7.58290839567418e-08
##   tree_depth = 14
##   min_n = 39
## 
## Computational engine: rpart

Good job! When you printed your finalized workflow object, the optimal hyperparameter combination is displayed in the main arguments section of the output. Your workflow is now ready for model fitting and prediction on new data sources!

Training a finalized workflow

Congratulations on successfully tuning your decision tree model and finalizing your workflow! Your final_loans_wkfl object can now be used for model training and prediction on new data sources.

In this last exercise, you will train your finalized workflow on the entire loans_training dataset and evaluate its performance on the loans_test data.

The final_loans_wkfl and loans_split objects have been loaded into your session.

# Train finalized decision tree workflow
loans_final_fit <- final_loans_wkfl %>% 
  last_fit(split = loans_split)

# View performance metrics
loans_final_fit_metrics <- loans_final_fit %>% 
  collect_metrics()
loans_final_fit_metrics

## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.772 Preprocessor1_Model1
## 2 roc_auc  binary         0.838 Preprocessor1_Model1

# Create an ROC curve
loans_final_fit %>% 
  # Collect predictions
  collect_predictions() %>%
  # Calculate ROC curve metrics
  roc_curve(truth = loan_default, .pred_yes) %>%
  # Plot the ROC curve
autoplot()

Great job! You were able to train your finalized workflow with last_fit() and generate predictions on the test data. The tuned decision tree model produced an area under the ROC curve of 0.838. That’s a great model! The ROC curve shows that the sensitivity and specificity remain high across a wide range of probability threshold values.

Modeling with tidymodels in R

Nathan Onduma

19 December 2021

Introduction

Course Description

What is Covered

Libraries and Data

Machine Learning with tidymodels

The tidymodels ecosystem

Creating training and test datasets

Distribution of outcome variable values

Linear regression with tidymodels

Fitting a linear regression model

Predicting home selling prices

Evaluating model performance

Model performance metrics

R squared plot

Complete model fitting process with last_fit()

Classification Models

classification models

Data resampling

Fitting a logistic regression model

Combining test dataset results

Assessing model fit

Calculating metrics from the confusion matrix

Evaluating performance with yardstick

Creating custom metric sets

Visualizing model performance

Plotting the confusion matrix

ROC curves and area under the ROC curve

Automating the modeling workflow

Streamlining the modeling process

Collecting predictions and creating custom metrics

Complete modeling workflow

Feature Engineering

Feature engineering

Exploring recipe objects

Creating recipe objects

Training a recipe object

Numeric predictors

Discovering correlated predictors

Removing correlated predictors with recipes

Multiple feature engineering steps

Nominal predictors

Applying step_dummy() to predictors

Ordering of step_*() functions

Complete feature engineering pipeline

Complete modeling workflow

Feature engineering process

Model training and prediction

Model performance metrics

Workflows and Hyperparameter Tuning

Machine learning workflows

Exploring the loans dataset

Specifying a model and recipe

Creating workflows

Estimating performance with cross validation

Measuring performance with cross validation

Cross validation with logistic regression

Comparing model performance profiles

Hyperparameter tuning

Setting model hyperparameters

Random grid search

Exploring tuning results

Selecting the best model

Finalizing a workflow

Training a finalized workflow

Congratulations