Course notes from the Modeling with tidymodels in R course on DataCamp taught by David Svancer
Tidymodels is a powerful suite of R packages designed to streamline machine learning workflows. Learn to split datasets for cross-validation, preprocess data with tidymodels’ recipe package, and fine-tune machine learning algorithms. You’ll learn key concepts such as defining model objects and creating modeling workflows. Then, you’ll apply your skills to predict home prices and classify employees by their risk of leaving a company.
library(tidyverse)
library(knitr)
library(tidymodels)
home_sales <- read_rds("data/home_sales.rds")
telecom_df <- read_rds("data/telecom_df.rds")
loans_df <- read_rds("data/loan_df.rds")
In this chapter, you’ll explore the rich ecosystem of R packages that power tidymodels and learn how they can streamline your machine learning workflows. You’ll then put your tidymodels skills to the test by predicting house sale prices in Seattle, Washington.
The rsample
package is designed to create training and
test datasets. Creating a test dataset is important for estimating how a
trained model will likely perform on new data. It also guards against
overfitting, where a model memorizes patterns that exist only in the
training data and performs poorly on new data.
In this exercise, you will create training and test datasets from the home_sales
data. This data contains information on homes sold in the Seattle, Washington area between 2015 and 2016.
The outcome variable in this data is selling_price.
The tidymodels
package will be pre-loaded in every exercise in the course. The home_sales
tibble has also been loaded for you.
# Create a data split object
home_split <-initial_split (home_sales,
prop = 0.7,
strata = selling_price)
# Create the training data
home_training <- home_split %>%
training()
# Create the test data
home_test <- home_split %>%
testing()
# Check number of rows in each dataset
nrow(home_training)
## [1] 1042
nrow(home_test)
## [1] 450
Great job! Since the home_sales data has nearly 1,500 rows, it is appropriate to allocate more rows into the test set. This will provide more data for the model evaluation step.
Stratifying by the outcome variable when generating training and test datasets ensures that the outcome variable values have a similar range in both datasets.
Since the original data is split at random, stratification avoids placing all the expensive homes in home_sales
into the test dataset, for example. In this case, your model would most
likely perform poorly because it was trained on less expensive homes.
In this exercise, you will calculate summary statistics for the selling_price
variable in the training and test datasets. The home_training
and home_test
tibbles have been loaded from the previous exercise.
# Distribution of selling_price in training data
home_training %>%
summarize(min_sell_price = min(selling_price),
max_sell_price = max(selling_price),
mean_sell_price = mean(selling_price),
sd_sell_price = sd(selling_price))
## # A tibble: 1 x 4
## min_sell_price max_sell_price mean_sell_price sd_sell_price
## <dbl> <dbl> <dbl> <dbl>
## 1 350000 650000 479780. 81542.
# Distribution of selling_price in test data
home_test %>%
summarize(min_sell_price = min(selling_price),
max_sell_price = max(selling_price),
mean_sell_price = mean(selling_price),
sd_sell_price = sd(selling_price))
## # A tibble: 1 x 4
## min_sell_price max_sell_price mean_sell_price sd_sell_price
## <dbl> <dbl> <dbl> <dbl>
## 1 350000 650000 477475. 79725.
Excellent work! The minimum and maximum selling prices in both datasets are the same. The mean and standard deviation are also similar. Stratifying by the outcome variable ensures the model fitting process is performed on a representative sample of the original data.
The parsnip
package provides a unified syntax for the model fitting process in R.
With parsnip
, it is easy to define models using the various packages, or engines, that exist in the R ecosystem.
In this exercise, you will define a parsnip
linear regression object and train your model to predict selling_price
using home_age
and sqft_living
as predictor variables from the home_sales
data.
The home_training
and home_test
tibbles that you created in the previous lesson have been loaded into this session.
# Initialize a linear regression object, linear_model
linear_model <- linear_reg() %>%
# Set the model engine
set_engine('lm') %>%
# Set the model mode
set_mode('regression')
# Train the model with the training data
lm_fit <- linear_model %>%
fit(selling_price ~ home_age + sqft_living,
data = home_training)
# Print lm_fit to view model information
lm_fit
## parsnip model object
##
## Fit time: 0ms
##
## Call:
## stats::lm(formula = selling_price ~ home_age + sqft_living, data = data)
##
## Coefficients:
## (Intercept) home_age sqft_living
## 293527.1 -1658.4 103.4
Excellent work! You have defined your model with linear_reg() and
trained it to predict selling_price using home_age and sqft_living.
Printing a parsnip model fit object displays useful model information,
such as the training time, model formula used during training, and the
estimated model parameters. ### Exploring estimated model parameters In
the previous exercise, you trained a linear regression model to predict selling_price
using home_age
and sqft_living
as predictor variables.
Pass your trained model object into the appropriate function to explore the estimated model parameters and select the true statement.
Your trained model, lm_fit
, has been loaded into your session.
tidy(lm_fit)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 293527. 7400. 39.7 3.18e-210
## 2 home_age -1658. 175. -9.49 1.46e- 20
## 3 sqft_living 103. 2.68 38.6 9.26e-203
The true statement is: The estimated parameter for the sqft_living
predictor variable is 103.417.The remaining statements are false as per the above table.
Great job! The tidy() function automatically creates a tibble of estimated model parameters. Since sqft_living has a positive estimated parameter, the selling price of homes increases with the square footage. Conversely, since home_age has a negative estimated parameter, older homes tend to have lower selling prices.
After fitting a model using the training data, the next step is to use it to make predictions on the test dataset. The test dataset acts as a new source of data for the model and will allow you to evaluate how well it performs.
Before you can evaluate model performance, you must add your predictions to the test dataset.
In this exercise, you will use your trained model, lm_fit
, to predict selling_price
in the home_test
dataset.
Your trained model, lm_fit
, as well as the test dataset, home_test
have been loaded into your session.
# Predict selling_price
home_predictions <- predict(lm_fit,
new_data = home_test)
# View predicted selling prices
home_predictions
## # A tibble: 450 x 1
## .pred
## <dbl>
## 1 539624.
## 2 380538.
## 3 633342.
## 4 401613.
## 5 509865.
## 6 484170.
## 7 448133.
## 8 531387.
## 9 535915.
## 10 500466.
## # ... with 440 more rows
# Combine test data with predictions
home_test_results <- home_test %>%
select(selling_price, home_age, sqft_living) %>%
bind_cols(home_predictions)
# View results
home_test_results
## # A tibble: 450 x 4
## selling_price home_age sqft_living .pred
## <dbl> <dbl> <dbl> <dbl>
## 1 487000 10 2540 539624.
## 2 411000 18 1130 380538.
## 3 635000 4 3350 633342.
## 4 356000 24 1430 401613.
## 5 495000 3 2140 509865.
## 6 525000 16 2100 484170.
## 7 552321 29 1960 448133.
## 8 475000 0 2300 531387.
## 9 485000 6 2440 535915.
## 10 525000 28 2450 500466.
## # ... with 440 more rows
Congratualtions! You have trained a linear regression model and used it to predict the selling prices of homes in the test dataset! The model only used two predictor variables, but the predicted values in the .pred column seem reasonable!
Evaluating model results is an important step in the modeling process. Model evaluation should be done on the test dataset in order to see how well a model will generalize to new datasets.
In the previous exercise, you trained a linear regression model to predict selling_price
using home_age
and sqft_living
as predictor variables. You then created the home_test_results
tibble using your trained model on the home_test
data.
In this exercise, you will calculate the RMSE and R squared metrics using your results in home_test_results
.
The home_test_results
tibble has been loaded into your session.
# Print home_test_results
home_test_results
## # A tibble: 450 x 4
## selling_price home_age sqft_living .pred
## <dbl> <dbl> <dbl> <dbl>
## 1 487000 10 2540 539624.
## 2 411000 18 1130 380538.
## 3 635000 4 3350 633342.
## 4 356000 24 1430 401613.
## 5 495000 3 2140 509865.
## 6 525000 16 2100 484170.
## 7 552321 29 1960 448133.
## 8 475000 0 2300 531387.
## 9 485000 6 2440 535915.
## 10 525000 28 2450 500466.
## # ... with 440 more rows
# Caculate the RMSE metric
rmse <-home_test_results %>%
rmse(truth = selling_price, estimate = .pred)
rmse
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 48505.
# Calculate the R squared metric
rsq <- home_test_results %>%
rsq(truth = selling_price, estimate = .pred)
rsq
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.630
Great job! The RMSE metric indicates that the average prediction error for home selling prices is about $48,000. Not bad considering you only used home_age and sqft_living as predictor variables!
In the previous exercise, you got an R squared value of 0.630.The R squared metric ranges from 0 to 1, 0 being the worst and 1 the best.
Calculating the R squared value is only the first step in studying your model’s predictions.
Making an R squared plot is extremely important because it will uncover potential problems with your model, such as non-linear patterns or regions where your model is either over or under-predicting the outcome variable.
In this exercise, you will create an R squared plot of your model’s performance.
The home_test_results tibble has been loaded into your session.
# Create an R squared plot of model performance
ggplot(home_test_results, aes(x = selling_price, y = .pred)) +
geom_point(alpha = 0.5) +
geom_abline(color = 'blue', linetype = 2) +
coord_obs_pred() +
labs(x = 'Actual Home Selling Price', y = 'Predicted Selling Price')
Good work! From the plot, you can see that your model tends to
over-predict selling prices for homes that sold for less than $400,000,
and under-predict for homes that sold for $600,000 or more. This
indicates that you will have to add more predictors to your model or
that linear regression may not be able to model the relationship as well
as more advanced modeling techniques!
In this exercise, you will train and evaluate the performance of a linear regression model that predicts selling_price
using all the predictors available in the home_sales
tibble.
This exercise will give you a chance to perform the entire model fitting process with tidymodels
, from defining your model object to evaluating its performance on the test data.
Earlier in the chapter, you created an rsample
object called home_split
by passing the home_sales
tibble into initial_split()
. The home_split
object contains the instructions for randomly splitting home_sales
into training and test sets.
The home_sales
tibble, and home_split
object have been loaded into this session.
# Define a linear regression model
linear_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
# Train linear_model with last_fit()
linear_fit <- linear_model %>%
last_fit(selling_price ~. , split = home_split)
# Collect predictions and view results
predictions_df <- linear_fit %>% collect_predictions()
predictions_df
## # A tibble: 450 x 5
## id .pred .row selling_price .config
## <chr> <dbl> <int> <dbl> <chr>
## 1 train/test split 529274. 1 487000 Preprocessor1_Model1
## 2 train/test split 398602. 3 411000 Preprocessor1_Model1
## 3 train/test split 693173. 4 635000 Preprocessor1_Model1
## 4 train/test split 435176. 11 356000 Preprocessor1_Model1
## 5 train/test split 479692. 12 495000 Preprocessor1_Model1
## 6 train/test split 503484. 13 525000 Preprocessor1_Model1
## 7 train/test split 468670. 15 552321 Preprocessor1_Model1
## 8 train/test split 465154. 16 475000 Preprocessor1_Model1
## 9 train/test split 564799. 17 485000 Preprocessor1_Model1
## 10 train/test split 457006. 19 525000 Preprocessor1_Model1
## # ... with 440 more rows
# Make an R squared plot using predictions_df
ggplot(predictions_df, aes(x = selling_price, y = .pred)) +
geom_point(alpha = 0.5) +
geom_abline(color = 'blue', linetype = 2) +
coord_obs_pred() +
labs(x = 'Actual Home Selling Price', y = 'Predicted Selling Price')
Great work! You have created your first machine learning pipeline and
visualized the performance of your model. From the R squared plot, the
model still tends to over-predict selling prices for homes that sold for
less than $400,000 and under-predict for homes at $600,000 or more, but
it is a slight improvement over your previous model with only two
predictor variables.
Learn how to predict categorical outcomes by training classification models. Using the skills you’ve gained so far, you’ll predict the likelihood of customers canceling their service with a telecommunications company.
The first step in a machine learning project is to create training and test datasets for model fitting and evaluation. The test dataset provides an estimate of how your model will perform on new data and helps to guard against overfitting.
You will be working with the telecom_df
dataset which contains information on customers of a telecommunications company. The outcome variable is canceled_service
and it records whether a customer canceled their contract with the
company. The predictor variables contain information about customers’
cell phone and internet usage as well as their contract type and monthly
charges.
The telecom_df
tibble has been loaded into your session.
# Create data split object
telecom_split <- initial_split(telecom_df, prop = 0.75,
strata = canceled_service)
# Create the training data
telecom_training <- telecom_split %>%
training ()
# Create the test data
telecom_test <- telecom_split %>%
testing()
# Check the number of rows
nrow(telecom_training)
## [1] 731
nrow(telecom_test)
## [1] 244
Good job! You have 731 rows in your training data and 244 rows in
your test dataset. Now you can begin the model fitting process using telecom_training.
In addition to regression models, the parsnip
package also provides a general interface to classification models in R.
In this exercise, you will define a parsnip
logistic regression object and train your model to predict canceled_service
using avg_call_mins
, avg_intl_mins
, and monthly_charges
as predictor variables from the telecom_df
data.
The telecom_training
and telecom_test
tibbles that you created in the previous lesson have been loaded into this session.
# Specify a logistic regression model
logistic_model <- logistic_reg() %>%
# Set the engine
set_engine("glm") %>%
# Set the mode
set_mode("classification")
# Print the model specification
logistic_model
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
# Fit to training data
logistic_fit <- logistic_model %>%
fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,
data = telecom_training )
# Print model fit object
logistic_fit
## parsnip model object
##
## Fit time: 10ms
##
## Call: stats::glm(formula = canceled_service ~ avg_call_mins + avg_intl_mins +
## monthly_charges, family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) avg_call_mins avg_intl_mins monthly_charges
## 1.725750 -0.010457 0.023631 0.002039
##
## Degrees of Freedom: 730 Total (i.e. Null); 727 Residual
## Null Deviance: 932.4
## Residual Deviance: 801.3 AIC: 809.3
Great job! You have defined your model with logistic_reg()
and trained it to predict canceled_service
using avg_call_mins
, avg_intl_mins
, and monthly_charges.
Printing a parsnip
model specification object displays useful model information, such as
the model type, computational engine, and mode. Printing a model fit
object will display the estimated model coefficients.
Evaluating your model’s performance on the test dataset gives insights into how well your model predicts on new data sources. These insights will help you communicate your model’s value in solving problems or improving decision making.
Before you can calculate classification metrics such as sensitivity
or specificity, you must create a results tibble with the required
columns for yardstick
metric functions.
In this exercise, you will use your trained model to predict the outcome variable in the telecom_test
dataset and combine it with the true outcome values in the canceled_service
column.
Your trained model, logistic_fit
, and test dataset, telecom_test
, have been loaded from the previous exercise.
# Predict outcome categories
class_preds <- predict(logistic_fit, new_data = telecom_test,
type = 'class')
# Obtain estimated probabilities for each outcome value
prob_preds <- predict(logistic_fit, new_data = telecom_test,
type = 'prob')
# Combine test set results
telecom_results <- telecom_test %>%
select(canceled_service) %>%
bind_cols(class_preds, prob_preds)
# View results tibble
telecom_results
## # A tibble: 244 x 4
## canceled_service .pred_class .pred_yes .pred_no
## <fct> <fct> <dbl> <dbl>
## 1 yes no 0.381 0.619
## 2 no no 0.209 0.791
## 3 no no 0.230 0.770
## 4 no no 0.168 0.832
## 5 no no 0.422 0.578
## 6 no yes 0.629 0.371
## 7 no no 0.107 0.893
## 8 yes yes 0.541 0.459
## 9 no no 0.0192 0.981
## 10 no no 0.0939 0.906
## # ... with 234 more rows
Good job! You have created a tibble of model results using the test
dataset. Your results tibble contains all the necessary columns for
calculating classification metrics. Next, you’ll use this tibble and the
yardstick
package to evaluate your model’s performance.
The confusion matrix of a binary classification model lists the number of correct and incorrect predictions obtained on the test dataset and is useful for evaluating the performance of your model.
Suppose you have trained a classification model that predicts whether
customers will cancel their service at a telecommunications company and
obtained the following confusion matrix on your test dataset. Here yes
represents the positive class, while no
represents the negative class.
Choose the true statement from the options below.
That’s correct. The sensitivity is calculated by taking the proportion of true positives among predicted positives. Your model was able to correctly classify 75% of all customers who actually canceled their service.
In the previous exercise, you calculated classification metrics from a sample confusion matrix. The yardstick
package was designed to automate this process.
For classification models, yardstick
functions require a
tibble of model results as the first argument. This should include the
actual outcome values, predicted outcome values, and estimated
probabilities for each value of the outcome variable.
In this exercise, you will use the results from your logistic regression model, telecom_results
, to calculate performance metrics.
The telecom_results
tibble has been loaded into your session.
# Calculate the confusion matrix
conf_mat(telecom_results, truth = canceled_service,
estimate = .pred_class)
## Truth
## Prediction yes no
## yes 38 32
## no 44 130
# Calculate the accuracy
acc <- accuracy(telecom_results,
truth = canceled_service,
estimate = .pred_class)
acc
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.689
# Calculate the sensitivity
sens_cl<- sensitivity(telecom_results,
truth = canceled_service,
estimate = .pred_class)
sens_cl
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 sens binary 0.463
# Calculate the specificity
spec_cl <- specificity(telecom_results,
truth = canceled_service,
estimate = .pred_class)
spec_cl
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 spec binary 0.802
Excellent work! The specificity of your logistic regression model is 0.802, which is more than double the sensitivity of 0.463. This indicates that your model is much better at detecting customers who will not cancel their telecommunications service versus the ones who will.
The yardstick
package also provides the ability to
create custom sets of model metrics. In cases where the cost of
obtaining false negative errors is different from the cost of false
positive errors, it may be important to examine a specific set of
performance metrics.
Instead of calculating accuracy, sensitivity, and specificity separately, you can create your own metric function that calculates all three at the same time.
In this exercise, you will use the results from your logistic regression model, telecom_results
,
to calculate a custom set of performance metrics. You will also use a
confusion matrix to calculate all available binary classification
metrics in tidymodels
all at once.
The telecom_results
tibble has been loaded into your session.
# Create a custom metric function
telecom_metrics <- metric_set(accuracy, sens, spec)
# Calculate metrics using model results tibble
telecom_metrics(telecom_results, truth = canceled_service,
estimate = .pred_class)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.689
## 2 sens binary 0.463
## 3 spec binary 0.802
# Create a confusion matrix
conf_mat(telecom_results,
truth = canceled_service,
estimate = .pred_class) %>%
# Pass to the summary() function
summary()
## # A tibble: 13 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.689
## 2 kap binary 0.276
## 3 sens binary 0.463
## 4 spec binary 0.802
## 5 ppv binary 0.543
## 6 npv binary 0.747
## 7 mcc binary 0.278
## 8 j_index binary 0.266
## 9 bal_accuracy binary 0.633
## 10 detection_prevalence binary 0.287
## 11 precision binary 0.543
## 12 recall binary 0.463
## 13 f_meas binary 0.5
Nice work! You created a custom metric function to calculate
accuracy, sensitivity, and specificity. Oftentimes, you will be
interested in tracking certain performance metrics for a given modeling
problem, but passing a confusion matrix to the summary()
function will calculate all available binary classification metrics in tidymodels
at once!
Calculating performance metrics with the yardstick
package provides insight into how well a classification model is performing on the test dataset. Most yardstick
functions return a single number that summarizes classification performance.
Many times, it is helpful to create visualizations of the confusion matrix to more easily communicate your results.
In this exercise, you will make a heat map and mosaic plot of the confusion matrix from your logistic regression model on the telecom_df
dataset.
Your model results tibble, telecom_results
, has been loaded into your session.
# Create a confusion matrix
conf_mat(telecom_results,
truth = canceled_service,
estimate = .pred_class) %>%
# Create a heat map
autoplot(type = "heatmap")
# Create a confusion matrix
conf_mat(telecom_results,
truth = canceled_service,
estimate = .pred_class) %>%
# Create a heat map
autoplot(type = "mosaic")
Great job! The mosaic plot clearly shows that your logistic regression
model performs much better in terms of specificity than sensitivity. You
can see that in the
yes
column, a large proportion of outcomes were incorrectly predicted as no
.
ROC curves are used to visualize the performance of a classification model across a range of probability thresholds. An ROC curve with the majority of points near the upper left corner of the plot indicates that a classification model is able to correctly predict both the positive and negative outcomes correctly across a wide range of probability thresholds.
The area under this curve provides a letter grade summary of model performance.
In this exercise, you will create an ROC curve from your logistic
regression model results and calculate the area under the ROC curve with
yardstick.
Your model results tibble, telecom_results
has been loaded into your session.
# Calculate metrics across thresholds
threshold_df <- telecom_results %>%
roc_curve(truth = canceled_service, .pred_yes)
# View results
threshold_df
## # A tibble: 246 x 3
## .threshold specificity sensitivity
## <dbl> <dbl> <dbl>
## 1 -Inf 0 1
## 2 0.0192 0 1
## 3 0.0397 0.00617 1
## 4 0.0477 0.0123 1
## 5 0.0589 0.0185 1
## 6 0.0639 0.0247 1
## 7 0.0684 0.0309 1
## 8 0.0706 0.0370 1
## 9 0.0793 0.0432 1
## 10 0.0797 0.0494 1
## # ... with 236 more rows
# Plot ROC curve
threshold_df %>%
autoplot()
# Calculate ROC AUC
telecom_roc_auc <- telecom_results %>%
roc_auc(
truth = canceled_service, .pred_yes)
telecom_roc_auc
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.739
Nice work! The area under the ROC curve is 0.739. This indicates that your model gets a C in terms of overall performance. This is mainly due to the low sensitivity of the model.
The last_fit()
function is designed to streamline the modeling workflow in tidymodels
. Instead of training your model on the training data and building a results tibble using the test data, last_fit()
accomplishes this with one function.
In this exercise, you will train the same logistic regression model as you fit in the previous exercises, except with the last_fit()
function.
Your data split object, telecom_split
, and model specification, logistic_model
, have been loaded into your session.
# Train model with last_fit()
telecom_last_fit <- logistic_model %>%
last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,
split = telecom_split)
# View test set metrics
mets_stream <- telecom_last_fit %>%
collect_metrics()
mets_stream
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.689 Preprocessor1_Model1
## 2 roc_auc binary 0.739 Preprocessor1_Model1
Excellent work! Notice that you got the same area under the ROC curve as before, just with a lot less effort!
Using the last_fit()
modeling workflow also saves time
in collecting model predictions. Instead of manually creating a tibble
of model results, there are helper functions that extract this
information automatically.
In this exercise, you will use your trained model, telecom_last_fit
, to create a tibble of model results on the test dataset as well as calculate custom performance metrics.
You trained model, telecom_last_fit
, has been loaded into this session.
# Collect predictions
last_fit_results <- telecom_last_fit %>%
collect_predictions()
# View results
last_fit_results
## # A tibble: 244 x 7
## id .pred_yes .pred_no .row .pred_class canceled_service .config
## <chr> <dbl> <dbl> <int> <fct> <fct> <chr>
## 1 train/tes~ 0.381 0.619 2 no yes Preprocesso~
## 2 train/tes~ 0.209 0.791 5 no no Preprocesso~
## 3 train/tes~ 0.230 0.770 6 no no Preprocesso~
## 4 train/tes~ 0.168 0.832 10 no no Preprocesso~
## 5 train/tes~ 0.422 0.578 15 no no Preprocesso~
## 6 train/tes~ 0.629 0.371 18 yes no Preprocesso~
## 7 train/tes~ 0.107 0.893 28 no no Preprocesso~
## 8 train/tes~ 0.541 0.459 31 yes yes Preprocesso~
## 9 train/tes~ 0.0192 0.981 37 no no Preprocesso~
## 10 train/tes~ 0.0939 0.906 40 no no Preprocesso~
## # ... with 234 more rows
# Custom metrics function
last_fit_metrics <- metric_set(accuracy, sens, spec, roc_auc)
# Calculate metrics
last_fit_metrics(last_fit_results,
truth = canceled_service,
estimate = .pred_class,
.pred_yes)
## # A tibble: 4 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.689
## 2 sens binary 0.463
## 3 spec binary 0.802
## 4 roc_auc binary 0.739
Great job! You were able to train and evaluate your logistic regression model in half the time! Notice that all performance metrics match the results you obtained in previous exercises.
In this exercise, you will use the last_fit()
function
to train a logistic regression model and evaluate its performance on the
test data by assessing the ROC curve and the area under the ROC curve.
Similar to previous exercises, you will predict canceled_service
in the telecom_df
data, but with an additional predictor variable to see if you can improve model performance.
The telecom_df
tibble, telecom_split
, and logistic_model
objects from the previous exercises have been loaded into your workspace. The telecom_split
object contains the instructions for randomly splitting the telecom_df
tibble into training and test sets. The logistic_model
object is a parsnip
specification of a logistic regression model.
# Train a logistic regression model
logistic_fit <- logistic_model %>%
last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges + months_with_company,
split = telecom_split)
# Collect metrics
mets_comp <- logistic_fit %>%
collect_metrics()
mets_comp
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.746 Preprocessor1_Model1
## 2 roc_auc binary 0.802 Preprocessor1_Model1
# Collect model predictions
logistic_fit %>%
collect_predictions() %>%
# Plot ROC curve
roc_curve(truth = canceled_service, .pred_yes) %>%
autoplot()
Excellent work! The ROC curve shows that the logistic regression model performs better than a model that guesses at random (the dashed line in the plot). Adding the months_with_company predictor variable increased your area under the ROC curve from 0.739 in your previous model to 0.802!
Find out how to bake feature engineering pipelines with the recipes package. You’ll prepare numeric and categorical data to help machine learning algorithms optimize your predictions.
The first step in feature engineering is to specify a recipe
object with the recipe()
function and add data preprocessing steps with one or more step_*()
functions. Storing all of this information in a single recipe
object makes it easier to manage complex feature engineering pipelines and transform new data sources.
Use the R console to explore a recipe
object named telecom_rec
, which was specified using the telecom_training
data from the previous chapter and the code below.
telecom_rec <- recipe(canceled_service ~ ., data = telecom_df) %>% step_log(avg_call_mins, base = 10)
Both telecom_training
and telecom_rec
have been loaded into your session.
How many numeric and nominal predictor variables are encoded in the telecom_rec
object?
You got it! Based on the results from passing telecom_rec
to the summary()
function, you can see that 5 predictor variables were labeled as numeric and 3 as nominal by the recipe()
function.
In the previous chapter, you fit a logistic regression model using a subset of the predictor variables from the telecom_df
data. This dataset contains information on customers of a
telecommunications company and the goal is predict whether they will
cancel their service.
In this exercise, you will use the recipes
package to apply a log transformation to the avg_call_mins
and avg_intl_mins
variables in the telecommunications data. This will reduce the range of
these variables and potentially make their distributions more
symmetric, which may increase the accuracy of your logistic regression
model.
# Specify feature engineering recipe
telecom_log_rec <- recipe(canceled_service ~ . ,
data = telecom_training) %>%
# Add log transformation step for numeric predictors
step_log(avg_call_mins, avg_intl_mins, base = 10)
# Print recipe object
telecom_log_rec
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 8
##
## Operations:
##
## Log transformation on avg_call_mins, avg_intl_mins
# View variable roles and data types
telecom_log_rec %>%
summary()
## # A tibble: 9 x 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 cellular_service nominal predictor original
## 2 avg_data_gb numeric predictor original
## 3 avg_call_mins numeric predictor original
## 4 avg_intl_mins numeric predictor original
## 5 internet_service nominal predictor original
## 6 contract nominal predictor original
## 7 months_with_company numeric predictor original
## 8 monthly_charges numeric predictor original
## 9 canceled_service nominal outcome original
Great job! You have created a recipe
object that assigned variable roles and data types to the outcome and predictor variables in the telecom_training
dataset. You also added instructions for applying a log transformation to the avg_call_mins
and avg_intl_mins
variables. Now it’s time to train your recipe and apply it to new data!
In the previous exercise, you created a recipe
object with instructions to apply a log transformation to the avg_call_mins
and avg_intl_mins
predictor variables in the telecommunications data.
The next step in the feature engineering process is to train your recipe
object using the training data. Then you will be able to apply your trained recipe
to both the training and test datasets in order to prepare them for use in model fitting and model evaluation.
Your recipe
object, telecom_log_rec
, and the telecom_training
and telecom_test
datasets have been loaded into your session.
# Train the telecom_log_rec object
telecom_log_rec_prep <- telecom_log_rec %>%
prep(training = telecom_training)
# View results
telecom_log_rec_prep
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 8
##
## Training data contained 731 data points and no missing data.
##
## Operations:
##
## Log transformation on avg_call_mins, avg_intl_mins [trained]
# Apply to training data
telecom_log_rec_prep %>%
bake(new_data = NULL)
## # A tibble: 731 x 9
## cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
## <fct> <dbl> <dbl> <dbl> <fct>
## 1 single_line 10.3 2.42 1.74 fiber_optic
## 2 multiple_lines 9.4 2.49 2.17 fiber_optic
## 3 multiple_lines 10.2 2.60 2.06 fiber_optic
## 4 single_line 9.37 2.58 1.94 fiber_optic
## 5 multiple_lines 10.6 2.45 2.17 fiber_optic
## 6 multiple_lines 5.17 2.53 2.08 digital
## 7 multiple_lines 7.86 2.58 2.21 digital
## 8 single_line 8.67 1.97 2.12 fiber_optic
## 9 multiple_lines 9.24 2.59 2.15 fiber_optic
## 10 multiple_lines 11.0 2.59 1.89 fiber_optic
## # ... with 721 more rows, and 4 more variables: contract <fct>,
## # months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>
# Apply to test data
telecom_log_rec_prep %>%
bake(new_data = telecom_test)
## # A tibble: 244 x 9
## cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
## <fct> <dbl> <dbl> <dbl> <fct>
## 1 single_line 9.04 2.53 1.94 fiber_optic
## 2 multiple_lines 8.05 2.52 2.09 digital
## 3 single_line 9.3 2.51 2.06 fiber_optic
## 4 multiple_lines 9.96 2.53 2.13 fiber_optic
## 5 single_line 6.69 2.55 1.96 digital
## 6 multiple_lines 4.11 2.57 1.81 digital
## 7 single_line 7.31 2.39 2.08 digital
## 8 multiple_lines 6.79 2.71 2.15 digital
## 9 single_line 6.64 2.04 2.13 digital
## 10 multiple_lines 9.67 2.55 2.23 fiber_optic
## # ... with 234 more rows, and 4 more variables: contract <fct>,
## # months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>
Great work! You successfully trained your recipe
to be able to transform new data sources and applied it to the training and test datasets. Notice that the avg_call_mins
and avg_intl_mins
variables have been log transformed in the test dataset!
The power of the recipes
package is that you can include multiple preprocessing steps in a single recipe
object. These steps will be carried out in the order they are entered with the step_*()
functions.
In this exercise, you will build upon your feature engineering from
the last exercise. In addition to removing correlated predictors, you
will create a recipe
object that also normalizes all numeric predictors in the telecommunications data.
The telecom_training
and telecom_test
datasets have been loaded into your session.
# Specify a recipe object
telecom_norm_rec <- recipe(canceled_service ~ .,
data = telecom_training) %>%
# Remove correlated variables
step_corr(all_numeric(), threshold = 0.8) %>%
# Normalize numeric predictors
step_normalize(all_numeric())
# Train the recipe
telecom_norm_rec_prep <- telecom_norm_rec %>%
prep(training = telecom_training)
# Apply to test data
telecom_norm_rec_prep %>%
bake(new_data = telecom_test)
## # A tibble: 244 x 8
## cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
## <fct> <dbl> <dbl> <dbl> <fct>
## 1 single_line 0.405 -0.132 -0.643 fiber_optic
## 2 multiple_lines -0.112 -0.239 0.430 digital
## 3 single_line 0.541 -0.265 0.177 fiber_optic
## 4 multiple_lines 0.885 -0.0789 0.871 fiber_optic
## 5 single_line -0.822 0.0807 -0.548 digital
## 6 multiple_lines -2.17 0.333 -1.40 digital
## 7 single_line -0.499 -1.33 0.335 digital
## 8 multiple_lines -0.770 2.24 1.03 digital
## 9 single_line -0.848 -3.14 0.871 digital
## 10 multiple_lines 0.734 0.0807 1.94 fiber_optic
## # ... with 234 more rows, and 3 more variables: contract <fct>,
## # months_with_company <dbl>, canceled_service <fct>
Great job! When you applied your trained recipe
to the telecom_test
data, it removed the monthly_charges
column, due to its large correlation with avg_data_gb
, and normalized the numeric predictor variables!
You are using the telecom_training
data to predict canceled_service
using avg_data_gb
and contract
as predictor variables.
## # A tibble: 4 x 3
## canceled_service avg_data_gb contract
## <chr> <dbl> <chr>
## 1 yes 7.78 month_to_month
## 2 yes 9.04 month_to_month
## 3 yes 5.08 one_year
## 4 no 8.05 two_year
In your feature engineering pipeline, you would like to create dummy variables from the contract
column and leave avg_data_gb
and canceled_service
as is.
Which step_*()
function from the options will correctly encode your recipe object?
Determine whether each step_*()
specification will correctly encode your recipe
object and drag it to the appropriate section. Below find the filled baskets:-
Congratulations! The special selector functions are helpful for specifying feature engineering steps without having to type out all individual variables for processing.
The step_*()
functions within a recipe are carried out
in sequential order. It’s important to keep this in mind so that you
avoid unexpected results in your feature engineering pipeline!
In this exercise, you will combine different step_*()
functions into a single recipe
and see what effect the ordering of step_*()
functions has on the final result.
The telecom_training
and telecom_test
datasets have been loaded into this session.
telecom_recipe_1 <-
recipe(canceled_service ~ avg_data_gb + contract, data = telecom_training) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
# Train and apply telecom_recipe_1 on the test data
telecom_recipe_1 %>%
prep(training = telecom_training) %>%
bake(new_data = telecom_test)
## # A tibble: 244 x 4
## avg_data_gb canceled_service contract_one_year contract_two_year
## <dbl> <fct> <dbl> <dbl>
## 1 0.405 yes 0 0
## 2 -0.112 no 0 1
## 3 0.541 no 0 0
## 4 0.885 no 0 0
## 5 -0.822 no 0 0
## 6 -2.17 no 0 1
## 7 -0.499 no 1 0
## 8 -0.770 yes 1 0
## 9 -0.848 no 0 1
## 10 0.734 no 0 0
## # ... with 234 more rows
telecom_recipe_2 <-
recipe(canceled_service ~ avg_data_gb + contract, data = telecom_training) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes())
# Train and apply telecom_recipe_2 on the test data
telecom_recipe_2 %>%
prep(training = telecom_training) %>%
bake(new_data = telecom_test)
## # A tibble: 244 x 4
## avg_data_gb canceled_service contract_one_year contract_two_year
## <dbl> <fct> <dbl> <dbl>
## 1 0.405 yes -0.510 -0.467
## 2 -0.112 no -0.510 2.14
## 3 0.541 no -0.510 -0.467
## 4 0.885 no -0.510 -0.467
## 5 -0.822 no -0.510 -0.467
## 6 -2.17 no -0.510 2.14
## 7 -0.499 no 1.96 -0.467
## 8 -0.770 yes 1.96 -0.467
## 9 -0.848 no -0.510 2.14
## 10 0.734 no -0.510 -0.467
## # ... with 234 more rows
Great job! Notice that telecom_recipe_1
produced [0, 1] values in the dummy variable columns while telecom_recipe_2
produced dummy variables which were then normalized! The predictor contract_two_year
created by telecom_recipe_2
is -0.486 instead of 0 and 2.05 instead of 1 due to normalization. For
model interpretation, it’s best to normalize variables before creating
dummy variables. Also notice that since you only specified two predictor
variables in your model formula, the rest of the columns are ignored by
your recipe objects when transforming new data sources.
The recipes
package is designed to encode multiple
feature engineering steps into one object, making it easier to maintain
data transformations in a machine learning workflow.
In this exercise, you will train a feature engineering pipeline to prepare the telecommunications data for modeling.
The telecom_df
tibble, as well as your telecom_training
and telecom_test
datasets from the previous exercises, have been loaded into your workspace.
# Create a recipe that predicts canceled_service using the training data
telecom_recipe <- recipe(canceled_service ~ . , data = telecom_training) %>%
# Remove correlated predictors
step_corr(all_numeric(), threshold = 0.8) %>%
# Normalize numeric predictors
step_normalize(all_numeric(), - all_outcomes()) %>%
# Create dummy variables
step_dummy(all_nominal(), -all_outcomes())
# Train your recipe and apply it to the test data
telecom_recipe %>%
prep(training = telecom_training) %>%
bake(new_data = telecom_test)
## # A tibble: 244 x 9
## avg_data_gb avg_call_mins avg_intl_mins months_with_company canceled_service
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 0.405 -0.132 -0.643 -0.930 yes
## 2 -0.112 -0.239 0.430 0.656 no
## 3 0.541 -0.265 0.177 -0.335 no
## 4 0.885 -0.0789 0.871 1.09 no
## 5 -0.822 0.0807 -0.548 -1.09 no
## 6 -2.17 0.333 -1.40 1.53 no
## 7 -0.499 -1.33 0.335 0.973 no
## 8 -0.770 2.24 1.03 -0.850 yes
## 9 -0.848 -3.14 0.871 0.815 no
## 10 0.734 0.0807 1.94 -0.295 no
## # ... with 234 more rows, and 4 more variables:
## # cellular_service_single_line <dbl>, internet_service_digital <dbl>,
## # contract_one_year <dbl>, contract_two_year <dbl>
Great job! You are now a feature engineering ninja! Transforming your training data for modeling is an important part of the machine learning process. In the next section, we will incorporate your feature engineering skills to the entire model fitting process for the telecommunications data.
To incorporate feature engineering into the modeling process, the training and test datasets must be preprocessed before the model fitting stage. With the new skills you have learned in this chapter, you will be able to use all of the available predictor variables in the telecommunications data to train your logistic regression model.
In this exercise, you will create a feature engineering pipeline on the telecommunications data and use it to transform the training and test datasets.
The telecom_training
and telecom_test
datasets as well as your logistic regression model specification, logistic_model
, have been loaded into your session.
telecom_recipe <- recipe(canceled_service ~ ., data = telecom_training) %>%
# Removed correlated predictors
step_corr(all_numeric(), threshold = 0.8) %>%
# Log transform numeric predictors
step_log(all_numeric(), base = 10) %>%
# Normalize numeric predictors
step_normalize(all_numeric()) %>%
# Create dummy variables
step_dummy(all_nominal(), -all_outcomes())
# Train recipe
telecom_recipe_prep <- telecom_recipe %>%
prep(training = telecom_training)
# Transform training data
telecom_training_prep <- telecom_recipe_prep %>%
bake(new_data = NULL)
# Transform test data
telecom_test_prep <- telecom_recipe_prep %>%
bake(new_data = telecom_test)
telecom_test_prep
## # A tibble: 244 x 9
## avg_data_gb avg_call_mins avg_intl_mins months_with_company canceled_service
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 0.472 -0.00903 -0.497 -0.457 yes
## 2 0.0145 -0.108 0.513 0.722 no
## 3 0.583 -0.133 0.303 0.214 no
## 4 0.853 0.0395 0.849 0.868 no
## 5 -0.715 0.182 -0.393 -0.832 no
## 6 -2.63 0.397 -1.48 0.990 no
## 7 -0.365 -1.29 0.436 0.831 no
## 8 -0.656 1.73 0.960 -0.324 yes
## 9 -0.744 -4.59 0.849 0.779 no
## 10 0.737 0.182 1.54 0.243 no
## # ... with 234 more rows, and 4 more variables:
## # cellular_service_single_line <dbl>, internet_service_digital <dbl>,
## # contract_one_year <dbl>, contract_two_year <dbl>
Excellent work! You have preprocessed your training and test datasets with your recipe
object and are now ready to use them for the model fitting and
evaluation steps. Looking at the transformed test dataset, you can see
that your feature engineering steps have been applied correctly.
You have preprocessed your training and test datasets in the previous exercise. Since you incorporated feature engineering into your modeling workflow, you are able to use all of the predictor variables available in the telecommunications data!
The next step is training your logistic regression model and using it to obtain predictions on your new preprocessed test dataset.
Your preprocessed training and test datasets, telecom_training_prep
and telecom_test_prep
, as well as your model object, logistic_model
, have been loaded into your session.
# Train logistic model
logistic_fit <- logistic_model %>%
fit(canceled_service ~ ., data = telecom_training_prep)
# Obtain class predictions
class_preds <- predict(logistic_fit, new_data = telecom_test_prep,
type = 'class')
# Obtain estimated probabilities
prob_preds <- predict(logistic_fit, new_data = telecom_test_prep,
type = 'prob')
# Combine test set results
telecom_results <- telecom_test_prep %>%
select(canceled_service) %>%
bind_cols(class_preds, prob_preds)
telecom_results
## # A tibble: 244 x 4
## canceled_service .pred_class .pred_yes .pred_no
## <fct> <fct> <dbl> <dbl>
## 1 yes yes 0.625 0.375
## 2 no no 0.00893 0.991
## 3 no no 0.330 0.670
## 4 no no 0.222 0.778
## 5 no no 0.380 0.620
## 6 no no 0.0232 0.977
## 7 no no 0.0109 0.989
## 8 yes no 0.207 0.793
## 9 no no 0.000135 1.00
## 10 no no 0.249 0.751
## # ... with 234 more rows
Good job! You have created a tibble of model results on the test
dataset with the actual outcome variable value, predicted outcome
values, and estimated probabilities of the positive and negative
classes. Now you can evaluate the performance of your model with yardstick.
In this exercise, you will use yardstick
metric functions to evaluate your model’s performance on the test dataset.
When you fit a logistic regression model to the telecommunications data in Chapter 2, you predicted canceled_service
using avg_call_mins
, avg_intl_mins
, and monthly_charges.
The sensitivity of your model was 0.42 while the specificity was 0.895.
Now that you have incorporated all available predictor variables using feature engineering, you can compare your new model’s performance to your previous results.
Your model results, telecom_results
, have been loaded into your session.
# Create a confusion matrix
telecom_results %>%
conf_mat(truth = canceled_service, estimate = .pred_class)
## Truth
## Prediction yes no
## yes 52 25
## no 30 137
# Calculate sensitivity
sens_tel <- telecom_results %>%
sens(truth = canceled_service, estimate = .pred_class)
sens_tel
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 sens binary 0.634
# Calculate specificity
spec_tel <- telecom_results %>%
spec(truth = canceled_service, estimate = .pred_class)
spec_tel
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 spec binary 0.846
# Plot ROC curve
telecom_results %>%
roc_curve(truth = canceled_service, .pred_yes) %>%
autoplot()
Fantastic work! You have really come a long way in developing your modeling skills with tidymodels
!
From the results of your metric calculations, using feature engineering
and incorporating all predictor variables increased your model’s
sensitivity to 0.634, up from 0.463, and specificity to 0.846, up from
0.802!
Now it’s time to streamline the modeling process using workflows and fine-tune models with cross-validation and hyperparameter tuning. You’ll learn how to tune a decision tree classification model to predict whether a bank’s customers are likely to default on their loan.
The workflows
package provides the ability to bundle parsnip
models and recipe
objects into a single modeling workflow
object. This makes managing a machine learning project much easier and
removes the need to keep track of multiple modeling objects.
In this exercise, you will be working with the loans_df
dataset, which contains financial information on consumer loans at a bank. The outcome variable in this data is loan_default.
You will create a decision tree model object and specify a feature engineering pipeline for the loan data. The loans_df
tibble has been loaded into your session.
# Create data split object
loans_split <- initial_split(loans_df,
strata = loan_default)
# Build training data
loans_training <- loans_split %>%
training()
# Build test data
loans_test <- loans_split %>%
testing()
# Check for correlated predictors
loans_training %>%
# Select numeric columns
select_if(is.numeric) %>%
# Calculate correlation matrix
cor()
## loan_amount interest_rate installment annual_income
## loan_amount 1.00000000 0.03184287 0.93260718 0.35677596
## interest_rate 0.03184287 1.00000000 0.07582908 -0.06730054
## installment 0.93260718 0.07582908 1.00000000 0.29437125
## annual_income 0.35677596 -0.06730054 0.29437125 1.00000000
## debt_to_income 0.11574276 0.18579065 0.17944233 -0.20516588
## debt_to_income
## loan_amount 0.1157428
## interest_rate 0.1857907
## installment 0.1794423
## annual_income -0.2051659
## debt_to_income 1.0000000
Great work! You have created your training and test datasets and discovered that loan_amount
and installment
are highly correlated predictor variables. To remove one of these predictors, you will have to incorporate step_corr()
into your feature engineering pipeline for this data.
Now that you have created your training and test datasets, the next
step is to specify your model and feature engineering pipeline. These
are the two components that are needed to create a workflow
object for the model training process.
In this exercise, you will define a decision tree model object with decision_tree()
and a recipe
specification with the recipe()
function.
Your loans_training
data has been loaded into this session.
dt_model <- decision_tree() %>%
# Specify the engine
set_engine('rpart') %>%
# Specify the mode
set_mode('classification')
# Build feature engineering pipeline
loans_recipe <- recipe(loan_default ~ .,
data = loans_training) %>%
# Correlation filter
step_corr(all_numeric(), threshold = 0.85) %>%
# Normalize numeric predictors
step_normalize(all_numeric()) %>%
# Create dummy variables
step_dummy(all_nominal(), -all_outcomes())
loans_recipe
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 7
##
## Operations:
##
## Correlation filter on all_numeric()
## Centering and scaling for all_numeric()
## Dummy variables from all_nominal(), -all_outcomes()
Nice work! Now that you have your model and feature engineering steps specified, you can create a workflow
object for model training.
workflow
objects simplify the modeling process in tidymodels.
With workflows
, it’s possible to train a parsnip
model and recipe
object at the same time.
In this exercise, you will combine your decision tree model and feature engineering recipe
into a single workflow
object and perform model fitting and evaluation.
Your model object, dt_model
, recipe
object, loans_recipe
, and data split, loans_split
have been loaded into this session.
# Create a workflow
loans_dt_wkfl <- workflow() %>%
# Include the model object
add_model(dt_model) %>%
# Include the recipe object
add_recipe(loans_recipe)
# Train the workflow
loans_dt_wkfl_fit <- loans_dt_wkfl %>%
last_fit(split = loans_split)
# Calculate performance metrics on test data
pmetrics_loans <- loans_dt_wkfl_fit %>%
collect_metrics()
pmetrics_loans
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.785 Preprocessor1_Model1
## 2 roc_auc binary 0.817 Preprocessor1_Model1
Good job! You have trained a workflow
with last_fit()
that created training and test datasets, trained and applied your
recipe, fit your decision tree model to the training data and calculated
performance metrics on the test data all with just a few lines of code!
The model performed really well, with an area under the ROC curve of
0.817.
Cross validation is a method that uses training data to provide multiple estimates of model performance. When trying different model types on your data, it is important to study their performance profile to help decide which model type performs consistently well.
In this exercise, you will perform cross validation with your decision tree model workflow
to explore its performance.
The training data, loans_training
, and your workflow object, loans_dt_wkfl
, have been loaded into your session.
# Create cross validation folds
set.seed(290)
loans_folds <- vfold_cv(loans_training, v = 5,
strata = loan_default)
# Create custom metrics function
loans_metrics <- metric_set(roc_auc, sensitivity, specificity)
# Fit resamples
loans_dt_rs <- loans_dt_wkfl %>%
fit_resamples(resamples = loans_folds,
metrics = loans_metrics)
# View performance metrics
loans_dt_metrics <- loans_dt_rs %>%
collect_metrics()
loans_dt_metrics
## # A tibble: 3 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.799 5 0.0256 Preprocessor1_Model1
## 2 sens binary 0.683 5 0.0522 Preprocessor1_Model1
## 3 spec binary 0.838 5 0.0157 Preprocessor1_Model1
Excellent work! You have used cross validation to evaluate the performance of your decision tree workflow. Across the 5 cross validation folds, the average area under the ROC curve was 0.799. The average sensitivity and specificity were 0.683 and 0.838, respectively.
Cross validation provides the ability to compare the performance profile of multiple model types. This is helpful in the early stages of modeling, when you are trying to determine which model type will perform best with your data.
In this exercise, you will perform cross validation on the loans_training
data using logistic regression and compare the results to your decision tree model.
The loans_folds
and loans_metrics
objects from the previous exercise have been loaded into your session. Your feature engineering recipe
from the previous section, loans_recipe
, has also been loaded.
logistic_model <- logistic_reg() %>%
# Specify the engine
set_engine('glm') %>%
# Specify the mode
set_mode('classification')
# Create workflow
loans_logistic_wkfl <- workflow() %>%
# Add model
add_model(logistic_model) %>%
# Add recipe
add_recipe(loans_recipe)
# Fit resamples
loans_logistic_rs <- loans_logistic_wkfl %>%
fit_resamples(resamples = loans_folds,
metrics = loans_metrics)
# View performance metrics
loans_logistic_metrics <- loans_logistic_rs %>%
collect_metrics()
loans_logistic_metrics
## # A tibble: 3 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.851 5 0.0288 Preprocessor1_Model1
## 2 sens binary 0.667 5 0.0575 Preprocessor1_Model1
## 3 spec binary 0.875 5 0.0262 Preprocessor1_Model1
Great job! For logistic regression, across the 5 cross validation folds, the average area under the ROC curve was 0.851. The average sensitivity and specificity were 0.667 and 0.875, respectively. ROC AUC and specificity are very close to the decision tree cross validation results. However, the decision tree model performed slightly better on sensitivity, with an average value of 0.683.
The benefit of the collect_metrics()
function is that it
returns a tibble of cross validation results. This makes it easy to
calculate custom summary statistics with the dplyr
package.
In this exercise, you will use dplyr
to explore the cross validation results of your decision tree and logistic regression models.
Your cross validation results, loans_dt_rs
and loans_logistic_rs
have been loaded into your session.
# Detailed cross validation results
dt_rs_results <- loans_dt_rs %>%
collect_metrics(summarize = FALSE)
# Explore model performance for decision tree
dt_rs_results %>%
group_by(.metric) %>%
summarize(min = min(.estimate),
median = median(.estimate),
max = max(.estimate))
## # A tibble: 3 x 4
## .metric min median max
## <chr> <dbl> <dbl> <dbl>
## 1 roc_auc 0.740 0.784 0.867
## 2 sens 0.569 0.667 0.84
## 3 spec 0.8 0.827 0.875
# Detailed cross validation results
logistic_rs_results <- loans_logistic_rs %>%
collect_metrics(summarize = FALSE)
# Explore model performance for logistic regression
logistic_rs_results %>%
group_by(.metric) %>%
summarize(min = min(.estimate),
median = median(.estimate),
max = max(.estimate))
## # A tibble: 3 x 4
## .metric min median max
## <chr> <dbl> <dbl> <dbl>
## 1 roc_auc 0.783 0.868 0.935
## 2 sens 0.569 0.588 0.86
## 3 spec 0.802 0.9 0.938
Great job! Both models have similar average values across all metrics. However, logistic regression tends to have a wider range of values on all metrics. This provides evidence that a decision tree model may produce more stable prediction accuarcy on the loans dataset.
Hyperparameter tuning is a method for fine-tuning the performance of
your models. In most cases, the default hyperparameters values of parsnip
model objects will not be the optimal values for maximizing model performance.
In this exercise, you will define a decision tree model with hyperparameters for tuning and create a tuning workflow
object.
Your decision tree workflow
object, loans_dt_wkfl
, has been loaded into your session.
# Set tuning hyperparameters
dt_tune_model <- decision_tree(cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()) %>%
# Specify engine
set_engine('rpart') %>%
# Specify mode
set_mode('classification')
# Create a tuning workflow
loans_tune_wkfl <- loans_dt_wkfl %>%
# Replace model
update_model(dt_tune_model)
loans_tune_wkfl
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: decision_tree()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_corr()
## * step_normalize()
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = tune()
## tree_depth = tune()
## min_n = tune()
##
## Computational engine: rpart
Good job! When you print your new workflow
object, the decision tree hyperparameters now appear under the main arguments section.
The most common method of hyperparameter tuning is grid search. This method creates a tuning grid with unique combinations of hyperparameter values and uses cross validation to evaluate their performance. The goal of hyperparameter tuning is to find the optimal combination of values for maximizing model performance.
In this exercise, you will create a random hyperparameter grid and tune your loans data decision tree model.
Your cross validation folds, loans_folds
, workflow
object, loans_tune_wkfl
, custom metrics function, loans_metrics
, and dt_tune_model
have been loaded into your session.
# Hyperparameter tuning with grid search
set.seed(214)
dt_grid <- grid_random(parameters(dt_tune_model),
size = 5)
# Hyperparameter tuning
dt_tuning <- loans_tune_wkfl %>%
tune_grid(resamples = loans_folds,
grid = dt_grid,
metrics = loans_metrics)
# View results
dt_tuning %>%
collect_metrics()
## # A tibble: 15 x 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000758 14 39 roc_auc binary 0.839 5 0.0263
## 2 0.0000000758 14 39 sens binary 0.719 5 0.0465
## 3 0.0000000758 14 39 spec binary 0.808 5 0.0272
## 4 0.0243 5 34 roc_auc binary 0.775 5 0.0361
## 5 0.0243 5 34 sens binary 0.640 5 0.0577
## 6 0.0243 5 34 spec binary 0.910 5 0.0196
## 7 0.00000443 11 8 roc_auc binary 0.816 5 0.0276
## 8 0.00000443 11 8 sens binary 0.683 5 0.0491
## 9 0.00000443 11 8 spec binary 0.793 5 0.0123
## 10 0.000000600 3 5 roc_auc binary 0.787 5 0.0374
## 11 0.000000600 3 5 sens binary 0.640 5 0.0634
## 12 0.000000600 3 5 spec binary 0.893 5 0.0296
## 13 0.00380 5 36 roc_auc binary 0.828 5 0.0268
## 14 0.00380 5 36 sens binary 0.699 5 0.0545
## 15 0.00380 5 36 spec binary 0.836 5 0.0271
## # ... with 1 more variable: .config <chr>
Good work! Since you have 5 random hyperparameter combinations and 3 performance metrics, there are 15 results in your summarized tuning results. Each row shows the average of the 5 cross validation estimates of each metric and hyperparameter combination.
The collect_metrics()
function is able to produce a
detailed tibble of tuning results from a tuning object. Since this
function returns a tibble, it works well with the dplyr
package for further data exploration and analysis.
In this exercise, you will explore your tuning results, dt_tuning
, to gain further insights into your hyperparameter tuning.
Your dt_tuning
object has been loaded into this session.
# Collect detailed tuning results
dt_tuning_results <- dt_tuning %>%
collect_metrics(summarize = FALSE)
# Explore detailed ROC AUC results for each fold
dt_tuning_results %>%
filter(.metric == "roc_auc") %>%
group_by(id) %>%
summarize(min_roc_auc = min(.estimate),
median_roc_auc = median(.estimate),
max_roc_auc = max(.estimate))
## # A tibble: 5 x 4
## id min_roc_auc median_roc_auc max_roc_auc
## <chr> <dbl> <dbl> <dbl>
## 1 Fold1 0.694 0.756 0.789
## 2 Fold2 0.725 0.740 0.799
## 3 Fold3 0.732 0.792 0.813
## 4 Fold4 0.869 0.891 0.901
## 5 Fold5 0.85 0.878 0.912
Excellent work! You have now had the chance to explore the detailed
results of your decision tree hyperparameter tuning. The next step will
be selecting the best combination and finalizing your workflow
object!
To incorporate hyperparameter tuning into your modeling process, an optimal hyperparameter combination must be selected based on the average value of a performance metric. Then you will be able to finalize your tuning workflow and fit your final model.
In this exercise, you will explore the best performing models from your hyperparameter tuning and finalize your tuning workflow
object.
The dt_tuning
and loans_tune_wkfl
objects from your previous session have been loaded into your environment.
# Display 5 best performing models
dt_tuning %>%
show_best(metric = 'roc_auc', n = 5)
## # A tibble: 5 x 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000758 14 39 roc_auc binary 0.839 5 0.0263
## 2 0.00380 5 36 roc_auc binary 0.828 5 0.0268
## 3 0.00000443 11 8 roc_auc binary 0.816 5 0.0276
## 4 0.000000600 3 5 roc_auc binary 0.787 5 0.0374
## 5 0.0243 5 34 roc_auc binary 0.775 5 0.0361
## # ... with 1 more variable: .config <chr>
# Select based on best performance
best_dt_model <- dt_tuning %>%
# Choose the best model based on roc_auc
select_best(metric = 'roc_auc')
# Finalize your workflow
final_loans_wkfl <- loans_tune_wkfl %>%
finalize_workflow(best_dt_model)
final_loans_wkfl
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: decision_tree()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_corr()
## * step_normalize()
## * step_dummy()
##
## -- Model -----------------------------------------------------------------------
## Decision Tree Model Specification (classification)
##
## Main Arguments:
## cost_complexity = 7.58290839567418e-08
## tree_depth = 14
## min_n = 39
##
## Computational engine: rpart
Good job! When you printed your finalized workflow
object, the optimal hyperparameter combination is displayed in the main arguments section of the output. Your workflow
is now ready for model fitting and prediction on new data sources!
Congratulations on successfully tuning your decision tree model and finalizing your workflow! Your final_loans_wkfl
object can now be used for model training and prediction on new data sources.
In this last exercise, you will train your finalized workflow
on the entire loans_training
dataset and evaluate its performance on the loans_test
data.
The final_loans_wkfl
and loans_split
objects have been loaded into your session.
# Train finalized decision tree workflow
loans_final_fit <- final_loans_wkfl %>%
last_fit(split = loans_split)
# View performance metrics
loans_final_fit_metrics <- loans_final_fit %>%
collect_metrics()
loans_final_fit_metrics
## # A tibble: 2 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.772 Preprocessor1_Model1
## 2 roc_auc binary 0.838 Preprocessor1_Model1
# Create an ROC curve
loans_final_fit %>%
# Collect predictions
collect_predictions() %>%
# Calculate ROC curve metrics
roc_curve(truth = loan_default, .pred_yes) %>%
# Plot the ROC curve
autoplot()
Great job! You were able to train your finalized
workflow
with last_fit()
and generate predictions on the test data. The tuned decision tree
model produced an area under the ROC curve of 0.838. That’s a great
model! The ROC curve shows that the sensitivity and specificity remain
high across a wide range of probability threshold values.