Machine Learning in the Tidyverse
Dmitriy Gorenshteyn - DataCamp
Course Description
This course will teach you to leverage the tools in the “tidyverse” to generate, explore, and evaluate machine learning models. Using a combination of tidyr and purrr packages, you will build a foundation for how to work with complex model objects in a “tidy” way. You will also learn how to leverage the broom package to explore your resulting models. You will then be introduced to the tools in the test-train-validate workflow, which will empower you evaluate the performance of both classification and regression models as well as provide the necessary information to optimize model performance via hyperparameter tuning.
1 Foundations of “tidy” Machine learning
This chapter will introduce you to the backbone of machine learning in
the tidyverse, the List Column Workflow (LCW). The LCW will empower you
to work with many models in one dataframe.
This chapter will also introduce you to the fundamentals of the broom package for exploring your models.
1.1 "Tidy" machine learning
1.1.1 Nesting your data
In this course, you will work with a collection of economic and social
indicators for 77 countries over a period of 52 years. This data is
stored in the gapminder
data frame.
In this exercise, you will transform your gapminder
data into a nested data frame by using the first tool needed to build the foundation of tidy machine learning skills: nest()
.
Note: This is a more granular version than the dataset available from the gapminder
package. This version is available in the dslabs
package.
gapminder
.
library(dslabs)
data(gapminder)
# Explore gapminder
=gapminder %>% mutate(gdpPercap=gdp/population) %>% select(-gdp)
gapminder=readRDS("/Users/apple/Documents/Rstudio/DataCamp/MachineLearningintheTidyverse/gapminder.rds")
gapminderhead(gapminder)
## # A tibble: 6 × 7
## country year infant_mortality life_expectancy fertility population gdpPercap
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Algeria 1960 148. 47.5 7.65 11124892 1242
## 2 Algeria 1961 148. 48.0 7.65 11404859 1047
## 3 Algeria 1962 148. 48.6 7.65 11690152 820
## 4 Algeria 1963 148. 49.1 7.65 11985130 1075
## 5 Algeria 1964 149. 49.6 7.65 12295973 1109
## 6 Algeria 1965 149. 50.1 7.66 12626953 1147
group_by()
and nest()
to nest your data frames by country
, save this as gap_nested
.
# Prepare the nested data frame gap_nested
library(tidyverse)
<- gapminder %>% select(country,year, infant_mortality, life_expectancy, fertility, population, gdpPercap) %>%
gap_nested group_by(country) %>%
nest()
gap_nested
, note the new complex column data containing tibbles.
# Explore gap_nested
head(gap_nested)
## # A tibble: 6 × 2
## # Groups: country [6]
## country data
## <fct> <list>
## 1 Algeria <tibble [52 × 6]>
## 2 Argentina <tibble [52 × 6]>
## 3 Australia <tibble [52 × 6]>
## 4 Austria <tibble [52 × 6]>
## 5 Bangladesh <tibble [52 × 6]>
## 6 Belgium <tibble [52 × 6]>
You’re off to a great start! Notice that each row in gap_nested
contains a tibble.
1.1.2 Unnesting your data
As you’ve seen in the previous exercise, a nested data frame is simply a way to shape your data. Essentially taking the group_by()
windows and packaging them in corresponding rows.
In the same way you can use the nest()
function to break your data into nested chunks, you can use the unnest()
function to expand the data frames that are nested in these chunks.
unnest()
on the gap_nested
data frame to take a nested column and expand it into a new data frame and save it as gap_unnested
.
# Create the unnested data frame called gap_unnnested
<- gap_nested %>%
gap_unnested unnest()
## Warning: `cols` is now required when using unnest().
## Please use `cols = c(data)`
gapminder
and gap_unnested
are identical by using the identical()
function.
# Confirm that your data was not modified
identical(gapminder, gap_unnested)
## [1] FALSE
Great work!! Notice that this transformation only reshaped your data, it did not modify it.
1.1.3 Explore a nested cell
In the first exercise, you successfully created a nested data frame gap_nested
. The data
column contains tibbles for each country. In this exercise, you will explore one of these nested chunks.
algeria_df
.
# Extract the data of Algeria
<- gap_nested$data[[1]] algeria_df
min()
, max()
and mean()
.
# Calculate the minimum of the population vector
min(algeria_df$population)
## [1] 11124892
# Calculate the maximum of the population vector
max(algeria_df$population)
## [1] 36717132
# Calculate the mean of the population vector
mean(algeria_df$population)
## [1] 23129438
Well done! You can see that working with a single chunk in a nested data
frame is identical to working with regular data frames. In the next
section you will learn how to scale this approach to work on a vector of
nested data frames using the map
family of functions.
1.2 The map family of functions
1.2.1 Mapping your data
In combination with mutate(),
you can use map()
to append the results of your calculation to a data frame. Since the map()
function always returns a vector of lists you must use unnest()
to extract this information into a numeric vector.
Here you will explore this functionality by calculating the mean population of each country in the gapminder
dataset.
map()
to apply the mean()
function to calculate the population mean for each country and append this new list column called mean_pop
using mutate()
.
# Calculate the mean population for each country
<- gap_nested %>%
pop_nested mutate(mean_pop = map(data, ~mean(.x$population)))
pop_nested
.
# Take a look at pop_nested
head(pop_nested)
## # A tibble: 6 × 3
## # Groups: country [6]
## country data mean_pop
## <fct> <list> <list>
## 1 Algeria <tibble [52 × 6]> <dbl [1]>
## 2 Argentina <tibble [52 × 6]> <dbl [1]>
## 3 Australia <tibble [52 × 6]> <dbl [1]>
## 4 Austria <tibble [52 × 6]> <dbl [1]>
## 5 Bangladesh <tibble [52 × 6]> <dbl [1]>
## 6 Belgium <tibble [52 × 6]> <dbl [1]>
unnest()
to convert the mean_pop
list into a numeric column and save this as the pop_mean
data frame.
# Extract the mean_pop value by using unnest
<- pop_nested %>%
pop_mean unnest(mean_pop)
pop_mean
using head()
.
# Take a look at pop_mean
head(pop_mean)
## # A tibble: 6 × 3
## # Groups: country [6]
## country data mean_pop
## <fct> <list> <dbl>
## 1 Algeria <tibble [52 × 6]> 23129438.
## 2 Argentina <tibble [52 × 6]> 30783053.
## 3 Australia <tibble [52 × 6]> 16074837.
## 4 Austria <tibble [52 × 6]> 7746272.
## 5 Bangladesh <tibble [52 × 6]> 97649407.
## 6 Belgium <tibble [52 × 6]> 9983596.
Excellent job! Here you can see how to leverage the map()
function to apply a desired function and store it. In the next exercise you will see how this can be done more concisely using map_dbl()
.
1.2.2 Expecting mapped output
When you know that the output of your mapped function is an expected type (here it is a numeric vector) you can leverage the map_*()
family of functions to explicitly try to return that object type instead of a list.
Here you will again calculate the mean population of each country, but instead, you will use map_dbl()
to explicitly append the numeric vector returned by mean()
to your data frame.
pop_mean
data frame using the map_dbl()
function to calculate the population mean for each nested data frame.
# Calculate mean population and store result as a double
<- gap_nested %>%
pop_mean mutate(mean_pop = map_dbl(data, ~mean(.x$population)))
pop_mean
data frame using head()
.
# Take a look at pop_mean
head(pop_mean)
## # A tibble: 6 × 3
## # Groups: country [6]
## country data mean_pop
## <fct> <list> <dbl>
## 1 Algeria <tibble [52 × 6]> 23129438.
## 2 Argentina <tibble [52 × 6]> 30783053.
## 3 Australia <tibble [52 × 6]> 16074837.
## 4 Austria <tibble [52 × 6]> 7746272.
## 5 Bangladesh <tibble [52 × 6]> 97649407.
## 6 Belgium <tibble [52 × 6]> 9983596.
You’re doing great! With the nest()
and map_*()
functions in hand you now have the foundation for building multiple models.
1.2.3 Mapping many models
The gap_nested
data frame available in your workspace contains the gapminder dataset nested by country.
You will use this data to build a linear model for each country to predict life expectancy using the year feature.
Note: The term feature is synonymous with the terms variable or predictor. It refers to an attribute of your data that can be used to build a machine learning model.
life_expectancy
using the year
feature. Use the lm()
function for this and save this new data frame containing models as gap_models
.
# Build a linear model for each country
<- gap_nested %>%
gap_models mutate(model = map(data, ~lm(formula = life_expectancy~year, data = .x)))
algeria_model
.
# Extract the model for Algeria
<- gap_models$model[[1]] algeria_model
summary()
.
# View the summary for the Algeria model
summary(algeria_model)
##
## Call:
## lm(formula = life_expectancy ~ year, data = .x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.044 -1.577 -0.543 1.700 3.843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.197e+03 3.994e+01 -29.96 <2e-16 ***
## year 6.349e-01 2.011e-02 31.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.177 on 50 degrees of freedom
## Multiple R-squared: 0.9522, Adjusted R-squared: 0.9513
## F-statistic: 996.2 on 1 and 50 DF, p-value: < 2.2e-16
Fantastic work! You’ve just built 77 models for 77 countries with just a few lines of code.
In the next series of exercises, you will learn how you can extract information from summary()
in a tidy fashion.
1.3 Tidy your models with broom
1.3.1 The three ways to tidy your model
Below are the descriptions of the three functions in the broom package. Which ones are correct?
- tidy() returns the statistical findings of the model (such as coefficients)
- glance() returns a concise one-row summary of the model
- augment() adds prediction columns to the data being modeled
- glance() returns a concise one-row summary of the model
-
Only A
-
A and C
-
None are correct
-
All are correct
Great job! These are the three main functions that broom
provides for tidying the output of models.
1.3.2 Extracting model statistics tidily
In this exercise, you will use the tidy()
and glance()
functions to extract information from algeria_model
in a tidy manner.
For a linear model, tidy()
extracts the model coefficients while glance()
returns the model statistics such as the .
algeria_model
using tidy()
.
library(broom)
# Extract the coefficients of the algeria_model as a data frame
tidy(algeria_model)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1197. 39.9 -30.0 1.32e-33
## 2 year 0.635 0.0201 31.6 1.11e-34
algeria_model
using glance()
.
# Extract the statistics of the algeria_model as a data frame
glance(algeria_model)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.952 0.951 2.18 996. 1.11e-34 1 -113. 232. 238.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Great job! As you can see tidy()
and glance()
both return data frames, this feature can be very useful for managing the results of many models in one data frame.
1.3.3 Augmenting your data
From the results of glance()
, you learned that using the available features the linear model fits well with an adjusted of 0.99. The augment()
function can help you explore this fit by appending the predictions to the original data.
Here you will leverage this to compare the predicted values of life_expectancy
with the original ones based on the year
feature.
algeria_fitted
using augment()
.
# Build the augmented data frame
<- augment(algeria_model) algeria_fitted
year
by plotting both life_expectancy
as points and .fitted
as a line.
# Compare the predicted values with the actual values of life expectancy
%>%
algeria_fitted ggplot(aes(x = year)) +
geom_point(aes(y = life_expectancy)) +
geom_line(aes(y = .fitted), color = "red")
Congratulations!
You’ve successfully completed Chapter 1. In the
next chapter you will see how you can leverage the tools you learned to
build, evaluate and explore the models you created for each country.
2 Multiple Models with broom
This chapter leverages the List Column Workflow to build and explore the attributes of 77 models. You will use the tools from the broom package to gain a multidimensional understanding of all of these models.
2.1 Coefficients across models
2.1.1 Tidy up the coefficients of your models
In this exercise you will leverage the list column workflow along with the tidy()
function from broom
to extract and explore the coefficients for the 77 models you built.
Remember the gap_models
data frame contains a model predicting life expectancy by year for 77 countries.
tidy()
to append a column (coef
) containing coefficient statistics for each model to the gap_models
data frame and save it as model_coef_nested
.
# Extract the coefficient statistics of each model into nested data frames
<- gap_models %>%
model_coef_nested mutate(coef = map(model, ~tidy(.x)))
unnest()
to extract these coefficients in your data frame.
# Simplify the coef data frames for each model
<- model_coef_nested %>%
model_coef unnest(coef)
# Plot a histogram of the coefficient estimates for year
%>%
model_coef filter(term == "year") %>%
ggplot(aes(x = estimate)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Great job! Now that you have the slope for each model let’s explore their distribution.
2.1.2 What can we learn about these 77 countries?
Explore the model_coef
data frame you just created to answer the following question:
Which of the following conclusions can we make from the coefficients of our models?
-
About 10% of the 77 countries had a decrease of life expectancy between 1960 and 2011.
-
The United States experienced the fastest growth in life expectancy.
-
The majority of the 77 countries experienced a growth in average life expectancy between 1960 and 2011.
-
All of these conclusions are correct.
-
None of these conclusions are correct.
You got it! Based on thse models we can conclude that 73 of the 77 countries experienced a growth in life expectancy during this time period.
2.2 The fit of many models
2.2.1 Glance at the fit of your models
In this exercise you will use glance()
to calculate how well the linear models fit the data for each country.
fit
) containing the fit statistics for each model to the gap_models
data frame and save it as model_perf_nested
.
# Extract the fit statistics of each model into data frames
<- gap_models %>%
model_perf_nested mutate(fit = map(model, ~glance(.x)))
unnest()
to extract these fit statistics of each model and save it as model_perf
.
# Simplify the fit data frames for each model
<- model_perf_nested %>%
model_perf unnest(fit)
head()
to take a peek at model_perf
.
# Look at the first six rows of model_perf
head(model_perf)
## # A tibble: 6 × 15
## # Groups: country [6]
## country data model r.squared adj.r.squared sigma statistic p.value df
## <fct> <list> <lis> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Algeria <tibble> <lm> 0.952 0.951 2.18 996. 1.11e-34 1
## 2 Argenti… <tibble> <lm> 0.984 0.984 0.431 3137. 8.78e-47 1
## 3 Austral… <tibble> <lm> 0.983 0.983 0.511 2905. 5.83e-46 1
## 4 Austria <tibble> <lm> 0.987 0.986 0.438 3702. 1.48e-48 1
## 5 Banglad… <tibble> <lm> 0.949 0.947 1.83 921. 7.10e-34 1
## 6 Belgium <tibble> <lm> 0.990 0.990 0.331 5094. 5.54e-52 1
## # … with 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
## # df.residual <int>, nobs <int>
Great job! You have successfully calculated the fit statistics for all 77 of your models. Next, we’ll explore these results.
2.2.2 Best and worst fitting models
In this exercise you will answer the following questions:
- Overall, how well do your models fit your data?
- Which are the best fitting models?
- Which models do not fit the data well?
# Plot a histogram of rsquared for the 77 models
%>%
model_perf ggplot(aes(x = r.squared)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
best_fit
# Extract the 4 best fitting models
<- model_perf %>%
best_fit top_n(n = 4, wt = r.squared)
worst_fit
# Extract the 4 models with the worst fit
<- model_perf %>%
worst_fit top_n(n = 4, wt = -r.squared)
Excellent work! You have now prepared two data frames, one containing
the four best fitting models and another the four worst fitting models.
In the next section we will use the augment()
function to explore these fits visually.
2.3 Inspect many models
2.3.1 Augment the fitted values of each model
In this exercise you will prepare your four best and worst fitting
models for further exploration by augmenting your model data with augment()
.
best_augmented
data frame by building augmented data frames and simplifying them with unnest()
using the best_fit
data frame.
<- best_fit %>%
best_augmented # Build the augmented data frame for each country model
mutate(augmented = map(model, ~augment(.x))) %>%
# Expand the augmented data frames
unnest(augmented)
worst_augmented
data frame by building augmented data frames and simplifying them with unnest()
using the worst_fit
data frame.
<- worst_fit %>%
worst_augmented # Build the augmented data frame for each country model
mutate(augmented = map(model, ~augment(.x))) %>%
# Expand the augmented data frames
unnest(augmented)
You’re doing great! You now have the pieces necessary to visually explore the fits of these 8 models.
2.3.2 Explore your best and worst fitting models
Let’s explore your four best and worst fitting models by comparing the fitted lines with the actual values.
Visualize the fit of your four best fitting models with respect to year
by plotting both life_expectancy
as points and .fitted
as a line.
# Compare the predicted values with the actual values of life expectancy
# for the top 4 best fitting models
%>%
best_augmented ggplot(aes(x = year)) +
geom_point(aes(y = life_expectancy)) +
geom_line(aes(y = .fitted), color = "red") +
facet_wrap(~country, scales = "free_y")
Visualize the fit of your four worst fitting models with respect to year
by plotting both life_expectancy
as points and .fitted
as a line.
# Compare the predicted values with the actual values of life expectancy
# for the top 4 worst fitting models
%>%
worst_augmented ggplot(aes(x = year)) +
geom_point(aes(y = life_expectancy)) +
geom_line(aes(y = .fitted), color = "red") +
facet_wrap(~country, scales = "free_y")
Cool plots! You can see that a linear model does a great job for the best 4 fitting models but the worst 4 fitting models do not seem to have a linear relationship. You will work to improve this fit in the next series of exercises by incorporating additional features.
2.4 Improve the fit of your models
2.4.1 Build better models
Earlier you built a collection of simple models to fit life expectancy using the year feature. Your previous analysis showed that some of these models didn’t fit very well.
In this exercise you will build multiple regression models for each country using all available features. You may be interested in comparing the performance of the four worst fitting models so their adjusted are provided below:
Country | Adjusted |
---|---|
Botswana | -0.0060772 |
Lesotho | -0.0169851 |
Zambia | 0.1668999 |
Zimbabwe | 0.2083979 |
life_expectancy
using every feature in the dataset.
# Build a linear model for each country using all features
<- gap_nested %>%
gap_fullmodel mutate(model = map(data, ~lm(life_expectancy~., data = .x)))
fit
) containing fit statistics for each model and simplify this data frame.
<- gap_fullmodel %>%
fullmodel_perf # Extract the fit statistics of each model into data frames
mutate(fit = map(model, ~glance(.x))) %>%
# Simplify the fit data frames for each model
unnest(fit)
fullmodel_perf
of the four countries from worst_fit
data frame.
# View the performance for the four countries with the worst fitting
# four simple models you looked at before
%>%
fullmodel_perf filter(country %in% worst_fit$country) %>%
select(country, adj.r.squared)
## # A tibble: 77 × 2
## # Groups: country [77]
## country adj.r.squared
## <fct> <dbl>
## 1 Algeria 0.999
## 2 Argentina 0.999
## 3 Australia 0.994
## 4 Austria 0.996
## 5 Bangladesh 0.981
## 6 Belgium 0.996
## 7 Benin 0.996
## 8 Bolivia 0.999
## 9 Botswana 0.844
## 10 Brazil 0.999
## # … with 67 more rows
That was a tough one and you did great! You can see that the performance
of each of the four worst performing models based on their adjusted
Which of these four models do you expect to perform the best for future years?
Country |
Adjusted |
---|---|
Brazil | 0.9994261 |
Greece | 0.9994407 |
Mexico | 0.9995427 |
Morocco | 0.9997960 |
-
Brazil
-
Greece
-
Mexico
-
Morocco
-
Can not be determined using this information
2.4.2 Predicting the future
3 Build, Tune & Evaluate Regression
In this chapter you will learn how to use the List Column Workflow to build, tune and evaluate regression models. You will have the chance to work with two types of models: linear models and random forest models.
3.1 Validation splits
3.1.1 The test-train split
In a disciplined machine learning workflow it is crucial to withhold a portion of your data (testing data) from any decision-making process. This allows you to independently assess the performance of your model when it is finalized. The remaining data, the training data, is used to build and select the best model.
In this exercise, you will use the rsample
package to split your data to perform the initial train-test split of your gapminder
data.
Note: Since this is a random split of the data it is good practice to set a seed before splitting it.
initial_split()
function and assign it to gap_split
.
set.seed(42)
library(rsample)
# Prepare the initial split object
<- initial_split(gapminder, prop = 0.75) gap_split
gap_split
using the training()
function.
# Extract the training data frame
<- training(gap_split) training_data
gap_split
using the testing()
function.
# Extract the testing data frame
<- testing(gap_split) testing_data
dim()
function on training_data
and testing_data
.
# Calculate teh dimensions of both training_data and testing_data
dim(training_data)
## [1] 3003 7
dim(testing_data)
## [1] 1001 7
Great work! You have withheld a portion of your data for a final, unbiased, evaluation of your model. Throughout the rest of this chapter you will take the steps necessary to identify the best performing model using only the training data. At the end of the chapter you will select the best performing model and measure its performance using the testing data that you created here.
3.1.2 Cross-validation data frames
Now that you have withheld a portion of your data as testing data, you can use the remaining portion to find the best performing model.
In this exercise, you will split the training data into a series of 5 train-validate sets using the vfold_cv()
function from the rsample
package.
training_data
using vfold_cv()
and assign it to cv_split
.
set.seed(42)
# Prepare the data frame containing the cross validation partitions
<- vfold_cv(training_data, v = 5) cv_split
cv_data
by appending two new columns to cv_split
:
-
train
: containing the train data frames by mappingtraining()
across thesplits
column. -
validate
: containing the validate data frames by using mappingtesting()
across thesplits
column.
<- cv_split %>%
cv_data mutate(
# Extract the train data frame for each split
train = map(splits, ~training(.x)),
# Extract the validate data frame for each split
validate = map(splits, ~testing(.x))
)
# Use head() to preview cv_data
head(cv_data)
## # 5-fold cross-validation
## # A tibble: 5 × 4
## splits id train validate
## <list> <chr> <list> <list>
## 1 <split [2402/601]> Fold1 <tibble [2,402 × 7]> <tibble [601 × 7]>
## 2 <split [2402/601]> Fold2 <tibble [2,402 × 7]> <tibble [601 × 7]>
## 3 <split [2402/601]> Fold3 <tibble [2,402 × 7]> <tibble [601 × 7]>
## 4 <split [2403/600]> Fold4 <tibble [2,403 × 7]> <tibble [600 × 7]>
## 5 <split [2403/600]> Fold5 <tibble [2,403 × 7]> <tibble [600 × 7]>
Excellent job! Now you’re back to the same list column workflow you’ve used in the previous chapters. You will use this data frame throughout this chapter to measure and compare the performance of the models you create.
3.2 Cross-validation performance
3.2.1 Build cross-validated models
In this exercise, you will build a linear model predicting life_expectancy
using all available features. You will do this for the train data of each cross-validation fold.
Build models predicting life_expectancy
using all available features with the train
data for each fold of the cross validation.
# Build a model using the train data for each fold of the cross validation
<- cv_data %>%
cv_models_lm mutate(model = map(train, ~lm(formula = life_expectancy~., data = .x)))
You’re doing great! Now that you have the models built, let’s prepare the parts we need to evaluate their performance.
3.2.2 Preparing for evaluation
In order to measure the validate performance of your models you need compare the predicted values of life_expectancy
for the observations from validate set to the actual values recorded.
Here you will prepare both of these vectors for each partition.
-
Extract the actual
life_expectancy
from the validate data frames and store these in the columnvalidate_actual
. -
Predict the
life_expectancy
for each validate partition using themap2()
andpredict()
functions in the columnvalidate_predicted
.
<- cv_models_lm %>%
cv_prep_lm mutate(
# Extract the recorded life expectancy for the records in the validate data frames
validate_actual = map(validate, ~.x$life_expectancy),
# Predict life expectancy for each validate set using its corresponding model
validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y))
)
Great work! In the next exercise you will compare validate_actual
to validate_predicted
to measure the performance of all 5 models.
3.2.3 Evaluate model performance
Now that you have both the actual and predicted values of each fold you can compare them to measure performance.
For this regression model, you will measure the Mean Absolute Error (MAE) between these two vectors. This value tells you the average difference between the actual and predicted values.
validate_mae
column.
library(Metrics)
# Calculate the mean absolute error for each validate fold
<- cv_prep_lm %>%
cv_eval_lm mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y)))
validate_mae
column (note how they vary).
# Print the validate_mae column
$validate_mae cv_eval_lm
## [1] 1.564897 1.481837 1.395386 1.589818 1.460490
# Calculate the mean of validate_mae column
mean(cv_eval_lm$validate_mae)
## [1] 1.498485
Excellent! You now know that based on 5 train-validate splits, the predictions of the models are on average off by 1.5 years. Can we improve this performance by using a more complex model? Let’s find out!
3.3 Random forest model
3.3.1 Build a random forest model
Here you will use the same cross-validation data to build (using train
) and evaluate (using validate
)
random forests for each partition. Since you are using the same
cross-validation partitions as your regression models, you are able to
directly compare the performance of the two models.
Note: We will limit our random forests to contain
100 trees to ensure they finish fitting in a reasonable time. The
default number of trees for ranger()
is 500.
ranger()
to build a random forest predicting life_expectancy
using all features in train
for each cross validation partition.
library(ranger)
# Build a random forest model for each fold
<- cv_data %>%
cv_models_rf mutate(model = map(train, ~ranger(formula = life_expectancy~., data = .x,
num.trees = 100, seed = 42)))
validate_predicted
predicting the life_expectancy
for the observations in validate
using the random forest models you just created.
# Generate predictions using the random forest model
<- cv_models_rf %>%
cv_prep_rf mutate(validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y)$predictions))
Great job! In the next exercise, you will evaluate the predictions from this model.
3.3.2 Evaluate a random forest model
Similar to the linear regression model, you will use the MAE metric to evaluate the performance of the random forest model.
validate_mae
column.
library(ranger)
=cv_prep_rf %>% mutate(validate_actual=map(validate,~.x$life_expectancy))
cv_prep_rf
# Calculate validate MAE for each fold
<- cv_prep_rf %>%
cv_eval_rf mutate(validate_mae = map2_dbl(validate_actual, validate_predicted, ~mae(actual = .x, predicted = .y)))
validate_mae
column (note how they vary).
# Print the validate_mae column
$validate_mae cv_eval_rf
## [1] 0.8694553 0.7990644 0.7832452 0.8853153 0.7828388
# Calculate the mean of validate_mae column
mean(cv_eval_rf$validate_mae)
## [1] 0.8239838
Note: The actual values of the validate fold (validate_actual
) has already been added to your cv_data
data frame.
Impressive! You’ve dropped the average error of your predictions from 1.5 to 0.83. That’s quite an improvement! In the next exercise you’ll see if you can squeeze a bit more performance out by tuning a parameter of the random forest model.
3.3.3 Fine tune your model
Wow! That was a significant improvement over a regression model. Now
let’s see if you can further improve this performance by fine tuning
your random forest models. To do this you will vary the mtry
parameter when building your random forest models on your train
data.
The default value of mtry
for ranger is the rounded down square root of the total number of features (6). This results in a value of 2.
crossing()
to expand the cross validation data for values of mtry
ranging from 2 through 5.
# Prepare for tuning your cross validation folds by varying mtry
<- cv_data %>%
cv_tune crossing(mtry = 2:5)
# Build a model for each fold & mtry combination
<- cv_tune %>%
cv_model_tunerf mutate(model = map2(.x = train, .y = mtry, ~ranger(formula = life_expectancy~.,
data = .x, mtry = .y,
num.trees = 100, seed = 42)))
Great work! You’ve built a model for each fold/mtry combination. Next,
you’ll measure the performance of each to find the best performing value
of mtry
.
3.3.4 The best performing parameter
You’ve now built models where you’ve varied the random forest-specific hyperparameter mtry
in the hopes of improving your model further. Now you will measure the performance of each mtry
value across the 5 cross validation partitions to see if you can improve the model.
Remember that the validate MAE you calculated two exercises ago of 0.795
was for the default mtry
value of 2.
# Generate validate predictions for each model
<- cv_model_tunerf %>%
cv_prep_tunerf mutate(validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y)$predictions))
=cv_prep_tunerf %>% mutate(validate_actual=map(validate,~.x$life_expectancy))
cv_prep_tunerf# Calculate validate MAE for each fold and mtry combination
<- cv_prep_tunerf %>%
cv_eval_tunerf mutate(validate_mae = map2_dbl(.x = validate_actual, .y = validate_predicted, ~mae(actual = .x, predicted = .y)))
mtry
.
# Calculate the mean validate_mae for each mtry used
%>%
cv_eval_tunerf group_by(mtry) %>%
summarise(mean_mae = mean(validate_mae))
## # A tibble: 4 × 2
## mtry mean_mae
## <int> <dbl>
## 1 2 0.824
## 2 3 0.819
## 3 4 0.812
## 4 5 0.820
Excellent job! Looks like parameter tuning was able to eke out another
slight boost in performance, dropping the mae from 0.831 (mtry = 2) to
0.816 (mtry = 4). Assuming that you’ve finished your model selection you
can conclude that your final (best performing) model will be the random
forest model built using ranger
with an mtry = 4
and num.trees = 100
.
In the next exercise you will build this model using all training data
and evaluate its expected future performance using the testing data.
3.4 Test performance
Build & evaluate the best model
Using cross-validation you were able to identify the best model for predicting life_expectancy
using all the features in gapminder
. Now that you’ve selected your model, you can use the independent set of data (testing_data
) that you’ve held out to estimate the performance of this model on new data.
You will build this model using all training_data
and evaluate using testing_data
.
ranger()
to build the best performing model (mtry = 4) using all of the training data. Assign this to best_model
.
# Build the model using all training data and the best performing parameter
<- ranger(formula = life_expectancy~., data = training_data,
best_model mtry = 4, num.trees = 100, seed = 42)
life_expectancy
column from testing_data
and assign it to test_actual
.
# Prepare the test_actual vector
<- testing_data$life_expectancy test_actual
life_expectancy
using the best_model
on the testing
data and assign it to test_predicted
.
# Predict life_expectancy for the testing_data
<- predict(best_model, testing_data)$predictions test_predicted
test_actual
and test_predicted
vectors.
# Calculate the test MAE
mae(test_actual, test_predicted)
## [1] 0.6679158
Fantastic work! You have successfully leveraged the list column workflow to identify and build a model to predict life expectancy. You can claim that based on the test holdout you can expect that your predictions on new data will only be off by a magnitude of 0.663 years.
4 Build, Tune & Evaluate Classification
In this chapter you will shift gears to build, tune and evaluate classification models.
4.1 Logistic regression models
4.1.1 Prepare train-test-validate parts
In this exercise, you will leverage the tools that you have learned thus far to build a classification model to predict employee attrition.
You will work with the attrition
dataset, which contains 30 features about employees which you will use to predict if they have left the company.
You will first prepare the training & testing data sets, then you will further split the training data using cross-validation so that you can search for the best performing model for this task.
initial_split()
function.
set.seed(42)
=readRDS("/Users/apple/Documents/Rstudio/DataCamp/MachineLearningintheTidyverse/attrition.rds")
attrition
# Prepare the initial split object
<- initial_split(attrition, prop = 0.75) data_split
data_split
using training()
and testing()
, respectively.
# Extract the training data frame
<- training(data_split)
training_data
# Extract the testing data frame
<- testing(data_split) testing_data
-
Build a data frame for 5-fold cross validation from the
training_data
usingvfold_cv()
. -
Prepare the
cv_data
data frame by extracting the train and validate data frames for each fold.
set.seed(42)
<- vfold_cv(training_data, v = 5)
cv_split <- cv_split %>%
cv_data mutate(
# Extract the train data frame for each split
train = map(splits, ~training(.x)),
# Extract the validate data frame for each split
validate = map(splits, ~testing(.x))
)$validate %>% map(~head(.x)) cv_data
## [[1]]
## Age Attrition BusinessTravel DailyRate Department
## 775 34 No Travel_Rarely 167 Research_Development
## 437 27 No Travel_Rarely 1377 Sales
## 878 36 No Travel_Rarely 1278 Human_Resources
## 476 26 No Travel_Rarely 933 Sales
## 547 42 No Travel_Frequently 532 Research_Development
## 830 32 No Travel_Rarely 859 Research_Development
## DistanceFromHome Education EducationField EnvironmentSatisfaction Gender
## 775 8 Doctor Life_Sciences Medium Female
## 437 2 Bachelor Life_Sciences Very_High Male
## 878 8 Bachelor Life_Sciences Low Male
## 476 1 Bachelor Life_Sciences High Male
## 547 29 College Life_Sciences Low Female
## 830 4 Bachelor Life_Sciences High Female
## HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction
## 775 32 High 2 Manufacturing_Director Low
## 437 74 High 2 Sales_Executive High
## 878 77 Medium 1 Human_Resources Low
## 476 57 High 2 Sales_Executive High
## 547 92 High 2 Research_Scientist High
## 830 98 Medium 2 Manufacturing_Director High
## MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime
## 775 Divorced 5121 4187 3 No
## 437 Single 4478 5242 1 Yes
## 878 Married 2342 8635 0 No
## 476 Married 5296 20156 1 No
## 547 Divorced 4556 12932 2 No
## 830 Married 6162 19124 1 No
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## 775 14 Excellent High
## 437 11 Excellent Low
## 878 21 Outstanding High
## 476 17 Excellent Medium
## 547 11 Excellent Medium
## 830 12 Excellent High
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 775 1 7 3 Better
## 437 0 5 3 Better
## 878 0 6 3 Better
## 476 1 8 3 Better
## 547 1 19 3 Better
## 830 1 14 3 Better
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 775 0 0 0
## 437 5 4 0
## 878 5 4 0
## 476 8 7 7
## 547 5 4 0
## 830 14 13 6
## YearsWithCurrManager
## 775 0
## 437 4
## 878 3
## 476 7
## 547 2
## 830 8
##
## [[2]]
## Age Attrition BusinessTravel DailyRate Department
## 1622 28 No Travel_Rarely 580 Research_Development
## 224 27 No Non-Travel 1450 Research_Development
## 857 36 No Travel_Rarely 928 Sales
## 1833 31 No Travel_Frequently 1125 Sales
## 1202 29 No Travel_Rarely 942 Research_Development
## 1875 28 No Travel_Rarely 1172 Sales
## DistanceFromHome Education EducationField EnvironmentSatisfaction
## 1622 27 Bachelor Medical Medium
## 224 3 Bachelor Medical High
## 857 1 College Life_Sciences Medium
## 1833 7 Master Marketing Low
## 1202 15 Below_College Life_Sciences Medium
## 1875 3 Bachelor Medical Medium
## Gender HourlyRate JobInvolvement JobLevel JobRole
## 1622 Female 39 Low 2 Manufacturing_Director
## 224 Male 79 Medium 1 Research_Scientist
## 857 Male 56 High 2 Sales_Executive
## 1833 Female 68 High 3 Sales_Executive
## 1202 Female 69 Low 1 Research_Scientist
## 1875 Female 78 High 1 Sales_Representative
## JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked
## 1622 Low Divorced 4877 20460 0
## 224 High Divorced 2566 25326 1
## 857 Very_High Married 6201 2823 1
## 1833 Low Married 9637 8277 2
## 1202 Very_High Married 2168 26933 0
## 1875 Medium Married 2856 3692 1
## OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction
## 1622 No 21 Outstanding Medium
## 224 Yes 15 Excellent Very_High
## 857 Yes 14 Excellent Very_High
## 1833 No 14 Excellent Very_High
## 1202 Yes 18 Excellent Low
## 1875 No 19 Excellent Very_High
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 1622 1 6 5 Good
## 224 1 1 2 Good
## 857 1 18 1 Good
## 1833 2 9 3 Better
## 1202 1 6 2 Good
## 1875 1 1 3 Better
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1622 5 3 0
## 224 1 1 0
## 857 18 14 4
## 1833 3 2 2
## 1202 5 4 1
## 1875 1 0 0
## YearsWithCurrManager
## 1622 0
## 224 1
## 857 11
## 1833 2
## 1202 3
## 1875 0
##
## [[3]]
## Age Attrition BusinessTravel DailyRate Department
## 1655 49 No Travel_Rarely 301 Research_Development
## 683 34 No Travel_Rarely 1397 Research_Development
## 1315 28 No Non-Travel 1476 Research_Development
## 1286 28 No Travel_Frequently 791 Research_Development
## 1967 31 Yes Travel_Frequently 754 Sales
## 1068 26 No Travel_Frequently 921 Research_Development
## DistanceFromHome Education EducationField EnvironmentSatisfaction
## 1655 22 Master Other Low
## 683 1 Doctor Life_Sciences Medium
## 1315 1 Bachelor Life_Sciences High
## 1286 1 Master Medical Very_High
## 1967 26 Master Marketing Low
## 1068 1 Below_College Medical Low
## Gender HourlyRate JobInvolvement JobLevel JobRole
## 1655 Female 72 High 4 Research_Director
## 683 Male 42 High 1 Research_Scientist
## 1315 Female 55 Low 2 Laboratory_Technician
## 1286 Male 44 High 1 Laboratory_Technician
## 1967 Male 63 High 2 Sales_Executive
## 1068 Female 66 Medium 1 Research_Scientist
## JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked
## 1655 Medium Married 16413 3498 3
## 683 Very_High Married 2691 7660 1
## 1315 Very_High Married 6674 16392 0
## 1286 High Single 2154 6842 0
## 1967 Very_High Married 5617 21075 1
## 1068 High Divorced 2007 25265 1
## OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction
## 1655 No 16 Excellent Medium
## 683 No 12 Excellent Very_High
## 1315 No 11 Excellent Low
## 1286 Yes 11 Excellent High
## 1967 Yes 11 Excellent High
## 1068 No 13 Excellent High
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 1655 2 27 2 Better
## 683 1 10 4 Good
## 1315 3 10 6 Better
## 1286 0 5 2 Good
## 1967 0 10 4 Better
## 1068 2 5 5 Better
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1655 4 2 1
## 683 10 9 8
## 1315 9 8 7
## 1286 4 2 0
## 1967 10 7 0
## 1068 5 3 1
## YearsWithCurrManager
## 1655 2
## 683 8
## 1315 5
## 1286 2
## 1967 8
## 1068 3
##
## [[4]]
## Age Attrition BusinessTravel DailyRate Department
## 1551 24 No Travel_Rarely 350 Research_Development
## 1754 30 No Travel_Rarely 979 Sales
## 1647 27 No Travel_Rarely 486 Research_Development
## 30 21 No Travel_Rarely 391 Research_Development
## 1854 42 No Non-Travel 355 Research_Development
## 1298 30 No Non-Travel 879 Research_Development
## DistanceFromHome Education EducationField EnvironmentSatisfaction Gender
## 1551 21 College Technical_Degree High Male
## 1754 15 College Marketing High Male
## 1647 8 Bachelor Medical Medium Female
## 30 15 College Life_Sciences High Male
## 1854 10 Master Technical_Degree High Male
## 1298 9 College Medical High Female
## HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction
## 1551 57 Medium 1 Laboratory_Technician Low
## 1754 94 Medium 3 Sales_Executive Low
## 1647 86 Very_High 1 Research_Scientist High
## 30 96 High 1 Research_Scientist Very_High
## 1854 38 High 1 Research_Scientist High
## 1298 72 High 2 Manufacturing_Director High
## MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime
## 1551 Divorced 2296 10036 0 No
## 1754 Divorced 7140 3088 2 No
## 1647 Married 3517 22490 7 No
## 30 Single 1232 19281 1 No
## 1854 Married 2936 6161 3 No
## 1298 Single 4695 12858 7 Yes
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## 1551 14 Excellent Medium
## 1754 11 Excellent Low
## 1647 17 Excellent Low
## 30 14 Excellent Very_High
## 1854 22 Outstanding Medium
## 1298 18 Excellent High
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 1551 3 2 3 Better
## 1754 1 12 2 Better
## 1647 0 5 0 Better
## 30 0 0 6 Better
## 1854 2 10 1 Good
## 1298 0 10 3 Better
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1551 1 1 0
## 1754 7 7 1
## 1647 3 2 0
## 30 0 0 0
## 1854 6 3 3
## 1298 8 4 1
## YearsWithCurrManager
## 1551 0
## 1754 7
## 1647 2
## 30 0
## 1854 3
## 1298 7
##
## [[5]]
## Age Attrition BusinessTravel DailyRate Department
## 62 46 No Travel_Frequently 1211 Sales
## 1862 32 Yes Travel_Rarely 414 Sales
## 1255 43 No Travel_Rarely 920 Research_Development
## 407 36 No Travel_Frequently 566 Research_Development
## 1164 35 No Travel_Rarely 528 Human_Resources
## 1380 35 Yes Travel_Frequently 662 Sales
## DistanceFromHome Education EducationField EnvironmentSatisfaction Gender
## 62 5 Master Marketing Low Male
## 1862 2 Master Marketing High Male
## 1255 3 Bachelor Life_Sciences High Male
## 407 18 Master Life_Sciences High Male
## 1164 8 Master Technical_Degree High Male
## 1380 18 Master Marketing Very_High Female
## HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction
## 62 98 High 2 Sales_Executive Very_High
## 1862 82 Medium 2 Sales_Executive Medium
## 1255 96 Low 5 Research_Director Very_High
## 407 81 Very_High 1 Laboratory_Technician Very_High
## 1164 100 High 1 Human_Resources High
## 1380 67 High 2 Sales_Executive High
## MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked OverTime
## 62 Single 5772 20445 4 Yes
## 1862 Single 9907 26186 7 Yes
## 1255 Married 19740 18625 3 No
## 407 Married 3688 7122 4 No
## 1164 Single 4323 7108 1 No
## 1380 Married 4614 23288 0 Yes
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## 62 21 Outstanding High
## 1862 12 Excellent High
## 1255 14 Excellent Medium
## 407 18 Excellent Very_High
## 1164 17 Excellent Medium
## 1380 18 Excellent High
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 62 0 14 4 Better
## 1862 0 7 3 Good
## 1255 1 25 2 Better
## 407 2 4 2 Better
## 1164 0 6 2 Bad
## 1380 1 5 0 Good
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 62 9 6 0
## 1862 2 2 2
## 1255 8 7 0
## 407 1 0 0
## 1164 5 4 1
## 1380 4 2 3
## YearsWithCurrManager
## 62 8
## 1862 2
## 1255 7
## 407 0
## 1164 4
## 1380 2
Great work!! Now you have the parts necessary to build & tune your classification models.
4.1.2 Build cross-validated models
In this exercise, you will build logistic regression models for each fold in your cross-validation.
You will build this using the glm()
function and by setting the family argument to “binomial”
.
Build models predicting Attrition
using all available features with the train
data for each fold of the cross validation.
# Build a model using the train data for each fold of the cross validation
<- cv_data %>%
cv_models_lr mutate(model = map(train, ~glm(formula = Attrition~.,
data = .x, family = "binomial")))
Excellent work! Now let’s learn how to evaluate these models.
4.2 Classification models
4.2.1 Predictions of a single model
To calculate the performance of a classification model you need to compare the actual values of Attrition
to those predicted by the model.
When calculating metrics for binary classification tasks (such as
precision and recall), the actual and predicted vectors must be
converted to binary values.
In this exercise, you will learn how to prepare these vectors using the model and validate data frames from the first cross-validation fold as an example.
model
and the validate
data frame from the first fold of the cross-validation.
# Extract the first model and validate
<- cv_models_lr$model[[1]]
model <- cv_models_lr$validate[[1]] validate
Attrition
column from the validate
data frame and convert the values to binary (TRUE/FALSE).
# Prepare binary vector of actual Attrition values in validate
<- validate$Attrition == "Yes" validate_actual
model
to predict the probabilities of attrition for the validate
data frame.
# Predict the probabilities for the observations in validate
<- predict(model, validate, type = "response") validate_prob
0.5
are TRUE.
# Prepare binary vector of predicted Attrition values for validate
<- validate_prob > 0.5 validate_predicted
Fantastic! Now you have the actual and predicted vectors. In the next exercise you’ll use these vectors to calculate some metrics to check the performance of the model.
4.2.2 Performance of a single model
Now that you have the binary vectors for the actual and predicted values of the model, you can calculate many commonly used binary classification metrics. In this exercise you will focus on:
- accuracy: rate of correctly predicted values relative to all predictions.
- precision: portion of predictions that the model correctly predicted as TRUE.
- recall: portion of actual TRUE values that the model correctly recovered.
table()
to compare the validate_actual
and validate_predicted
values for the example model and validate data frame.
library(Metrics)
# Compare the actual & predicted performance visually using a table
table(validate_actual, validate_predicted)
## validate_predicted
## validate_actual FALSE TRUE
## FALSE 187 5
## TRUE 14 15
# Calculate the accuracy
accuracy(validate_actual, validate_predicted)
## [1] 0.9140271
# Calculate the precision
precision(validate_actual, validate_predicted)
## [1] 0.75
# Calculate the recall
recall(validate_actual, validate_predicted)
## [1] 0.5172414
Great work! The type of metric you use should be informed by the application of your model. In the next exercise you will expand on this example to calculate the recall metric for each of your cross validation folds.
4.2.3 Prepare for cross-validated performance
Now that you know how to calculate the performance metrics for a single model, you are now ready to expand this for all the folds in the cross-validation data frame.
-
Add the
validate_actual
binary column for each cross-validation fold by converting all“Yes”
values toTRUE
. -
Use
model
to predict the probabilities of attrition for each cross-validation fold ofvalidate
. Convert the predicted probabilities to a binary vector, treating all probabilities greater than 0.5 as TRUE. Name this columnvalidate_predicted
.
<- cv_models_lr %>%
cv_prep_lr mutate(
# Prepare binary vector of actual Attrition values in validate
validate_actual = map(validate, ~.x$Attrition == "Yes"),
# Prepare binary vector of predicted Attrition values for validate
validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y, type = "response") > 0.5)
)
Great work! Next, you’ll calculate the recall of these cross validated models.
4.2.4 Calculate cross-validated performance
It is crucial to optimize models using a carefully selected metric aimed at achieving the goal of the model.
Imagine that in this case you want to use this model to identify employees that are predicted to leave the company. Ideally, you want a model that can capture as many of the ready-to-leave employees as possible so that you can intervene. The corresponding metric that captures this is the recall metric. As such, you will exclusively use recall to optimize and select your models.
validate_recall
column.
# Calculate the validate recall for each cross validation fold
<- cv_prep_lr %>%
cv_perf_recall mutate(validate_recall = map2_dbl(validate_actual, validate_predicted,
~recall(actual = .x, predicted = .y)))
validate_recall
column.
# Print the validate_recall column
$validate_recall cv_perf_recall
## [1] 0.5172414 0.4705882 0.4242424 0.4722222 0.4222222
# Calculate the average of the validate_recall column
mean(cv_perf_recall$validate_recall)
## [1] 0.4613033
Excellent! As you can see the validate recall of the model is 0.46, can you beat this using a more complex model. In the next series of exercises you will find out.
4.3 RF for classification
4.3.1 Tune random forest models
Now that you have a working logistic regression model you will prepare a random forest model to compare it with.
crossing()
to expand the cross-validation data for values of mtry
using the values of 2, 4, 8, and 16.
# Prepare for tuning your cross validation folds by varying mtry
<- cv_data %>%
cv_tune crossing(mtry = c(2, 4, 8, 16))
# Build a cross validation model for each fold & mtry combination
<- cv_tune %>%
cv_models_rf mutate(model = map2(train, mtry, ~ranger(formula = Attrition~.,
data = .x, mtry = .y,
num.trees = 100, seed = 42)))
Fantastic work! Next you will evaluate the validation performance of these random forest models.
4.3.2 Random forest performance
It is now time to see whether the random forests models you built in the previous exercise are able to outperform the logistic regression model.
Remember that the validate recall for the logistic regression model was 0.43.
validate_actual
and validate_predicted
columns for each mtry/fold combination.
<- cv_models_rf %>%
cv_prep_rf mutate(
# Prepare binary vector of actual Attrition values in validate
validate_actual = map(validate, ~.x$Attrition == "Yes"),
# Prepare binary vector of predicted Attrition values for validate
validate_predicted = map2(.x = model, .y = validate, ~predict(.x, .y, type = "response")$predictions == "Yes")
)
# Calculate the validate recall for each cross validation fold
<- cv_prep_rf %>%
cv_perf_recall mutate(recall = map2_dbl(.x = validate_actual, .y = validate_predicted, ~recall(actual = .x, predicted = .y)))
mtry
.
# Calculate the mean recall for each mtry used
%>%
cv_perf_recall group_by(mtry) %>%
summarise(mean_recall = mean(recall))
## # A tibble: 4 × 2
## mtry mean_recall
## <dbl> <dbl>
## 1 2 0.0888
## 2 4 0.121
## 3 8 0.185
## 4 16 0.208
Great work! This time you can see that none of the random forest models were able to outperform the logistic regression model with respect to recall.
4.3.3 Build final classification model
Comparing the recall performance between the logistic regression model (0.4) and the best performing random forest model (0.2), you’ve learned that the model with the best performance is the logistic regression model. In this exercise, you will build the logistic regression model using all of the train data and you will prepare the necessary vectors for evaluating this model’s test performance.
Attrition
using all available features in the training_data
.
# Build the logistic regression model using all training data
<- glm(formula = Attrition~.,
best_model data = training_data, family = "binomial")
test_actual
.
# Prepare binary vector of actual Attrition values for testing_data
<- testing_data$Attrition == "Yes" test_actual
TRUE
and store this as test_predicted
.
# Prepare binary vector of predicted Attrition values for testing_data
<- predict(best_model, testing_data, type = "response") > 0.5 test_predicted
Almost at the finish line. You’ve now selected & built your best performing model and have prepared the necessary parts to evaluate its performance.
4.3.4 Measure final model performance
Now its time to calculate the test performance of your final model (logistic regression). Here you will use the held out testing data to characterize the performance you would expect from this model when it is applied to new data.
table()
to compare the test_actual
and test_predicted
vectors.
# Compare the actual & predicted performance visually using a table
table(test_actual, test_predicted)
## test_predicted
## test_actual FALSE TRUE
## FALSE 301 7
## TRUE 33 27
# Calculate the test accuracy
accuracy(test_actual, test_predicted)
## [1] 0.8913043
# Calculate the test precision
precision(test_actual, test_predicted)
## [1] 0.7941176
# Calculate the test recall
recall(test_actual, test_predicted)
## [1] 0.45
Well done! You now have a model that you can expect to identify 45% of employees that are at risk to leave the organization.