Split out the train and test sets The first step of training a model is dividing the data into train and test sets. The tidymodels package makes this easy. Setting aside a test data set allows you to evaluate the trained model on a set of data the model has never seen. You will use the employee healthcare attrition data which contains data about employees of a healthcare company and whether they left the company or not. It is available in attrition_df. The target variable is Attrition. The tidyverse and tidymodels packages have been loaded for you. * Initialize a split of the data with 80% for training and stratify based on Attrition, the target variable. * Extract the training data set and store it in train. * Extract the testing data set and store it in test. > glimpse(attrition_df) Rows: 1,676 Columns: 35 $ EmployeeID 1313919, 1200302, 1060315, 1272912, 1414939, … $ Age 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2… $ Attrition No, No, Yes, No, No, No, No, No, No, No, No, … $ BusinessTravel "Travel_Rarely", "Travel_Frequently", "Travel… $ DailyRate 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,… $ Department "Cardiology", "Maternity", "Maternity", "Mate… $ DistanceFromHome 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, … $ Education 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, … $ EducationField "Life Sciences", "Life Sciences", "Other", "L… $ EmployeeCount 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … $ EnvironmentSatisfaction 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, … $ Gender "Female", "Male", "Male", "Female", "Male", "… $ HourlyRate 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4… $ JobInvolvement 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, … $ JobLevel 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, … $ JobRole "Nurse", "Other", "Nurse", "Other", "Nurse", … $ JobSatisfaction 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, … $ MaritalStatus "Single", "Married", "Single", "Married", "Ma… $ MonthlyIncome 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269… $ MonthlyRate 19479, 24907, 2396, 23159, 16632, 11864, 9964… $ NumCompaniesWorked 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, … $ Over18 "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", … $ OverTime "Yes", "No", "Yes", "Yes", "No", "No", "Yes",… $ PercentSalaryHike 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1… $ PerformanceRating 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, … $ RelationshipSatisfaction 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, … $ StandardHours 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8… $ Shift 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, … $ TotalWorkingYears 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3… $ TrainingTimesLastYear 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, … $ WorkLifeBalance 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, … $ YearsAtCompany 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,… $ YearsInCurrentRole 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, … $ YearsSinceLastPromotion 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, … $ YearsWithCurrManager 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, … # Initialize the split split <- initial_split(attrition_df, prop = 0.8, strata = Attrition) # Extract training set train <- split %>% training() # Extract testing set test <- split %>% testing() The strata argument helps ensure the same proportion of target variable values appear in both the train and test data sets. ----------------------------------------------------------------------------------------------------- Create a recipe-model workflow The tidymodels package can combine recipes and models into workflows. Workflows make it easy to create a pipeline of steps to prepare data and train models. Workflows can then be applied to new data easily, without having to redefine all the preprocessing and model building steps. Conveniently, workflows have a fit() function that fit both the recipe and the model to the data. In this exercise, you will practice creating a recipe and a model and adding them to a workflow, so they are ready to be fit to the data. The train and test sets of the employee healthcare attrition data are available for your use. The target variable is Attrition. The tidyverse and tidymodels packages have been loaded for you. * Define a recipe using the train data with a step_filter_missing(), step_scale(), and step_nzv() to remove NAs, scale the numeric features, and remove low-variance features, respectively. Use a threshold of 0.5 for step_filter_missing(). * Define a logistic regression model using the "glm" engine. * Add feature_selection_recipe and lr_model to a workflow named attrition_wflow. # Create recipe feature_selection_recipe <- recipe(Attrition ~ ., data = train) %>% step_filter_missing(all_predictors(), threshold = 0.5) %>% step_scale(all_numeric_predictors()) %>% step_nzv(all_predictors()) %>% prep() # Create model lr_model <- logistic_reg() %>% set_engine("glm") # Add recipe and model to a workflow attrition_wflow <- workflow() %>% add_recipe(feature_selection_recipe) %>% add_model(lr_model) At this point, the workflow with the recipe and the model has not been fit to the data. You have just defined it. ----------------------------------------------------------------------------------------------------- Fit, explore, and evaluate the model Once you have defined a workflow with a recipe and a model, you can fit the data to the workflow. This is done with the training data set. The trained model is then evaluated using the test set. In this example, the target variable is categorical and you are using a logistic regression model. So you will evaluate the test predictions using the F measure. feature_selection_recipe, lr_model, attrition_wflow, train, and test from the previous exercise are available for your use. The tidyverse and tidymodels packages have been loaded for you. * Fit attrition_wflow using the training data. * Add the test predictions to the test data with the original Attrition values. * Use f_meas() to evaluate the model's performance on the test data. * Display the model estimates of attrition_fit. # Fit workflow to train data attrition_fit <- attrition_wflow %>% fit(data = train) # Add the test predictions to the test data attrition_pred_df <- predict(attrition_fit, test) %>% bind_cols(test %>% select(Attrition)) # Evaluate F score f_meas(attrition_pred_df, Attrition, .pred_class) # A tibble: 1 × 3 .metric .estimator .estimate 1 f_meas binary 0.948 # Display model estimates tidy(attrition_fit) # A tibble: 42 × 5 term estimate std.error statistic p.value 1 (Intercept) 0.348 663. 0.000524 1.00 2 EmployeeID -0.192 0.142 -1.35 0.176 3 Age -1.06 0.217 -4.90 0.000000941 4 BusinessTravelTravel_Frequently 1.79 0.582 3.07 0.00214 5 BusinessTravelTravel_Rarely -0.0102 0.508 -0.0201 0.984 6 DailyRate -0.252 0.142 -1.78 0.0758 7 DepartmentMaternity -0.756 0.366 -2.06 0.0390 8 DepartmentNeurology -1.13 0.513 -2.21 0.0274 9 DistanceFromHome 0.644 0.141 4.55 0.00000528 10 Education 0.122 0.140 0.867 0.386 # … with 32 more rows # Use `print(n = ...)` to see more rows You may want to explore how model performance changes as you remove more or fewer features in the recipe steps. For example, if you adjust the threshold of step_filter_missing(), how does it affect the F1 measure of the model? ----------------------------------------------------------------------------------------------------- Scale the data for lasso regression To prepare to fit a lasso regression model, it is important to scale the data so that all features are comparable among each other. The full set of King County, California house sales data is available in house_sales_df. In this exercise, you will scale the target variable, price, separately before you split the data into training and testing sets. This is because of the way tidymodels recipes work. We don't include target variable transformations in the recipe. The tidyverse and tidymodels packages have been loaded for you. * Scale the target variable price in house_sales_df using scale(). * Create the training and testing sets with 80% in the training set. * Create the recipe using the training data to scale all numeric predictors. > glimpse(house_sales_df) Rows: 21,613 Columns: 16 $ price 221900, 538000, 180000, 604000, 510000, 1225000, 257500,… $ bedrooms 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,… $ bathrooms 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.… $ sqft_living 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189… $ sqft_lot 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,… $ floors 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1… $ waterfront 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… $ view 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,… $ condition 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,… $ grade 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7… $ sqft_above 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189… $ sqft_basement 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, … $ yr_built 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20… $ yr_renovated 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,… $ sqft_living15 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23… $ sqft_lot15 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, … # Scale the target variable house_sales_df <- house_sales_df %>% mutate(price = as.vector(scale(price))) # Create the training and testing sets split <- initial_split(house_sales_df, prop = 0.8) train <- split %>% training() test <- split %>% testing() # Create recipe to scale the predictors lasso_recipe <- recipe(price ~ ., data = train) %>% step_normalize(all_numeric_predictors()) You've written the necessary code to prepare the data to fit a lasso regression model. Scaled data is important for lasso regression to operate correctly. ----------------------------------------------------------------------------------------------------- Explore lasso regression penalty values In the previous exercise, you completed all the code to scale the target and predictor variables. You will use the train data and lasso_recipe to build a workflow to train a lasso regression model and explore the effects of different penalty values. As you adjust the penalty and retrain the model, pay attention to the number of non-zero variables that are left in the model. You will be observing how lasso regression performs feature selection. The tidyverse and tidymodels packages have been loaded for you. * Train a lasso regression workflow with a penalty of 0.001 and display the model coefficients that are greater than zero. * Re-train a lasso regression workflow with a penalty of 0.01 and display the model coefficients that are greater than zero. * Re-train a lasso regression workflow with a penalty of 0.1 and display the model coefficients that are greater than zero. # Train workflow model with penalty = 0.001 and view model variables lasso_model <- linear_reg(penalty = 0.001, mixture = 1, engine = "glmnet") lasso_workflow <- workflow(preprocessor = lasso_recipe, spec = lasso_model) tidy(lasso_workflow %>% fit(train)) %>% filter(estimate > 0) # A tibble: 11 × 3 term estimate penalty 1 (Intercept) 0.00280 0.001 2 bathrooms 0.0987 0.001 3 sqft_living 0.389 0.001 4 floors 0.0332 0.001 5 waterfront 0.131 0.001 6 view 0.0965 0.001 7 condition 0.0367 0.001 8 grade 0.394 0.001 9 sqft_basement 0.00411 0.001 10 yr_renovated 0.00959 0.001 11 sqft_living15 0.0416 0.001 # Train the workflow model with penalty = 0.01 and view model variables lasso_model <- linear_reg(penalty = 0.01, mixture = 1, engine = "glmnet") lasso_workflow <- workflow(preprocessor = lasso_recipe, spec = lasso_model) tidy(lasso_workflow %>% fit(train)) %>% filter(estimate > 0) # A tibble: 10 × 3 term estimate penalty 1 (Intercept) 0.00280 0.01 2 bathrooms 0.0842 0.01 3 sqft_living 0.384 0.01 4 floors 0.0206 0.01 5 waterfront 0.126 0.01 6 view 0.0977 0.01 7 condition 0.0281 0.01 8 grade 0.393 0.01 9 yr_renovated 0.00578 0.01 10 sqft_living15 0.0322 0.01 # Train the workflow model with penalty = 0.1 and view model variables lasso_model <- linear_reg(penalty = 0.1, mixture = 1, engine = "glmnet") lasso_workflow <- lasso_recipe %>% workflow(spec = lasso_model) tidy(lasso_workflow %>% fit(train)) %>% filter(estimate > 0) # A tibble: 5 × 3 term estimate penalty 1 (Intercept) 0.00280 0.1 2 sqft_living 0.364 0.1 3 waterfront 0.0621 0.1 4 view 0.0920 0.1 Define a random forest classification model with 200 trees that you can use to extract feature importances. Fit the random forest model with all predictors. Bind the predictions to the test set. Calculate the F1 metric. 5 grade 0.306 0.1 Notice how when you increased the lasso regression penalty the number of non-zero coefficients decreased. A stronger penalty forces less important feature coefficients to zero, naturally performing feature selection. ----------------------------------------------------------------------------------------------------- Tune the penalty hyperparameter Now that you've seen how the penalty parameter affects lasso regression's selection of features, you might be wondering, "What's the best value for penalty?" tidymodels provides functions to explore the best value for hyperparameters like penalty. In this exercise, you will find the best value of penalty based on the RMSE of the model, then fit a final model with that penalty value. This will optimize the feature selection of lasso regression for model performance. lasso_recipe has been created for you and train is also available. The tidyverse and tidymodels packages have also been loaded for you. * Define a linear_reg() workflow that will tune penalty. * Create a 3-fold cross validation sample from train and a sequence of 20 penalty values ranging from 0.001 to 0.1. * Create lasso models using with different penalty values. * Plot the model performance (RMSE) based on the penalty value. # Create tune-able model lasso_model <- linear_reg(penalty = tune(), mixture = 1, engine = "glmnet") lasso_workflow <- workflow(preprocessor = lasso_recipe, spec = lasso_model) # Create a cross validation sample and sequence of penalty values train_cv <- vfold_cv(train, v = 3) penalty_grid <- grid_regular(penalty(range = c(-3, -1)), levels = 20) # Create lasso models with different penalty values lasso_grid <- tune_grid( lasso_workflow, resamples = train_cv, grid = penalty_grid) # Plot RMSE vs. penalty values autoplot(lasso_grid, metric = "rmse") Notice the point along the x-axis (around 0.005) where the RMSE begins to rise exponentially. That is the value you want to set penalty to when you train your lasso regression model. ----------------------------------------------------------------------------------------------------- Fit the best model lasso_grid contains 50 different model specs with 50 different penalty values in penalty_grid. In this exercise, you will find and fit the model with the optimal penalty value. In doing so, you will end up with a lasso regression model that optimizes feature selection for best model performance. lasso_workflow and train are available for your use. The tidyverse and tidymodels packages have also been loaded for you. * Retrieve the best fitted model based on RMSE. * Use finalize_workflow() to fit a model based on best_rmse. * Display of the model coefficients of final_lasso. # Retrieve the best RMSE best_rmse <- lasso_grid %>% select_best("rmse") # Refit the model with the best RMSE final_lasso <- finalize_workflow(lasso_workflow, best_rmse) %>% fit(train) # Display the non-zero model coefficients tidy(final_lasso) %>% filter(estimate > 0) # A tibble: 10 × 3 term estimate penalty 1 bathrooms 0.0871 0.001 2 sqft_living 0.392 0.001 3 floors 0.0358 0.001 4 waterfront 0.132 0.001 5 view 0.0913 0.001 6 condition 0.0306 0.001 7 grade 0.379 0.001 8 sqft_basement 0.00990 0.001 9 yr_renovated 0.0113 0.001 10 sqft_living15 0.0472 0.001 final_lasso is now trained with optimal feature selection and ready to predict new data. ----------------------------------------------------------------------------------------------------- Create full random forest model Random forest models naturally perform feature selection as they build many subtrees from random subsets of the features. One way to understand feature importances is to build a model and then extract the feature importances. So, in this exercise, you will use the Healthcare Job Attrition data to train a rand_forest() classification model from which you can extract feature importances. To make feature importances available, be sure to create the model with importance = "impurity". The train and test sets are available to you. The tidyverse, tidymodels, and vip packages have been loaded for you. * Define a random forest classification model with 200 trees that you can use to extract feature importances. * Fit the random forest model with all predictors. * Bind the predictions to the test set. * Calculate the F1 metric. # Specify the random forest model rf_spec <- rand_forest(mode = "classification", trees = 200) %>% set_engine("ranger", importance = "impurity") # Fit the random forest model with all predictors rf_fit <- rf_spec %>% fit(Attrition ~ ., data = train) # Create the test set prediction data frame predict_df <- test %>% bind_cols(predict = predict(rf_fit, test)) # Calculate F1 performance f_meas(predict_df, Attrition, .pred_class) # A tibble: 1 × 3 .metric .estimator .estimate 1 f_meas binary 0.946 Notice that the random forest model with all twenty-three features performs well based on the F1 measure. But are all those features needed to achieve that performance? ----------------------------------------------------------------------------------------------------- Reduce data using feature importances Now that you have created a full random forest model, you will explore feature importance. Even though random forest models naturally - but implicitly - perform feature selection, it is often advantageous to build a reduced model. A reduced model trains faster, computes predictions faster, and is easier to understand and manage. Of course, it is always a trade-off between model simplicity and model performance. In this exercise, you will reduce the data set. In the next exercise, you will fit a reduced model and compare its performance to the full model. rf_fit, train, and test are provided for you. The tidyverse, tidymodels, and vip packages have been loaded for you. * Use vi() with the rank parameter to extract the ten most important features. * Add the target variable back to the top feature list. * Apply the top feature mask to reduce the data sets. # Extract the top ten features top_features <- rf_fit %>% vi(rank = TRUE) %>% filter(Importance <= 10) %>% pull(Variable) # Add the target variable to the feature list top_features <- c(top_features, "Attrition") # Reduce and print the data sets train_reduced <- train[top_features] test_reduced <- test[top_features] train_reduced %>% head(5) test_reduced %>% head(5) You just created a mask of the most important features and reduced the dimensionality of the data. How well do you think a model with the top ten features will perform? ----------------------------------------------------------------------------------------------------- Create reduced random forest Now, it's time to fit a reduced model using train_reduced and evaluate it using test_reduced. rf_spec is available for you to fit the reduced model. The full model had an F1 value of 0.948. As you fit and evaluate a reduced model, keep in mind there is always a trade-off between model simplicity and model performance. You have to make a judgment call about whether the benefits of the model reduction are worth any decrease in model performance, if there is any. The tidyverse, tidymodels, and vip packages have been loaded for you. * Use the rf_spec to fit the reduced random forest model. * Bind the reduced model predictions to test_reduced. * Calculate the F1 metric for the reduced model. # Fit a reduced model rf_reduced_fit <- rf_spec %>% fit(Attrition ~ ., data = train_reduced) # Create test set prediction data frame predict_reduced_df <- test_reduced %>% bind_cols(predict = predict(rf_reduced_fit, test_reduced)) # Calculate F1 performance f_meas(predict_reduced_df, Attrition, .pred_class) # A tibble: 1 × 3 .metric .estimator .estimate 1 f_meas binary 0.943 You produced a reduced model that performs slightly worse than the full model and you removed more than half the features. A reduced model can result in decreased training time, decreased prediction time in production, and an easier to understand model. -----------------------------------------------------------------------------------------------------