Create a zero-variance filter house_sales_df contains ten continuous variables describing house sales in King County, California. Examples of these variables include square footage, number of rooms, and sales price. You will need to reduce the dimensionality to make the dataset easier to work with and reduce the training time when creating models. Let's get started with creating a zero-variance filter. The tidyverse package has been loaded for you. * Create a zero-variance filter using summarize() and filter() and store it in zero_var_filter. > glimpse(house_sales_df) Rows: 21,613 Columns: 10 $ price 221900, 538000, 180000, 604000, 510000, 1225000, 25… $ bedrooms 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, … $ bathrooms 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.0… $ sqft_living 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780… $ sqft_living_near15 NA, 1690, 2720, 1360, 1800, NA, NA, NA, 1780, 2390,… $ sqft_lot 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, … $ sqft_lot_near15 NA, 7639, 8062, NA, 7503, 101930, 6819, NA, NA, NA,… $ yr_renovated NA, 1991, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… $ num_garages 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … $ num_hvac_units 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … # Create zero-variance filter zero_var_filter <- house_sales_df %>% summarize(across(everything(), ~ var(., na.rm = TRUE))) %>% pivot_longer(everything(), names_to = "feature", values_to = "variance") %>% filter(variance == 0) %>% pull(feature) zero_var_filter [1] "num_garages" "num_hvac_units" ----------------------------------------------------------------------------------------------------- Create a missing values filter The zero-variance filter only removes some of the low-information features. Features may also contain little to no information because they have a high number of missing values. In this exercise, you'll create a missing values filter. You'll take an extreme approach and remove any feature with at least one missing value, which means you could remove features with significant information. house_sales_df is available on the console and tidyverse package has been loaded for you. * Create a missing values filter using summarize(), across(), sum(), and is.na() to remove features with zero or more missing values and store it in na_filter. # Create a missing values filter na_filter <- house_sales_df %>% summarize(across(everything(), ~ sum(is.na(.)))) %>% pivot_longer(everything(), names_to = "feature", values_to = "NA_count") %>% filter(NA_count > 0) %>% pull(feature) na_filter [1] "sqft_living_near15" "sqft_lot_near15" "yr_renovated" ----------------------------------------------------------------------------------------------------- Feature selection with the combined filter Now that you've created the zero-variance and missing values filters, put them to work to reduce the dimensionality of house_sales_df. You'll combine the filters and then use the combined filter to remove the low-information features from house_sales_df. The zero_var_filter and na_filter objects are available for your use and the tidyverse package has been loaded for you. * Combine the zero_var_filter and na_filter into low_info_filter. * Apply low_info_filter to reduce the dimensionality of house_sales_df. * Display five rows of the reduced house_sales_df data set. # Combine the two filters low_info_filter <- c(zero_var_filter, na_filter) # Apply the filter house_sales_filtered_df <- house_sales_df %>% select(-all_of(low_info_filter)) # Display five rows of the reduced data set house_sales_filtered_df %>% head(5) # A tibble: 5 × 5 price bedrooms bathrooms sqft_living sqft_lot 1 221900 3 1 1180 5650 2 538000 3 2.25 2570 7242 3 180000 2 1 770 10000 4 604000 4 3 1960 5000 5 510000 3 2 1680 8080 You just performed feature selection on house_sales_df by identifying and removing features with zero variance or missing values. ----------------------------------------------------------------------------------------------------- Create a missing value ratio filter The house_sales_df data frame contains a target variable price and a variety of predictors that describe individual houses and determine their selling prices. Several of the features have a varying number of missing values. If the missing value ratio is too high, then the feature will not be very informative in predicting the price of the house. These features can be removed. In this exercise, you will calculate the missing value ratio for each column. This will help you think about an appropriate threshold for each column. The tidyverse package has been loaded for you. * Store the total number of rows in house_sales_df into n. * Calculate the missing value ratios for each column in house_sales_df and store them in missing_vals_df. # Calculate total rows n <- nrow(house_sales_df) # Calculate missing value ratios missing_vals_df <- house_sales_df %>% summarize(across(everything(), ~ sum(is.na(.)))) %>% pivot_longer(everything(), names_to = "feature", values_to = "num_missing_values") %>% mutate(missing_val_ratio = num_missing_values / n) # Display missing value ratios missing_vals_df # A tibble: 18 × 3 feature num_missing_values missing_val_ratio 1 date 12019 0.556 2 price 0 0 3 bedrooms 0 0 4 bathrooms 0 0 5 sqft_living 0 0 6 sqft_lot 0 0 7 floors 13541 0.627 8 waterfront 0 0 9 view 18368 0.850 10 condition 3700 0.171 11 sqft_above 0 0 12 sqft_basement 0 0 13 yr_built 0 0 14 yr_renovated 20699 0.958 15 num_garages 0 0 16 num_hvac_units 0 0 17 sqft_living_near15 12917 0.598 18 sqft_lot_near15 10908 0.505 Notice how the missing value ratios among the columns range from 17% to 96%. While it is obvious that yr_renovated with a 96% missing value ratio should be removed and condition with a 17% missing value ratio should be kept, it is not so obvious if columns like date, floors, and sqft_lot_near15 with ratios around 50-60% should be kept or removed. In part, it will depend on how important each column is for predicting the target variable for price. ----------------------------------------------------------------------------------------------------- Apply a missing value ratio filter Now that you have calculated the missing value ratios, you can create a filter using a missing value threshold. In this exercise, we will select an arbitrary, but reasonable, missing value ratio threshold and apply it to all the columns. In the real world, you will think critically and customize the threshold to each feature. The missing_vals_df which contains the ratios you calculated in the last exercise and the house_sales_df data frame are both available for your use. The tidyverse package has also been loaded for you. * Use missing_vals_df and a threshold of 0.5 to create a missing values ratio filter and store it in missing_vals_filter. * Apply missing_vals_df to house_sales_df to reduce its dimensionality and store the new data frame in filtered_house_sales_df. # Create the missing values filter missing_vals_filter <- missing_vals_df %>% filter(missing_val_ratio <= 0.5) %>% pull(feature) # Apply the missing values filter filtered_house_sales_df <- house_sales_df %>% select(missing_vals_filter) # Display the first five rows of data filtered_house_sales_df %>% head(5) Now you have a reduced housing data set. Remember, in this lesson we chose an arbitrary threshold for the missing value ratio; but, in practice you will understand the data and explore the importance of each feature to tailor the cutoff ratio for each feature. ----------------------------------------------------------------------------------------------------- Create a missing values recipe In the previous exercises, you manually calculated the missing value ratio and created a filter to reduce the dimensionality of house_sales_df. The tidymodels package contains a recipe step to apply a missing values ratio automatically—step_filter_missing(). The advantages of the tidymodels approach is that it allows you reuse the recipe on other data sets and simplifies the move to a production environment. In this exercise, you will use the step_filter_missing() function to perform dimensionality reduction house_sales_df based on missing values. The tidyverse and tidymodels packages have been loaded for you. * Use recipe() to create a missing values filter with a threshold of 0.5. * Apply the missing_vals_recipe to house_sales_df. # Create missing values recipe missing_vals_recipe <- recipe(price ~ ., data = house_sales_df) %>% step_filter_missing(all_predictors(), threshold = 0.5) %>% prep() # Apply recipe to data filtered_house_sales_df <- bake(missing_vals_recipe, new_data = NULL) # Display the first five rows of data filtered_house_sales_df %>% head(5) While you could perform missing values feature selection manually, the tidymodels package provides recipe steps to automate the process and makes it easier apply the recipe to different data sets and to integrate the process into a production environment. ----------------------------------------------------------------------------------------------------- Create a low-variance filter In this exercise, you are given house_sales_df which contains seventeen continuous features. Some of those features do not have any variance. Some of them have very little variance. You will explore the variances and establish a filter using an appropriate variance threshold. This approach is useful for reducing dimensions with little to no information, but as you'll see, it has a few drawbacks. The tidyverse and tidymodels packages have been loaded for you. * Calculate the feature variances in house_sales_df. * Identify an appropriate variance threshold and create the low-variance filter. * Apply low_var_filter to house_sales_df. # Calculate feature variances houses_sales_variances <- house_sales_df %>% summarize(across(everything(), ~ var(scale(., center = FALSE), na.rm = TRUE))) %>% pivot_longer(everything(), names_to = "feature", values_to = "variance") %>% arrange(desc(variance)) houses_sales_variances # A tibble: 17 × 2 feature variance[,1] 1 waterfront 0.992 2 view 0.915 3 sqft_lot 0.883 4 sqft_lot_near15 0.811 5 sqft_basement 0.697 6 price 0.316 7 sqft_above 0.177 8 sqft_living 0.163 9 bathrooms 0.117 10 floors 0.115 11 sqft_living_near15 0.106 12 bedrooms 0.0707 13 condition 0.0351 14 yr_built 0.000222 15 yr_renovated 0.0000604 16 num_garages 0 17 num_hvac_units 0 # Set variance threshold and create filter low_var_filter <- houses_sales_variances %>% filter(variance < 0.1) %>% pull(feature) > low_var_filter [1] "bedrooms" "condition" "yr_built" "yr_renovated" [5] "num_garages" "num_hvac_units" # Apply the filter filtered_house_sales_df <- house_sales_df %>% select(-all_of(low_var_filter)) filtered_house_sales_df %>% head(5) Notice how a threshold of < 0.1 would filter out yr_built and yr_renovated. Those are probably important variables. They were filtered out because their range compared to the other variables is relatively narrow. In the next exercise, you'll explore a better way to filter low-variance features. ----------------------------------------------------------------------------------------------------- Create a low-variance recipe The tidymodels packages provides a better way to filter no- and near-zero-variance features with its step_zv() and step_nzv() functions, respectively. These recipe steps identify low-variance features by examining the number of unique values and the ratio of the frequency of the most common values in each feature. This approach is more robust than the simple variance cutoff we used previously. In addition, you will use the step_scale() recipe step to normalize the variance of the features. Remember it's always a good idea to normalize the data to make variances across features comparable. The house_sales_df is available for you to use. The target variable is price. The tidyverse and tidymodels packages have also been loaded for you. * Define a recipe for a low-variance filter and prepare it using house_sales_df. * Apply the recipe to house_sales_df and store the filtered data in filtered_house_sales_df. * Display the features that the recipe filtered in the step_nzv() step. # Prepare recipe low_variance_recipe <- recipe(price ~ ., data = house_sales_df) %>% step_zv(all_predictors()) %>% step_scale(all_numeric_predictors()) %>% step_nzv(all_predictors()) %>% prep() # Apply recipe filtered_house_sales_df <- bake(low_variance_recipe, new_data = NULL) # View list of features removed by the near-zero variance step tidy(low_variance_recipe, number = 3) # A tibble: 3 × 2 terms id 1 waterfront nzv_8b5T1 2 view nzv_8b5T1 3 sqft_basement nzv_8b5T1 The magic of a recipe is you can re-use it and apply it to a new sample of house sales data with the same columns. ----------------------------------------------------------------------------------------------------- Identify highly correlated features Using the data in house_sales_df, you will practice identifying features that have high correlation. High correlation among features indicates redundant information and can cause problems in modeling such as multicollinearity in regression models. You will determine which of the highly correlated features to remove. A correlation matrix will help you identify highly correlated features. The tidyverse and corrr packages have been loaded for you. * Create a correlation plot with the correlations printed on the plot. # Create a correlation plot of the house sales house_sales_df %>% correlate() %>% shave() %>% rplot(print_cor = TRUE) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) Nice correlation plot! (slika2.png) Sometimes you need to think about which of the correlated features contains the most important information. For instance, look at sqft_living and sqft_above. It'd be best to remove sqft_living because its contribution of information would be covered with other variables like sqft_living_near15, bedrooms, and bathrooms. ----------------------------------------------------------------------------------------------------- Create a high-correlation recipe Once you have identified highly correlated features, instead of removing them manually, you can use the step_corr() recipe step in tidymodels. step_corr() does not remove all features that are correlated with other features. It attempts to remove as few features as possible. Conceptually, as you saw in the multiple choice exercise, it removes the feature that has the most overlap with any combination of other features. The idea is that the other features contain the same information, so the overlapping information of the removed feature is still represented in those other features. The tidyverse and tidymodels packages have been loaded for you. * Create a recipe that uses step_corr() with a threshold of 0.7, applying the step to numeric predictors only. * Apply the recipe to house_sales_df and store the filtered data in filtered_house_sales_df. * Use tidy() to identify the column or columns that the step_corr() filter removed. # Create a recipe using step_corr to remove numeric predictors correlated > 0.7 corr_recipe <- recipe(price ~ ., data = house_sales_df) %>% step_corr(all_numeric_predictors(), threshold = 0.7) %>% prep() # Apply the recipe to the data filtered_house_sales_df <- corr_recipe %>% bake(new_data = NULL) # Identify the features that were removed tidy(corr_recipe, number = 1) # A tibble: 3 × 2 terms id 1 sqft_living corr_Sew2a 2 sqft_lot_near15 corr_Sew2a 3 sqft_above corr_Sew2a Notice that step_corr() removes the minimal number of correlated features, not all features that are correlated above the threshold.