Dimensionality and feature information Imagine you work for a bank and have collected information about different loans made to different people. Your boss wants you to begin exploring the possibility of using this data to classify customers into different credit score categories. A sample of the available data is loaded into credit_df. You are curious about how many features the data has. You also want to identify features that will not be useful for classifying customers into different credit categories. The tidyverse package has been loaded for you. * Find the number of features in credit_df. * Compute the variance of each feature in credit_df. * Identify the feature with zero variance and assign it to column_to_remove. > glimpse(credit_df) Rows: 137 Columns: 7 $ annual_income 87629.70, 16574.41, 24931.39, 136680.12, 7685… $ num_bank_accounts 5, 7, 4, 0, 5, 8, 7, 6, 5, 5, 5, 3, 8, 6, 2, … $ num_credit_card 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, … $ interest_rate 17, 11, 13, 3, 6, 11, 17, 20, 10, 16, 14, 16,… $ outstanding_debt 526.26, NA, NA, NA, 1112.24, NA, NA, NA, 722.… $ credit_utilization_ratio 26.67978, 29.56575, 41.18152, 31.44831, 24.35… $ credit_history_months 286, 122, 351, 216, 272, 200, 134, 113, 347, … # Find the number of features credit_df %>% ncol() # Compute each column variance credit_df %>% summarize(across(everything(), ~ var(., na.rm = TRUE))) %>% pivot_longer(everything(), names_to = "feature", values_to = "variance") # A tibble: 7 × 2 feature variance 1 annual_income 8.05e12 2 num_bank_accounts 1.16e 4 3 num_credit_card 0 4 interest_rate 2.57e 5 5 outstanding_debt 2.80e 5 6 credit_utilization_ratio 2.71e 1 7 credit_history_months 7.64e 3 # Assign the zero-variance column column_to_remove <- "num_credit_card" There are seven features in credit_df. Remember, features, columns, and dimensions are used interchangeably. Also, features like num_credit_card, whose values do not vary, do not provide any information and are not useful to us. ----------------------------------------------------------------------------------------------------- Mutual information features The credit_df data frame contains a number of continuous features. When two continuous features are correlated, they contain the same information — something called mutual information. Highly correlated features are not just redundant. They can cause problems in modeling. For instance, in regression, highly correlated features (i.e., multicollinearity) can cause nonsensical results. To get a sense of mutual information, you will create a correlation plot to identify features with mutual information. The tidyverse and corrr packages have been loaded for you. * Use correlate() and rplot() to create a correlation plot of the numeric features of credit_df. > glimpse(credit_df) Rows: 500 Columns: 23 $ month "April", "February", "April", "April", "Augus… $ age 49, 18, 49, 34, 24, 39, 23, 26, 52, 29, 49, 1… $ occupation "_______", "Writer", "Developer", "Developer"… $ annual_income 27525.020, 67104.120, 20817.455, 68690.080, 4… $ monthly_inhand_salary 2239.7517, 5829.0100, 1657.7879, 5588.1733, 3… $ num_bank_accounts 6, 9, 5, 7, 7, 8, 8, 7, 8, 5, 3, 5, 9, 2, 1, … $ num_credit_card 6, 9, 3, 8, 5, 8, 4, 9, 3, 5, 4, 10, 10, 6, 3… $ interest_rate 6, 19, 10, 19, 19, 30, 504, 26, 9, 32, 5, 16,… $ num_of_loan 0, 5, -100, 9, 6, 5, 3, 2, 4, 2, 1, 5, 5, 1, … $ delay_from_due_date 24, 26, 1, 45, 48, 23, 15, 37, 29, 14, 12, 16… $ num_of_delayed_payment 19, 25, -1, 27, 13, 17, 9, 24, 11, 20, 1, 16,… $ changed_credit_limit 17.92, 21.54, 1.52, 3.63, 15.83, 15.72, 4.34,… $ num_credit_inquiries 2, 6, 3, 7, 12, 9, 0, 12, 3, 12, 6, 8, 10, 2,… $ credit_mix "Standard", NA, "Good", NA, "Standard", NA, N… $ outstanding_debt 999.12, 4635.75, 691.90, 2042.65, 1264.71, 40… $ credit_utilization_ratio 33.84099, 31.31489, 32.72462, 38.94719, 33.08… $ payment_of_min_amount "NM", "Yes", "No", "Yes", "NM", "Yes", "No", … $ total_emi_per_month 0.000000, 265.022160, 44.948595, 380.788074, … $ amount_invested_monthly 86.53620, 198.33534, 150.53721, 46.06771, 85.… $ payment_behaviour "Low_spent_Large_value_payments", "Low_spent_… $ monthly_balance 407.4390, 389.5435, 240.2930, 371.9615, 387.8… $ credit_history_months 280, 15, 344, 236, 230, 20, 342, 100, 373, 16… $ credit_score "Standard", "Standard", "Standard", "Poor", "… # Create a correlation plot credit_df %>% select(where(is.numeric)) %>% correlate() %>% shave() %>% rplot(print_cor = TRUE) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) Features with a high positive or negative correlation share information. So you would consider removing them. ----------------------------------------------------------------------------------------------------- Calculating root entropy This exercise continues with the loan default example from the slides. The image shows the root node of a decision tree split by color. In these next three exercises, you will calculate the entropy of the root node, calculate the entropy of the child nodes, and determine the information gain that color provides for determining loan default status. In this exercise, the task is to calculate the entropy of the root node in this decision tree. slika1.png * Calculate the class probabilities of "yes" and "no" classes in the root node for defaulting on a loan. * Use the class probabilities to calculate the entropy of the root node. # Calculate the class probabilities p_yes <- 4 / 12 p_no <- 8 / 12 # Calculate the entropy entropy_root <- -(p_yes * log2(p_yes)) + -(p_no * log2(p_no)) entropy_root [1] 0.9182958 You just measured the amount of disorder -- lack of purity -- in the root node. Remember the goal of models like decision trees is to use the information in the predictors to better understand the target variable. ----------------------------------------------------------------------------------------------------- Calculating child entropies You have completed the first step for measuring the information gain that color provides -- you computed the disorder of the root node. Now you need to measure the entropy of the child nodes so that you can assess if the collective disorder of the child nodes is less than the disorder of the parent node. If it is, then you have gained some information about loan default status from the information found in color. slika1. png * Calculate the class probabilities for the left split (blue side). * Calculate the entropy of left split using the class probabilities. * Calculate the class probabilities for the left split (green side). * Calculate the entropy of right split using the class probabilities. # Calculate the class probabilities in the left split p_left_yes <- 3 / 5 p_left_no <- 2 / 5 # Calculate the entropy of the left split entropy_left <- -(p_left_yes * log2(p_left_yes)) + -(p_left_no * log2(p_left_no)) # Calculate the class probabilities in the right split p_right_yes <- 1 / 7 p_right_no <- 6 / 7 # Calculate the entropy of the right split entropy_right <- -(p_right_yes * log2(p_right_yes)) + -(p_right_no * log2(p_right_no)) Now you know the disorder in the root and the child nodes. Information gain will be the amount that the disorder decreases after the color split. ----------------------------------------------------------------------------------------------------- Calculating information gain of color Now that you know the entropies of the root and child nodes, you can calculate the information gain that color provides. In the prior exercises, you calculated entropy_root, entropy_left and entropy_right. They are available on the console. Remember that you will take the weighted average of the child node entropies. So, you will need to calculate what proportion of the original observations ended up on the left and right side of the split. Store those in p_left and p_right, respectively. slika1.png * Calculate the split weights — that is, the proportion of observations on each side of the split. * Calculate the information gain using the weights and the entropies. # Calculate the split weights p_left <- 5 / 12 p_right <- 7 / 12 # Calculate the information gain info_gain <- entropy_root - (p_left * entropy_left + p_right * entropy_right) info_gain [1] 0.1685906 To determine how much information gain color provides, you would need to compare it to the information gain of the other predictors (shape, outline, and texture). Remember that information gain is one way to measure feature importance that is specific to decision trees. There are other ways to determine feature importance as well. ----------------------------------------------------------------------------------------------------- Calculate possible combinations The healthcare_cat_df data frame contains categorical variables about employees in a healthcare company and whether they left the company or not. You will use this dataset to determine the number of combinations of the feature values that exist in the dataset. When training a machine learning model, you would want your data to contain many observations of each combination. So, the number of combinations helps create a benchmark for the minimum number of observations you would need to collect to help avoid bias in your model. The tidyverse package has been loaded for you. * Calculate the minimum number of observations needed to represent all combinations of the feature values in healthcare_cat_df. > glimpse(healthcare_cat_df) Rows: 1,676 Columns: 8 $ BusinessTravel Travel_Rarely, Travel_Frequently, Travel_Rarely, Travel… $ Department Cardiology, Maternity, Maternity, Maternity, Maternity,… $ EducationField Life Sciences, Life Sciences, Other, Life Sciences, Med… $ Gender Female, Male, Male, Female, Male, Male, Female, Male, M… $ JobRole Nurse, Other, Nurse, Other, Nurse, Nurse, Nurse, Nurse,… $ MaritalStatus Single, Married, Single, Married, Married, Single, Marr… $ OverTime Yes, No, Yes, Yes, No, No, Yes, No, No, No, No, Yes, No… $ Attrition No, No, Yes, No, No, No, No, No, No, No, No, No, No, No… # Calculate the minimum number of value combinations healthcare_cat_df %>% summarize(across(everything(), ~ length(unique(.)))) %>% prod() [1] 6480 When training a machine learning model, you would want a sample that includes each combination several times, so that every combination appears at least once in both the training and testing data set. In this example, healthcare_cat_df had eight dimensions and needed a bare minimum of 6480 observations. Bias and data sparsity can be thought of as two manifestations of the curse of dimensionality. As dimensionality increases, it becomes harder to represent all possible combinations of feature values.