Understanding principal components Principal component analysis (PCA) reduces dimensionality by combining the non-overlapping feature information. PCA extracts new features called principal components that are independent of each other. One way to understand PCA is to plot the major principal components along the x- and y-axis and display the feature vectors. This allows you to see what features are contributing to each principal component. Though it is not always easy, it is good practice to name the principal components based on the features that contribute to them. However, as a feature extraction method, PCA is often difficult to interpret. A subset of the credit data is contained in credit_df. The target variable is credit_score. The tidyverse and ggfortify packages have also been loaded for you. * Perform principal component analysis on credit_df. * Use autoplot() to display the first two principal components, the feature vectors and labels, and encode credit_score in color. > glimpse(credit_df) Rows: 18,965 Columns: 5 $ monthly_inhand_salary 2922.7992, 2501.7883, 2293.3483, 1301.3958, 1… $ credit_utilization_ratio 39.41268, 25.26184, 25.24044, 31.18439, 23.13… $ delay_from_due_date 5, 5, 5, 26, 40, 23, 62, 18, 13, 51, 10, 15, … $ credit_history_months 344, 398, 257, 145, 161, 199, 31, 121, 244, 1… $ credit_score "Standard", "Good", "Standard", "Poor", "Poor… # Perform PCA pca_res <- prcomp(credit_df %>% select(-credit_score), scale. = TRUE) # Plot principal components and feature vectors autoplot(pca_res, data = credit_df, colour = 'credit_score', alpha = 0.3, loadings = TRUE, loadings.label = TRUE, loadings.colour = "black", loadings.label.colour = "black") Notice that the first two principal components, if you add them together (25.43% + 46.99%), explain roughly 72% of variance in the original data. One of the disadvantages of feature extraction techniques like PCA is that they are often difficult to interpret. Can you create descriptive names for PC1 and PC2? ----------------------------------------------------------------------------------------------------- PCA in tidymodels From a model building perspective, PCA allows you to create models with fewer features, but still capture most of the information in the original data. However, as you've seen, a disadvantage of PCA is the difficulty of interpreting the model. In this exercise, you will be focusing on building a linear regression model using a subset of the house sales data. The target variable is price. A model built directly from the data without extracting principal components has a RMSE of $236,461.4. You will apply PCA with tidymodels and compare the new RMSE. Remember, lower RMSEs are better. The tidyverse and tidymodels packages have been loaded for you. * Build a PCA recipe using train to extract five principal components. * Fit a workflow with a default linear_reg() model spec. * Create a test prediction data frame using test that contains the actual and predicted values. * Calculate the RMSE for the PCA-reduced linear regression model. > glimpse(house_sales_df) Rows: 21,613 Columns: 9 $ bedrooms 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,… $ bathrooms 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.… $ sqft_living 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189… $ sqft_above 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189… $ view 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,… $ condition 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,… $ sqft_basement 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, … $ yr_built 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20… $ price 221900, 538000, 180000, 604000, 510000, 1225000, 257500,… # Build a PCA recipe pca_recipe <- recipe(price ~ ., data = train) %>% step_normalize(all_numeric_predictors()) %>% step_pca(all_numeric_predictors(), num_comp = 5) # Fit a workflow with a default linear_reg() model spec house_sales_fit <- workflow(preprocessor = pca_recipe, spec = linear_reg()) %>% fit(train) # Create prediction df for the test set house_sales_pred_df <- predict(house_sales_fit, test) %>% bind_cols(test %>% select(price)) # Calculate the RMSE rmse(house_sales_pred_df, price, .pred) # A tibble: 1 × 3 .metric .estimator .estimate 1 rmse standard 255895. Notice that the RMSE increased from $236,461.4 to $261,290.2 and increase of about $15,000, so we lost some information by applying PCA. However, the initial model had an average error of $236,461.4 in predicting house values in California - that is not very good performance. So the original data set must be missing some important information for predicting house prices. ----------------------------------------------------------------------------------------------------- Separating house prices with PCA PCA and t-SNE are both feature extraction techniques, but PCA can only capture the linear structure of the data. In this exercise, you will create a PCA plot of the full house_sales_df so you can compare its result with the t-SNE output. Remember that price is the target variable in house_sales_df. It is important to remove it before fitting PCA to the data. The tidyverse and ggfortify packages have been loaded for you. * Fit a PCA to the predictors of house_sales_df. * Use autoplot() to plot the first two PCs and encode price in color. # Fit PCA to only the predictors pca <- prcomp(house_sales_df %>% select(-price)) # Plot PCA and color code the target variable autoplot(pca, data = house_sales_df, colour = "price", alpha = 0.7) + scale_color_gradient(low="gray", high="blue") Notice that PCA doesn't do a very good job of differentiating house prices. PC1 accounts for almost all the variance, but it does not capture an explanation for why some houses are more expensive than others. (slika3.png) ----------------------------------------------------------------------------------------------------- Separating house prices with t-SNE t-SNE is a non-linear dimensionality reduction technique. It embeds high-dimensional data into a lower-dimensional space. As it does so, it strives to keep points next to their original neighbors. You will create a t-SNE plot that you can compare with the PCA plot in the last exercise. PCA preserves the global structure of the data, but not the local structure. t-SNE preserves the local structure by keeping neighbors in the higher-dimensional space close to each other in the lower-dimensional space. You will see this in the plots. You will apply t-SNE to reduce the house_sales_df. The target variable of house_sales_df is price. The tidyverse and Rtsne packages have been loaded for you. * Fit t-SNE to house_sales_df using Rtsne(). * Bind the t-SNE X and Y coordinates to house_sales_df. * Plot the t-SNE results using ggplot(), encoding the target variable in color. # Fit t-SNE set.seed(1234) tsne <- Rtsne(house_sales_df %>% select(-price), check_duplicates = FALSE) # Bind t-SNE coordinates to the data frame tsne_df <- house_sales_df %>% bind_cols(tsne_x = tsne$Y[,1], tsne_y = tsne$Y[,2]) # Plot t-SNE tsne_df %>% ggplot(aes(x = tsne_x, y = tsne_y, color = price)) + geom_point() + scale_color_gradient(low="gray", high="blue") The x and y axes are the extracted features. Remember, a goal of supervised dimensionality reduction is to gain more information about the target variable. T-SNE does a little better at separating more and less expensive houses, but it is still not great. (slika4.png) ----------------------------------------------------------------------------------------------------- Separating house prices with UMAP You have reduced the dimensionality of the California house sales data (house_sales_df) using PCA and t-SNE. You will now use UMAP. The end result of UMAP is very similar to that of t-SNE, however, UMAP tends to be more computationally efficient. It also strives to retain more of the global structure. In practical terms, this means you can interpret the distance between clusters as a measure of similarity - something that you couldn't do with t-SNE. Remember, the target variable of house_sales_df is price. Set num_comp = 2. The tidyverse and embed packages have been loaded for you. * Fit UMAP to all the predictors in house_sales_df using step_umap() in a recipe and store the transformed data in umap_df. * Plot the UMAP dimensions using ggplot(), encoding the target variable price in color. # Fit UMAP set.seed(1234) umap_df <- recipe(price ~ ., data = house_sales_df) %>% step_normalize(all_predictors()) %>% step_umap(all_predictors(), num_comp = 2) %>% prep() %>% juice() # Plot UMAP umap_df %>% ggplot(aes(x = UMAP1, y = UMAP2, color = price)) + geom_point(alpha = 0.7) + scale_color_gradient(low="gray", high="blue") UMAP still struggles to completely separate the more and less expensive houses, but that's to be expected - house pricing in California is complicated. (slika5.png) ----------------------------------------------------------------------------------------------------- UMAP reduction in a decision tree model Now that you have visualized a UMAP reduction, let's put UMAP to work in model building. In this exercise, you will build a workflow that applies UMAP in a preprocessing recipe to the credit data and then use the extracted components to build a decision tree model. The credit data train and test sets are provided for you. The embed library has already been loaded. * Create a recipe to apply a UMAP reduction to the data, resulting in four extracted components. * Create a decision_tree model for classification. * Add the UMAP recipe and the decision tree model to a workflow. > glimpse(credit_df) Rows: 18,965 Columns: 18 $ credit_score Standard, Good, Standard, Poor, Poor, Poor, S… $ age 44, 19, 39, 43, 22, 52, 32, 45, 37, 22, 26, 2… $ annual_income 32625.590, 31041.460, 25876.180, 15928.750, 1… $ monthly_inhand_salary 2922.7992, 2501.7883, 2293.3483, 1301.3958, 1… $ num_bank_accounts 0, 5, 4, 9, 7, 5, 7, 9, 1, 7, 5, 7, 9, 4, 4, … $ num_credit_card 5, 2, 6, 9, 9, 5, 8, 8, 5, 7, 5, 5, 5, 5, 7, … $ interest_rate 8, 5, 7, 24, 18, 5, 26, 21, 10, 25, 3, 23, 32… $ num_of_loan 4, 1, 3, 4, 2, 4, 9, 2, 2, 7, 1, 3, 6, 1, 2, … $ delay_from_due_date 5, 5, 5, 26, 40, 23, 62, 18, 13, 51, 10, 15, … $ num_of_delayed_payment 10, 11, 15, 22, 17, 1, 18, 18, 3, 16, 7, 21, … $ changed_credit_limit 2.55, 5.13, 10.29, 1.26, 6.09, 0.75, 22.13, 8… $ num_credit_inquiries 3, 0, 4, 12, 12, 1, 12, 13, 3, 8, 0, 9, 12, 1… $ outstanding_debt 177.90, 291.77, 71.54, 2240.56, 2063.45, 636.… $ credit_utilization_ratio 39.41268, 25.26184, 25.24044, 31.18439, 23.13… $ total_emi_per_month 73.125008, 21.021180, 37.374350, 31.846679, 1… $ amount_invested_monthly 139.43374, 185.31083, 117.93599, 161.38822, 1… $ monthly_balance 369.7212, 333.8468, 354.0245, 226.9047, 343.9… $ credit_history_months 344, 398, 257, 145, 161, 199, 31, 121, 244, 1… # Create a recipe to apply UMAP feature extraction umap_recipe <- recipe(credit_score ~ ., data = train) %>% step_normalize(all_predictors()) %>% step_umap(all_predictors(), outcome = vars(credit_score), num_comp = 4) # Specify a decision tree model umap_dt_model <- decision_tree(mode = "classification") # Add the recipe and model to a workflow umap_dt_workflow <- workflow() %>% add_recipe(umap_recipe) %>% add_model(umap_dt_model) umap_dt_workflow The workflow will extract four features that, hopefully, contain the majority of the information in the original data. Then it will create a decision tree model using the reduced data. Now, you need to fit the model and evaluate it. How do you think the performance of the UMAP reduced model compares to the original model? ----------------------------------------------------------------------------------------------------- Evaluate the UMAP decision tree model In the previous exercise, you created a workflow to apply UMAP and build a decision tree model. Now it's time to fit a model using that training data and compare its performance to the unreduced decision tree model. Because the target variable credit_score is categorical, you will use f_meas() to evaluate the models' performances. The unreduced model and its test predictions are stored in dt_fit and predict_df, respectively. The UMAP workflow you created is in umap_dt_workflow. The train and test sets are also provided for you. The tidyverse, tidymodels, and embed packages have been loaded for you. * Use f_meas to evaluate the performance of the unreduced dt_fit. * Fit the UMAP reduced model using umap_dt_workflow. * Create the test set prediction data frame for the reduced UMAP model. * Use f_meas to evaluate the performance of the reduced umap_dt_fit. > glimpse(predict_df) Rows: 4,742 Columns: 19 $ credit_score Standard, Good, Poor, Poor, Standard, Standar… $ age 44, 19, 52, 45, 37, 26, 26, 22, 29, 32, 39, 3… $ annual_income 32625.590, 31041.460, 128938.950, 63501.720, … $ monthly_inhand_salary 2922.7992, 2501.7883, 10722.9125, 5471.8100, … $ num_bank_accounts 0, 5, 5, 9, 1, 5, 7, 9, 2, 7, 8, 6, 6, 6, 6, … $ num_credit_card 5, 2, 5, 8, 5, 5, 5, 5, 3, 10, 7, 5, 5, 4, 5,… $ interest_rate 8, 5, 5, 21, 10, 3, 23, 32, 6, 34, 26, 33, 13… $ num_of_loan 4, 1, 4, 2, 2, 1, 3, 6, 0, 8, 7, 7, 4, 2, 6, … $ delay_from_due_date 5, 5, 23, 18, 13, 10, 15, 60, 11, 15, 43, 27,… $ num_of_delayed_payment 10, 11, 1, 18, 3, 7, 21, 18, 6, 21, 26, 19, 2… $ changed_credit_limit 2.55, 5.13, 0.75, 8.27, 11.59, 7.01, 2.36, 19… $ num_credit_inquiries 3, 0, 1, 13, 3, 0, 9, 12, 1, 15, 12, 11, 6, 2… $ outstanding_debt 177.90, 291.77, 636.95, 2056.37, 1023.93, 286… $ credit_utilization_ratio 39.41268, 25.26184, 29.91489, 38.42301, 40.50… $ total_emi_per_month 73.12501, 21.02118, 384.58011, 105.23869, 100… $ amount_invested_monthly 139.43374, 185.31083, 10000.00000, 83.19815, … $ monthly_balance 369.7212, 333.8468, 269.1067, 608.7442, 684.2… $ credit_history_months 344, 398, 199, 121, 244, 267, 80, 150, 214, 3… $ .pred_class Good, Good, Poor, Poor, Good, Good, Poor, Sta… > dt_fit ══ Workflow [trained] ══════════════════════════════════════════════════════════ Preprocessor: Recipe Model: decision_tree() ── Preprocessor ──────────────────────────────────────────────────────────────── 0 Recipe Steps ── Model ─────────────────────────────────────────────────────────────────────── n= 14223 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 14223 8225 Standard (0.421711313 0.256345356 0.321943331) 2) interest_rate< 14.5 8030 4573 Good (0.417808219 0.430510585 0.151681196) 4) num_credit_card>=2.5 7088 3929 Standard (0.445682844 0.382759594 0.171557562) 8) delay_from_due_date< 15.5 5318 2981 Standard (0.439450921 0.436066190 0.124482888) 16) num_credit_card>=5.5 1706 838 Standard (0.508792497 0.332356389 0.158851114) * 17) num_credit_card< 5.5 3612 1860 Good (0.406699889 0.485049834 0.108250277) * 9) delay_from_due_date>=15.5 1770 948 Standard (0.464406780 0.222598870 0.312994350) 18) num_of_delayed_payment>=7.5 1358 659 Standard (0.514727541 0.262886598 0.222385862) * 19) num_of_delayed_payment< 7.5 412 160 Poor (0.298543689 0.089805825 0.611650485) * 5) num_credit_card< 2.5 942 198 Good (0.208067941 0.789808917 0.002123142) * 3) interest_rate>=14.5 6193 2832 Poor (0.426772162 0.030518327 0.542709511) 6) outstanding_debt< 1249.945 699 138 Standard (0.802575107 0.127324750 0.070100143) * 7) outstanding_debt>=1249.945 5494 2182 Poor (0.378958864 0.018201675 0.602839461) 14) changed_credit_limit>=14.975 2662 1257 Standard (0.527798648 0.029676935 0.442524418) 28) outstanding_debt>=2515.37 2300 993 Standard (0.568260870 0.033043478 0.398695652) * 29) outstanding_debt< 2515.37 362 101 Poor (0.270718232 0.008287293 0.720994475) * 15) changed_credit_limit< 14.975 2832 698 Poor (0.239053672 0.007415254 0.753531073) * # Evaluate the unreduced decision tree model performance f_meas(predict_df, credit_score, .pred_class) # A tibble: 1 × 3 .metric .estimator .estimate 1 f_meas macro 0.610 # Fit the UMAP decision tree model umap_dt_fit <- umap_dt_workflow %>% fit(data = train) # Create test set prediction data frame for the UMAP model predict_umap_df <- test %>% bind_cols(predict = predict(umap_dt_fit, test)) # Calculate F1 performance of the UMAP model f_meas(predict_umap_df, credit_score, .pred_class) # A tibble: 1 × 3 .metric .estimator .estimate 1 f_meas macro 0.539 Nice job fitting a UMAP reduced decision tree model. That was a significant performance drop, huh? Sometimes increasing the number of components UMAP extracts can retain more of the original information and improve model performance.