Understanding principal components

Principal component analysis (PCA) reduces dimensionality by combining the non-overlapping feature
information. PCA extracts new features called principal components that are independent of each other.
One way to understand PCA is to plot the major principal components along the x- and y-axis and
display the feature vectors. This allows you to see what features are contributing to each principal
component. Though it is not always easy, it is good practice to name the principal components based
on the features that contribute to them. However, as a feature extraction method, PCA is often
difficult to interpret.

A subset of the credit data is contained in credit_df. The target variable is credit_score. 
The tidyverse and ggfortify packages have also been loaded for you.

* Perform principal component analysis on credit_df.
* Use autoplot() to display the first two principal components, the feature vectors and labels,
  and encode credit_score in color.


>
glimpse(credit_df)
Rows: 18,965
Columns: 5
$ monthly_inhand_salary    <dbl> 2922.7992, 2501.7883, 2293.3483, 1301.3958, 1…
$ credit_utilization_ratio <dbl> 39.41268, 25.26184, 25.24044, 31.18439, 23.13…
$ delay_from_due_date      <dbl> 5, 5, 5, 26, 40, 23, 62, 18, 13, 51, 10, 15, …
$ credit_history_months    <dbl> 344, 398, 257, 145, 161, 199, 31, 121, 244, 1…
$ credit_score             <chr> "Standard", "Good", "Standard", "Poor", "Poor…

# Perform PCA
pca_res <- prcomp(credit_df %>% select(-credit_score), scale. = TRUE)

# Plot principal components and feature vectors
autoplot(pca_res, 
         data = credit_df, 
         colour = 'credit_score', 
         alpha = 0.3,
         loadings = TRUE, 
         loadings.label = TRUE, 
         loadings.colour = "black", 
         loadings.label.colour = "black")
         
Notice that the first two principal components, if you add them together (25.43% + 46.99%), explain
roughly 72% of variance in the original data. One of the disadvantages of feature extraction
techniques like PCA is that they are often difficult to interpret. Can you create descriptive names
for PC1 and PC2? 

-----------------------------------------------------------------------------------------------------

PCA in tidymodels

From a model building perspective, PCA allows you to create models with fewer features, but still
capture most of the information in the original data. However, as you've seen, a disadvantage of PCA
is the difficulty of interpreting the model. In this exercise, you will be focusing on building a
linear regression model using a subset of the house sales data. The target variable is price.

A model built directly from the data without extracting principal components has a RMSE of $236,461.4.
You will apply PCA with tidymodels and compare the new RMSE. Remember, lower RMSEs are better.

The tidyverse and tidymodels packages have been loaded for you.

* Build a PCA recipe using train to extract five principal components.
* Fit a workflow with a default linear_reg() model spec.
* Create a test prediction data frame using test that contains the actual and predicted values.
* Calculate the RMSE for the PCA-reduced linear regression model.


>
glimpse(house_sales_df)
Rows: 21,613
Columns: 9
$ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
$ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
$ sqft_living   <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
$ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
$ view          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
$ condition     <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
$ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
$ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
$ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…

# Build a PCA recipe
pca_recipe <- recipe(price ~ ., data = train) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors(), num_comp = 5) 

# Fit a workflow with a default linear_reg() model spec
house_sales_fit <- workflow(preprocessor = pca_recipe, spec = linear_reg()) %>% 
  fit(train)

# Create prediction df for the test set
house_sales_pred_df <- predict(house_sales_fit, test) %>% 
  bind_cols(test %>% select(price))

# Calculate the RMSE
rmse(house_sales_pred_df, price, .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     255895.

Notice that the RMSE increased from $236,461.4 to $261,290.2 and increase of about $15,000, so we
lost some information by applying PCA. However, the initial model had an average error of $236,461.4
in predicting house values in California - that is not very good performance. So the original data
set must be missing some important information for predicting house prices. 

-----------------------------------------------------------------------------------------------------

Separating house prices with PCA

PCA and t-SNE are both feature extraction techniques, but PCA can only capture the linear structure
of the data. In this exercise, you will create a PCA plot of the full house_sales_df so you can
compare its result with the t-SNE output.

Remember that price is the target variable in house_sales_df. It is important to remove it before
fitting PCA to the data.

The tidyverse and ggfortify packages have been loaded for you.

* Fit a PCA to the predictors of house_sales_df.
* Use autoplot() to plot the first two PCs and encode price in color.


# Fit PCA to only the predictors
pca <- prcomp(house_sales_df %>% select(-price))

# Plot PCA and color code the target variable
autoplot(pca, data = house_sales_df, colour = "price", alpha = 0.7) +
  scale_color_gradient(low="gray", high="blue")
  
Notice that PCA doesn't do a very good job of differentiating house prices. PC1 accounts for almost
all the variance, but it does not capture an explanation for why some houses are more expensive than
others. (slika3.png)

-----------------------------------------------------------------------------------------------------

Separating house prices with t-SNE

t-SNE is a non-linear dimensionality reduction technique. It embeds high-dimensional data into a
lower-dimensional space. As it does so, it strives to keep points next to their original neighbors.
You will create a t-SNE plot that you can compare with the PCA plot in the last exercise. PCA preserves
the global structure of the data, but not the local structure. t-SNE preserves the local structure by
keeping neighbors in the higher-dimensional space close to each other in the lower-dimensional space.
You will see this in the plots.

You will apply t-SNE to reduce the house_sales_df. The target variable of house_sales_df is price.
The tidyverse and Rtsne packages have been loaded for you.

* Fit t-SNE to house_sales_df using Rtsne().
* Bind the t-SNE X and Y coordinates to house_sales_df.
* Plot the t-SNE results using ggplot(), encoding the target variable in color.


# Fit t-SNE
set.seed(1234)
tsne <- Rtsne(house_sales_df %>% select(-price), check_duplicates = FALSE)

# Bind t-SNE coordinates to the data frame
tsne_df <- house_sales_df %>% 
  bind_cols(tsne_x = tsne$Y[,1], tsne_y = tsne$Y[,2])

# Plot t-SNE
tsne_df %>% 
  ggplot(aes(x = tsne_x, y = tsne_y, color = price)) +
  geom_point() +
  scale_color_gradient(low="gray", high="blue")
  
The x and y axes are the extracted features. Remember, a goal of supervised dimensionality reduction
is to gain more information about the target variable. T-SNE does a little better at separating more
and less expensive houses, but it is still not great. (slika4.png)

-----------------------------------------------------------------------------------------------------

Separating house prices with UMAP

You have reduced the dimensionality of the California house sales data (house_sales_df) using PCA and
t-SNE. You will now use UMAP. The end result of UMAP is very similar to that of t-SNE, however, UMAP
tends to be more computationally efficient. It also strives to retain more of the global structure.
In practical terms, this means you can interpret the distance between clusters as a measure of
similarity - something that you couldn't do with t-SNE.

Remember, the target variable of house_sales_df is price. Set num_comp = 2. The tidyverse and embed
packages have been loaded for you.

* Fit UMAP to all the predictors in house_sales_df using step_umap() in a recipe and store the
  transformed data in umap_df.
* Plot the UMAP dimensions using ggplot(), encoding the target variable price in color.


# Fit UMAP
set.seed(1234)
umap_df <- recipe(price ~ ., data = house_sales_df) %>% 
  step_normalize(all_predictors()) %>% 
  step_umap(all_predictors(), num_comp = 2) %>% 
  prep() %>% 
  juice() 

# Plot UMAP
umap_df %>%  
  ggplot(aes(x = UMAP1, y = UMAP2, color = price)) +
  geom_point(alpha = 0.7) +
  scale_color_gradient(low="gray", high="blue")

UMAP still struggles to completely separate the more and less expensive houses, but that's to be
expected - house pricing in California is complicated. (slika5.png)

-----------------------------------------------------------------------------------------------------

UMAP reduction in a decision tree model

Now that you have visualized a UMAP reduction, let's put UMAP to work in model building. In this
exercise, you will build a workflow that applies UMAP in a preprocessing recipe to the credit data
and then use the extracted components to build a decision tree model. The credit data train and test
sets are provided for you. The embed library has already been loaded.

* Create a recipe to apply a UMAP reduction to the data, resulting in four extracted components.
* Create a decision_tree model for classification.
* Add the UMAP recipe and the decision tree model to a workflow.


>
glimpse(credit_df)
Rows: 18,965
Columns: 18
$ credit_score             <fct> Standard, Good, Standard, Poor, Poor, Poor, S…
$ age                      <dbl> 44, 19, 39, 43, 22, 52, 32, 45, 37, 22, 26, 2…
$ annual_income            <dbl> 32625.590, 31041.460, 25876.180, 15928.750, 1…
$ monthly_inhand_salary    <dbl> 2922.7992, 2501.7883, 2293.3483, 1301.3958, 1…
$ num_bank_accounts        <dbl> 0, 5, 4, 9, 7, 5, 7, 9, 1, 7, 5, 7, 9, 4, 4, …
$ num_credit_card          <dbl> 5, 2, 6, 9, 9, 5, 8, 8, 5, 7, 5, 5, 5, 5, 7, …
$ interest_rate            <dbl> 8, 5, 7, 24, 18, 5, 26, 21, 10, 25, 3, 23, 32…
$ num_of_loan              <dbl> 4, 1, 3, 4, 2, 4, 9, 2, 2, 7, 1, 3, 6, 1, 2, …
$ delay_from_due_date      <dbl> 5, 5, 5, 26, 40, 23, 62, 18, 13, 51, 10, 15, …
$ num_of_delayed_payment   <dbl> 10, 11, 15, 22, 17, 1, 18, 18, 3, 16, 7, 21, …
$ changed_credit_limit     <dbl> 2.55, 5.13, 10.29, 1.26, 6.09, 0.75, 22.13, 8…
$ num_credit_inquiries     <dbl> 3, 0, 4, 12, 12, 1, 12, 13, 3, 8, 0, 9, 12, 1…
$ outstanding_debt         <dbl> 177.90, 291.77, 71.54, 2240.56, 2063.45, 636.…
$ credit_utilization_ratio <dbl> 39.41268, 25.26184, 25.24044, 31.18439, 23.13…
$ total_emi_per_month      <dbl> 73.125008, 21.021180, 37.374350, 31.846679, 1…
$ amount_invested_monthly  <dbl> 139.43374, 185.31083, 117.93599, 161.38822, 1…
$ monthly_balance          <dbl> 369.7212, 333.8468, 354.0245, 226.9047, 343.9…
$ credit_history_months    <dbl> 344, 398, 257, 145, 161, 199, 31, 121, 244, 1…


# Create a recipe to apply UMAP feature extraction
umap_recipe <-  recipe(credit_score ~ ., data = train) %>% 
  step_normalize(all_predictors()) %>% 
  step_umap(all_predictors(), outcome = vars(credit_score), num_comp = 4)

# Specify a decision tree model
umap_dt_model <- decision_tree(mode = "classification")

# Add the recipe and model to a workflow
umap_dt_workflow <-  workflow() %>% 
  add_recipe(umap_recipe) %>% 
  add_model(umap_dt_model)
umap_dt_workflow

The workflow will extract four features that, hopefully, contain the majority of the information in
the original data. Then it will create a decision tree model using the reduced data. Now, you need
to fit the model and evaluate it. How do you think the performance of the UMAP reduced model compares
to the original model? 

-----------------------------------------------------------------------------------------------------

Evaluate the UMAP decision tree model

In the previous exercise, you created a workflow to apply UMAP and build a decision tree model. Now
it's time to fit a model using that training data and compare its performance to the unreduced
decision tree model. Because the target variable credit_score is categorical, you will use f_meas()
to evaluate the models' performances. The unreduced model and its test predictions are stored in
dt_fit and predict_df, respectively. The UMAP workflow you created is in umap_dt_workflow. The train
and test sets are also provided for you.

The tidyverse, tidymodels, and embed packages have been loaded for you.

* Use f_meas to evaluate the performance of the unreduced dt_fit.
* Fit the UMAP reduced model using umap_dt_workflow.
* Create the test set prediction data frame for the reduced UMAP model.
* Use f_meas to evaluate the performance of the reduced umap_dt_fit.


>
glimpse(predict_df)
Rows: 4,742
Columns: 19
$ credit_score             <fct> Standard, Good, Poor, Poor, Standard, Standar…
$ age                      <dbl> 44, 19, 52, 45, 37, 26, 26, 22, 29, 32, 39, 3…
$ annual_income            <dbl> 32625.590, 31041.460, 128938.950, 63501.720, …
$ monthly_inhand_salary    <dbl> 2922.7992, 2501.7883, 10722.9125, 5471.8100, …
$ num_bank_accounts        <dbl> 0, 5, 5, 9, 1, 5, 7, 9, 2, 7, 8, 6, 6, 6, 6, …
$ num_credit_card          <dbl> 5, 2, 5, 8, 5, 5, 5, 5, 3, 10, 7, 5, 5, 4, 5,…
$ interest_rate            <dbl> 8, 5, 5, 21, 10, 3, 23, 32, 6, 34, 26, 33, 13…
$ num_of_loan              <dbl> 4, 1, 4, 2, 2, 1, 3, 6, 0, 8, 7, 7, 4, 2, 6, …
$ delay_from_due_date      <dbl> 5, 5, 23, 18, 13, 10, 15, 60, 11, 15, 43, 27,…
$ num_of_delayed_payment   <dbl> 10, 11, 1, 18, 3, 7, 21, 18, 6, 21, 26, 19, 2…
$ changed_credit_limit     <dbl> 2.55, 5.13, 0.75, 8.27, 11.59, 7.01, 2.36, 19…
$ num_credit_inquiries     <dbl> 3, 0, 1, 13, 3, 0, 9, 12, 1, 15, 12, 11, 6, 2…
$ outstanding_debt         <dbl> 177.90, 291.77, 636.95, 2056.37, 1023.93, 286…
$ credit_utilization_ratio <dbl> 39.41268, 25.26184, 29.91489, 38.42301, 40.50…
$ total_emi_per_month      <dbl> 73.12501, 21.02118, 384.58011, 105.23869, 100…
$ amount_invested_monthly  <dbl> 139.43374, 185.31083, 10000.00000, 83.19815, …
$ monthly_balance          <dbl> 369.7212, 333.8468, 269.1067, 608.7442, 684.2…
$ credit_history_months    <dbl> 344, 398, 199, 121, 244, 267, 80, 150, 214, 3…
$ .pred_class              <fct> Good, Good, Poor, Poor, Good, Good, Poor, Sta…

>
dt_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────
n= 14223 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 14223 8225 Standard (0.421711313 0.256345356 0.321943331)  
   2) interest_rate< 14.5 8030 4573 Good (0.417808219 0.430510585 0.151681196)  
     4) num_credit_card>=2.5 7088 3929 Standard (0.445682844 0.382759594 0.171557562)  
       8) delay_from_due_date< 15.5 5318 2981 Standard (0.439450921 0.436066190 0.124482888)  
        16) num_credit_card>=5.5 1706  838 Standard (0.508792497 0.332356389 0.158851114) *
        17) num_credit_card< 5.5 3612 1860 Good (0.406699889 0.485049834 0.108250277) *
       9) delay_from_due_date>=15.5 1770  948 Standard (0.464406780 0.222598870 0.312994350)  
        18) num_of_delayed_payment>=7.5 1358  659 Standard (0.514727541 0.262886598 0.222385862) *
        19) num_of_delayed_payment< 7.5 412  160 Poor (0.298543689 0.089805825 0.611650485) *
     5) num_credit_card< 2.5 942  198 Good (0.208067941 0.789808917 0.002123142) *
   3) interest_rate>=14.5 6193 2832 Poor (0.426772162 0.030518327 0.542709511)  
     6) outstanding_debt< 1249.945 699  138 Standard (0.802575107 0.127324750 0.070100143) *
     7) outstanding_debt>=1249.945 5494 2182 Poor (0.378958864 0.018201675 0.602839461)  
      14) changed_credit_limit>=14.975 2662 1257 Standard (0.527798648 0.029676935 0.442524418)  
        28) outstanding_debt>=2515.37 2300  993 Standard (0.568260870 0.033043478 0.398695652) *
        29) outstanding_debt< 2515.37 362  101 Poor (0.270718232 0.008287293 0.720994475) *
      15) changed_credit_limit< 14.975 2832  698 Poor (0.239053672 0.007415254 0.753531073) *


# Evaluate the unreduced decision tree model performance
f_meas(predict_df, credit_score, .pred_class)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 f_meas  macro          0.610

# Fit the UMAP decision tree model
umap_dt_fit <- umap_dt_workflow %>% 
  fit(data = train)

# Create test set prediction data frame for the UMAP model
predict_umap_df <- test %>% 
  bind_cols(predict = predict(umap_dt_fit, test))

# Calculate F1 performance of the UMAP model
f_meas(predict_umap_df, credit_score, .pred_class)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 f_meas  macro          0.539

Nice job fitting a UMAP reduced decision tree model. That was a significant performance drop, huh?
Sometimes increasing the number of components UMAP extracts can retain more of the original
information and improve model performance.