6.1 What is Feature Engineering?

“Feature engineering encompasses activities that reformat predictor values to make them easier for a model to use effectively.”

Process of reformatting predictor variables
Includes, but certainly not limited to: Transformations (i.e. log, box-cox), encodings, proportions, change stats (i.e. Xt2 - Xt1 = Xtc) etc
Feature extraction/education i.e. PCA, UMAP (non-linear)
Impute missing Values
For much more see http://www.feat.engineering/.

What is this chapter (tidymodels chp 6) going to teach us about?

Basically how to build recipes our recipes for baking… In other words to prepare our data, by specifying treatments required to engineer our features, and to build our models which can be used for prediction.

Why use tidymodels?

Preprocessing steps focuses on making data suitable for modeling by using data transformations. While transformations can be accomplished with dplyr and other tidyverse packages it is suggested that we should consider using tidymodels packages when model development is more heavy and complex.

Code and Data so far…

At this point we have split the data (the code fore this can be found in chapter 5 of tidy models)
- This has given us a test and training set which we can investigate as below.
- This data has been stratified relative to the sales price, which we know is not normally distributed.
- Nothing has been specified as to the functional nature or shape of this data.

data(ames)
ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))

set.seed(123)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <-  testing(ames_split)
ames %>% slice_head(n = 5)

## # A tibble: 5 x 74
##   MS_SubClass      MS_Zoning     Lot_Frontage Lot_Area Street Alley   Lot_Shape 
##   <fct>            <fct>                <dbl>    <int> <fct>  <fct>   <fct>     
## 1 One_Story_1946_~ Residential_~          141    31770 Pave   No_All~ Slightly_~
## 2 One_Story_1946_~ Residential_~           80    11622 Pave   No_All~ Regular   
## 3 One_Story_1946_~ Residential_~           81    14267 Pave   No_All~ Slightly_~
## 4 One_Story_1946_~ Residential_~           93    11160 Pave   No_All~ Regular   
## 5 Two_Story_1946_~ Residential_~           74    13830 Pave   No_All~ Slightly_~
## # ... with 67 more variables: Land_Contour <fct>, Utilities <fct>,
## #   Lot_Config <fct>, Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>,
## #   Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>,
## #   Year_Built <int>, Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
## #   Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
## #   Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>,
## #   Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>,
## #   BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
## #   Total_Bsmt_SF <dbl>, Heating <fct>, Heating_QC <fct>, Central_Air <fct>,
## #   Electrical <fct>, First_Flr_SF <int>, Second_Flr_SF <int>,
## #   Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
## #   Full_Bath <int>, Half_Bath <int>, Bedroom_AbvGr <int>, Kitchen_AbvGr <int>,
## #   TotRms_AbvGrd <int>, Functional <fct>, Fireplaces <int>, Garage_Type <fct>,
## #   Garage_Finish <fct>, Garage_Cars <dbl>, Garage_Area <dbl>,
## #   Garage_Cond <fct>, Paved_Drive <fct>, Wood_Deck_SF <int>,
## #   Open_Porch_SF <int>, Enclosed_Porch <int>, Three_season_porch <int>,
## #   Screen_Porch <int>, Pool_Area <int>, Pool_QC <fct>, Fence <fct>,
## #   Misc_Feature <fct>, Misc_Val <int>, Mo_Sold <int>, Year_Sold <int>,
## #   Sale_Type <fct>, Sale_Condition <fct>, Sale_Price <dbl>, Longitude <dbl>,
## #   Latitude <dbl>

ames %>% skim(Sale_Price, Neighborhood, Gr_Liv_Area, 
              Year_Built, Bldg_Type)

Data summary

Name	Piped data
Number of rows	2930
Number of columns	74
_______________________
Column type frequency:
factor	2
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
Neighborhood	0	1	FALSE	28	Nor: 443, Col: 267, Old: 239, Edw: 194
Bldg_Type	0	1	FALSE	5	One: 2425, Twn: 233, Dup: 109, Twn: 101

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sale_Price	1	5.22	0.18	4.11	5.11	5.2	5.33	5.88	<U+2581><U+2581><U+2586><U+2587><U+2581>
Gr_Liv_Area	1	1499.69	505.51	334.00	1126.00	1442.0	1742.75	5642.00	<U+2587><U+2587><U+2581><U+2581><U+2581>
Year_Built	1	1971.36	30.25	1872.00	1954.00	1973.0	2001.00	2010.00	<U+2581><U+2582><U+2583><U+2586><U+2587>

We can also see (above) that we have a lot of variables (n = 74), and in this particular skim we have speicifed a focus on the five variables that are used later in the chapter for modelling.
Just by skiming we can see in the numeric section that Gr_Liv_Area is non-gausian in distribution. Which will need to be addressed before modeling.
As mentioned earlier we can see the same in what will be our dependent variable.

ames %>%
        ggplot(mapping = aes( x = Sale_Price)) +
        geom_histogram(bins = 50)

Base R modelling

In base R we model using a function such as lm() or glm() for a ‘simple’ linear model, within the formula (as below) we can transform our data (in this case natural log with base 10). We can do a variety of other things such as adding in weights, specify a variable to be a factor (dummy coding), interactions etc.

model1 <- lm(Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Year_Built + Bldg_Type, data = ames)
summary(model1)

## 
## Call:
## lm(formula = Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + 
##     Year_Built + Bldg_Type, data = ames)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83789 -0.04051  0.00115  0.04245  0.39186 
## 
## Coefficients:
##                                                       Estimate Std. Error
## (Intercept)                                         -0.8551717  0.2066005
## NeighborhoodCollege_Creek                            0.0135485  0.0073316
## NeighborhoodOld_Town                                -0.0289607  0.0074746
## NeighborhoodEdwards                                 -0.0493174  0.0068010
## NeighborhoodSomerset                                 0.0499653  0.0086486
## NeighborhoodNorthridge_Heights                       0.1335758  0.0090287
## NeighborhoodGilbert                                 -0.0337334  0.0083591
## NeighborhoodSawyer                                  -0.0042779  0.0074500
## NeighborhoodNorthwest_Ames                           0.0004589  0.0081490
## NeighborhoodSawyer_West                             -0.0174582  0.0086299
## NeighborhoodMitchell                                 0.0004695  0.0085823
## NeighborhoodBrookside                               -0.0110205  0.0089257
## NeighborhoodCrawford                                 0.0914254  0.0089320
## NeighborhoodIowa_DOT_and_Rail_Road                  -0.0839821  0.0095969
## NeighborhoodTimberland                               0.0604062  0.0108034
## NeighborhoodNorthridge                               0.0845868  0.0113346
## NeighborhoodStone_Brook                              0.1459657  0.0128573
## NeighborhoodSouth_and_West_of_Iowa_State_University -0.0282535  0.0124617
## NeighborhoodClear_Creek                              0.0480071  0.0126039
## NeighborhoodMeadow_Village                          -0.0899124  0.0153282
## NeighborhoodBriardale                               -0.0465821  0.0175020
## NeighborhoodBloomington_Heights                      0.0402528  0.0169594
## NeighborhoodVeenker                                  0.0885538  0.0168252
## NeighborhoodNorthpark_Villa                          0.0262051  0.0183321
## NeighborhoodBlueste                                  0.0322372  0.0262466
## NeighborhoodGreens                                   0.1751507  0.0290318
## NeighborhoodGreen_Hills                              0.2229230  0.0562585
## NeighborhoodLandmark                                -0.0119925  0.0796196
## log10(Gr_Liv_Area)                                   0.6343996  0.0126012
## Year_Built                                           0.0020678  0.0001044
## Bldg_TypeTwoFmCon                                   -0.0312306  0.0104405
## Bldg_TypeDuplex                                     -0.1038443  0.0079704
## Bldg_TypeTwnhs                                      -0.0968859  0.0110076
## Bldg_TypeTwnhsE                                     -0.0414929  0.0068666
##                                                     t value Pr(>|t|)    
## (Intercept)                                          -4.139 3.58e-05 ***
## NeighborhoodCollege_Creek                             1.848 0.064710 .  
## NeighborhoodOld_Town                                 -3.875 0.000109 ***
## NeighborhoodEdwards                                  -7.251 5.27e-13 ***
## NeighborhoodSomerset                                  5.777 8.40e-09 ***
## NeighborhoodNorthridge_Heights                       14.795  < 2e-16 ***
## NeighborhoodGilbert                                  -4.036 5.59e-05 ***
## NeighborhoodSawyer                                   -0.574 0.565865    
## NeighborhoodNorthwest_Ames                            0.056 0.955100    
## NeighborhoodSawyer_West                              -2.023 0.043165 *  
## NeighborhoodMitchell                                  0.055 0.956374    
## NeighborhoodBrookside                                -1.235 0.217043    
## NeighborhoodCrawford                                 10.236  < 2e-16 ***
## NeighborhoodIowa_DOT_and_Rail_Road                   -8.751  < 2e-16 ***
## NeighborhoodTimberland                                5.591 2.46e-08 ***
## NeighborhoodNorthridge                                7.463 1.11e-13 ***
## NeighborhoodStone_Brook                              11.353  < 2e-16 ***
## NeighborhoodSouth_and_West_of_Iowa_State_University  -2.267 0.023450 *  
## NeighborhoodClear_Creek                               3.809 0.000142 ***
## NeighborhoodMeadow_Village                           -5.866 4.98e-09 ***
## NeighborhoodBriardale                                -2.662 0.007822 ** 
## NeighborhoodBloomington_Heights                       2.373 0.017686 *  
## NeighborhoodVeenker                                   5.263 1.52e-07 ***
## NeighborhoodNorthpark_Villa                           1.429 0.152979    
## NeighborhoodBlueste                                   1.228 0.219455    
## NeighborhoodGreens                                    6.033 1.81e-09 ***
## NeighborhoodGreen_Hills                               3.962 7.60e-05 ***
## NeighborhoodLandmark                                 -0.151 0.880284    
## log10(Gr_Liv_Area)                                   50.344  < 2e-16 ***
## Year_Built                                           19.805  < 2e-16 ***
## Bldg_TypeTwoFmCon                                    -2.991 0.002801 ** 
## Bldg_TypeDuplex                                     -13.029  < 2e-16 ***
## Bldg_TypeTwnhs                                       -8.802  < 2e-16 ***
## Bldg_TypeTwnhsE                                      -6.043 1.71e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07872 on 2896 degrees of freedom
## Multiple R-squared:  0.8045, Adjusted R-squared:  0.8022 
## F-statistic:   361 on 33 and 2896 DF,  p-value: < 2.2e-16

What happens when base R model is run?

As soon as the code is run the model is executed from the dataframe into a model matrix and then the least squares method is used to estimate parameters. What does this mean?

Sale price is outcome
Neighborhood, general living area, the year built, and building type variables are defined as predictors.
Log (base 10) transformation is applied to the general living area.
Neighborhood and building type predictors are converted with contrast/dummy coding (0 and 1), not one-hot https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn#:~:text=One%2Dhot%20encoding%20converts%20it,up%20with%20kn%2Dk%20variables..

More models mo’ problems

Such a strategy creates several problems when we need to build a lot of models with varying data as we have to redefine the model. Also more by Kuhn (2017) explaining why base R isn’t great by preprocessing and using data matricies https://rviews.rstudio.com/2017/03/01/the-r-formula-method-the-bad-parts/.

###Recipes This is where Recipes() come in, instead of using the base R apporach we can create a recipe() using the tidymodels package. What is a recipe?

The recipes package is an alternative method for creating and preprocessing design matrices that can be used for modeling or visualization. tidymodels team https://recipes.tidymodels.org/index.html A data matrix can be thought of as a grid that represents each explanitory variable of a model in its rows, and columns the methods i.e. transformations applied to each variable.

As the tidymodels team state “The idea of the recipes package is to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. “feature engineering”)." Which allows us prespecify the data tranformations we wish to apply, before executing them on our training data. Each data transformation in a tidymodels recipe is a step. The functions correspond to specific types of steps, each of which has a prefix of step_. There are several step_ functions one may want to conduct, but we should consider the importance of orderhttps://recipes.tidymodels.org/articles/Ordering.html

Tidymodels Workflow for recipes

Some general advice…

If using a Box-Cox transformation, don’t center the data first or do any operations that might make the data non-positive, or use Yeo-Johnson transformation so you don’t have to worry about this.
Recipes do not automatically apply contrast coding/dummy variables
Centering, scaling, and operations applied to predictors should be run after step_dummy, so that numeric columns are in the data set instead of factors.
Step_interact should also be applied after dummy variables.
We sometimes lumping infrequently used categories together with step_other this should be called before step_dummy.

Ordering of workflow

Impute
Individual transformations for skewness and other issues
Discretize (if needed and if you have no other choice)
Create dummy variables
Create interactions
Normalization steps (center, scale, range, etc)
Multivariate transformation (e.g. PCA, spatial sign, etc)

Applying a recipe to our example…

simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_dummy(all_nominal())
simple_ames

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          4
## 
## Operations:
## 
## Log transformation on Gr_Liv_Area
## Dummy variables from all_nominal()

Whats going on?

Recipe captures and information for quoting (see advanced R metaprogramming for a full explanation), that essentially means it captures the syntax rather than executing it.
Each step captures the desired steps to be applied to the data before a model is run.
Step_log() applies log base 10 to Gr_Liv_Area, had we have put all_numeric it would have changed all numeric variables. Another option would be to capture the desired column names in a vector instead of specifying all_numeric or individual vectors.
step_dummy() converts the factors/categorical data into contrast/dummy coding.

Optional understanding - some alternatives to dummy coding

We are going to briefly go down a rabbit hole here, don’t worry we’ll be back soon.

Embedding

Dummy coding can dramtically expand your data set for example if we apply contrast coding to the ames neighborhood (see below) variable only we can see we expand our data considerably. This makes modeling a lot more resource demand when modeling commences which throttles productivity, steals the most precious thing we have (time), and can lead to us crashing our systems, or costing a lot of money (cloud computing is cheap until we start scaling processing demands).

as_tibble(contrasts(ames$Neighborhood))

## # A tibble: 29 x 28
##    College_Creek Old_Town Edwards Somerset Northridge_Heights Gilbert Sawyer
##            <dbl>    <dbl>   <dbl>    <dbl>              <dbl>   <dbl>  <dbl>
##  1             0        0       0        0                  0       0      0
##  2             1        0       0        0                  0       0      0
##  3             0        1       0        0                  0       0      0
##  4             0        0       1        0                  0       0      0
##  5             0        0       0        1                  0       0      0
##  6             0        0       0        0                  1       0      0
##  7             0        0       0        0                  0       1      0
##  8             0        0       0        0                  0       0      1
##  9             0        0       0        0                  0       0      0
## 10             0        0       0        0                  0       0      0
## # ... with 19 more rows, and 21 more variables: Northwest_Ames <dbl>,
## #   Sawyer_West <dbl>, Mitchell <dbl>, Brookside <dbl>, Crawford <dbl>,
## #   Iowa_DOT_and_Rail_Road <dbl>, Timberland <dbl>, Northridge <dbl>,
## #   Stone_Brook <dbl>, South_and_West_of_Iowa_State_University <dbl>,
## #   Clear_Creek <dbl>, Meadow_Village <dbl>, Briardale <dbl>,
## #   Bloomington_Heights <dbl>, Veenker <dbl>, Northpark_Villa <dbl>,
## #   Blueste <dbl>, Greens <dbl>, Green_Hills <dbl>, Landmark <dbl>,
## #   Hayden_Lake <dbl>

As you can see above dummy coding replaces every category with variable for each level of data, not so bad with 28 variables, but imagine adding post/zip codes? In the uk there are 1.7 million. This width will throttle processing. Luckily there are solutions to that can allow us to reduce this information into a smaller number of column: embedding and hashing are two methods that are currently popular.

Embedding - Encodes categorical data as multiple numeric variables using a word embedding approach. It was initially used to take a large number of word identifiers and represent them in a smaller dimension

Feature Hashing - Another way to combine categories is to use a hashing function (or hashes). Hashes are used to map one set of values to another set of values and are typically used in databases and cryptography (Preneel 2010). The original values are called the keys and are typically mapped to a smaller set of artificial hash values.* Kuhn and Johnson (2019) In tidymodel we can use step_feature_hash() https://embed.tidymodels.org/reference/step_feature_hash.html to do this. Hashing like embedding compresses wide data into specified smaller number of columns/variables, using the actual value of the level which is calculated with cryptographic methods to assign a value to a number of columns we specify. One problem is with these methods is that we can’t easily map back to the original data.

EXAMPLE - EMBEDDING (not used in main tidymodels recipe example)

Word of warning: You need to have tensorflow or keras installed, I found this a bit of a pain. If you do attempt this, it seemed that initialising tf in conda prompt and also running the rstudio packages in anaconda helped build the connection.

Carrying on with our ames data set example in coding neighborhoods we can reduce the 28 columns (representing 29 levels, 1 left of for degrees of freedom) down to

### Taken from: https://embed.tidymodels.org/articles/Applications/Tensorflow.html
library(embed)
tf_embed <- 
  recipe(Sale_Price ~ ., data = ames) %>%
  step_log(Sale_Price, base = 10) %>%
  # Add some other predictors that can be used by the network. We
  # preprocess them first
  step_YeoJohnson(Lot_Area, Full_Bath, Gr_Liv_Area)  %>%
  step_range(Lot_Area, Full_Bath, Gr_Liv_Area)  %>%
  step_embed(
    Neighborhood, # variable to embed
    outcome = vars(Sale_Price), # we need to specify the outcome/dependent variable the embedings will be used for
    predictors = vars(Lot_Area, Full_Bath, Gr_Liv_Area), # encoding columns 
    num_terms = 5, 
    hidden_units = 10, 
    options = embed_control(epochs = 75, validation_split = 0.2) #trained on 75 epochs
  ) %>% 
  prep(training = ames)
#you can see the graph for this in the source code for this mark down

Lets look at our new columns…

hood_coef <- 
  tidy(tf_embed, number = 4) %>%
  dplyr::select(-terms, -id)  %>%
  dplyr::rename(Neighborhood = level) %>%
  # Make names smaller
  rename_at(vars(contains("emb")), funs(gsub("Neighborhood_", "", ., fixed = TRUE)))

## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

hood_coef

## # A tibble: 30 x 6
##    embed_1  embed_2 embed_3  embed_4  embed_5 Neighborhood      
##      <dbl>    <dbl>   <dbl>    <dbl>    <dbl> <chr>             
##  1 -0.0218 -0.0272  -0.0402  0.0457  -0.0214  ..new             
##  2  0.0158  0.0366   0.0391 -0.0639  -0.0150  North_Ames        
##  3 -0.0654  0.0716  -0.0384  0.0117   0.0399  College_Creek     
##  4  0.0357 -0.0332  -0.0273  0.0103  -0.0389  Old_Town          
##  5  0.0434  0.00264 -0.0144  0.0484   0.00978 Edwards           
##  6 -0.0607  0.0964  -0.0257 -0.0258   0.0801  Somerset          
##  7 -0.121   0.101    0.0326 -0.0535   0.0697  Northridge_Heights
##  8 -0.0503  0.0278   0.0400  0.00560 -0.00665 Gilbert           
##  9 -0.0167  0.00569  0.0261 -0.0286  -0.0137  Sawyer            
## 10 -0.0547 -0.0271   0.0195 -0.0450  -0.0150  Northwest_Ames    
## # ... with 20 more rows

Not bad! We have changed the number of columns to approximately 1/5 of the size. There are a few more considerations to make here as we can introduce correlations between variables.

tf_plot

hood_coef %>% 
  dplyr::select(contains("emb")) %>% 
  cor() %>%
  round(2)

##         embed_1 embed_2 embed_3 embed_4 embed_5
## embed_1    1.00   -0.59   -0.44    0.31   -0.54
## embed_2   -0.59    1.00   -0.01   -0.29    0.53
## embed_3   -0.44   -0.01    1.00   -0.28    0.19
## embed_4    0.31   -0.29   -0.28    1.00   -0.05
## embed_5   -0.54    0.53    0.19   -0.05    1.00

Don’t you just love statistics never be bored! :D Unfortunatley Alice needs to go home so out the rabbit hole we go.

Alice climbs out of rabbit hole - what an adventure!

Back to our main recipe

so far we have not altered any data. We have just created a recipes for data processing to take place.

Summary - Why build recipes?

Computations can be recycled across models, possibly programmitically i.e. based on conditional parameters i.e. Skew > 1.5, NCOLS > x etc.
A recipe enables a broader set of data processing than building in individual models.
The syntax compact, compared to formulas that require each element to be explicitly listed.
All data processing can be captured in a single R object instead of in scripts that are repeated.

Preparing data for modelling

For a recipe with at least one preprocessing operation, estimate the required parameters from a training set that can be later applied to other data sets. https://recipes.tidymodels.org/reference/prep.html

Prep

The second phase for using a recipe is to estimate any quantities required by the steps using the prep() function. When called the function executes our recipe on the data i.e. ames_training. Not all steps are applied.

simple_ames <- prep(simple_ames, training = ames_train)
simple_ames

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          4
## 
## Training data contained 2346 data points and no missing data.
## 
## Operations:
## 
## Log transformation on Gr_Liv_Area [trained]
## Dummy variables from Neighborhood, Bldg_Type [trained]

In the above we have the original data, but if we do not want this we could change the default setting of retain = TRUE to FALSE, the prepared version of the training set still available. While this is the default we may want to drop this if we have large sets of data.

simple_ames <- prep(simple_ames, training = ames_train, retain = TRUE) 
simple_ames

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          4
## 
## Training data contained 2346 data points and no missing data.
## 
## Operations:
## 
## Log transformation on Gr_Liv_Area [trained]
## Dummy variables from Neighborhood, Bldg_Type [trained]

We can see using broom which data sets have already been applied and in what order.

simple_ames %>% tidy()

## # A tibble: 2 x 6
##   number operation type  trained skip  id         
##    <int> <chr>     <chr> <lgl>   <lgl> <chr>      
## 1      1 step      log   TRUE    FALSE log_kjCwe  
## 2      2 step      dummy TRUE    FALSE dummy_KxLdc

On your marks, get set, bake.

The third phase of recipe usage is to apply the preprocessing operations to a data set using the bake() function. The bake() function can apply the recipe to any data set, in this case the test set, if we want to look at the orginal we can set data to NULL

# bake(simple_ames, new_data = NULL) # to investigate orginal dataset
test_ex <- bake(simple_ames, new_data = ames_test)
names(test_ex) %>% head()

## [1] "Gr_Liv_Area"                "Year_Built"                
## [3] "Sale_Price"                 "Neighborhood_College_Creek"
## [5] "Neighborhood_Old_Town"      "Neighborhood_Edwards"

Note the dummy variable columns starting with “Neighborhood_” or whatever the variabke column it orginally came from was called. The bake() function can also take selectors so that, if we want to investigate the neighborhood results, we can use:

#select neighboorhood results
bake(simple_ames, ames_test, starts_with("Neighborhood_"))

## # A tibble: 584 x 28
##    Neighborhood_College~ Neighborhood_Old_~ Neighborhood_Edw~ Neighborhood_Some~
##                    <dbl>              <dbl>             <dbl>              <dbl>
##  1                     0                  0                 0                  0
##  2                     0                  0                 0                  0
##  3                     0                  0                 0                  0
##  4                     0                  0                 0                  0
##  5                     0                  0                 0                  0
##  6                     0                  0                 0                  0
##  7                     0                  0                 0                  0
##  8                     0                  0                 0                  0
##  9                     0                  0                 0                  0
## 10                     0                  0                 0                  0
## # ... with 574 more rows, and 24 more variables:
## #   Neighborhood_Northridge_Heights <dbl>, Neighborhood_Gilbert <dbl>,
## #   Neighborhood_Sawyer <dbl>, Neighborhood_Northwest_Ames <dbl>,
## #   Neighborhood_Sawyer_West <dbl>, Neighborhood_Mitchell <dbl>,
## #   Neighborhood_Brookside <dbl>, Neighborhood_Crawford <dbl>,
## #   Neighborhood_Iowa_DOT_and_Rail_Road <dbl>, Neighborhood_Timberland <dbl>,
## #   Neighborhood_Northridge <dbl>, Neighborhood_Stone_Brook <dbl>,
## #   Neighborhood_South_and_West_of_Iowa_State_University <dbl>,
## #   Neighborhood_Clear_Creek <dbl>, Neighborhood_Meadow_Village <dbl>,
## #   Neighborhood_Briardale <dbl>, Neighborhood_Bloomington_Heights <dbl>,
## #   Neighborhood_Veenker <dbl>, Neighborhood_Northpark_Villa <dbl>,
## #   Neighborhood_Blueste <dbl>, Neighborhood_Greens <dbl>,
## #   Neighborhood_Green_Hills <dbl>, Neighborhood_Landmark <dbl>,
## #   Neighborhood_Hayden_Lake <dbl>

Looking at the output of the model we can the impact of the model coeffs on sale price

test_ex %>% tidy()

## Warning: Data frame tidiers are deprecated and will be removed in an upcoming
## release of broom.

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.

## # A tibble: 35 x 13
##    column          n     mean     sd median trimmed     mad    min    max  range
##    <chr>       <dbl>    <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1 Gr_Liv_Area   584  3.15e+0  0.145 3.16e0    3.15  0.0933 2.64e0 3.75e0   1.11
##  2 Year_Built    584  1.97e+3 29.3   1.97e3 1975.   25      1.88e3 2.01e3 130   
##  3 Sale_Price    584  5.22e+0  0.180 5.20e0    5.22  0.103  4.11e0 5.80e0   1.69
##  4 Neighborho~   584  9.08e-2  0.288 0.        0     0      0.     1.00e0   1   
##  5 Neighborho~   584  8.22e-2  0.275 0.        0     0      0.     1.00e0   1   
##  6 Neighborho~   584  5.99e-2  0.238 0.        0     0      0.     1.00e0   1   
##  7 Neighborho~   584  5.31e-2  0.224 0.        0     0      0.     1.00e0   1   
##  8 Neighborho~   584  5.99e-2  0.238 0.        0     0      0.     1.00e0   1   
##  9 Neighborho~   584  6.34e-2  0.244 0.        0     0      0.     1.00e0   1   
## 10 Neighborho~   584  6.16e-2  0.241 0.        0     0      0.     1.00e0   1   
## # ... with 25 more rows, and 3 more variables: skew <dbl>, kurtosis <dbl>,
## #   se <dbl>

Another way to deal with categorical variables is to clump them together. This is best done when the levels makes up less than 1% of the data set and we have many levels. Of course we need to be careful for this new level not to become to big.

Encoding categoricals

ggplot(ames_train, aes(y = Neighborhood)) + 
        geom_bar() + 
        labs(y = NULL)

Why? For some models, it may be problematic to have dummy variables with a single non-zero entry in the column. At a minimum, it is highly improbable that these features would be important to a model, we can therefore collapse some of these factors into a single factor level by using step_other()

simple_ames <- 
        recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
               data = ames_train) %>%
        step_log(Gr_Liv_Area, base = 10) %>% 
        step_other(Neighborhood, threshold = 0.01) %>% 
        step_dummy(all_nominal())

Interactions

Below we see there are differences in the price and living area by building type, meaning type acts interactively with area to influence price.

ggplot(ames_train, aes(x = Gr_Liv_Area, y = 10^Sale_Price)) + 
        geom_point(alpha = .2) + 
        facet_wrap(~ Bldg_Type) + 
        geom_smooth(method = lm, formula = y ~ x, se = FALSE, col = "red") + 
        scale_x_log10() + 
        scale_y_log10() + 
        labs(x = "General Living Area", y = "Sale Price (USD)")

In base we deal with this by using either a colon between variables(:), or specify the interaction between variables with the multiplier symbol (*).

#baser interaction 
model1_int1 <-lm(Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Bldg_Type + log10(Gr_Liv_Area):Bldg_Type, ames_train)
# or
model1_int2 <- lm(Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) * Bldg_Type, ames_train)

summary(model1_int1)

## 
## Call:
## lm(formula = Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + 
##     Bldg_Type + log10(Gr_Liv_Area):Bldg_Type, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74055 -0.04323  0.00374  0.04568  0.31484 
## 
## Coefficients:
##                                                      Estimate Std. Error
## (Intercept)                                          3.083548   0.048350
## NeighborhoodCollege_Creek                            0.087784   0.007191
## NeighborhoodOld_Town                                -0.108763   0.007428
## NeighborhoodEdwards                                 -0.062454   0.007776
## NeighborhoodSomerset                                 0.136708   0.008521
## NeighborhoodNorthridge_Heights                       0.216400   0.009180
## NeighborhoodGilbert                                  0.038009   0.008545
## NeighborhoodSawyer                                   0.001126   0.008717
## NeighborhoodNorthwest_Ames                           0.027896   0.009029
## NeighborhoodSawyer_West                              0.043398   0.009131
## NeighborhoodMitchell                                 0.048527   0.009826
## NeighborhoodBrookside                               -0.060020   0.009407
## NeighborhoodCrawford                                 0.053116   0.010038
## NeighborhoodIowa_DOT_and_Rail_Road                  -0.148367   0.010383
## NeighborhoodTimberland                               0.142505   0.011732
## NeighborhoodNorthridge                               0.142782   0.012597
## NeighborhoodStone_Brook                              0.220550   0.013842
## NeighborhoodSouth_and_West_of_Iowa_State_University -0.092577   0.014238
## NeighborhoodClear_Creek                              0.060219   0.014887
## NeighborhoodMeadow_Village                          -0.116303   0.018555
## NeighborhoodBriardale                               -0.071170   0.022886
## NeighborhoodBloomington_Heights                      0.107678   0.020567
## NeighborhoodVeenker                                  0.121970   0.018021
## NeighborhoodNorthpark_Villa                          0.030740   0.020618
## NeighborhoodBlueste                                  0.055099   0.028989
## NeighborhoodGreens                                   0.188299   0.032413
## NeighborhoodGreen_Hills                              0.261556   0.057788
## NeighborhoodLandmark                                 0.044185   0.081825
## log10(Gr_Liv_Area)                                   0.671054   0.015581
## Bldg_TypeTwoFmCon                                    0.770292   0.251279
## Bldg_TypeDuplex                                      0.946443   0.226165
## Bldg_TypeTwnhs                                       1.236988   0.291237
## Bldg_TypeTwnhsE                                      0.071620   0.198107
## log10(Gr_Liv_Area):Bldg_TypeTwoFmCon                -0.253760   0.079460
## log10(Gr_Liv_Area):Bldg_TypeDuplex                  -0.325283   0.070530
## log10(Gr_Liv_Area):Bldg_TypeTwnhs                   -0.423664   0.093219
## log10(Gr_Liv_Area):Bldg_TypeTwnhsE                  -0.027576   0.063413
##                                                     t value Pr(>|t|)    
## (Intercept)                                          63.775  < 2e-16 ***
## NeighborhoodCollege_Creek                            12.208  < 2e-16 ***
## NeighborhoodOld_Town                                -14.643  < 2e-16 ***
## NeighborhoodEdwards                                  -8.032 1.52e-15 ***
## NeighborhoodSomerset                                 16.044  < 2e-16 ***
## NeighborhoodNorthridge_Heights                       23.573  < 2e-16 ***
## NeighborhoodGilbert                                   4.448 9.07e-06 ***
## NeighborhoodSawyer                                    0.129  0.89723    
## NeighborhoodNorthwest_Ames                            3.090  0.00203 ** 
## NeighborhoodSawyer_West                               4.753 2.13e-06 ***
## NeighborhoodMitchell                                  4.939 8.43e-07 ***
## NeighborhoodBrookside                                -6.380 2.13e-10 ***
## NeighborhoodCrawford                                  5.292 1.33e-07 ***
## NeighborhoodIowa_DOT_and_Rail_Road                  -14.289  < 2e-16 ***
## NeighborhoodTimberland                               12.146  < 2e-16 ***
## NeighborhoodNorthridge                               11.334  < 2e-16 ***
## NeighborhoodStone_Brook                              15.934  < 2e-16 ***
## NeighborhoodSouth_and_West_of_Iowa_State_University  -6.502 9.67e-11 ***
## NeighborhoodClear_Creek                               4.045 5.40e-05 ***
## NeighborhoodMeadow_Village                           -6.268 4.35e-10 ***
## NeighborhoodBriardale                                -3.110  0.00190 ** 
## NeighborhoodBloomington_Heights                       5.236 1.79e-07 ***
## NeighborhoodVeenker                                   6.768 1.65e-11 ***
## NeighborhoodNorthpark_Villa                           1.491  0.13611    
## NeighborhoodBlueste                                   1.901  0.05747 .  
## NeighborhoodGreens                                    5.809 7.14e-09 ***
## NeighborhoodGreen_Hills                               4.526 6.31e-06 ***
## NeighborhoodLandmark                                  0.540  0.58925    
## log10(Gr_Liv_Area)                                   43.070  < 2e-16 ***
## Bldg_TypeTwoFmCon                                     3.065  0.00220 ** 
## Bldg_TypeDuplex                                       4.185 2.96e-05 ***
## Bldg_TypeTwnhs                                        4.247 2.25e-05 ***
## Bldg_TypeTwnhsE                                       0.362  0.71774    
## log10(Gr_Liv_Area):Bldg_TypeTwoFmCon                 -3.194  0.00142 ** 
## log10(Gr_Liv_Area):Bldg_TypeDuplex                   -4.612 4.21e-06 ***
## log10(Gr_Liv_Area):Bldg_TypeTwnhs                    -4.545 5.78e-06 ***
## log10(Gr_Liv_Area):Bldg_TypeTwnhsE                   -0.435  0.66370    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08067 on 2309 degrees of freedom
## Multiple R-squared:  0.7939, Adjusted R-squared:  0.7907 
## F-statistic: 247.1 on 36 and 2309 DF,  p-value: < 2.2e-16

In tidymodels we can simply change the recipe with step_interact(), the tilde (~) means as a function of. So we can say in this example the interaction is a function of living area by buolding type. As building type is contrast coded we use the starts_with function.

simple_ames <- 
  recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal()) %>% 
  # Gr_Liv_Area is on the log scale from a previous step
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_")) #don't use step_interact( ~ Gr_Liv_Area:Bldg_Type)

Splines/degrees of freedom - They deserve your love!

Splines are key for when there is nonlinear relationships between predictors and outcome variables (so often)
The smooth function in ggplot is an example of spline application, ploys/degrees (check it out below)

library(patchwork)
library(splines)

plot_smoother <- function(deg_free) {
  ggplot(ames_train, aes(x = Latitude, y = Sale_Price)) + 
    geom_point(alpha = .2) + 
    scale_y_log10() +
    geom_smooth(
      method = lm,
      formula = y ~ ns(x, df = deg_free),
      col = "red",
      se = FALSE
    ) +
    ggtitle(paste(deg_free, "Spline Terms"))
}

( plot_smoother(2) + plot_smoother(5) ) / ( plot_smoother(20) + plot_smoother(100) )

Some panels clearly fit poorly; two terms under-fits the data while 100 terms over-fits, if you wonder which is worse its both https://stats.stackexchange.com/questions/249493/which-model-is-better-one-that-overfits-or-one-that-underfits. Why? because overfitting The panels with five and 20 terms seem like reasonably smooth fits that catch the main patterns of the data. This indicates that the proper amount of “non-linearness” matters.How do you select the optimal number of splines/degrees of freedom? I’m so glad you asked, see herehttps://cran.r-project.org/web/packages/santaR/vignettes/selecting-optimal-df.html.

In recipes to add a natural spline we use step_ns(), selecting the feature and the number of degrees. Not sure how to automate spline selection it at time of writing in the smooth.spline function you can select cv = TRUE, seems to be that you can use spline_degree(range = c(1L, 10L), trans = NULL) from the dials package (more later).

recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude,
         data = ames_train) %>%
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_other(Neighborhood, threshold = 0.01) %>% 
  step_dummy(all_nominal()) %>% 
  step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>% 
  step_ns(Latitude, deg_free = 20)

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          5
## 
## Operations:
## 
## Log transformation on Gr_Liv_Area
## Collapsing factor levels for Neighborhood
## Dummy variables from all_nominal()
## Interactions with Gr_Liv_Area:starts_with("Bldg_Type_")
## Natural Splines on Latitude

###Feature Extraction Finally (for now), I’ll briefly cover how to reduce down the number of numeric predictors typically you would use. Essentially when numeric predictors relate (this can be non-linear) we can reduce them down. A good example of what we can do with feature extraction is psychometrics, if you have ever done a personality or IQ you will know there are a number of questions that we combine together to measure latent variables such as spatial intelligence or introvertion/extraversion. While this is generally done with factor analysis, PCAs are often interpreted in the same way but is more focused on feature reduction than identification than latent variables.

Before going further please note that Julia Silge has an awesome blog on UMAP and PCA https://juliasilge.com/blog/cocktail-recipes-umap/ which has a great video.

PCA - this has been covered in tidymodels really well here https://allisonhorst.github.io/palmerpenguins/articles/articles/pca.html
UMAP - UMAP is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis. Learn more here https://umap-learn.readthedocs.io/en/latest/how_umap_works.html

EXAMPLE - PCA

See correlations in palmer dataset

library(corrr)
penguins_corr <- penguins %>%
  dplyr::select(where(is.numeric)) %>%
  correlate() %>%
  rearrange()

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'

## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy

penguins_corr

## # A tibble: 5 x 6
##   term         flipper_length_~ body_mass_g bill_length_mm    year bill_depth_mm
##   <chr>                   <dbl>       <dbl>          <dbl>   <dbl>         <dbl>
## 1 flipper_len~           NA          0.871          0.656   0.170        -0.584 
## 2 body_mass_g             0.871     NA              0.595   0.0422       -0.472 
## 3 bill_length~            0.656      0.595         NA       0.0545       -0.235 
## 4 year                    0.170      0.0422         0.0545 NA            -0.0604
## 5 bill_depth_~           -0.584     -0.472         -0.235  -0.0604       NA

Build and visualise PCA in tidymodels

library(recipes)
penguin_recipe <-
  recipe(~., data = penguins) %>% 
  update_role(species, island, sex, new_role = "id") %>% 
  step_naomit(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% # we'd want to use all numeric on ames 
  step_pca(all_predictors(), id = "pca") %>% 
  prep()

penguin_pca <- 
  penguin_recipe %>% 
  tidy(id = "pca") 

penguin_pca

## # A tibble: 25 x 4
##    terms               value component id   
##    <chr>               <dbl> <chr>     <chr>
##  1 bill_length_mm     0.452  PC1       pca  
##  2 bill_depth_mm     -0.398  PC1       pca  
##  3 flipper_length_mm  0.576  PC1       pca  
##  4 body_mass_g        0.544  PC1       pca  
##  5 year               0.0957 PC1       pca  
##  6 bill_length_mm    -0.0935 PC2       pca  
##  7 bill_depth_mm      0.0164 PC2       pca  
##  8 flipper_length_mm  0.0256 PC2       pca  
##  9 body_mass_g       -0.111  PC2       pca  
## 10 year               0.989  PC2       pca  
## # ... with 15 more rows

This output shows how each measure contribute the components, in this case there are 4, which we can see how much variance each accounts for.

penguin_recipe %>% 
  tidy(id = "pca", type = "variance") %>% 
  dplyr::filter(terms == "percent variance") %>% 
  ggplot(aes(x = component, y = value)) + 
  geom_col(fill = "#b6dfe2") + 
  xlim(c(0, 5)) + 
  ylab("% of total variance")

## Warning: Removed 1 rows containing missing values (geom_col).

library(ggplot2)
penguin_pca %>%
  mutate(terms = tidytext::reorder_within(terms, 
                                          abs(value), 
                                          component)) %>%
  ggplot(aes(abs(value), terms, fill = value > 0)) +
  geom_col() +
  facet_wrap(~component, scales = "free_y") +
  tidytext::scale_y_reordered() +
  scale_fill_manual(values = c("#b6dfe2", "#0A537D")) +
  labs(
    x = "Absolute value of contribution",
    y = NULL, fill = "Positive?"
  )

How do we select the optimal number of PCs? There is no robust method for determining how many components to use. As the number of observations, the number of variables, and the application vary, a different level of accuracy and variable reduction are desirable. The most common technique for determining how many principal components is to eyeball scree plots and look for the “elbow point”, where the variance by the PCs significantly drops off.