Note to the knowledge creators: Our thanks go to Max Kuhn, Julia Silge (highly recommend people checking out Julia’s awesome Youtube channal https://www.youtube.com/c/JuliaSilge/videos and blog https://juliasilge.com/), and Kjell Johnson along with the tidymodels and Rstudio teams for allowing us to use these materials. Much of this information was taken from http://www.feat.engineering/, https://www.tidymodels.org/. I would also like to point people towards Andrew Couch https://www.youtube.com/user/asianlongboarder and Matt Dancho https://www.business-science.io/, who also demonstrate masterful use of the tidymodels framework on youtube and business-science.
“Feature engineering encompasses activities that reformat predictor values to make them easier for a model to use effectively.”
Basically how to build recipes our recipes for baking… In other words to prepare our data, by specifying treatments required to engineer our features, and to build our models which can be used for prediction.
Preprocessing steps focuses on making data suitable for modeling by using data transformations. While transformations can be accomplished with dplyr and other tidyverse packages it is suggested that we should consider using tidymodels packages when model development is more heavy and complex.
data(ames)
ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))
set.seed(123)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
ames %>% slice_head(n = 5)
## # A tibble: 5 x 74
## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
## <fct> <fct> <dbl> <int> <fct> <fct> <fct>
## 1 One_Story_1946_~ Residential_~ 141 31770 Pave No_All~ Slightly_~
## 2 One_Story_1946_~ Residential_~ 80 11622 Pave No_All~ Regular
## 3 One_Story_1946_~ Residential_~ 81 14267 Pave No_All~ Slightly_~
## 4 One_Story_1946_~ Residential_~ 93 11160 Pave No_All~ Regular
## 5 Two_Story_1946_~ Residential_~ 74 13830 Pave No_All~ Slightly_~
## # ... with 67 more variables: Land_Contour <fct>, Utilities <fct>,
## # Lot_Config <fct>, Land_Slope <fct>, Neighborhood <fct>, Condition_1 <fct>,
## # Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>, Overall_Cond <fct>,
## # Year_Built <int>, Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
## # Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
## # Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>, Bsmt_Cond <fct>,
## # Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>,
## # BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
## # Total_Bsmt_SF <dbl>, Heating <fct>, Heating_QC <fct>, Central_Air <fct>,
## # Electrical <fct>, First_Flr_SF <int>, Second_Flr_SF <int>,
## # Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
## # Full_Bath <int>, Half_Bath <int>, Bedroom_AbvGr <int>, Kitchen_AbvGr <int>,
## # TotRms_AbvGrd <int>, Functional <fct>, Fireplaces <int>, Garage_Type <fct>,
## # Garage_Finish <fct>, Garage_Cars <dbl>, Garage_Area <dbl>,
## # Garage_Cond <fct>, Paved_Drive <fct>, Wood_Deck_SF <int>,
## # Open_Porch_SF <int>, Enclosed_Porch <int>, Three_season_porch <int>,
## # Screen_Porch <int>, Pool_Area <int>, Pool_QC <fct>, Fence <fct>,
## # Misc_Feature <fct>, Misc_Val <int>, Mo_Sold <int>, Year_Sold <int>,
## # Sale_Type <fct>, Sale_Condition <fct>, Sale_Price <dbl>, Longitude <dbl>,
## # Latitude <dbl>
ames %>% skim(Sale_Price, Neighborhood, Gr_Liv_Area,
Year_Built, Bldg_Type)
Name | Piped data |
Number of rows | 2930 |
Number of columns | 74 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Neighborhood | 0 | 1 | FALSE | 28 | Nor: 443, Col: 267, Old: 239, Edw: 194 |
Bldg_Type | 0 | 1 | FALSE | 5 | One: 2425, Twn: 233, Dup: 109, Twn: 101 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Sale_Price | 0 | 1 | 5.22 | 0.18 | 4.11 | 5.11 | 5.2 | 5.33 | 5.88 | <U+2581><U+2581><U+2586><U+2587><U+2581> |
Gr_Liv_Area | 0 | 1 | 1499.69 | 505.51 | 334.00 | 1126.00 | 1442.0 | 1742.75 | 5642.00 | <U+2587><U+2587><U+2581><U+2581><U+2581> |
Year_Built | 0 | 1 | 1971.36 | 30.25 | 1872.00 | 1954.00 | 1973.0 | 2001.00 | 2010.00 | <U+2581><U+2582><U+2583><U+2586><U+2587> |
ames %>%
ggplot(mapping = aes( x = Sale_Price)) +
geom_histogram(bins = 50)
In base R we model using a function such as lm() or glm() for a ‘simple’ linear model, within the formula (as below) we can transform our data (in this case natural log with base 10). We can do a variety of other things such as adding in weights, specify a variable to be a factor (dummy coding), interactions etc.
model1 <- lm(Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Year_Built + Bldg_Type, data = ames)
summary(model1)
##
## Call:
## lm(formula = Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) +
## Year_Built + Bldg_Type, data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83789 -0.04051 0.00115 0.04245 0.39186
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.8551717 0.2066005
## NeighborhoodCollege_Creek 0.0135485 0.0073316
## NeighborhoodOld_Town -0.0289607 0.0074746
## NeighborhoodEdwards -0.0493174 0.0068010
## NeighborhoodSomerset 0.0499653 0.0086486
## NeighborhoodNorthridge_Heights 0.1335758 0.0090287
## NeighborhoodGilbert -0.0337334 0.0083591
## NeighborhoodSawyer -0.0042779 0.0074500
## NeighborhoodNorthwest_Ames 0.0004589 0.0081490
## NeighborhoodSawyer_West -0.0174582 0.0086299
## NeighborhoodMitchell 0.0004695 0.0085823
## NeighborhoodBrookside -0.0110205 0.0089257
## NeighborhoodCrawford 0.0914254 0.0089320
## NeighborhoodIowa_DOT_and_Rail_Road -0.0839821 0.0095969
## NeighborhoodTimberland 0.0604062 0.0108034
## NeighborhoodNorthridge 0.0845868 0.0113346
## NeighborhoodStone_Brook 0.1459657 0.0128573
## NeighborhoodSouth_and_West_of_Iowa_State_University -0.0282535 0.0124617
## NeighborhoodClear_Creek 0.0480071 0.0126039
## NeighborhoodMeadow_Village -0.0899124 0.0153282
## NeighborhoodBriardale -0.0465821 0.0175020
## NeighborhoodBloomington_Heights 0.0402528 0.0169594
## NeighborhoodVeenker 0.0885538 0.0168252
## NeighborhoodNorthpark_Villa 0.0262051 0.0183321
## NeighborhoodBlueste 0.0322372 0.0262466
## NeighborhoodGreens 0.1751507 0.0290318
## NeighborhoodGreen_Hills 0.2229230 0.0562585
## NeighborhoodLandmark -0.0119925 0.0796196
## log10(Gr_Liv_Area) 0.6343996 0.0126012
## Year_Built 0.0020678 0.0001044
## Bldg_TypeTwoFmCon -0.0312306 0.0104405
## Bldg_TypeDuplex -0.1038443 0.0079704
## Bldg_TypeTwnhs -0.0968859 0.0110076
## Bldg_TypeTwnhsE -0.0414929 0.0068666
## t value Pr(>|t|)
## (Intercept) -4.139 3.58e-05 ***
## NeighborhoodCollege_Creek 1.848 0.064710 .
## NeighborhoodOld_Town -3.875 0.000109 ***
## NeighborhoodEdwards -7.251 5.27e-13 ***
## NeighborhoodSomerset 5.777 8.40e-09 ***
## NeighborhoodNorthridge_Heights 14.795 < 2e-16 ***
## NeighborhoodGilbert -4.036 5.59e-05 ***
## NeighborhoodSawyer -0.574 0.565865
## NeighborhoodNorthwest_Ames 0.056 0.955100
## NeighborhoodSawyer_West -2.023 0.043165 *
## NeighborhoodMitchell 0.055 0.956374
## NeighborhoodBrookside -1.235 0.217043
## NeighborhoodCrawford 10.236 < 2e-16 ***
## NeighborhoodIowa_DOT_and_Rail_Road -8.751 < 2e-16 ***
## NeighborhoodTimberland 5.591 2.46e-08 ***
## NeighborhoodNorthridge 7.463 1.11e-13 ***
## NeighborhoodStone_Brook 11.353 < 2e-16 ***
## NeighborhoodSouth_and_West_of_Iowa_State_University -2.267 0.023450 *
## NeighborhoodClear_Creek 3.809 0.000142 ***
## NeighborhoodMeadow_Village -5.866 4.98e-09 ***
## NeighborhoodBriardale -2.662 0.007822 **
## NeighborhoodBloomington_Heights 2.373 0.017686 *
## NeighborhoodVeenker 5.263 1.52e-07 ***
## NeighborhoodNorthpark_Villa 1.429 0.152979
## NeighborhoodBlueste 1.228 0.219455
## NeighborhoodGreens 6.033 1.81e-09 ***
## NeighborhoodGreen_Hills 3.962 7.60e-05 ***
## NeighborhoodLandmark -0.151 0.880284
## log10(Gr_Liv_Area) 50.344 < 2e-16 ***
## Year_Built 19.805 < 2e-16 ***
## Bldg_TypeTwoFmCon -2.991 0.002801 **
## Bldg_TypeDuplex -13.029 < 2e-16 ***
## Bldg_TypeTwnhs -8.802 < 2e-16 ***
## Bldg_TypeTwnhsE -6.043 1.71e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07872 on 2896 degrees of freedom
## Multiple R-squared: 0.8045, Adjusted R-squared: 0.8022
## F-statistic: 361 on 33 and 2896 DF, p-value: < 2.2e-16
As soon as the code is run the model is executed from the dataframe into a model matrix and then the least squares method is used to estimate parameters. What does this mean?
Such a strategy creates several problems when we need to build a lot of models with varying data as we have to redefine the model. Also more by Kuhn (2017) explaining why base R isn’t great by preprocessing and using data matricies https://rviews.rstudio.com/2017/03/01/the-r-formula-method-the-bad-parts/.
###Recipes This is where Recipes() come in, instead of using the base R apporach we can create a recipe() using the tidymodels package. What is a recipe?
The recipes package is an alternative method for creating and preprocessing design matrices that can be used for modeling or visualization. tidymodels team https://recipes.tidymodels.org/index.html A data matrix can be thought of as a grid that represents each explanitory variable of a model in its rows, and columns the methods i.e. transformations applied to each variable.
As the tidymodels team state “The idea of the recipes package is to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. “feature engineering”)." Which allows us prespecify the data tranformations we wish to apply, before executing them on our training data. Each data transformation in a tidymodels recipe is a step. The functions correspond to specific types of steps, each of which has a prefix of step_. There are several step_ functions one may want to conduct, but we should consider the importance of orderhttps://recipes.tidymodels.org/articles/Ordering.html
Some general advice…
simple_ames <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_dummy(all_nominal())
simple_ames
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 4
##
## Operations:
##
## Log transformation on Gr_Liv_Area
## Dummy variables from all_nominal()
We are going to briefly go down a rabbit hole here, don’t worry we’ll be back soon.
Dummy coding can dramtically expand your data set for example if we apply contrast coding to the ames neighborhood (see below) variable only we can see we expand our data considerably. This makes modeling a lot more resource demand when modeling commences which throttles productivity, steals the most precious thing we have (time), and can lead to us crashing our systems, or costing a lot of money (cloud computing is cheap until we start scaling processing demands).
as_tibble(contrasts(ames$Neighborhood))
## # A tibble: 29 x 28
## College_Creek Old_Town Edwards Somerset Northridge_Heights Gilbert Sawyer
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0
## 3 0 1 0 0 0 0 0
## 4 0 0 1 0 0 0 0
## 5 0 0 0 1 0 0 0
## 6 0 0 0 0 1 0 0
## 7 0 0 0 0 0 1 0
## 8 0 0 0 0 0 0 1
## 9 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0
## # ... with 19 more rows, and 21 more variables: Northwest_Ames <dbl>,
## # Sawyer_West <dbl>, Mitchell <dbl>, Brookside <dbl>, Crawford <dbl>,
## # Iowa_DOT_and_Rail_Road <dbl>, Timberland <dbl>, Northridge <dbl>,
## # Stone_Brook <dbl>, South_and_West_of_Iowa_State_University <dbl>,
## # Clear_Creek <dbl>, Meadow_Village <dbl>, Briardale <dbl>,
## # Bloomington_Heights <dbl>, Veenker <dbl>, Northpark_Villa <dbl>,
## # Blueste <dbl>, Greens <dbl>, Green_Hills <dbl>, Landmark <dbl>,
## # Hayden_Lake <dbl>
As you can see above dummy coding replaces every category with variable for each level of data, not so bad with 28 variables, but imagine adding post/zip codes? In the uk there are 1.7 million. This width will throttle processing. Luckily there are solutions to that can allow us to reduce this information into a smaller number of column: embedding and hashing are two methods that are currently popular.
Embedding - Encodes categorical data as multiple numeric variables using a word embedding approach. It was initially used to take a large number of word identifiers and represent them in a smaller dimension
Feature Hashing - Another way to combine categories is to use a hashing function (or hashes). Hashes are used to map one set of values to another set of values and are typically used in databases and cryptography (Preneel 2010). The original values are called the keys and are typically mapped to a smaller set of artificial hash values.* Kuhn and Johnson (2019) In tidymodel we can use step_feature_hash() https://embed.tidymodels.org/reference/step_feature_hash.html to do this. Hashing like embedding compresses wide data into specified smaller number of columns/variables, using the actual value of the level which is calculated with cryptographic methods to assign a value to a number of columns we specify. One problem is with these methods is that we can’t easily map back to the original data.
Word of warning: You need to have tensorflow or keras installed, I found this a bit of a pain. If you do attempt this, it seemed that initialising tf in conda prompt and also running the rstudio packages in anaconda helped build the connection.
Carrying on with our ames data set example in coding neighborhoods we can reduce the 28 columns (representing 29 levels, 1 left of for degrees of freedom) down to
### Taken from: https://embed.tidymodels.org/articles/Applications/Tensorflow.html
library(embed)
tf_embed <-
recipe(Sale_Price ~ ., data = ames) %>%
step_log(Sale_Price, base = 10) %>%
# Add some other predictors that can be used by the network. We
# preprocess them first
step_YeoJohnson(Lot_Area, Full_Bath, Gr_Liv_Area) %>%
step_range(Lot_Area, Full_Bath, Gr_Liv_Area) %>%
step_embed(
Neighborhood, # variable to embed
outcome = vars(Sale_Price), # we need to specify the outcome/dependent variable the embedings will be used for
predictors = vars(Lot_Area, Full_Bath, Gr_Liv_Area), # encoding columns
num_terms = 5,
hidden_units = 10,
options = embed_control(epochs = 75, validation_split = 0.2) #trained on 75 epochs
) %>%
prep(training = ames)
#you can see the graph for this in the source code for this mark down
Lets look at our new columns…
hood_coef <-
tidy(tf_embed, number = 4) %>%
dplyr::select(-terms, -id) %>%
dplyr::rename(Neighborhood = level) %>%
# Make names smaller
rename_at(vars(contains("emb")), funs(gsub("Neighborhood_", "", ., fixed = TRUE)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
hood_coef
## # A tibble: 30 x 6
## embed_1 embed_2 embed_3 embed_4 embed_5 Neighborhood
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 -0.0218 -0.0272 -0.0402 0.0457 -0.0214 ..new
## 2 0.0158 0.0366 0.0391 -0.0639 -0.0150 North_Ames
## 3 -0.0654 0.0716 -0.0384 0.0117 0.0399 College_Creek
## 4 0.0357 -0.0332 -0.0273 0.0103 -0.0389 Old_Town
## 5 0.0434 0.00264 -0.0144 0.0484 0.00978 Edwards
## 6 -0.0607 0.0964 -0.0257 -0.0258 0.0801 Somerset
## 7 -0.121 0.101 0.0326 -0.0535 0.0697 Northridge_Heights
## 8 -0.0503 0.0278 0.0400 0.00560 -0.00665 Gilbert
## 9 -0.0167 0.00569 0.0261 -0.0286 -0.0137 Sawyer
## 10 -0.0547 -0.0271 0.0195 -0.0450 -0.0150 Northwest_Ames
## # ... with 20 more rows
Not bad! We have changed the number of columns to approximately 1/5 of the size. There are a few more considerations to make here as we can introduce correlations between variables.
tf_plot
hood_coef %>%
dplyr::select(contains("emb")) %>%
cor() %>%
round(2)
## embed_1 embed_2 embed_3 embed_4 embed_5
## embed_1 1.00 -0.59 -0.44 0.31 -0.54
## embed_2 -0.59 1.00 -0.01 -0.29 0.53
## embed_3 -0.44 -0.01 1.00 -0.28 0.19
## embed_4 0.31 -0.29 -0.28 1.00 -0.05
## embed_5 -0.54 0.53 0.19 -0.05 1.00
Don’t you just love statistics never be bored! :D Unfortunatley Alice needs to go home so out the rabbit hole we go.
For a recipe with at least one preprocessing operation, estimate the required parameters from a training set that can be later applied to other data sets. https://recipes.tidymodels.org/reference/prep.html
The second phase for using a recipe is to estimate any quantities required by the steps using the prep() function. When called the function executes our recipe on the data i.e. ames_training. Not all steps are applied.
simple_ames <- prep(simple_ames, training = ames_train)
simple_ames
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 4
##
## Training data contained 2346 data points and no missing data.
##
## Operations:
##
## Log transformation on Gr_Liv_Area [trained]
## Dummy variables from Neighborhood, Bldg_Type [trained]
In the above we have the original data, but if we do not want this we could change the default setting of retain = TRUE to FALSE, the prepared version of the training set still available. While this is the default we may want to drop this if we have large sets of data.
simple_ames <- prep(simple_ames, training = ames_train, retain = TRUE)
simple_ames
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 4
##
## Training data contained 2346 data points and no missing data.
##
## Operations:
##
## Log transformation on Gr_Liv_Area [trained]
## Dummy variables from Neighborhood, Bldg_Type [trained]
We can see using broom which data sets have already been applied and in what order.
simple_ames %>% tidy()
## # A tibble: 2 x 6
## number operation type trained skip id
## <int> <chr> <chr> <lgl> <lgl> <chr>
## 1 1 step log TRUE FALSE log_kjCwe
## 2 2 step dummy TRUE FALSE dummy_KxLdc
The third phase of recipe usage is to apply the preprocessing operations to a data set using the bake() function. The bake() function can apply the recipe to any data set, in this case the test set, if we want to look at the orginal we can set data to NULL
# bake(simple_ames, new_data = NULL) # to investigate orginal dataset
test_ex <- bake(simple_ames, new_data = ames_test)
names(test_ex) %>% head()
## [1] "Gr_Liv_Area" "Year_Built"
## [3] "Sale_Price" "Neighborhood_College_Creek"
## [5] "Neighborhood_Old_Town" "Neighborhood_Edwards"
Note the dummy variable columns starting with “Neighborhood_” or whatever the variabke column it orginally came from was called. The bake() function can also take selectors so that, if we want to investigate the neighborhood results, we can use:
#select neighboorhood results
bake(simple_ames, ames_test, starts_with("Neighborhood_"))
## # A tibble: 584 x 28
## Neighborhood_College~ Neighborhood_Old_~ Neighborhood_Edw~ Neighborhood_Some~
## <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## 10 0 0 0 0
## # ... with 574 more rows, and 24 more variables:
## # Neighborhood_Northridge_Heights <dbl>, Neighborhood_Gilbert <dbl>,
## # Neighborhood_Sawyer <dbl>, Neighborhood_Northwest_Ames <dbl>,
## # Neighborhood_Sawyer_West <dbl>, Neighborhood_Mitchell <dbl>,
## # Neighborhood_Brookside <dbl>, Neighborhood_Crawford <dbl>,
## # Neighborhood_Iowa_DOT_and_Rail_Road <dbl>, Neighborhood_Timberland <dbl>,
## # Neighborhood_Northridge <dbl>, Neighborhood_Stone_Brook <dbl>,
## # Neighborhood_South_and_West_of_Iowa_State_University <dbl>,
## # Neighborhood_Clear_Creek <dbl>, Neighborhood_Meadow_Village <dbl>,
## # Neighborhood_Briardale <dbl>, Neighborhood_Bloomington_Heights <dbl>,
## # Neighborhood_Veenker <dbl>, Neighborhood_Northpark_Villa <dbl>,
## # Neighborhood_Blueste <dbl>, Neighborhood_Greens <dbl>,
## # Neighborhood_Green_Hills <dbl>, Neighborhood_Landmark <dbl>,
## # Neighborhood_Hayden_Lake <dbl>
Looking at the output of the model we can the impact of the model coeffs on sale price
test_ex %>% tidy()
## Warning: Data frame tidiers are deprecated and will be removed in an upcoming
## release of broom.
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## # A tibble: 35 x 13
## column n mean sd median trimmed mad min max range
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Gr_Liv_Area 584 3.15e+0 0.145 3.16e0 3.15 0.0933 2.64e0 3.75e0 1.11
## 2 Year_Built 584 1.97e+3 29.3 1.97e3 1975. 25 1.88e3 2.01e3 130
## 3 Sale_Price 584 5.22e+0 0.180 5.20e0 5.22 0.103 4.11e0 5.80e0 1.69
## 4 Neighborho~ 584 9.08e-2 0.288 0. 0 0 0. 1.00e0 1
## 5 Neighborho~ 584 8.22e-2 0.275 0. 0 0 0. 1.00e0 1
## 6 Neighborho~ 584 5.99e-2 0.238 0. 0 0 0. 1.00e0 1
## 7 Neighborho~ 584 5.31e-2 0.224 0. 0 0 0. 1.00e0 1
## 8 Neighborho~ 584 5.99e-2 0.238 0. 0 0 0. 1.00e0 1
## 9 Neighborho~ 584 6.34e-2 0.244 0. 0 0 0. 1.00e0 1
## 10 Neighborho~ 584 6.16e-2 0.241 0. 0 0 0. 1.00e0 1
## # ... with 25 more rows, and 3 more variables: skew <dbl>, kurtosis <dbl>,
## # se <dbl>
Another way to deal with categorical variables is to clump them together. This is best done when the levels makes up less than 1% of the data set and we have many levels. Of course we need to be careful for this new level not to become to big.
ggplot(ames_train, aes(y = Neighborhood)) +
geom_bar() +
labs(y = NULL)
Why?
For some models, it may be problematic to have dummy variables with a
single non-zero entry in the column. At a minimum, it is highly
improbable that these features would be important to a model, we can
therefore collapse some of these factors into a single factor level by
using step_other()
simple_ames <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal())
Below we see there are differences in the price and living area by building type, meaning type acts interactively with area to influence price.
ggplot(ames_train, aes(x = Gr_Liv_Area, y = 10^Sale_Price)) +
geom_point(alpha = .2) +
facet_wrap(~ Bldg_Type) +
geom_smooth(method = lm, formula = y ~ x, se = FALSE, col = "red") +
scale_x_log10() +
scale_y_log10() +
labs(x = "General Living Area", y = "Sale Price (USD)")
In base we deal with this by using either a colon between variables(:),
or specify the interaction between variables with the multiplier symbol
(*).
#baser interaction
model1_int1 <-lm(Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Bldg_Type + log10(Gr_Liv_Area):Bldg_Type, ames_train)
# or
model1_int2 <- lm(Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) * Bldg_Type, ames_train)
summary(model1_int1)
##
## Call:
## lm(formula = Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) +
## Bldg_Type + log10(Gr_Liv_Area):Bldg_Type, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74055 -0.04323 0.00374 0.04568 0.31484
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 3.083548 0.048350
## NeighborhoodCollege_Creek 0.087784 0.007191
## NeighborhoodOld_Town -0.108763 0.007428
## NeighborhoodEdwards -0.062454 0.007776
## NeighborhoodSomerset 0.136708 0.008521
## NeighborhoodNorthridge_Heights 0.216400 0.009180
## NeighborhoodGilbert 0.038009 0.008545
## NeighborhoodSawyer 0.001126 0.008717
## NeighborhoodNorthwest_Ames 0.027896 0.009029
## NeighborhoodSawyer_West 0.043398 0.009131
## NeighborhoodMitchell 0.048527 0.009826
## NeighborhoodBrookside -0.060020 0.009407
## NeighborhoodCrawford 0.053116 0.010038
## NeighborhoodIowa_DOT_and_Rail_Road -0.148367 0.010383
## NeighborhoodTimberland 0.142505 0.011732
## NeighborhoodNorthridge 0.142782 0.012597
## NeighborhoodStone_Brook 0.220550 0.013842
## NeighborhoodSouth_and_West_of_Iowa_State_University -0.092577 0.014238
## NeighborhoodClear_Creek 0.060219 0.014887
## NeighborhoodMeadow_Village -0.116303 0.018555
## NeighborhoodBriardale -0.071170 0.022886
## NeighborhoodBloomington_Heights 0.107678 0.020567
## NeighborhoodVeenker 0.121970 0.018021
## NeighborhoodNorthpark_Villa 0.030740 0.020618
## NeighborhoodBlueste 0.055099 0.028989
## NeighborhoodGreens 0.188299 0.032413
## NeighborhoodGreen_Hills 0.261556 0.057788
## NeighborhoodLandmark 0.044185 0.081825
## log10(Gr_Liv_Area) 0.671054 0.015581
## Bldg_TypeTwoFmCon 0.770292 0.251279
## Bldg_TypeDuplex 0.946443 0.226165
## Bldg_TypeTwnhs 1.236988 0.291237
## Bldg_TypeTwnhsE 0.071620 0.198107
## log10(Gr_Liv_Area):Bldg_TypeTwoFmCon -0.253760 0.079460
## log10(Gr_Liv_Area):Bldg_TypeDuplex -0.325283 0.070530
## log10(Gr_Liv_Area):Bldg_TypeTwnhs -0.423664 0.093219
## log10(Gr_Liv_Area):Bldg_TypeTwnhsE -0.027576 0.063413
## t value Pr(>|t|)
## (Intercept) 63.775 < 2e-16 ***
## NeighborhoodCollege_Creek 12.208 < 2e-16 ***
## NeighborhoodOld_Town -14.643 < 2e-16 ***
## NeighborhoodEdwards -8.032 1.52e-15 ***
## NeighborhoodSomerset 16.044 < 2e-16 ***
## NeighborhoodNorthridge_Heights 23.573 < 2e-16 ***
## NeighborhoodGilbert 4.448 9.07e-06 ***
## NeighborhoodSawyer 0.129 0.89723
## NeighborhoodNorthwest_Ames 3.090 0.00203 **
## NeighborhoodSawyer_West 4.753 2.13e-06 ***
## NeighborhoodMitchell 4.939 8.43e-07 ***
## NeighborhoodBrookside -6.380 2.13e-10 ***
## NeighborhoodCrawford 5.292 1.33e-07 ***
## NeighborhoodIowa_DOT_and_Rail_Road -14.289 < 2e-16 ***
## NeighborhoodTimberland 12.146 < 2e-16 ***
## NeighborhoodNorthridge 11.334 < 2e-16 ***
## NeighborhoodStone_Brook 15.934 < 2e-16 ***
## NeighborhoodSouth_and_West_of_Iowa_State_University -6.502 9.67e-11 ***
## NeighborhoodClear_Creek 4.045 5.40e-05 ***
## NeighborhoodMeadow_Village -6.268 4.35e-10 ***
## NeighborhoodBriardale -3.110 0.00190 **
## NeighborhoodBloomington_Heights 5.236 1.79e-07 ***
## NeighborhoodVeenker 6.768 1.65e-11 ***
## NeighborhoodNorthpark_Villa 1.491 0.13611
## NeighborhoodBlueste 1.901 0.05747 .
## NeighborhoodGreens 5.809 7.14e-09 ***
## NeighborhoodGreen_Hills 4.526 6.31e-06 ***
## NeighborhoodLandmark 0.540 0.58925
## log10(Gr_Liv_Area) 43.070 < 2e-16 ***
## Bldg_TypeTwoFmCon 3.065 0.00220 **
## Bldg_TypeDuplex 4.185 2.96e-05 ***
## Bldg_TypeTwnhs 4.247 2.25e-05 ***
## Bldg_TypeTwnhsE 0.362 0.71774
## log10(Gr_Liv_Area):Bldg_TypeTwoFmCon -3.194 0.00142 **
## log10(Gr_Liv_Area):Bldg_TypeDuplex -4.612 4.21e-06 ***
## log10(Gr_Liv_Area):Bldg_TypeTwnhs -4.545 5.78e-06 ***
## log10(Gr_Liv_Area):Bldg_TypeTwnhsE -0.435 0.66370
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08067 on 2309 degrees of freedom
## Multiple R-squared: 0.7939, Adjusted R-squared: 0.7907
## F-statistic: 247.1 on 36 and 2309 DF, p-value: < 2.2e-16
In tidymodels we can simply change the recipe with step_interact(), the tilde (~) means as a function of. So we can say in this example the interaction is a function of living area by buolding type. As building type is contrast coded we use the starts_with function.
simple_ames <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal()) %>%
# Gr_Liv_Area is on the log scale from a previous step
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_")) #don't use step_interact( ~ Gr_Liv_Area:Bldg_Type)
library(patchwork)
library(splines)
plot_smoother <- function(deg_free) {
ggplot(ames_train, aes(x = Latitude, y = Sale_Price)) +
geom_point(alpha = .2) +
scale_y_log10() +
geom_smooth(
method = lm,
formula = y ~ ns(x, df = deg_free),
col = "red",
se = FALSE
) +
ggtitle(paste(deg_free, "Spline Terms"))
}
( plot_smoother(2) + plot_smoother(5) ) / ( plot_smoother(20) + plot_smoother(100) )
Some panels clearly fit poorly; two terms under-fits the data while 100 terms over-fits, if you wonder which is worse its both https://stats.stackexchange.com/questions/249493/which-model-is-better-one-that-overfits-or-one-that-underfits.
Why? because overfitting The panels with five and 20 terms seem like
reasonably smooth fits that catch the main patterns of the data. This
indicates that the proper amount of “non-linearness” matters.How do you
select the optimal number of splines/degrees of freedom? I’m so glad you
asked, see herehttps://cran.r-project.org/web/packages/santaR/vignettes/selecting-optimal-df.html.
In recipes to add a natural spline we use step_ns(), selecting the feature and the number of degrees. Not sure how to automate spline selection it at time of writing in the smooth.spline function you can select cv = TRUE, seems to be that you can use spline_degree(range = c(1L, 10L), trans = NULL) from the dials package (more later).
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude,
data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, deg_free = 20)
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 5
##
## Operations:
##
## Log transformation on Gr_Liv_Area
## Collapsing factor levels for Neighborhood
## Dummy variables from all_nominal()
## Interactions with Gr_Liv_Area:starts_with("Bldg_Type_")
## Natural Splines on Latitude
###Feature Extraction Finally (for now), I’ll briefly cover how to reduce down the number of numeric predictors typically you would use. Essentially when numeric predictors relate (this can be non-linear) we can reduce them down. A good example of what we can do with feature extraction is psychometrics, if you have ever done a personality or IQ you will know there are a number of questions that we combine together to measure latent variables such as spatial intelligence or introvertion/extraversion. While this is generally done with factor analysis, PCAs are often interpreted in the same way but is more focused on feature reduction than identification than latent variables.
Before going further please note that Julia Silge has an awesome blog on UMAP and PCA https://juliasilge.com/blog/cocktail-recipes-umap/ which has a great video.
See correlations in palmer dataset
library(corrr)
penguins_corr <- penguins %>%
dplyr::select(where(is.numeric)) %>%
correlate() %>%
rearrange()
##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
penguins_corr
## # A tibble: 5 x 6
## term flipper_length_~ body_mass_g bill_length_mm year bill_depth_mm
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 flipper_len~ NA 0.871 0.656 0.170 -0.584
## 2 body_mass_g 0.871 NA 0.595 0.0422 -0.472
## 3 bill_length~ 0.656 0.595 NA 0.0545 -0.235
## 4 year 0.170 0.0422 0.0545 NA -0.0604
## 5 bill_depth_~ -0.584 -0.472 -0.235 -0.0604 NA
Build and visualise PCA in tidymodels
library(recipes)
penguin_recipe <-
recipe(~., data = penguins) %>%
update_role(species, island, sex, new_role = "id") %>%
step_naomit(all_predictors()) %>%
step_normalize(all_predictors()) %>% # we'd want to use all numeric on ames
step_pca(all_predictors(), id = "pca") %>%
prep()
penguin_pca <-
penguin_recipe %>%
tidy(id = "pca")
penguin_pca
## # A tibble: 25 x 4
## terms value component id
## <chr> <dbl> <chr> <chr>
## 1 bill_length_mm 0.452 PC1 pca
## 2 bill_depth_mm -0.398 PC1 pca
## 3 flipper_length_mm 0.576 PC1 pca
## 4 body_mass_g 0.544 PC1 pca
## 5 year 0.0957 PC1 pca
## 6 bill_length_mm -0.0935 PC2 pca
## 7 bill_depth_mm 0.0164 PC2 pca
## 8 flipper_length_mm 0.0256 PC2 pca
## 9 body_mass_g -0.111 PC2 pca
## 10 year 0.989 PC2 pca
## # ... with 15 more rows
This output shows how each measure contribute the components, in this case there are 4, which we can see how much variance each accounts for.
penguin_recipe %>%
tidy(id = "pca", type = "variance") %>%
dplyr::filter(terms == "percent variance") %>%
ggplot(aes(x = component, y = value)) +
geom_col(fill = "#b6dfe2") +
xlim(c(0, 5)) +
ylab("% of total variance")
## Warning: Removed 1 rows containing missing values (geom_col).
library(ggplot2)
penguin_pca %>%
mutate(terms = tidytext::reorder_within(terms,
abs(value),
component)) %>%
ggplot(aes(abs(value), terms, fill = value > 0)) +
geom_col() +
facet_wrap(~component, scales = "free_y") +
tidytext::scale_y_reordered() +
scale_fill_manual(values = c("#b6dfe2", "#0A537D")) +
labs(
x = "Absolute value of contribution",
y = NULL, fill = "Positive?"
)
How do we select the optimal number of PCs? There is no robust method for determining how many components to use. As the number of observations, the number of variables, and the application vary, a different level of accuracy and variable reduction are desirable. The most common technique for determining how many principal components is to eyeball scree plots and look for the “elbow point”, where the variance by the PCs significantly drops off.
library(recipes)
library(dplyr)
library(ggplot2)
split <- seq.int(1, 150, by = 9)
tr <- iris[-split, ]
te <- iris[ split, ]
tidymodels - supervised UMAP recipe
set.seed(11)
supervised <-
recipe(Species ~ ., data = tr) %>% # formula
step_center(all_predictors()) %>% # centering
step_scale(all_predictors()) %>% # scaling
step_umap(all_predictors(), outcome = vars(Species), num_comp = 2) %>% # as with hashing we select our components
prep(training = tr)
theme_set(theme_bw())
bake(supervised, new_data = te, Species, starts_with("umap")) %>%
ggplot(aes(x = umap_1, y = umap_2, col = Species)) +
geom_point(alpha = .5)
Hope that helps!
August