• 1 What is Regression?
  • 2 Training and Evaluating Regression Models
  • 3 Issues to Consider
  • 4 Dealing with Non-Linear Responses
  • 5 Tree-Based Methods
    • 5.1 The intuition behind tree-based methods
    • 5.2 Random forests
    • 5.3 One-Hot-Encoding Categorical Variables
    • 5.4 Gradient boosting machines
      • 5.4.1 Find the right number of trees for a gradient boosting machine
      • 5.4.2 Fit an xgboost bike rental model and predict
      • 5.4.3 Evaluate the xgboost bike rental model
      • 5.4.4 Visualize the xgboost bike rental model
  • 1 What is Regression?
    • 1.1 Introduction
    • 1.2 Linear regression - the fundamental method
      • 1.2.1 Code a simple one-variable regression
      • 1.2.2 Examining a model
    • 1.3 Predicting once you fit a model
      • 1.3.1 Predicting from the unemployment model
      • 1.3.2 Multivariate linear regression
    • 1.4 Wrapping up linear regression
      • 1.4.1 Pros and Cons of Linear Regression
      • 1.4.2 Collinearity
  • 2 Training and Evaluating Regression Models
    • 2.1 Evaluating a model graphically
      • 2.1.1 Graphically evaluate the unemployment model
      • 2.1.2 The gain curve to evaluate the unemployment model
    • 2.2 Root Mean Squared Error (RMSE)
      • 2.2.1 Calculate RMSE
    • 2.3 R-Squared
      • 2.3.1 Calculate R-Squared
      • 2.3.2 Correlation and R-squared
    • 2.4 Properly Training a Model
      • 2.4.1 Generating a random test/train split
      • 2.4.2 Train a model using test/train split
      • 2.4.3 Evaluate a model using test/train split
      • 2.4.4 Create a cross validation plan
      • 2.4.5 Evaluate a modeling procedure using n-fold cross-validation
  • 3 Issues to Consider
    • 3.1 Categorical inputs
      • 3.1.1 Examining the structure of categorical inputs
      • 3.1.2 Modeling with categorical inputs
    • 3.2 Interactions
      • 3.2.1 Modeling an interaction
    • 3.3 Transforming the response before modeling
      • 3.3.1 Relative error
      • 3.3.2 Modeling log-transformed monetary output
      • 3.3.3 Comparing RMSE and root-mean-squared Relative Error
    • 3.4 Transforming inputs before modeling
      • 3.4.1 Input transforms: the “hockey stick”
  • 4 Dealing with Non-Linear Responses
    • 4.1 Logistic regression to predict probabilities
      • 4.1.1 Fit a model of sparrow survival probability
      • 4.1.2 Predict sparrow survival
    • 4.2 Poisson and quasipoisson regression to predict counts
      • 4.2.1 Poisson or quasipoisson
      • 4.2.2 Fit a model to predict bike rental counts
      • 4.2.3 Predict bike rentals on new data
      • 4.2.4 Visualize the Bike Rental Predictions
    • 4.3 GAM to learn non-linear transforms
      • 4.3.1 Writing formulas for GAM models
      • 4.3.2 Model soybean growth with GAM
      • 4.3.3 Predict with the soybean model on test data
  • 5 Tree-Based Methods
    • 5.1 The intuition behind tree-based methods
    • 5.2 Random forests
      • 5.2.1 Build a random forest model for bike rentals
      • 5.2.2 Predict bike rentals with the random forest model
      • 5.2.3 Visualize random forest bike model predictions
    • 5.3 One-Hot-Encoding Categorical Variables
      • 5.3.1 vtreat on a small example
      • 5.3.2 Novel levels
      • 5.3.3 vtreat the bike rental data
    • 5.4 Gradient boosting machines
      • 5.4.1 Find the right number of trees for a gradient boosting machine
      • 5.4.2 Fit an xgboost bike rental model and predict
      • 5.4.3 Evaluate the xgboost bike rental model
      • 5.4.4 Visualize the xgboost bike rental model

1 What is Regression?

1.1 Introduction

From a machine learning perspective, the term regression generally encompasses the prediction of continuous values. Statistically, these predictions are the expected value, or the average value one would observe for the given input values.

1.2 Linear regression - the fundamental method

1.2.1 Code a simple one-variable regression

We’ll create a formula to define a one-variable modeling task, and then fit a linear model to the data. We are given the rates of male and female unemployment in the United States over several years (Source).

unemployment <- readRDS("unemployment.rds")

The task is to predict the rate of female unemployment from the observed rate of male unemployment. The outcome is female_unemployment, and the input is male_unemployment.

The sign of the variable coefficient tells you whether the outcome increases (+) or decreases (-) as the variable increases.

The calling interface for lm() is:

lm(formula, data = ___)
# unemployment is loaded in the workspace
summary(unemployment)
 male_unemployment female_unemployment
 Min.   :2.900     Min.   :4.000      
 1st Qu.:4.900     1st Qu.:4.400      
 Median :6.000     Median :5.200      
 Mean   :5.954     Mean   :5.569      
 3rd Qu.:6.700     3rd Qu.:6.100      
 Max.   :9.800     Max.   :7.900      
# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment

# Print it
fmla
female_unemployment ~ male_unemployment
# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data = unemployment)

# Print it
unemployment_model

Call:
lm(formula = fmla, data = unemployment)

Coefficients:
      (Intercept)  male_unemployment  
           1.4341             0.6945  

The coefficient for male unemployment is positive, so female unemployment increases as male unemployment does. Linear regression is the most basic of regression approaches. You can think of this course as ways to address its limitations.

1.2.2 Examining a model

Let’s look at the model unemployment_model that we have just created. There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().

# broom and sigr are already loaded in your workspace
# Print unemployment_model
unemployment_model

Call:
lm(formula = fmla, data = unemployment)

Coefficients:
      (Intercept)  male_unemployment  
           1.4341             0.6945  
# Call summary() on unemployment_model to get more details
summary(unemployment_model)

Call:
lm(formula = fmla, data = unemployment)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77621 -0.34050 -0.09004  0.27911  1.31254 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.43411    0.60340   2.377   0.0367 *  
male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5803 on 11 degrees of freedom
Multiple R-squared:  0.8213,    Adjusted R-squared:  0.8051 
F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05
# Call glance() on unemployment_model to see the details in a tidier form
library(broom)
glance(unemployment_model)
ABCDEFGHIJ0123456789
r.squared
<dbl>
adj.r.squared
<dbl>
sigma
<dbl>
statistic
<dbl>
p.value
<dbl>
df
<int>
logLik
<dbl>
AIC
<dbl>
0.82131570.80507160.580259650.561081.965985e-052-10.2847126.56943

# Call wrapFTest() on unemployment_model to see the most relevant details
library(sigr)
wrapFTest(unemployment_model)
[1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."

There are several different ways to get diagnostics for our model. Use the one that suits your needs or preferences the best.

1.3 Predicting once you fit a model

1.3.1 Predicting from the unemployment model

We will use our unemployment model unemployment_model to make predictions from the unemployment data, and compare predicted female unemployment rates to the actual observed female unemployment rates on the training data, unemployment. We will also use our model to predict on the new data in newrates, which consists of only one observation, where male unemployment is 5%.

The predict() interface for lm models takes the form

predict(model, newdata)

We will use the ggplot2 package to make the plots, so we will add the prediction column to the unemployment data frame. We will plot outcome versus prediction, and compare them to the line that represents perfect predictions (that is when the outcome is equal to the predicted value).

The ggplot2 command to plot a scatterplot of dframe$outcome versus dframe$pred (pred on the x axis, outcome on the y axis), along with a blue line where outcome == pred is as follows:

ggplot(dframe, aes(x = pred, y = outcome)) + 
       geom_point() +  
       geom_abline(color = "blue")
newrates <- data.frame(male_unemployment = 5)
newrates
ABCDEFGHIJ0123456789
male_unemployment
<dbl>
5
# unemployment is in your workspace
summary(unemployment)
 male_unemployment female_unemployment
 Min.   :2.900     Min.   :4.000      
 1st Qu.:4.900     1st Qu.:4.400      
 Median :6.000     Median :5.200      
 Mean   :5.954     Mean   :5.569      
 3rd Qu.:6.700     3rd Qu.:6.100      
 Max.   :9.800     Max.   :7.900      
# newrates is in your workspace
newrates
ABCDEFGHIJ0123456789
male_unemployment
<dbl>
5

# Predict female unemployment in the unemployment data set
unemployment$prediction <-  predict(unemployment_model, unemployment)

# load the ggplot2 package
library(ggplot2)

# Make a plot to compare predictions to actual (prediction on x axis). 
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) + 
  geom_point() +
  geom_abline(color = "blue")


# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newrates)
# Print it
pred
       1 
4.906757 

While all the modeling algorithms in R implement the predict() method, the call may be a little different for each one.

1.3.2 Multivariate linear regression

We will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.

bloodpressure <- readRDS("bloodpressure.rds")
# bloodpressure is in the workspace
summary(bloodpressure)
 blood_pressure       age            weight   
 Min.   :128.0   Min.   :46.00   Min.   :167  
 1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
 Median :153.0   Median :64.00   Median :194  
 Mean   :150.1   Mean   :62.45   Mean   :195  
 3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
 Max.   :168.0   Max.   :74.00   Max.   :220  
# Create the formula and print it
fmla <- blood_pressure ~ weight + age
fmla
blood_pressure ~ weight + age
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)

# Print bloodpressure_model and call summary() 
bloodpressure_model

Call:
lm(formula = fmla, data = bloodpressure)

Coefficients:
(Intercept)       weight          age  
    30.9941       0.3349       0.8614  
summary(bloodpressure_model)

Call:
lm(formula = fmla, data = bloodpressure)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4640 -1.1949 -0.4078  1.8511  2.6981 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  30.9941    11.9438   2.595  0.03186 * 
weight        0.3349     0.1307   2.563  0.03351 * 
age           0.8614     0.2482   3.470  0.00844 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.318 on 8 degrees of freedom
Multiple R-squared:  0.9768,    Adjusted R-squared:  0.9711 
F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07

One of the advantages of linear regression is that you can interpret the effects of each variable on the input – to a certain extent. In this case the coefficients for both age and weight are positive, which indicates that bloodpressure tends to increase as both age and weight increase.

Now we will make predictions using the blood pressure model bloodpressure_model.

# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, bloodpressure)

# plot the results
ggplot(bloodpressure, aes(x = prediction, y = blood_pressure)) + 
    geom_point() +
    geom_abline(color = "blue")

The results stay fairly close to the line of perfect prediction, indicating that the model fits the training data well. From a prediction perspective, multivariate linear regression behaves much as simple (one-variable) linear regression does.

1.4 Wrapping up linear regression

1.4.1 Pros and Cons of Linear Regression

  • Pros
    • Easy to t and to apply
    • Concise
    • Less prone to overtting
    • Interpretable
  • Cons
    • Can only express linear and additive relationships

1.4.2 Collinearity

  • Collinearity – when variables are partially correlated.
  • Coefcients might change sign
  • High collinearity:
    • Coefcients (or standard errors) look too large
    • Model may be unstable

2 Training and Evaluating Regression Models

2.1 Evaluating a model graphically

2.1.1 Graphically evaluate the unemployment model

We will graphically evaluate the unemployment model, unemployment_model, that we fit to the unemployment data.

We will plot the model’s predictions against the actual female_unemployment. Then we will calculate the residuals:

residuals <- actual outcome - predicted outcome

and plot predictions against residuals. The residual graph will take a slightly different form: we compare the residuals to the horizontal line x=0 (using geom_hline()) rather than to the line x=y. The command will be provided.

# Make predictions from the model
unemployment$predictions <- predict(unemployment_model, unemployment)

# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) + 
  geom_point() + 
  geom_abline()

# From previous step
unemployment$predictions <- predict(unemployment_model)

# Calculate residuals
unemployment$residuals <- unemployment$female_unemployment - unemployment$predictions

# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) + 
  geom_pointrange(aes(ymin = 0, ymax = residuals)) + 
  geom_hline(yintercept = 0, linetype = 3) + 
  ggtitle("residuals vs. linear model prediction")

We have now evaluated model predictions by comparing them to ground truth, and by examining prediction error.

2.1.2 The gain curve to evaluate the unemployment model

We made predictions about female_unemployment and visualized the predictions and the residuals. Now, we will also plot the gain curve of the unemployment_model’s predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.

For situations where order is more important than exact values, the gain curve helps you check if the model’s predictions sort in the same order as the true outcome.

Calls to the function GainCurvePlot() look like:

GainCurvePlot(frame, xvar, truthvar, title)

where

  • frame is a data frame
  • xvar and truthvar are strings naming the prediction and actual outcome columns of frame
  • title is the title of the plot

When the predictions sort in exactly the same order, the relative Gini coefficient is 1. When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative.

# Load the package WVPlots
library(WVPlots)

# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")

A relative gini coefficient close to one shows that the model correctly sorts high unemployment situations from lower ones.

2.2 Root Mean Squared Error (RMSE)

2.2.1 Calculate RMSE

We will calculate the RMSE of your unemployment model. We added two columns to the unemployment dataset:

  • the model’s predictions (predictions column)
  • the residuals between the predictions and the outcome (residuals column)

We can calculate the RMSE from a vector of residuals, res, as:

RMSE=mean(res2)

We want RMSE to be small. How small is “small”? One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.

# For convenience put the residuals in the variable res
res <- unemployment$residuals

# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res**2)))
[1] 0.5337612
# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
[1] 1.314271

An RMSE much smaller than the outcome’s standard deviation suggests a model that predicts well.

2.3 R-Squared

2.3.1 Calculate R-Squared

We will examine how well the model fits the data: that is, how much variance does it explain. You can do this using R2.

Suppose y is the true outcome, p is the prediction from the model, and res=yp are the residuals of the predictions.

Then the total sum of squares tss (“total variance”) of the data is:

tss=(yy¯)2

where y¯ is the mean value of y.

The residual sum of squared errors of the model, rss is:

rss=res2

R2 (R-Squared), the “variance explained” by the model, is then:

1rsstss

After we calculate R2, we will compare what you computed with the R2 reported by glance(). glance() returns a one-row data frame; for a linear regression model, one of the columns returned is the R2 of the model on the training data.

# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
[1] 5.569231
# Calculate total sum of squares: tss. Print it
(tss <- sum((unemployment$female_unemployment - fe_mean)^2))
[1] 20.72769
# Calculate residual sum of squares: rss. Print it
(rss <- sum((unemployment$residuals)^2))
[1] 3.703714
# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1 - rss/tss)
[1] 0.8213157
# Get R-squared from glance. Print it
(rsq_glance <- glance(unemployment_model)$r.squared)
[1] 0.8213157

An R-squared close to one suggests a model that predicts well.

2.3.2 Correlation and R-squared

The linear correlation of two variables, x and y, measures the strength of the linear relationship between them. When x and y are respectively:

  • the outcomes of a regression model that minimizes squared-error (like linear regression) and
  • the true outcomes of the training data,

then the square of the correlation is the same as R2. We will verify that in this exercise.

# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions, unemployment$female_unemployment))
[1] 0.9062647
# Square rho: rho2 and print it
(rho2 <- rho^2)
[1] 0.8213157
# Get R-squared from glance and print it
(rsq_glance <- glance(unemployment_model)$r.squared)
[1] 0.8213157

Remember this equivalence is only true for the training data, and only for models that minimize squared error.

2.4 Properly Training a Model

2.4.1 Generating a random test/train split

We will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.

We will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif().

If you have a data set dframe of size N, and you want a random subset of approximately size 100X of N (where X is between 0 and 1), then:

  1. Generate a vector of uniform random numbers: gp = runif(N).
  2. dframe[gp < X,] will be about the right size.
  3. dframe[gp >= X,] will be the complement.
# mpg is in the workspace
summary(mpg)
 manufacturer          model               displ            year     
 Length:234         Length:234         Min.   :1.600   Min.   :1999  
 Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
 Mode  :character   Mode  :character   Median :3.300   Median :2004  
                                       Mean   :3.472   Mean   :2004  
                                       3rd Qu.:4.600   3rd Qu.:2008  
                                       Max.   :7.000   Max.   :2008  
      cyl           trans               drv                 cty       
 Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
 1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
 Median :6.000   Mode  :character   Mode  :character   Median :17.00  
 Mean   :5.889                                         Mean   :16.86  
 3rd Qu.:8.000                                         3rd Qu.:19.00  
 Max.   :8.000                                         Max.   :35.00  
      hwy             fl               class              pred.cv      
 Min.   :12.00   Length:234         Length:234         Min.   : 8.942  
 1st Qu.:18.00   Class :character   Class :character   1st Qu.:13.160  
 Median :24.00   Mode  :character   Mode  :character   Median :17.262  
 Mean   :23.44                                         Mean   :16.859  
 3rd Qu.:27.00                                         3rd Qu.:19.308  
 Max.   :44.00                                         Max.   :30.975  
      pred       
 Min.   : 9.043  
 1st Qu.:13.142  
 Median :17.241  
 Mean   :16.859  
 3rd Qu.:19.291  
 Max.   :30.906  
dim(mpg)
[1] 234  13
# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))
[1] 234
# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(N * 0.75))
[1] 176
# Create the vector of N uniform random variables: gp
gp <- runif(N)

# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp < 0.75,]
mpg_test <- mpg[gp >= 0.75,]

# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
[1] 174
nrow(mpg_test)
[1] 60

A random split won’t always produce sets of exactly X% and (100-X)% of the data, but it should be close.

2.4.2 Train a model using test/train split

We will use mpg_train to train a model to predict city fuel efficiency (cty) from highway fuel efficiency (hwy).

# mpg_train is in the workspace
summary(mpg_train)
 manufacturer          model               displ            year     
 Length:174         Length:174         Min.   :1.600   Min.   :1999  
 Class :character   Class :character   1st Qu.:2.425   1st Qu.:1999  
 Mode  :character   Mode  :character   Median :3.300   Median :1999  
                                       Mean   :3.472   Mean   :2003  
                                       3rd Qu.:4.600   3rd Qu.:2008  
                                       Max.   :7.000   Max.   :2008  
      cyl           trans               drv                 cty       
 Min.   :4.000   Length:174         Length:174         Min.   : 9.00  
 1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
 Median :6.000   Mode  :character   Mode  :character   Median :17.00  
 Mean   :5.879                                         Mean   :16.86  
 3rd Qu.:8.000                                         3rd Qu.:19.00  
 Max.   :8.000                                         Max.   :35.00  
      hwy             fl               class              pred.cv      
 Min.   :12.00   Length:174         Length:174         Min.   : 9.062  
 1st Qu.:18.00   Class :character   Class :character   1st Qu.:13.190  
 Median :25.00   Mode  :character   Mode  :character   Median :17.893  
 Mean   :23.53                                         Mean   :16.922  
 3rd Qu.:27.00                                         3rd Qu.:19.308  
 Max.   :44.00                                         Max.   :30.975  
      pred       
 Min.   : 9.043  
 1st Qu.:13.142  
 Median :17.925  
 Mean   :16.919  
 3rd Qu.:19.291  
 Max.   :30.906  
# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty ~ hwy)
cty ~ hwy
# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy 
mpg_model <- lm(fmla, mpg_train)

# Use summary() to examine the model
summary(mpg_model)

Call:
lm(formula = fmla, data = mpg_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8629 -0.8119 -0.0603  0.7256  4.2120 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.85627    0.40174   2.131   0.0345 *  
hwy          0.68027    0.01658  41.041   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.272 on 172 degrees of freedom
Multiple R-squared:  0.9073,    Adjusted R-squared:  0.9068 
F-statistic:  1684 on 1 and 172 DF,  p-value: < 2.2e-16

2.4.3 Evaluate a model using test/train split

Now weu will test the model mpg_model on the test data, mpg_test. Functions rmse() and r_squared() to calculate RMSE and R-squared have been provided for convenience:

rmse(predcol, ycol)
r_squared(predcol, ycol)
# rmse function
rmse <- function (predcol, ycol) 
{
    res = predcol - ycol
    sqrt(mean(res^2))
}

# r_squared function
r_squared <- function (predcol, ycol) 
{
    res = predcol - ycol
    sqrt(mean(res^2))
}

where:

  • predcol: The predicted values
  • ycol: The actual outcome

We will also plot the predictions vs. the outcome.

Generally, model performance is better on the training data than the test data (though sometimes the test set “gets lucky”). A slight difference in performance is okay; if the performance on training is significantly better, there is a problem.

# Examine the objects in the workspace
# ls.str()
# Examine the objects in the workspace
ls.str()
alcohol : 'data.frame': 32 obs. of  7 variables:
 $ Subject         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Metabol         : num  0.6 0.6 1.5 0.4 0.1 0.2 0.3 0.3 0.4 1 ...
 $ Gastric         : num  1 1.6 1.5 2.2 1.1 1.2 0.9 0.8 1.5 0.9 ...
 $ Sex             : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
 $ Alcohol         : Factor w/ 2 levels "Alcoholic","Non-alcoholic": 1 1 1 2 2 2 2 2 2 2 ...
 $ pred_add        : num  0.105 1.358 0.703 2.61 0.412 ...
 $ pred_interaction: num  0.475 1.259 0.674 2.044 0.721 ...
bike_model : List of 30
 $ coefficients     : Named num [1:32] 5.935 -0.58 -0.892 -1.662 -2.35 ...
 $ residuals        : Named num [1:744] 0.724 0.973 1.52 1.085 -0.502 ...
 $ fitted.values    : Named num [1:744] 86.43 47.13 35.71 15.82 8.03 ...
 $ effects          : Named num [1:744] -2648.1 77 77.2 70.8 61.7 ...
 $ R                : num [1:32, 1:32] -451 0 0 0 0 ...
 $ rank             : int 32
 $ qr               :List of 5
 $ family           :List of 11
 $ linear.predictors: Named num [1:744] 4.46 3.85 3.58 2.76 2.08 ...
 $ deviance         : num 28775
 $ aic              : num NA
 $ null.deviance    : num 133365
 $ iter             : int 5
 $ weights          : Named num [1:744] 86.43 47.13 35.71 15.82 8.03 ...
 $ prior.weights    : Named num [1:744] 1 1 1 1 1 1 1 1 1 1 ...
 $ df.residual      : int 712
 $ df.null          : int 743
 $ y                : Named int [1:744] 149 93 90 33 4 10 27 50 142 219 ...
 $ converged        : logi TRUE
 $ boundary         : logi FALSE
 $ model            :'data.frame':  744 obs. of  9 variables:
 $ call             : language glm(formula = fmla, family = quasipoisson, data = bikesJuly)
 $ formula          : chr "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
 $ terms            :Classes 'terms', 'formula'  language cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum +      windspeed
 $ data             :'data.frame':  744 obs. of  12 variables:
 $ offset           : NULL
 $ control          :List of 3
 $ method           : chr "glm.fit"
 $ contrasts        :List of 4
 $ xlevels          :List of 2
bike_model_rf : List of 14
 $ predictions              : num [1:744] 122.9 78.1 95.1 70.2 44.9 ...
 $ num.trees                : num 500
 $ num.independent.variables: num 8
 $ mtry                     : num 2
 $ min.node.size            : num 5
 $ prediction.error         : num 8231
 $ forest                   :List of 8
 $ splitrule                : chr "variance"
 $ treetype                 : chr "Regression"
 $ r.squared                : num 0.821
 $ call                     : language ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed)
 $ importance.mode          : chr "none"
 $ num.samples              : int 744
 $ replace                  : logi TRUE
bikesAugust : 'data.frame': 744 obs. of  13 variables:
 $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
 $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
 $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
 $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
 $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
 $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
 $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
 $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
 $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pred      : num  77.4 36.2 38.3 28.2 42.2 ...
bikesJuly : 'data.frame':   744 obs. of  12 variables:
 $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
 $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
 $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
 $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
 $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
 $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
 $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
 $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
 $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
bloodpressure : 'data.frame':   11 obs. of  4 variables:
 $ blood_pressure: int  132 143 153 162 154 168 137 149 159 128 ...
 $ age           : int  52 59 67 73 64 74 54 61 65 46 ...
 $ weight        : int  173 184 194 211 196 220 188 188 207 167 ...
 $ prediction    : num  134 143 154 165 152 ...
bloodpressure_model : List of 12
 $ coefficients : Named num [1:3] 30.994 0.335 0.861
 $ residuals    : Named num [1:11] -1.718 -0.432 -0.672 -2.533 2.243 ...
 $ effects      : Named num [1:11] -497.8 41.82 -8.04 -2.09 2.66 ...
 $ rank         : int 3
 $ fitted.values: Named num [1:11] 134 143 154 165 152 ...
 $ assign       : int [1:3] 0 1 2
 $ qr           :List of 5
 $ df.residual  : int 8
 $ xlevels      : Named list()
 $ call         : language lm(formula = fmla, data = bloodpressure)
 $ terms        :Classes 'terms', 'formula'  language blood_pressure ~ weight + age
 $ model        :'data.frame':  11 obs. of  3 variables:
fdata : 'data.frame':   100 obs. of  3 variables:
 $ y    : num  9.15 1.9 -3.86 2.39 1.54 ...
 $ pred : num  6.43 3.47 1.59 3.76 9.51 ...
 $ label: Factor w/ 2 levels "small purchases",..: 1 1 1 1 1 1 1 1 1 1 ...
fdata2 : tibble [100 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
fe_mean :  num 5.57
flower_model : List of 13
 $ coefficients : Named num [1:3] 83.4642 -0.0405 -12.1583
 $ residuals    : Named num [1:24] -2.94 12.16 -3.86 -4.96 -3.49 ...
 $ effects      : Named num [1:24] -275.02 -50.79 29.78 -5.16 -2.41 ...
 $ rank         : int 3
 $ fitted.values: Named num [1:24] 65.2 65.2 59.2 59.2 53.1 ...
 $ assign       : int [1:3] 0 1 2
 $ qr           :List of 5
 $ df.residual  : int 21
 $ contrasts    :List of 1
 $ xlevels      :List of 1
 $ call         : language lm(formula = fmla, data = flowers)
 $ terms        :Classes 'terms', 'formula'  language Flowers ~ Intensity + Time
 $ model        :'data.frame':  24 obs. of  3 variables:
flowers : 'data.frame': 24 obs. of  4 variables:
 $ Flowers    : num  62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
 $ Time       : chr  "Late" "Late" "Late" "Late" ...
 $ Intensity  : int  150 150 300 300 450 450 600 600 750 750 ...
 $ predictions: num  65.2 65.2 59.2 59.2 53.1 ...
fmla : Class 'formula'  language cty ~ hwy
fmla_add : Class 'formula'  language Metabol ~ Gastric + Sex
fmla_interaction : Class 'formula'  language Metabol ~ Gastric + Gastric:Sex
fmla_main_and_interaction : Class 'formula'  language Metabol ~ Gastric * Sex
fmla_sqr : Class 'formula'  language price ~ I(size^2)
fmla.abs : Class 'formula'  language Income2005 ~ Arith + Word + Parag + Math + AFQT
fmla.gam : Class 'formula'  language weight ~ s(Time)
fmla.lin : Class 'formula'  language weight ~ Time
fmla.log : Class 'formula'  language log(Income2005) ~ Arith + Word + Parag + Math + AFQT
gp :  num [1:234] 0.931 0.424 0.39 0.13 0.647 ...
houseprice : 'data.frame':  40 obs. of  4 variables:
 $ size    : int  72 98 92 90 44 46 90 150 94 90 ...
 $ price   : int  156 153 230 152 42 157 113 573 202 261 ...
 $ pred_lin: num  159.7 259.4 244.8 227.1 41.1 ...
 $ pred_sqr: num  150.6 243.3 222.8 211.4 80.5 ...
houseprice_long : 'data.frame': 80 obs. of  5 variables:
 $ size     : int  72 98 92 90 44 46 90 150 94 90 ...
 $ price    : int  156 153 230 152 42 157 113 573 202 261 ...
 $ modeltype: chr  "pred_lin" "pred_lin" "pred_lin" "pred_lin" ...
 $ pred     : num  159.7 259.4 244.8 227.1 41.1 ...
 $ residuals: num  3.705 106.419 14.825 75.08 -0.866 ...
i :  int 3
income_long : 'data.frame': 1030 obs. of  13 variables:
 $ Subject    : int  7 13 47 62 73 78 89 105 106 107 ...
 $ Arith      : int  14 30 26 12 18 25 24 28 30 19 ...
 $ Word       : int  27 29 33 25 34 35 34 32 35 31 ...
 $ Parag      : int  8 13 13 10 13 14 15 14 12 11 ...
 $ Math       : int  11 24 16 10 18 21 17 20 23 16 ...
 $ AFQT       : num  47.4 72.3 75.5 36.4 81.5 ...
 $ Income2005 : int  19000 8000 66309 30000 186135 14657 62000 40000 105000 43000 ...
 $ logpred    : num  10.3 10.9 10.7 10.1 10.5 ...
 $ pred.income: num  28644 56646 44468 25462 36356 ...
 $ modeltype  : chr  "pred.absmodel" "pred.absmodel" "pred.absmodel" "pred.absmodel" ...
 $ pred       : num  42844 75499 63519 35008 53495 ...
 $ residual   : num  23844 67499 -2790 5008 -132640 ...
 $ relerr     : num  1.255 8.4374 -0.0421 0.1669 -0.7126 ...
income_test : 'data.frame': 515 obs. of  11 variables:
 $ Subject      : int  7 13 47 62 73 78 89 105 106 107 ...
 $ Arith        : int  14 30 26 12 18 25 24 28 30 19 ...
 $ Word         : int  27 29 33 25 34 35 34 32 35 31 ...
 $ Parag        : int  8 13 13 10 13 14 15 14 12 11 ...
 $ Math         : int  11 24 16 10 18 21 17 20 23 16 ...
 $ AFQT         : num  47.4 72.3 75.5 36.4 81.5 ...
 $ Income2005   : int  19000 8000 66309 30000 186135 14657 62000 40000 105000 43000 ...
 $ logpred      : num  10.3 10.9 10.7 10.1 10.5 ...
 $ pred.income  : num  28644 56646 44468 25462 36356 ...
 $ pred.absmodel: Named num  42844 75499 63519 35008 53495 ...
 $ pred.logmodel: Named num  28644 56646 44468 25462 36356 ...
income_train : 'data.frame':    2069 obs. of  7 variables:
 $ Subject   : int  2 6 8 9 16 17 18 20 21 22 ...
 $ Arith     : int  8 30 13 21 17 29 30 17 29 27 ...
 $ Word      : int  15 35 35 28 30 33 35 28 33 31 ...
 $ Parag     : int  6 15 12 10 12 13 14 14 13 14 ...
 $ Math      : int  6 23 4 13 17 21 23 20 25 22 ...
 $ AFQT      : num  6.84 99.39 44.02 59.68 50.28 ...
 $ Income2005: int  5500 65000 36000 65000 71000 43000 120000 64000 253043 45300 ...
k :  num 3
mean_bikes :  num 274
mmat :  num [1:24, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
model : List of 12
 $ coefficients : Named num [1:2] 0.865 0.683
 $ residuals    : Named num [1:156] -0.35705 -2.62476 -1.30783 0.00909 -0.30783 ...
 $ effects      : Named num [1:156] -214.412 -49.087 -1.187 0.166 -0.187 ...
 $ rank         : int 2
 $ fitted.values: Named num [1:156] 21.4 18.6 19.3 20 19.3 ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
 $ df.residual  : int 154
 $ xlevels      : Named list()
 $ call         : language lm(formula = cty ~ hwy, data = mpg[split$train, ])
 $ terms        :Classes 'terms', 'formula'  language cty ~ hwy
 $ model        :'data.frame':  156 obs. of  2 variables:
model_add : List of 13
 $ coefficients : Named num [1:3] -1.4 1.65 1.81
 $ residuals    : Named num [1:21] 0.352 -0.635 0.43 -1.822 -0.377 ...
 $ effects      : Named num [1:21] -10.213 9.007 3.55 -2.059 -0.399 ...
 $ rank         : int 3
 $ fitted.values: Named num [1:21] 0.248 1.235 1.07 2.222 0.577 ...
 $ assign       : int [1:3] 0 1 2
 $ qr           :List of 5
 $ df.residual  : int 18
 $ contrasts    :List of 1
 $ xlevels      :List of 1
 $ call         : language lm(formula = fmla_add, data = alcohol[split$train, ])
 $ terms        :Classes 'terms', 'formula'  language Metabol ~ Gastric + Sex
 $ model        :'data.frame':  21 obs. of  3 variables:
model_interaction : List of 13
 $ coefficients : Named num [1:3] -0.0572 0.7072 1.1427
 $ residuals    : Named num [1:21] -0.05 -0.474 0.496 -1.099 -0.591 ...
 $ effects      : Named num [1:21] -10.213 9.007 3.99 -1.341 -0.552 ...
 $ rank         : int 3
 $ fitted.values: Named num [1:21] 0.65 1.074 1.004 1.499 0.791 ...
 $ assign       : int [1:3] 0 1 2
 $ qr           :List of 5
 $ df.residual  : int 18
 $ contrasts    :List of 1
 $ xlevels      :List of 1
 $ call         : language lm(formula = fmla_interaction, data = alcohol[split$train, ])
 $ terms        :Classes 'terms', 'formula'  language Metabol ~ Gastric + Gastric:Sex
 $ model        :'data.frame':  21 obs. of  3 variables:
model_lin : List of 12
 $ coefficients : Named num [1:2] -136.73 4.04
 $ residuals    : Named num [1:27] 1.68 -5.16 107.78 -114.08 -41.25 ...
 $ effects      : Named num [1:27] -1287.3 -581.3 105.8 -114.4 -41.4 ...
 $ rank         : int 2
 $ fitted.values: Named num [1:27] 154.3 235.2 49.2 227.1 243.2 ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
 $ df.residual  : int 25
 $ xlevels      : Named list()
 $ call         : language lm(formula = price ~ size, data = houseprice[split$train, ])
 $ terms        :Classes 'terms', 'formula'  language price ~ size
 $ model        :'data.frame':  27 obs. of  2 variables:
model_main_and_interaction : List of 13
 $ coefficients : Named num [1:4] -0.197 0.837 -0.988 1.507
 $ residuals    : Named num [1:32] -0.0397 -0.5418 0.4418 -1.244 -0.6234 ...
 $ effects      : Named num [1:32] -13.7 12.246 4.211 3.254 -0.734 ...
 $ rank         : int 4
 $ fitted.values: Named num [1:32] 0.64 1.142 1.058 1.644 0.723 ...
 $ assign       : int [1:4] 0 1 2 3
 $ qr           :List of 5
 $ df.residual  : int 28
 $ contrasts    :List of 1
 $ xlevels      :List of 1
 $ call         : language lm(formula = fmla_main_and_interaction, data = alcohol)
 $ terms        :Classes 'terms', 'formula'  language Metabol ~ Gastric * Sex
 $ model        :'data.frame':  32 obs. of  3 variables:
model_sqr : List of 12
 $ coefficients : Named num [1:2] 39.416 0.0212
 $ residuals    : Named num [1:27] 6.52 10.88 72.66 -98.39 -25.02 ...
 $ effects      : Named num [1:27] -1287.3 610.5 69.3 -99.8 -26.1 ...
 $ rank         : int 2
 $ fitted.values: Named num [1:27] 149.5 219.1 84.3 211.4 227 ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
 $ df.residual  : int 25
 $ xlevels      : Named list()
 $ call         : language lm(formula = fmla_sqr, data = houseprice[split$train, ])
 $ terms        :Classes 'terms', 'formula'  language price ~ I(size^2)
 $ model        :'data.frame':  27 obs. of  2 variables:
model.abs : List of 12
 $ coefficients : Named num [1:6] 17517 1552 -132 -1155 725 ...
 $ residuals    : Named num [1:2069] -21089 -11484 6068 10099 23653 ...
 $ effects      : Named num [1:2069] -2269500 -729259 4672 -47914 -160517 ...
 $ rank         : int 6
 $ fitted.values: Named num [1:2069] 26589 76484 29932 54901 47347 ...
 $ assign       : int [1:6] 0 1 2 3 4 5
 $ qr           :List of 5
 $ df.residual  : int 2063
 $ xlevels      : Named list()
 $ call         : language lm(formula = fmla.abs, data = income_train)
 $ terms        :Classes 'terms', 'formula'  language Income2005 ~ Arith + Word + Parag + Math + AFQT
 $ model        :'data.frame':  2069 obs. of  6 variables:
model.gam : List of 46
 $ coefficients     : Named num [1:10] 6.16 -10.95 18.33 -3.09 11.91 ...
 $ residuals        : num [1:330] -0.024 0.047 -0.305 0.424 0.244 ...
 $ fitted.values    : num [1:330] 0.13 0.214 0.971 1.686 5.986 ...
 $ family           :List of 11
 $ linear.predictors: num [1:330] 0.13 0.214 0.971 1.686 5.986 ...
 $ deviance         : num 1382
 $ null.deviance    : num 14416
 $ iter             : int 1
 $ weights          : num [1:330] 1 1 1 1 1 1 1 1 1 1 ...
 $ prior.weights    : num [1:330] 1 1 1 1 1 1 1 1 1 1 ...
 $ df.null          : int 329
 $ y                : num [1:330] 0.106 0.261 0.666 2.11 6.23 ...
 $ converged        : logi TRUE
 $ sig2             : num 4.31
 $ edf              : Named num [1:10] 1 1 1.03 0.98 0.932 ...
 $ edf1             : Named num [1:10] 1 1 1.006 1 0.998 ...
 $ hat              : num [1:330] 0.0311 0.0275 0.0235 0.0244 0.0259 ...
 $ R                : num [1:10, 1:10] -18.2 0 0 0 0 ...
 $ boundary         : logi FALSE
 $ sp               : Named num 0.00597
 $ nsdf             : int 1
 $ Ve               : num [1:10, 1:10] 1.31e-02 7.10e-16 -5.34e-15 1.32e-16 -4.24e-15 ...
 $ Vp               : num [1:10, 1:10] 0.0131 0 0 0 0 ...
 $ rV               : num [1:10, 1:10] 0 -0.011195 -0.001636 -0.011576 0.000107 ...
 $ mgcv.conv        :List of 7
 $ gcv.ubre         : Named num 4.44
 $ aic              : num 1430
 $ rank             : int 10
 $ gcv.ubre.dev     : num 4.44
 $ scale.estimated  : logi TRUE
 $ method           : chr "GCV"
 $ smooth           :List of 1
 $ formula          :Class 'formula'  language weight ~ s(Time)
 $ var.summary      :List of 1
 $ cmX              : Named num [1:10] 1.00 1.46e-17 -1.49e-16 3.69e-18 -8.68e-17 ...
 $ model            :'data.frame':  330 obs. of  2 variables:
 $ control          :List of 19
 $ terms            :Classes 'terms', 'formula'  language weight ~ 1 + Time
 $ pred.formula     :Class 'formula'  language ~Time
 $ pterms           :Classes 'terms', 'formula'  language weight ~ 1
 $ assign           : int 0
 $ offset           : num [1:330] 0 0 0 0 0 0 0 0 0 0 ...
 $ df.residual      : num 321
 $ min.edf          : num 2
 $ optimizer        : chr "magic"
 $ call             : language gam(formula = fmla.gam, family = gaussian, data = soybean_train)
model.lin : List of 12
 $ coefficients : Named num [1:2] -6.559 0.292
 $ residuals    : Named num [1:330] 2.576 0.686 -0.953 -1.554 -1.523 ...
 $ effects      : Named num [1:330] -111.98 109.01 -1.11 -1.7 -1.65 ...
 $ rank         : int 2
 $ fitted.values: Named num [1:330] -2.47 -0.425 1.619 3.664 7.753 ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
 $ df.residual  : int 328
 $ xlevels      : Named list()
 $ call         : language lm(formula = fmla.lin, data = soybean_train)
 $ terms        :Classes 'terms', 'formula'  language weight ~ Time
 $ model        :'data.frame':  330 obs. of  2 variables:
model.log : List of 12
 $ coefficients : Named num [1:6] 9.675486 0.030436 -0.000552 -0.010679 0.015518 ...
 $ residuals    : Named num [1:2069] -1.34 0.126 0.421 0.574 0.762 ...
 $ effects      : Named num [1:2069] -474.504 -14.454 -0.552 -0.229 -2.946 ...
 $ rank         : int 6
 $ fitted.values: Named num [1:2069] 9.95 10.96 10.07 10.51 10.41 ...
 $ assign       : int [1:6] 0 1 2 3 4 5
 $ qr           :List of 5
 $ df.residual  : int 2063
 $ xlevels      : Named list()
 $ call         : language lm(formula = fmla.log, data = income_train)
 $ terms        :Classes 'terms', 'formula'  language log(Income2005) ~ Arith + Word + Parag + Math + AFQT
 $ model        :'data.frame':  2069 obs. of  6 variables:
mpg : tibble [234 × 13] (S3: tbl_df/tbl/data.frame)
mpg_model : List of 12
 $ coefficients : Named num [1:2] 0.856 0.68
 $ residuals    : Named num [1:174] 0.416 -1.945 -0.264 -2.543 -0.543 ...
 $ effects      : Named num [1:174] -222.426 -52.214 -0.15 -2.523 -0.523 ...
 $ rank         : int 2
 $ fitted.values: Named num [1:174] 20.6 21.9 21.3 18.5 18.5 ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
 $ df.residual  : int 172
 $ xlevels      : Named list()
 $ call         : language lm(formula = fmla, data = mpg_train)
 $ terms        :Classes 'terms', 'formula'  language cty ~ hwy
 $ model        :'data.frame':  174 obs. of  2 variables:
mpg_test : tibble [60 × 13] (S3: tbl_df/tbl/data.frame)
mpg_train : tibble [174 × 13] (S3: tbl_df/tbl/data.frame)
N :  int 234
newrates : 'data.frame':    1 obs. of  1 variable:
 $ male_unemployment: num 5
nRows :  int 234
outcome :  chr "cnt"
perf : tibble [1 × 7] (S3: tbl_df/tbl/data.frame)
pred :  Named num 4.91
pseudoR2 :  num 0.784
r_squared : function (predcol, ycol)  
res :  num [1:13] 0.552 1.313 0.163 0.279 -0.34 ...
rho :  num 0.906
rho2 :  num 0.821
rmse : function (predcol, ycol)  
rmse_test :  num 1.21
rmse_train :  num 1.26
rsq :  num 0.821
rsq_glance :  num 0.821
rsq_test :  num 1.21
rsq_train :  num 1.26
rss :  num 3.7
sd_unemployment :  num 1.31
seed :  num 423563
soybean_long : 'data.frame':    164 obs. of  7 variables:
 $ Plot     : Ord.factor w/ 41 levels "1988F4"<"1988F2"<..: 3 3 3 2 2 2 6 6 1 1 ...
 $ Variety  : Factor w/ 2 levels "F","P": 1 1 1 1 1 1 1 1 1 1 ...
 $ Year     : Factor w/ 3 levels "1988","1989",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Time     : num  42 56 70 14 42 77 49 63 14 35 ...
 $ weight   : num  3.56 8.71 16.342 0.104 2.93 ...
 $ modeltype: chr  "pred.lin" "pred.lin" "pred.lin" "pred.lin" ...
 $ pred     : num  5.71 9.8 13.89 -2.47 5.71 ...
soybean_test : Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame':   82 obs. of  7 variables:
 $ Plot    : Ord.factor w/ 41 levels "1988F4"<"1988F2"<..: 3 3 3 2 2 2 6 6 1 1 ...
 $ Variety : Factor w/ 2 levels "F","P": 1 1 1 1 1 1 1 1 1 1 ...
 $ Year    : Factor w/ 3 levels "1988","1989",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Time    : num  42 56 70 14 42 77 49 63 14 35 ...
 $ weight  : num  3.56 8.71 16.342 0.104 2.93 ...
 $ pred.lin: num  5.71 9.8 13.89 -2.47 5.71 ...
 $ pred.gam: num  3.94 9.96 16.55 0.13 3.94 ...
soybean_train : Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame':  330 obs. of  5 variables:
 $ Plot   : Ord.factor w/ 48 levels "1988F4"<"1988F2"<..: 3 3 3 3 3 3 3 2 2 2 ...
 $ Variety: Factor w/ 2 levels "F","P": 1 1 1 1 1 1 1 1 1 1 ...
 $ Year   : Factor w/ 3 levels "1988","1989",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Time   : num  14 21 28 35 49 63 77 21 28 35 ...
 $ weight : num  0.106 0.261 0.666 2.11 6.23 ...
sparrow : 'data.frame': 87 obs. of  13 variables:
 $ status      : Factor w/ 2 levels "Perished","Survived": 2 2 2 2 2 2 2 2 2 2 ...
 $ age         : chr  "adult" "adult" "adult" "adult" ...
 $ total_length: int  154 160 155 154 156 161 157 159 158 158 ...
 $ wingspan    : int  241 252 243 245 247 253 251 247 247 252 ...
 $ weight      : num  24.5 26.9 26.9 24.3 24.1 26.5 24.6 24.2 23.6 26.2 ...
 $ beak_head   : num  31.2 30.8 30.6 31.7 31.5 31.8 31.1 31.4 29.8 32 ...
 $ humerus     : num  0.69 0.74 0.73 0.74 0.71 0.78 0.74 0.73 0.7 0.75 ...
 $ femur       : num  0.67 0.71 0.7 0.69 0.71 0.74 0.74 0.72 0.67 0.74 ...
 $ legbone     : num  1.02 1.18 1.15 1.15 1.13 1.14 1.15 1.13 1.08 1.15 ...
 $ skull       : num  0.59 0.6 0.6 0.58 0.57 0.61 0.61 0.61 0.6 0.61 ...
 $ sternum     : num  0.83 0.84 0.85 0.84 0.82 0.89 0.86 0.79 0.82 0.86 ...
 $ survived    : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ pred        : num  0.789 0.614 0.919 0.995 0.878 ...
sparrow_model : List of 30
 $ coefficients     : Named num [1:4] 46.881 -0.543 -0.569 75.461
 $ residuals        : Named num [1:87] 1.27 1.63 1.09 1.01 1.14 ...
 $ fitted.values    : Named num [1:87] 0.789 0.614 0.919 0.995 0.878 ...
 $ effects          : Named num [1:87] -1.065 2.49 0.351 -3.939 0.437 ...
 $ R                : num [1:4, 1:4] -3.48 0 0 0 -559.05 ...
 $ rank             : int 4
 $ qr               :List of 5
 $ family           :List of 12
 $ linear.predictors: Named num [1:87] 1.319 0.466 2.429 5.206 1.969 ...
 $ deviance         : num 75.1
 $ aic              : num 83.1
 $ null.deviance    : num 118
 $ iter             : int 5
 $ weights          : Named num [1:87] 0.16646 0.2369 0.07444 0.00542 0.10749 ...
 $ prior.weights    : Named num [1:87] 1 1 1 1 1 1 1 1 1 1 ...
 $ df.residual      : int 83
 $ df.null          : int 86
 $ y                : Named num [1:87] 1 1 1 1 1 1 1 1 1 1 ...
 $ converged        : logi TRUE
 $ boundary         : logi FALSE
 $ model            :'data.frame':  87 obs. of  4 variables:
 $ call             : language glm(formula = fmla, family = binomial, data = sparrow)
 $ formula          :Class 'formula'  language survived ~ total_length + weight + humerus
 $ terms            :Classes 'terms', 'formula'  language survived ~ total_length + weight + humerus
 $ data             :'data.frame':  87 obs. of  12 variables:
 $ offset           : NULL
 $ control          :List of 3
 $ method           : chr "glm.fit"
 $ contrasts        : NULL
 $ xlevels          : Named list()
split : List of 2
 $ train: int [1:27] 1 3 6 7 9 10 11 12 13 14 ...
 $ app  : int [1:13] 5 36 28 16 2 4 17 31 20 34 ...
splitPlan : List of 3
 $ :List of 2
 $ :List of 2
 $ :List of 2
target :  num 176
tss :  num 20.7
unemployment : 'data.frame':    13 obs. of  5 variables:
 $ male_unemployment  : num  2.9 6.7 4.9 7.9 9.8 ...
 $ female_unemployment: num  4 7.4 5 7.2 7.9 ...
 $ prediction         : num  3.45 6.09 4.84 6.92 8.24 ...
 $ predictions        : num  3.45 6.09 4.84 6.92 8.24 ...
 $ residuals          : num  0.552 1.313 0.163 0.279 -0.34 ...
unemployment_model : List of 12
 $ coefficients : Named num [1:2] 1.434 0.695
 $ residuals    : Named num [1:13] 0.552 1.313 0.163 0.279 -0.34 ...
 $ effects      : Named num [1:13] -20.08 -4.126 0.106 -0.264 -1.192 ...
 $ rank         : int 2
 $ fitted.values: Named num [1:13] 3.45 6.09 4.84 6.92 8.24 ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
 $ df.residual  : int 11
 $ xlevels      : Named list()
 $ call         : language lm(formula = fmla, data = unemployment)
 $ terms        :Classes 'terms', 'formula'  language female_unemployment ~ male_unemployment
 $ model        :'data.frame':  13 obs. of  2 variables:
var_bikes :  num 45864
vars :  chr [1:8] "hr" "holiday" "workingday" "weathersit" "temp" "atemp" ...
# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model, mpg_train)

# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, mpg_test)

# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))
[1] 1.264899
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))
[1] 1.199585
# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))
[1] 1.264899
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))
[1] 1.199585
# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) + 
  geom_point() + 
  geom_abline()

Good performance on the test data is more confirmation that the model works as expected.

2.4.4 Create a cross validation plan

There are several ways to implement an n-fold cross validation plan. We will create such a plan using vtreat::kWayCrossValidation(), and examine it.

kWayCrossValidation() creates a cross validation plan with the following call:

splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)

where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.

Strictly speaking, dframe and y aren’t used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. We can set them both to NULL.

The resulting splitPlan is a list of nSplits elements; each element contains two vectors:

  • train: the indices of dframe that will form the training set
  • app: the indices of dframe that will form the test (or application) set

We will create a 3-fold cross-validation plan for the data set mpg.

# Load the package vtreat
library(vtreat)

# mpg is in the workspace
summary(mpg)
 manufacturer          model               displ            year     
 Length:234         Length:234         Min.   :1.600   Min.   :1999  
 Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
 Mode  :character   Mode  :character   Median :3.300   Median :2004  
                                       Mean   :3.472   Mean   :2004  
                                       3rd Qu.:4.600   3rd Qu.:2008  
                                       Max.   :7.000   Max.   :2008  
      cyl           trans               drv                 cty       
 Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
 1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
 Median :6.000   Mode  :character   Mode  :character   Median :17.00  
 Mean   :5.889                                         Mean   :16.86  
 3rd Qu.:8.000                                         3rd Qu.:19.00  
 Max.   :8.000                                         Max.   :35.00  
      hwy             fl               class              pred.cv      
 Min.   :12.00   Length:234         Length:234         Min.   : 8.942  
 1st Qu.:18.00   Class :character   Class :character   1st Qu.:13.160  
 Median :24.00   Mode  :character   Mode  :character   Median :17.262  
 Mean   :23.44                                         Mean   :16.859  
 3rd Qu.:27.00                                         3rd Qu.:19.308  
 Max.   :44.00                                         Max.   :30.975  
      pred       
 Min.   : 9.043  
 1st Qu.:13.142  
 Median :17.241  
 Mean   :16.859  
 3rd Qu.:19.291  
 Max.   :30.906  
# Get the number of rows in mpg
nRows <- nrow(mpg)

# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3, dframe = NULL, y = NULL)

# Examine the split plan
str(splitPlan)
List of 3
 $ :List of 2
  ..$ train: int [1:156] 1 2 3 4 5 6 7 8 10 11 ...
  ..$ app  : int [1:78] 20 161 98 230 9 179 73 124 126 83 ...
 $ :List of 2
  ..$ train: int [1:156] 1 2 3 5 8 9 10 13 14 15 ...
  ..$ app  : int [1:78] 138 88 53 210 113 80 190 189 24 112 ...
 $ :List of 2
  ..$ train: int [1:156] 4 6 7 9 11 12 13 14 15 16 ...
  ..$ app  : int [1:78] 30 64 133 209 75 197 91 28 194 34 ...
 - attr(*, "splitmethod")= chr "kwaycross"

2.4.5 Evaluate a modeling procedure using n-fold cross-validation

We will use splitPlan to make predictions from a model that predicts mpg$cty from mpg$hwy.

If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:

# Initialize a column of the appropriate length
dframe$pred.cv <- 0 

# k is the number of folds
# splitPlan is the cross validation plan

for(i in 1:k) {
  # Get the ith split
  split <- splitPlan[[i]]

  # Build a model on the training data 
  # from this split 
  # (lm, in this case)
  model <- lm(fmla, data = dframe[split$train,])

  # make predictions on the 
  # application data from this split
  dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}

Cross-validation predicts how well a model built from all the data will perform on new data. As with the test/train split, for a good modeling procedure, cross-validation performance and training performance should be close.

# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0 
for(i in 1:k) {
  split <- splitPlan[[i]]
  model <- lm(cty ~ hwy, data = mpg[split$train,])
  mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app,])
}

# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))

# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)
[1] 1.247045
# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
[1] 1.249645

We have successfully estimated a model’s out-of-sample error via cross-validation. Remember, cross-validation validates the modeling process, not an actual model.

3 Issues to Consider

3.1 Categorical inputs

3.1.1 Examining the structure of categorical inputs

For this exercise you will call model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling. The dataset flowers (derived from the Sleuth3 package) is loaded into your workspace. It has the following columns:

  • Flowers: the average number of flowers on a meadowfoam plant
  • Intensity: the intensity of a light treatment applied to the plant
  • Time: A categorical variable - when (Late or Early) in the lifecycle the light treatment occurred

The ultimate goal is to predict Flowers as a function of Time and Intensity.

flowers <- readRDS("flowers.rds")
# Call str on flowers to see the types of each column
str(flowers)
'data.frame':   24 obs. of  3 variables:
 $ Flowers  : num  62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
 $ Time     : chr  "Late" "Late" "Late" "Late" ...
 $ Intensity: int  150 150 300 300 450 450 600 600 750 750 ...
# Use unique() to see how many possible values Time takes
unique(flowers$Time)
[1] "Late"  "Early"
# Build a formula to express Flowers as a function of Intensity and Time: fmla. Print it
(fmla <- as.formula("Flowers ~ Intensity + Time"))
Flowers ~ Intensity + Time
# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- model.matrix(fmla, data = flowers)

# Examine the first 20 lines of flowers
head(flowers, 20)
ABCDEFGHIJ0123456789
 
 
Flowers
<dbl>
Time
<chr>
Intensity
<int>
162.3Late150
277.4Late150
355.3Late300
454.2Late300
549.6Late450
661.9Late450
739.4Late600
845.7Late600
931.3Late750
1044.9Late750

# Examine the first 20 lines of mmat
head(mmat, 20)
   (Intercept) Intensity TimeLate
1            1       150        1
2            1       150        1
3            1       300        1
4            1       300        1
5            1       450        1
6            1       450        1
7            1       600        1
8            1       600        1
9            1       750        1
10           1       750        1
11           1       900        1
12           1       900        1
13           1       150        0
14           1       150        0
15           1       300        0
16           1       300        0
17           1       450        0
18           1       450        0
19           1       600        0
20           1       600        0

Now we have seen how most R modeling functions represent categorical variables internally.

3.1.2 Modeling with categorical inputs

you will fit a linear model to the flowers data, to predict Flowers as a function of Time and Intensity.

# flowers in is the workspace
str(flowers)
'data.frame':   24 obs. of  3 variables:
 $ Flowers  : num  62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
 $ Time     : chr  "Late" "Late" "Late" "Late" ...
 $ Intensity: int  150 150 300 300 450 450 600 600 750 750 ...
# fmla is in the workspace
fmla
Flowers ~ Intensity + Time
# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla, data = flowers)

# Use summary on mmat to remind yourself of its structure
summary(mmat)
  (Intercept)   Intensity      TimeLate  
 Min.   :1    Min.   :150   Min.   :0.0  
 1st Qu.:1    1st Qu.:300   1st Qu.:0.0  
 Median :1    Median :525   Median :0.5  
 Mean   :1    Mean   :525   Mean   :0.5  
 3rd Qu.:1    3rd Qu.:750   3rd Qu.:1.0  
 Max.   :1    Max.   :900   Max.   :1.0  
# Use summary to examine flower_model 
summary(flower_model)

Call:
lm(formula = fmla, data = flowers)

Residuals:
   Min     1Q Median     3Q    Max 
-9.652 -4.139 -1.558  5.632 12.165 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  83.464167   3.273772  25.495  < 2e-16 ***
Intensity    -0.040471   0.005132  -7.886 1.04e-07 ***
TimeLate    -12.158333   2.629557  -4.624 0.000146 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.441 on 21 degrees of freedom
Multiple R-squared:  0.7992,    Adjusted R-squared:   0.78 
F-statistic: 41.78 on 2 and 21 DF,  p-value: 4.786e-08
# Predict the number of flowers on each plant
flowers$predictions <- predict(flower_model, flowers)

# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) + 
  geom_point() +
  geom_abline(color = "blue") 

We’ve successfully fit a model with a categorical variable, and seen how categorical variables are represented in that model.

3.2 Interactions

3.2.1 Modeling an interaction

We will use interactions to model the effect of gender and gastric activity on alcohol metabolism.

The data frame alcohol has columns:

  • Metabol: the alcohol metabolism rate
  • Gastric: the rate of gastric alcohol dehydrogenase activity
  • Sex: the sex of the drinker (Male or Female)
alcohol <- readRDS("alcohol.rds")

We can fit three models to the alcohol data:

  • one with only additive (main effect) terms : Metabol ~ Gastric + Sex
  • two models, each with interactions between gastric activity and sex

We saw that one of the models with interaction terms had a better R-squared than the additive model, suggesting that using interaction terms gives a better fit. In this exercise we will compare the R-squared of one of the interaction models to the main-effects-only model.

Recall that the operator : designates the interaction between two variables. The operator * designates the interaction between the two variables, plus the main effects.

xy=x+y+x:y

# alcohol is in the workspace
summary(alcohol)
    Subject         Metabol          Gastric          Sex    
 Min.   : 1.00   Min.   : 0.100   Min.   :0.800   Female:18  
 1st Qu.: 8.75   1st Qu.: 0.600   1st Qu.:1.200   Male  :14  
 Median :16.50   Median : 1.700   Median :1.600              
 Mean   :16.50   Mean   : 2.422   Mean   :1.863              
 3rd Qu.:24.25   3rd Qu.: 2.925   3rd Qu.:2.200              
 Max.   :32.00   Max.   :12.300   Max.   :5.200              
          Alcohol  
 Alcoholic    : 8  
 Non-alcoholic:24  
                   
                   
                   
                   
# Create the formula with main effects only
(fmla_add <- Metabol ~ Gastric + Sex)
Metabol ~ Gastric + Sex
# Create the formula with main effect and interactions
(fmla_main_and_interaction <- Metabol ~ Gastric * Sex)
Metabol ~ Gastric * Sex
# Create the formula with interactions
(fmla_interaction <- Metabol ~ Gastric + Gastric:Sex)
Metabol ~ Gastric + Gastric:Sex
# Fit the main effects only model
model_add <- lm(fmla_add, alcohol)

# Fit the main effects and interactions
model_main_and_interaction <- lm(fmla_main_and_interaction, alcohol)

# Fit the interaction model
model_interaction <- lm(fmla_interaction, alcohol)

# Call summary on both models and compare
summary(model_add)

Call:
lm(formula = fmla_add, data = alcohol)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2779 -0.6328 -0.0966  0.5783  4.5703 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.9466     0.5198  -3.745 0.000796 ***
Gastric       1.9656     0.2674   7.352 4.24e-08 ***
SexMale       1.6174     0.5114   3.163 0.003649 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.331 on 29 degrees of freedom
Multiple R-squared:  0.7654,    Adjusted R-squared:  0.7492 
F-statistic: 47.31 on 2 and 29 DF,  p-value: 7.41e-10
summary(model_main_and_interaction)

Call:
lm(formula = fmla_main_and_interaction, data = alcohol)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4427 -0.6111 -0.0326  0.5436  3.8759 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
(Intercept)      -0.1973     0.8022  -0.246   0.8075  
Gastric           0.8369     0.4839   1.730   0.0947 .
SexMale          -0.9885     1.0724  -0.922   0.3645  
Gastric:SexMale   1.5069     0.5591   2.695   0.0118 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.207 on 28 degrees of freedom
Multiple R-squared:  0.8137,    Adjusted R-squared:  0.7938 
F-statistic: 40.77 on 3 and 28 DF,  p-value: 2.386e-10
summary(model_interaction)

Call:
lm(formula = fmla_interaction, data = alcohol)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4656 -0.5091  0.0143  0.5660  4.0668 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -0.7504     0.5310  -1.413 0.168236    
Gastric           1.1489     0.3450   3.331 0.002372 ** 
Gastric:SexMale   1.0422     0.2412   4.321 0.000166 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.204 on 29 degrees of freedom
Multiple R-squared:  0.8081,    Adjusted R-squared:  0.7948 
F-statistic: 61.05 on 2 and 29 DF,  p-value: 4.033e-11

We will compare the performance of the interaction model you fit in the previous exercise to the performance of a main-effects only model. Because this data set is small, we will use cross-validation to simulate making predictions on out-of-sample data.

We will use the dplyr package to do calculations.

  • mutate() adds new columns to a tbl (a type of data frame)
  • group_by() specifies how rows are grouped in a tbl
  • summarize() computes summary statistics of a column

You will also use tidyr’s gather() which takes multiple columns and collapses them into key-value pairs.

# alcohol is in the workspace
summary(alcohol)
    Subject         Metabol          Gastric          Sex    
 Min.   : 1.00   Min.   : 0.100   Min.   :0.800   Female:18  
 1st Qu.: 8.75   1st Qu.: 0.600   1st Qu.:1.200   Male  :14  
 Median :16.50   Median : 1.700   Median :1.600              
 Mean   :16.50   Mean   : 2.422   Mean   :1.863              
 3rd Qu.:24.25   3rd Qu.: 2.925   3rd Qu.:2.200              
 Max.   :32.00   Max.   :12.300   Max.   :5.200              
          Alcohol  
 Alcoholic    : 8  
 Non-alcoholic:24  
                   
                   
                   
                   
# Both the formulae are in the workspace
fmla_add
Metabol ~ Gastric + Sex
fmla_interaction
Metabol ~ Gastric + Gastric:Sex
# Create the splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)

# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_add <- lm(fmla_add, data = alcohol[split$train, ])
  alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}

# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_interaction <- lm(fmla_interaction, data = alcohol[split$train, ])
  alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}

# Get RMSE
alcohol %>% 
  gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
  mutate(residuals = Metabol - pred) %>%      
  group_by(modeltype) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
ABCDEFGHIJ0123456789
modeltype
<chr>
rmse
<dbl>
pred_add1.375565
pred_interaction1.303637

Cross-validation confirms that a model with interaction will likely give better predictions.

3.3 Transforming the response before modeling

3.3.1 Relative error

We will compare relative error to absolute error. For the purposes of modeling, we will define relative error as:

rel=(ypred)y

that is, the error is relative to the true outcome. You will measure the overall relative error of a model using root mean squared relative error:

rmserel=(rel2)¯

where rel2¯ is the mean of rel2.

The example (toy) dataset fdata is loaded in your workspace. It includes the columns:

  • y: the true output to be predicted by some model; imagine it is the amount of money a customer will spend on a visit to your store.
  • pred: the predictions of a model that predicts y.
  • label: categorical: whether y comes from a population that makes small purchases, or large ones.

You want to know which model does “better”: the one predicting the small purchases, or the one predicting large ones.

fdata <- readRDS("fdata.rds")
# fdata is in the workspace
summary(fdata)
       y                 pred                      label   
 Min.   :  -5.894   Min.   :   1.072   small purchases:50  
 1st Qu.:   5.407   1st Qu.:   6.373   large purchases:50  
 Median :  57.374   Median :  55.693                       
 Mean   : 306.204   Mean   : 305.905                       
 3rd Qu.: 550.903   3rd Qu.: 547.886                       
 Max.   :1101.619   Max.   :1098.896                       
# Examine the data: generate the summaries for the groups large and small:
fdata %>% 
    group_by(label) %>%     # group by small/large purchases
    summarize(min  = min(y),   # min of y
              mean = mean(y),   # mean of y
              max  = max(y))   # max of y
ABCDEFGHIJ0123456789
label
<fctr>
min
<dbl>
mean
<dbl>
max
<dbl>
small purchases-5.8934996.47825418.62829
large purchases96.119814605.9286731101.61864

# Fill in the blanks to add error columns
fdata2 <- fdata %>% 
         group_by(label) %>%       # group by label
           mutate(residual = y - pred,  # Residual
                  relerr   = residual / y)  # Relative error

# Compare the rmse and rmse.rel of the large and small groups:
fdata2 %>% 
  group_by(label) %>% 
  summarize(rmse     = sqrt(mean(residual^2)),   # RMSE
            rmse.rel = sqrt(mean(relerr^2)))   # Root mean squared relative error
ABCDEFGHIJ0123456789
label
<fctr>
rmse
<dbl>
rmse.rel
<dbl>
small purchases4.0149691.24965674
large purchases5.5444390.01473322
            
# Plot the predictions for both groups of purchases
ggplot(fdata2, aes(x = pred, y = y, color = label)) + 
  geom_point() + 
  geom_abline() + 
  facet_wrap(~ label, ncol = 1, scales = "free") + 
  ggtitle("Outcome vs prediction")

Notice from this example how a model with larger RMSE might still be better, if relative errors are more important than absolute errors.

3.3.2 Modeling log-transformed monetary output

you will practice modeling on log-transformed monetary output, and then transforming the “log-money” predictions back into monetary units.

The data records subjects’ incomes in 2005 (Income2005), as well as the results of several aptitude tests taken by the subjects in 1981:

  • Arith
  • Word
  • Parag
  • Math
  • AFQT (Percentile on the Armed Forces Qualifying Test)

The data have already been split into training and test sets (income_train and income_test respectively).

income_train <- readRDS("income_train.rds")
income_test <- readRDS("income_test.rds")

We will build a model of log(income) from the inputs, and then convert log(income) back into income.

# Examine Income2005 in the training set
summary(income_train$Income2005)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     63   23000   39000   49894   61500  703637 
# Write the formula for log income as a function of the tests and print it
(fmla.log <- log(Income2005) ~ Arith + Word + Parag + Math + AFQT)
log(Income2005) ~ Arith + Word + Parag + Math + AFQT
# Fit the linear model
model.log <-  lm(fmla.log, data = income_train)

# Make predictions on income_test
income_test$logpred <- predict(model.log, income_test)
summary(income_test$logpred)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  9.766  10.133  10.423  10.419  10.705  11.006 
# Convert the predictions to monetary units
income_test$pred.income <- exp(income_test$logpred)
summary(income_test$pred.income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  17432   25167   33615   35363   44566   60217 
#  Plot predicted income (x axis) vs income
ggplot(income_test, aes(x = pred.income, y = Income2005)) + 
  geom_point() + 
  geom_abline(color = "blue")

Remember that when we transform the output before modeling, we have to ‘reverse transform’ the resulting predictions after applying the model.

3.3.3 Comparing RMSE and root-mean-squared Relative Error

We will show that log-transforming a monetary output before modeling improves mean relative error (but increases RMSE) compared to modeling the monetary output directly. You will compare the results of model.log from the previous exercise to a model (model.abs) that directly fits income.

# Write the formula for income as a function of the tests and print it
(fmla.abs <- Income2005 ~ Arith + Word + Parag + Math + AFQT)
Income2005 ~ Arith + Word + Parag + Math + AFQT
# Fit the linear model
model.abs <-  lm(fmla.abs, data = income_train)

# model.abs is in the workspace
summary(model.abs)

Call:
lm(formula = fmla.abs, data = income_train)

Residuals:
   Min     1Q Median     3Q    Max 
-78728 -24137  -6979  11964 648573 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  17516.7     6420.1   2.728  0.00642 ** 
Arith         1552.3      303.4   5.116 3.41e-07 ***
Word          -132.3      265.0  -0.499  0.61754    
Parag        -1155.1      618.3  -1.868  0.06189 .  
Math           725.5      372.0   1.950  0.05127 .  
AFQT           177.8      144.1   1.234  0.21734    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 45500 on 2063 degrees of freedom
Multiple R-squared:  0.1165,    Adjusted R-squared:  0.1144 
F-statistic:  54.4 on 5 and 2063 DF,  p-value: < 2.2e-16
# Add predictions to the test set
income_test <- income_test %>%
  mutate(pred.absmodel = predict(model.abs, income_test),        # predictions from model.abs
         pred.logmodel = exp(predict(model.log, income_test)))   # predictions from model.log

# Gather the predictions and calculate residuals and relative error
income_long <- income_test %>% 
  gather(key = modeltype, value = pred, pred.absmodel, pred.logmodel) %>%
  mutate(residual = pred - Income2005,   # residuals
         relerr   = residual / Income2005)   # relative error

# Calculate RMSE and relative RMSE and compare
income_long %>% 
  group_by(modeltype) %>%      # group by modeltype
  summarize(rmse     = sqrt(mean(residual^2)),    # RMSE
            rmse.rel = sqrt(mean(relerr^2)))    # Root mean squared relative error
ABCDEFGHIJ0123456789
modeltype
<chr>
rmse
<dbl>
rmse.rel
<dbl>
pred.absmodel37448.373.183705
pred.logmodel39234.902.218663

We’ve seen how modeling log(income) can reduce the relative error of the fit, at the cost of increased RMSE. Which tradeoff to make depends on the goals of our project.

3.4 Transforming inputs before modeling

3.4.1 Input transforms: the “hockey stick”

We will build a model to predict price from a measure of the house’s size (surface area).

houseprice <- readRDS("houseprice.rds")

The data set houseprice has the columns:

  • price : house price in units of $1000
  • size: surface area

A scatterplot of the data shows that the data is quite non-linear: a sort of “hockey-stick” where price is fairly flat for smaller houses, but rises steeply as the house gets larger. Quadratics and tritics are often good functional forms to express hockey-stick like relationships. Note that there may not be a “physical” reason that price is related to the square of the size; a quadratic is simply a closed form approximation of the observed relationship.

ggplot(houseprice, aes(size, price))+
  geom_point()

We will fit a model to predict price as a function of the squared size, and look at its fit on the training data.

Because ^ is also a symbol to express interactions, use the function I() to treat the expression x^2 “as is”: that is, as the square of x rather than the interaction of x with itself.

exampleFormula = y ~ I(x^2)
# houseprice is in the workspace
summary(houseprice)
      size           price      
 Min.   : 44.0   Min.   : 42.0  
 1st Qu.: 73.5   1st Qu.:164.5  
 Median : 91.0   Median :203.5  
 Mean   : 94.3   Mean   :249.2  
 3rd Qu.:118.5   3rd Qu.:287.8  
 Max.   :150.0   Max.   :573.0  
# Create the formula for price as a function of squared size
(fmla_sqr <- price ~ I(size^2))
price ~ I(size^2)
# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, houseprice)

# Fit a model of price as a linear function of size
model_lin <- lm(price ~ size, houseprice)

# Make predictions and compare
houseprice %>% 
    mutate(pred_lin = predict(model_lin, houseprice),       # predictions from linear model
           pred_sqr = predict(model_sqr, houseprice)) %>%   # predictions from quadratic model 
    gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
    ggplot(aes(x = size)) + 
       geom_point(aes(y = price)) +                   # actual prices
       geom_line(aes(y = pred, color = modeltype)) + # the predictions
       scale_color_brewer(palette = "Dark2")

A quadratic model seems to fit the houseprice data better than a linear model. We will confirm whether the quadratic model would perform better on out-of-sample data. Since this data set is small, we will use cross-validation.

# houseprice is in the workspace
summary(houseprice)
      size           price      
 Min.   : 44.0   Min.   : 42.0  
 1st Qu.: 73.5   1st Qu.:164.5  
 Median : 91.0   Median :203.5  
 Mean   : 94.3   Mean   :249.2  
 3rd Qu.:118.5   3rd Qu.:287.8  
 Max.   :150.0   Max.   :573.0  
# fmla_sqr is in the workspace
fmla_sqr
price ~ I(size^2)
# Create a splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(houseprice) ,3 ,NULL, NULL)

# Sample code: get cross-val predictions for price ~ size
houseprice$pred_lin <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_lin <- lm(price ~ size, data = houseprice[split$train,])
  houseprice$pred_lin[split$app] <- predict(model_lin, newdata = houseprice[split$app,])
}

# Get cross-val predictions for price as a function of size^2 (use fmla_sqr)
houseprice$pred_sqr <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_sqr <- lm(fmla_sqr, data = houseprice[split$train, ])
  houseprice$pred_sqr[split$app] <- predict(model_sqr, newdata = houseprice[split$app, ])
}

# Gather the predictions and calculate the residuals
houseprice_long <- houseprice %>%
  gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>%
  mutate(residuals = pred - price)

# Compare the cross-validated RMSE for the two models
houseprice_long %>% 
  group_by(modeltype) %>% # group by modeltype
  summarize(rmse = sqrt(mean(residuals^2)))
ABCDEFGHIJ0123456789
modeltype
<chr>
rmse
<dbl>
pred_lin73.14852
pred_sqr60.27335

We’ve confirmed that the quadratic input tranformation improved the model.

4 Dealing with Non-Linear Responses

4.1 Logistic regression to predict probabilities

4.1.1 Fit a model of sparrow survival probability

you will estimate the probability that a sparrow survives a severe winter storm, based on physical characteristics of the sparrow.

The outcome to be predicted is status (“Survived”, “Perished”). The variables we will consider are:

  • total_length: length of the bird from tip of beak to tip of tail (mm)
  • weight: in grams
  • humerus : length of humerus (“upper arm bone” that connects the wing to the body) (inches)

When using glm() to create a logistic regression model, we must explicitly specify that family = binomial:

glm(formula, data = data, family = binomial)

We will call summary(), broom::glance() to see different functions for examining a logistic regression model. One of the diagnostics that you will look at is the analog to R2, called pseudoR2.

pseudoR2=1deviancenull.deciance

You can think of deviance as analogous to variance: it is a measure of the variation in categorical data. The pseudo-R2 is analogous to R2 for standard regression: R2 is a measure of the “variance explained” of a regression model. The pseudo-R2 is a measure of the “deviance explained”.

library(broom)
sparrow <- readRDS("sparrow.rds")
# sparrow is in the workspace
summary(sparrow)
      status       age             total_length      wingspan    
 Perished:36   Length:87          Min.   :153.0   Min.   :236.0  
 Survived:51   Class :character   1st Qu.:158.0   1st Qu.:245.0  
               Mode  :character   Median :160.0   Median :247.0  
                                  Mean   :160.4   Mean   :247.5  
                                  3rd Qu.:162.5   3rd Qu.:251.0  
                                  Max.   :167.0   Max.   :256.0  
     weight       beak_head        humerus           femur       
 Min.   :23.2   Min.   :29.80   Min.   :0.6600   Min.   :0.6500  
 1st Qu.:24.7   1st Qu.:31.40   1st Qu.:0.7250   1st Qu.:0.7000  
 Median :25.8   Median :31.70   Median :0.7400   Median :0.7100  
 Mean   :25.8   Mean   :31.64   Mean   :0.7353   Mean   :0.7134  
 3rd Qu.:26.7   3rd Qu.:32.10   3rd Qu.:0.7500   3rd Qu.:0.7300  
 Max.   :31.0   Max.   :33.00   Max.   :0.7800   Max.   :0.7600  
    legbone          skull           sternum      
 Min.   :1.010   Min.   :0.5600   Min.   :0.7700  
 1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300  
 Median :1.130   Median :0.6000   Median :0.8500  
 Mean   :1.131   Mean   :0.6032   Mean   :0.8511  
 3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800  
 Max.   :1.230   Max.   :0.6400   Max.   :0.9300  
# Create the survived column
sparrow$survived <- ifelse(sparrow$status == "Survived", TRUE, FALSE)

# Create the formula
(fmla <- survived ~ total_length + weight + humerus)
survived ~ total_length + weight + humerus
# Fit the logistic regression model
sparrow_model <- glm(fmla, data = sparrow, family = binomial)

# Call summary
summary(sparrow_model)

Call:
glm(formula = fmla, family = binomial, data = sparrow)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1117  -0.6026   0.2871   0.6577   1.7082  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   46.8813    16.9631   2.764 0.005715 ** 
total_length  -0.5435     0.1409  -3.858 0.000115 ***
weight        -0.5689     0.2771  -2.053 0.040060 *  
humerus       75.4610    19.1586   3.939 8.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 118.008  on 86  degrees of freedom
Residual deviance:  75.094  on 83  degrees of freedom
AIC: 83.094

Number of Fisher Scoring iterations: 5
# Call glance
(perf <- glance(sparrow_model))
ABCDEFGHIJ0123456789
null.deviance
<dbl>
df.null
<int>
logLik
<dbl>
AIC
<dbl>
BIC
<dbl>
deviance
<dbl>
df.residual
<int>
118.008486-37.5471883.0943692.9579975.0943683

# Calculate pseudo-R-squared
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)
[1] 0.3636526

We’ve fit a logistic regression model to predict probabilities! When looking at the pseudo-R2 of a logistic regression model, we should hope to see a value close to 1.

4.1.2 Predict sparrow survival

you will predict the probability of survival. When calling predict() to get the predicted probabilities from a glm() model, you must specify that you want the response:

predict(model, type = "response")

Otherwise, predict() on a logistic regression model returns the predicted log-odds of the event, not the probability.

We will also use the GainCurvePlot() function to plot the gain curve from the model predictions. If the model’s gain curve is close to the ideal (“wizard”) gain curve, then the model sorted the sparrows well: that is, the model predicted that sparrows that actually survived would have a higher probability of survival. The inputs to the GainCurvePlot() function are:

  • frame: data frame with prediction column and ground truth column
  • xvar: the name of the column of predictions (as a string)
  • truthVar: the name of the column with actual outcome (as a string)
  • title: a title for the plot (as a string)

    GainCurvePlot(frame, xvar, truthVar, title)

# sparrow is in the workspace
summary(sparrow)
      status       age             total_length      wingspan    
 Perished:36   Length:87          Min.   :153.0   Min.   :236.0  
 Survived:51   Class :character   1st Qu.:158.0   1st Qu.:245.0  
               Mode  :character   Median :160.0   Median :247.0  
                                  Mean   :160.4   Mean   :247.5  
                                  3rd Qu.:162.5   3rd Qu.:251.0  
                                  Max.   :167.0   Max.   :256.0  
     weight       beak_head        humerus           femur       
 Min.   :23.2   Min.   :29.80   Min.   :0.6600   Min.   :0.6500  
 1st Qu.:24.7   1st Qu.:31.40   1st Qu.:0.7250   1st Qu.:0.7000  
 Median :25.8   Median :31.70   Median :0.7400   Median :0.7100  
 Mean   :25.8   Mean   :31.64   Mean   :0.7353   Mean   :0.7134  
 3rd Qu.:26.7   3rd Qu.:32.10   3rd Qu.:0.7500   3rd Qu.:0.7300  
 Max.   :31.0   Max.   :33.00   Max.   :0.7800   Max.   :0.7600  
    legbone          skull           sternum        survived      
 Min.   :1.010   Min.   :0.5600   Min.   :0.7700   Mode :logical  
 1st Qu.:1.110   1st Qu.:0.5900   1st Qu.:0.8300   FALSE:36       
 Median :1.130   Median :0.6000   Median :0.8500   TRUE :51       
 Mean   :1.131   Mean   :0.6032   Mean   :0.8511                  
 3rd Qu.:1.160   3rd Qu.:0.6100   3rd Qu.:0.8800                  
 Max.   :1.230   Max.   :0.6400   Max.   :0.9300                  
# sparrow_model is in the workspace
summary(sparrow_model)

Call:
glm(formula = fmla, family = binomial, data = sparrow)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1117  -0.6026   0.2871   0.6577   1.7082  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   46.8813    16.9631   2.764 0.005715 ** 
total_length  -0.5435     0.1409  -3.858 0.000115 ***
weight        -0.5689     0.2771  -2.053 0.040060 *  
humerus       75.4610    19.1586   3.939 8.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 118.008  on 86  degrees of freedom
Residual deviance:  75.094  on 83  degrees of freedom
AIC: 83.094

Number of Fisher Scoring iterations: 5
# Make predictions
sparrow$pred <- predict(sparrow_model, type = "response")

# Look at gain curve
GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")

We see from the gain curve that the model follows the wizard curve for about the first 30% of the data, identifying about 45% of the surviving sparrows with only a few false positives.

4.2 Poisson and quasipoisson regression to predict counts

4.2.1 Poisson or quasipoisson

One of the assumptions of Poisson regression to predict counts is that the event you are counting is Poisson distributed: the average count per unit time is the same as the variance of the count. In practice, “the same” means that the mean and the variance should be of a similar order of magnitude.

When the variance is much larger than the mean, the Poisson assumption doesn’t apply, and one solution is to use quasipoisson regression, which does not assume that variance=mean.

4.2.2 Fit a model to predict bike rental counts

We will build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. We will train the model on data from the month of July.

The data frame has the columns:

  • cnt: the number of bikes rented in that hour (the outcome)
  • hr: the hour of the day (0-23, as a factor)
  • holiday: TRUE/FALSE
  • workingday: TRUE if neither a holiday nor a weekend, else FALSE
  • weathersit: categorical, “Clear to partly cloudy”/“Light Precipitation”/“Misty”
  • temp: normalized temperature in Celsius
  • atemp: normalized “feeling” temperature in Celsius
  • hum: normalized humidity
  • windspeed: normalized windspeed
  • instant: the time index – number of hours since beginning of data set (not a variable)
  • mnth and yr: month and year indices (not variables)

Remember that we must specify family = poisson or family = quasipoisson when using glm() to fit a count model.

bikesAugust <- miceadds::load.Rdata2( filename="Bikes.RData")
bikesJuly <- readRDS("bikesJuly.rds")
# bikesJuly is in the workspace
str(bikesJuly)
'data.frame':   744 obs. of  12 variables:
 $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
 $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
 $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
 $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
 $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
 $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
 $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
 $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
 $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# The outcome column
(outcome <- "cnt")
[1] "cnt"
# The inputs to use
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
[1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
[6] "atemp"      "hum"        "windspeed" 
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
[1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))
[1] 273.6653
(var_bikes <- var(bikesJuly$cnt))
[1] 45863.84
# Fit the model
bike_model <- glm(fmla, data = bikesJuly, family = quasipoisson)

# Call glance
(perf <- glance(bike_model))
ABCDEFGHIJ0123456789
null.deviance
<dbl>
df.null
<int>
logLik
<dbl>
AIC
<dbl>
BIC
<dbl>
deviance
<dbl>
df.residual
<int>
133364.9743NANANA28774.9712

# Calculate pseudo-R-squared
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)
[1] 0.7842393

We’ve fit a (quasi)poisson model to predict counts! As with a logistic model, you hope for a pseudo-R2 near 1.

4.2.3 Predict bike rentals on new data

We will use the model to make predictions for the month of August.

# bikesAugust is in the workspace
str(bikesAugust)
'data.frame':   744 obs. of  12 variables:
 $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
 $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
 $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
 $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
 $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
 $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
 $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
 $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
 $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# bike_model is in the workspace
summary(bike_model)

Call:
glm(formula = fmla, family = quasipoisson, data = bikesJuly)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-21.6117   -4.3121   -0.7223    3.5507   16.5079  

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    5.934986   0.439027  13.519  < 2e-16 ***
hr1                           -0.580055   0.193354  -3.000 0.002794 ** 
hr2                           -0.892314   0.215452  -4.142 3.86e-05 ***
hr3                           -1.662342   0.290658  -5.719 1.58e-08 ***
hr4                           -2.350204   0.393560  -5.972 3.71e-09 ***
hr5                           -1.084289   0.230130  -4.712 2.96e-06 ***
hr6                            0.211945   0.156476   1.354 0.176012    
hr7                            1.211135   0.132332   9.152  < 2e-16 ***
hr8                            1.648361   0.127177  12.961  < 2e-16 ***
hr9                            1.155669   0.133927   8.629  < 2e-16 ***
hr10                           0.993913   0.137096   7.250 1.09e-12 ***
hr11                           1.116547   0.136300   8.192 1.19e-15 ***
hr12                           1.282685   0.134769   9.518  < 2e-16 ***
hr13                           1.273010   0.135872   9.369  < 2e-16 ***
hr14                           1.237721   0.136386   9.075  < 2e-16 ***
hr15                           1.260647   0.136144   9.260  < 2e-16 ***
hr16                           1.515893   0.132727  11.421  < 2e-16 ***
hr17                           1.948404   0.128080  15.212  < 2e-16 ***
hr18                           1.893915   0.127812  14.818  < 2e-16 ***
hr19                           1.669277   0.128471  12.993  < 2e-16 ***
hr20                           1.420732   0.131004  10.845  < 2e-16 ***
hr21                           1.146763   0.134042   8.555  < 2e-16 ***
hr22                           0.856182   0.138982   6.160 1.21e-09 ***
hr23                           0.479197   0.148051   3.237 0.001265 ** 
holidayTRUE                    0.201598   0.079039   2.551 0.010961 *  
workingdayTRUE                 0.116798   0.033510   3.485 0.000521 ***
weathersitLight Precipitation -0.214801   0.072699  -2.955 0.003233 ** 
weathersitMisty               -0.010757   0.038600  -0.279 0.780572    
temp                          -3.246001   1.148270  -2.827 0.004833 ** 
atemp                          2.042314   0.953772   2.141 0.032589 *  
hum                           -0.748557   0.236015  -3.172 0.001581 ** 
windspeed                      0.003277   0.148814   0.022 0.982439    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 38.98949)

    Null deviance: 133365  on 743  degrees of freedom
Residual deviance:  28775  on 712  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5
# Make predictions on August data
bikesAugust$pred  <- predict(bike_model, newdata = bikesAugust, type = "response")

# Calculate the RMSE
bikesAugust %>% 
  mutate(residual = pred - cnt) %>%
  summarize(rmse  = sqrt(mean(residual^2)))
ABCDEFGHIJ0123456789
rmse
<dbl>
112.5815

# Plot predictions vs cnt (pred on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
  geom_point() + 
  geom_abline(color = "darkblue")

(Quasi)poisson models predict non-negative rates, making them useful for count or frequency data.

4.2.4 Visualize the Bike Rental Predictions

We visualized the bike model’s predictions using the standard “outcome vs. prediction” scatter plot. Since the bike rental data is time series data, we might be interested in how the model performs as a function of time. In this exercise, we will compare the predictions and actual rentals on an hourly basis, for the first 14 days of August.

To create the plot you will use the function tidyr::gather() to consolidate the predicted and actual values from bikesAugust in a single column. gather() takes as arguments:

  • The “wide” data frame to be gathered (implicit in a pipe)
  • The name of the key column to be created - contains the names of the gathered columns.
  • The name of the value column to be created - contains the values of the gathered columns.
  • The names of the columns to be gathered into a single column.

We’ll use the gathered data frame to compare the actual and predicted rental counts as a function of time. The time index, instant counts the number of observations since the beginning of data collection. The sample code converts the instants to daily units, starting from 0.

# Plot predictions and cnt by date/time
(quasipoisson_plot <- bikesAugust %>% 
  # set start to 0, convert unit to days
  mutate(instant = (instant - min(instant))/24) %>%  
  # gather cnt and pred into a value column
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # restric to first 14 days
  # plot value by instant
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Quasipoisson model"))

This model mostly identifies the slow and busy hours of the day, although it often underestimates peak demand.

4.3 GAM to learn non-linear transforms

4.3.1 Writing formulas for GAM models

When using gam() to model outcome as an additive function of the inputs, you can use the s() function inside formulas to designate that you want to use a spline to model the non-linear relationship of a continuous variable to the outcome.

Suppose that you want to predict how much weight (Wtloss) a dieter will lose over a 2-year diet plan as a function of:

  • Diet type (categorical)
  • Sex (categorical)
  • Age at beginning of diet (continuous)
  • BMI (body mass index) at beginning of diet (continuous)

    Wtloss ~ Diet + Sex + s(Age) + s(BMI)

This formula says that Age and BMI will both be modeled non-linearly.

Suppose that in the diet problem from the previous exercise, you now also want to take into account

  • the dieter’s resting metabolic rate (BMR – continuous) and
  • the dieter’s average number hours of aerobic exercise per day (E – continuous) at the beginning of the study.

You have reason to believe that the relationship between BMR and weight loss is linear (and you want to model it that way), but not necessarily the relationship between aerobic exercise and weight loss.

Wtloss ~ Diet + Sex + s(Age) + s(BMI) + BMR + s(E)

This formula says to model Age, BMI and E as non-linear, but model BMR as linear.

4.3.2 Model soybean growth with GAM

We will model the average leaf weight on a soybean plant as a function of time (after planting). As we will see, the soybean plant doesn’t grow at a steady rate, but rather has a “growth spurt” that eventually tapers off. Hence, leaf weight is not well described by a linear model.

We can designate which variable you want to model non-linearly in a formula with the s() function:

y ~ s(x)

gam() from the package mgcv has the calling interface:

gam(formula, family, data)

For standard regression, use family = gaussian (the default).

soybean_train <- readRDS("soybean_train.rds")
# soybean_train is in the workspace
summary(soybean_train)
      Plot     Variety   Year          Time           weight       
 1988F6 : 10   F:161   1988:124   Min.   :14.00   Min.   : 0.0290  
 1988F7 :  9   P:169   1989:102   1st Qu.:27.00   1st Qu.: 0.6663  
 1988P1 :  9           1990:104   Median :42.00   Median : 3.5233  
 1988P8 :  9                      Mean   :43.56   Mean   : 6.1645  
 1988P2 :  9                      3rd Qu.:56.00   3rd Qu.:10.3808  
 1988F3 :  8                      Max.   :84.00   Max.   :27.3700  
 (Other):276                                                       
# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) + 
  geom_point() 

# Load the package mgcv
library(mgcv)

# Create the formula 
(fmla.gam <- weight ~ s(Time) )
weight ~ s(Time)
# Fit the GAM Model
model.gam <- gam(fmla.gam, family = gaussian, soybean_train)
# Create the formula 
(fmla.lin <- weight ~ Time )
weight ~ Time
# Fit the GAM Model
model.lin <- lm(fmla.lin, soybean_train)
# From previous step
library(mgcv)
fmla.gam <- weight ~ s(Time)
model.gam <- gam(fmla.gam, data = soybean_train, family = gaussian)

# Call summary() on model.lin and look for R-squared
summary(model.lin)

Call:
lm(formula = fmla.lin, data = soybean_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3933 -1.7100 -0.3909  1.9056 11.4381 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.559283   0.358527  -18.30   <2e-16 ***
Time         0.292094   0.007444   39.24   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.778 on 328 degrees of freedom
Multiple R-squared:  0.8244,    Adjusted R-squared:  0.8238 
F-statistic:  1540 on 1 and 328 DF,  p-value: < 2.2e-16
# Call summary() on model.gam and look for R-squared
summary(model.gam)

Family: gaussian 
Link function: identity 

Formula:
weight ~ s(Time)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.1645     0.1143   53.93   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
          edf Ref.df     F p-value    
s(Time) 8.495   8.93 338.2  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.902   Deviance explained = 90.4%
GCV = 4.4395  Scale est. = 4.3117    n = 330
# Call plot() on model.gam
plot(model.gam)

We have fit a generalized additive model. For this data, the GAM appears to fit the data better than a linear model, as measured by the R-squared.

4.3.3 Predict with the soybean model on test data

We will apply the soybean models to new data: soybean_test

soybean_test <- readRDS("soybean_test.rds")
# soybean_test is in the workspace
summary(soybean_test)
      Plot    Variety   Year         Time           weight       
 1988F8 : 4   F:43    1988:32   Min.   :14.00   Min.   : 0.0380  
 1988P7 : 4   P:39    1989:26   1st Qu.:23.00   1st Qu.: 0.4248  
 1989F8 : 4           1990:24   Median :41.00   Median : 3.0025  
 1990F8 : 4                     Mean   :44.09   Mean   : 7.1576  
 1988F4 : 3                     3rd Qu.:69.00   3rd Qu.:15.0113  
 1988F2 : 3                     Max.   :84.00   Max.   :30.2717  
 (Other):60                                                      
# Get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)

# Get predictions from gam model
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))

# Gather the predictions into a "long" dataset
soybean_long <- soybean_test %>%
  gather(key = modeltype, value = pred, pred.lin, pred.gam)

# Calculate the rmse
soybean_long %>%
  mutate(residual = weight - pred) %>%     # residuals
  group_by(modeltype) %>%                  # group by modeltype
  summarize(rmse = sqrt(mean(residual^2))) # calculate the RMSE
ABCDEFGHIJ0123456789
modeltype
<chr>
rmse
<dbl>
pred.gam2.286451
pred.lin3.190995

# Compare the predictions against actual weights on the test data
soybean_long %>%
  ggplot(aes(x = Time)) +                          # the column for the x axis
  geom_point(aes(y = weight)) +                    # the y-column for the scatterplot
  geom_point(aes(y = pred, color = modeltype)) +   # the y-column for the point-and-line plot
  geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + # the y-column for the point-and-line plot
  scale_color_brewer(palette = "Dark2")

The GAM learns the non-linear growth function of the soybean plants, including the fact that weight is never negative.

5 Tree-Based Methods

5.1 The intuition behind tree-based methods

The tree predicts the expected intelligence (humans have an intelligence of 1) of several mammals, as a function of gestation time (in days) and average litter size for that species.

The leaf nodes show the expected brain size for the datums in that node, as well as how many (percentage-wise) of the datums fall into the node.

5.2 Random forests

5.2.1 Build a random forest model for bike rentals

We will again build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. We will train the model on data from the month of July.

We will use the ranger package to fit the random forest model. For this exercise, the key arguments to the ranger() call are:

  • formula
  • data
  • num.trees: the number of trees in the forest.
  • respect.unordered.factors : Specifies how to treat unordered factor variables. We recommend setting this to “order” for regression.
  • seed: because this is a random algorithm, you will set the seed to get reproducible results

Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in the variables outcome and vars, and use paste() to assemble a string representing the model formula.

# bikesJuly is in the workspace
str(bikesJuly)
'data.frame':   744 obs. of  12 variables:
 $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
 $ temp      : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
 $ atemp     : num  0.727 0.697 0.697 0.712 0.667 ...
 $ hum       : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
 $ windspeed : num  0 0.1343 0.0896 0.1343 0.194 ...
 $ cnt       : int  149 93 90 33 4 10 27 50 142 219 ...
 $ instant   : int  13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
 $ mnth      : int  7 7 7 7 7 7 7 7 7 7 ...
 $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
# Random seed to reproduce results
seed <- 423563

# The outcome column
(outcome <- "cnt")
[1] "cnt"
# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
[1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
[6] "atemp"      "hum"        "windspeed" 
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
[1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Load the package ranger
library(ranger)

# Fit and print the random forest model
(bike_model_rf <- ranger(fmla, # formula 
                         bikesJuly, # data
                         num.trees = 500, 
                         respect.unordered.factors = "order", 
                         seed = seed))
Ranger result

Call:
 ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      744 
Number of independent variables:  8 
Mtry:                             2 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       8230.568 
R squared (OOB):                  0.8205434 

We have fit a model to the data with a respectable R-squared.

5.2.2 Predict bike rentals with the random forest model

We will use the model to predict bike rentals for the month of August.

The predict() function for a ranger model produces a list. One of the elements of this list is predictions, a vector of predicted values. You can access predictions with the $ notation for accessing named elements of a list:

predict(model, data)$predictions
# bikesAugust is in the workspace
str(bikesAugust)
'data.frame':   744 obs. of  13 variables:
 $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
 $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
 $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
 $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
 $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
 $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
 $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
 $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
 $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pred      : num  94.96 51.74 37.98 17.58 9.36 ...
# bike_model_rf is in the workspace
bike_model_rf
Ranger result

Call:
 ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order",      seed = seed) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      744 
Number of independent variables:  8 
Mtry:                             2 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       8230.568 
R squared (OOB):                  0.8205434 
# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions

# Calculate the RMSE of the predictions
bikesAugust %>% 
  mutate(residual = cnt - pred)  %>% # calculate the residual
  summarize(rmse  = sqrt(mean(residual^2)))      # calculate rmse
ABCDEFGHIJ0123456789
rmse
<dbl>
97.18347

# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

This random forest model outperforms the poisson count model on the same data; it is discovering more complex non-linear or non-additive relationships in the data.

5.2.3 Visualize random forest bike model predictions

We saw that the random forest bike model did better on the August data than the quasiposson model, in terms of RMSE.

you will visualize the random forest model’s August predictions as a function of time.

Recall that the quasipoisson model mostly identified the pattern of slow and busy hours in the day, but it somewhat underestimated peak demands. You would like to see how the random forest model compares.

The plot quasipoisson_plot of quasipoisson model predictions as a function of time is shown.

quasipoisson_plot

first_two_weeks <- bikesAugust %>% 
  # Set start to 0, convert unit to days
  mutate(instant = (instant - min(instant)) / 24) %>% 
  # Gather cnt and pred into a column named value with key valuetype
  gather(key = valuetype, value = value, cnt, pred) %>%
  # Filter for rows in the first two
  filter(instant < 14)

# Plot predictions and cnt by date/time 
(randomforest_plot <- ggplot(first_two_weeks, aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Random Forest plot"))

The random forest model captured the day-to-day variations in peak demand better than the quasipoisson model, but it still underestmates peak demand, and also overestimates minimum demand. So there is still room for improvement.

5.3 One-Hot-Encoding Categorical Variables

5.3.1 vtreat on a small example

We will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded “lev”), and to clean bad values out of numerical variables (coded “clean”).

To design a treatment plan use the function designTreatmentsZ()

treatplan <- designTreatmentsZ(data, varlist)
  • data: the original training data frame
  • varlist: a vector of input variables to be treated (as strings).

designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:

scoreFrame <- treatplan %>% 
            magrittr::use_series(scoreFrame) %>% 
            select(varName, origName, code)
  • varName: the name of the new treated variable
  • origName: the name of the original variable that the treated variable comes from
  • code: the type of the new variable.
    • “clean”: a numerical variable with no NAs or NaNs
    • “lev”: an indicator variable for a specific level of the original categorical variable.

(magrittr::use_series() is an alias for $ that you can use in pipes.)

For these exercises, we want varName where code is either “clean” or “lev”:

newvarlist <- scoreFrame %>% 
             filter(code %in% c("clean", "lev") %>%
             magrittr::use_series(varName)

To transform the data set into all numerical and one-hot-encoded variables, use prepare():

data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)
  • treatplan: the treatment plan
  • data: the data frame to be treated
  • varRestrictions: the variables desired in the treated data
dframe <- readRDS("dframe.rds")
# dframe is in the workspace
dframe
ABCDEFGHIJ0123456789
color
<fctr>
size
<dbl>
popularity
<dbl>
b131.0785088
r111.3956245
r150.9217988
r141.2025453
r131.0838662
b110.8043527
r91.1035440
g120.8746332
b70.6947058
b120.8832502

# Create and print a vector of variable names
(vars <- c("color", "size"))
[1] "color" "size" 
# Load the package vtreat
library(vtreat)

# Create the treatment plan
treatplan <- designTreatmentsZ(dframe, vars)
[1] "vtreat 1.6.0 inspecting inputs Wed Jun 10 17:55:13 2020"
[1] "designing treatments Wed Jun 10 17:55:13 2020"
[1] " have initial level statistics Wed Jun 10 17:55:13 2020"
[1] " scoring treatments Wed Jun 10 17:55:13 2020"
[1] "have treatment plan Wed Jun 10 17:55:13 2020"
# Examine the scoreFrame
(scoreFrame <- treatplan %>%
    magrittr::use_series(scoreFrame) %>%
    select(varName, origName, code))
ABCDEFGHIJ0123456789
 
 
varName
<chr>
origName
<chr>
code
<chr>
1color_catPcolorcatP
2sizesizeclean
3color_lev_x_bcolorlev
4color_lev_x_gcolorlev
5color_lev_x_rcolorlev

# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
    filter(code %in% c("clean", "lev")) %>%
    magrittr::use_series(varName))
[1] "size"          "color_lev_x_b" "color_lev_x_g" "color_lev_x_r"
# Create the treated training data
(dframe.treat <- prepare(treatplan, dframe, varRestriction = newvars))
ABCDEFGHIJ0123456789
size
<dbl>
color_lev_x_b
<dbl>
color_lev_x_g
<dbl>
color_lev_x_r
<dbl>
13100
11001
15001
14001
13001
11100
9001
12010
7100
12100

We have successfully one-hot-encoded categorical data. The new indicator variables have ‘lev’ in their names, and the new cleaned continuous variables have ’_clean’ in their names. The treated data is all numerical, with no missing values, and is suitable for use with xgboost and other R modeling functions.

5.3.2 Novel levels

When a level of a categorical variable is rare, sometimes it will fail to show up in training data. If that rare level then appears in future data, downstream models may not know what to do with it. When such novel levels appear, using model.matrix or caret::dummyVars to one-hot-encode will not work correctly.

vtreat is a “safer” alternative to model.matrix for one-hot-encoding, because it can manage novel levels safely. vtreat also manages missing values in the data (both categorical and continuous).

We will see how vtreat handles categorical values that did not appear in the training set.

testframe <- readRDS("testframe.rds")
# treatplan is in the workspace
summary(treatplan)
              Length Class           Mode     
treatments    3      -none-          list     
scoreFrame    8      data.frame      list     
outcomename   1      -none-          character
vtreatVersion 1      package_version list     
outcomeType   1      -none-          character
outcomeTarget 1      -none-          character
meanY         1      -none-          logical  
splitmethod   1      -none-          character
# newvars is in the workspace
newvars
[1] "size"          "color_lev_x_b" "color_lev_x_g" "color_lev_x_r"
# Print dframe and testframe
dframe
ABCDEFGHIJ0123456789
color
<fctr>
size
<dbl>
popularity
<dbl>
b131.0785088
r111.3956245
r150.9217988
r141.2025453
r131.0838662
b110.8043527
r91.1035440
g120.8746332
b70.6947058
b120.8832502
testframe
ABCDEFGHIJ0123456789
color
<fctr>
size
<dbl>
popularity
<dbl>
g70.9733920
g80.9122529
y101.4217153
g121.1905828
g60.9866464
y81.3697515
b121.0959387
g120.9161547
g121.0000460
r81.3137360

# Use prepare() to one-hot-encode testframe
(testframe.treat <- prepare(treatplan, testframe, varRestriction = newvars))
ABCDEFGHIJ0123456789
size
<dbl>
color_lev_x_b
<dbl>
color_lev_x_g
<dbl>
color_lev_x_r
<dbl>
7010
8010
10000
12010
6010
8000
12100
12010
12010
8001

As you saw, vtreat encodes novel colors like yellow that were not present in the data as all zeros: ‘none of the known colors’. This allows downstream models to accept these novel values without crashing.

5.3.3 vtreat the bike rental data

We will create one-hot-encoded data frames of the July/August bike data, for use with xgboost later on.

For your convenience, we have defined the variable vars with the list of variable columns for the model.

library(magrittr)

Attaching package: ‘magrittr’

The following object is masked from ‘package:tidyr’:

    extract
# The outcome column
(outcome <- "cnt")
[1] "cnt"
# The input columns
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
[1] "hr"         "holiday"    "workingday" "weathersit" "temp"      
[6] "atemp"      "hum"        "windspeed" 
# Load the package vtreat
library(vtreat)

# Create the treatment plan from bikesJuly (the training data)
treatplan <- designTreatmentsZ(bikesJuly, vars, verbose = FALSE)

# Get the "clean" and "lev" variables from the scoreFrame
(newvars <- treatplan %>%
  use_series(scoreFrame) %>%        
  filter(code %in% c("lev", "clean")) %>%  # get the rows you care about
  use_series(varName))           # get the varName column
 [1] "holiday"                                
 [2] "workingday"                             
 [3] "temp"                                   
 [4] "atemp"                                  
 [5] "hum"                                    
 [6] "windspeed"                              
 [7] "hr_lev_x_0"                             
 [8] "hr_lev_x_1"                             
 [9] "hr_lev_x_10"                            
[10] "hr_lev_x_11"                            
[11] "hr_lev_x_12"                            
[12] "hr_lev_x_13"                            
[13] "hr_lev_x_14"                            
[14] "hr_lev_x_15"                            
[15] "hr_lev_x_16"                            
[16] "hr_lev_x_17"                            
[17] "hr_lev_x_18"                            
[18] "hr_lev_x_19"                            
[19] "hr_lev_x_2"                             
[20] "hr_lev_x_20"                            
[21] "hr_lev_x_21"                            
[22] "hr_lev_x_22"                            
[23] "hr_lev_x_23"                            
[24] "hr_lev_x_3"                             
[25] "hr_lev_x_4"                             
[26] "hr_lev_x_5"                             
[27] "hr_lev_x_6"                             
[28] "hr_lev_x_7"                             
[29] "hr_lev_x_8"                             
[30] "hr_lev_x_9"                             
[31] "weathersit_lev_x_Clear_to_partly_cloudy"
[32] "weathersit_lev_x_Light_Precipitation"   
[33] "weathersit_lev_x_Misty"                 
# Prepare the training data
bikesJuly.treat <- prepare(treatplan, bikesJuly,  varRestriction = newvars)

# Prepare the test data
bikesAugust.treat <- prepare(treatplan, bikesAugust,  varRestriction = newvars)

# Call str() on the treated data
str(bikesJuly.treat)
'data.frame':   744 obs. of  33 variables:
 $ holiday                                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ workingday                             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ temp                                   : num  0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
 $ atemp                                  : num  0.727 0.697 0.697 0.712 0.667 ...
 $ hum                                    : num  0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
 $ windspeed                              : num  0 0.1343 0.0896 0.1343 0.194 ...
 $ hr_lev_x_0                             : num  1 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_1                             : num  0 1 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_10                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_11                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_12                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_13                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_14                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_15                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_16                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_17                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_18                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_19                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_2                             : num  0 0 1 0 0 0 0 0 0 0 ...
 $ hr_lev_x_20                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_21                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_22                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_23                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_3                             : num  0 0 0 1 0 0 0 0 0 0 ...
 $ hr_lev_x_4                             : num  0 0 0 0 1 0 0 0 0 0 ...
 $ hr_lev_x_5                             : num  0 0 0 0 0 1 0 0 0 0 ...
 $ hr_lev_x_6                             : num  0 0 0 0 0 0 1 0 0 0 ...
 $ hr_lev_x_7                             : num  0 0 0 0 0 0 0 1 0 0 ...
 $ hr_lev_x_8                             : num  0 0 0 0 0 0 0 0 1 0 ...
 $ hr_lev_x_9                             : num  0 0 0 0 0 0 0 0 0 1 ...
 $ weathersit_lev_x_Clear_to_partly_cloudy: num  1 1 1 1 1 1 1 1 1 1 ...
 $ weathersit_lev_x_Light_Precipitation   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ weathersit_lev_x_Misty                 : num  0 0 0 0 0 0 0 0 0 0 ...
str(bikesAugust.treat)
'data.frame':   744 obs. of  33 variables:
 $ holiday                                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ workingday                             : num  1 1 1 1 1 1 1 1 1 1 ...
 $ temp                                   : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
 $ atemp                                  : num  0.636 0.606 0.576 0.576 0.591 ...
 $ hum                                    : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
 $ windspeed                              : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
 $ hr_lev_x_0                             : num  1 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_1                             : num  0 1 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_10                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_11                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_12                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_13                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_14                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_15                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_16                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_17                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_18                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_19                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_2                             : num  0 0 1 0 0 0 0 0 0 0 ...
 $ hr_lev_x_20                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_21                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_22                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_23                            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ hr_lev_x_3                             : num  0 0 0 1 0 0 0 0 0 0 ...
 $ hr_lev_x_4                             : num  0 0 0 0 1 0 0 0 0 0 ...
 $ hr_lev_x_5                             : num  0 0 0 0 0 1 0 0 0 0 ...
 $ hr_lev_x_6                             : num  0 0 0 0 0 0 1 0 0 0 ...
 $ hr_lev_x_7                             : num  0 0 0 0 0 0 0 1 0 0 ...
 $ hr_lev_x_8                             : num  0 0 0 0 0 0 0 0 1 0 ...
 $ hr_lev_x_9                             : num  0 0 0 0 0 0 0 0 0 1 ...
 $ weathersit_lev_x_Clear_to_partly_cloudy: num  1 1 1 1 0 0 1 0 0 0 ...
 $ weathersit_lev_x_Light_Precipitation   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ weathersit_lev_x_Misty                 : num  0 0 0 0 1 1 0 1 1 1 ...

The bike data is now in completely numeric form, ready to use with xgboost. Note that the treated data does not include the outcome column.

5.4 Gradient boosting machines

5.4.1 Find the right number of trees for a gradient boosting machine

you will get ready to build a gradient boosting model to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. We will train the model on data from the month of July.

Remember that bikesJuly.treat no longer has the outcome column, so we must get it from the untreated data: bikesJuly$cnt.

You will use the xgboost package to fit the random forest model. The function xgb.cv() uses cross-validation to estimate the out-of-sample learning error as each new tree is added to the model. The appropriate number of trees to use in the final model is the number that minimizes the holdout RMSE.

For this example, the key arguments to the xgb.cv() call are:

  • data: a numeric matrix.
  • label: vector of outcomes (also numeric).
  • nrounds: the maximum number of rounds (trees to build).
  • nfold: the number of folds for the cross-validation. 5 is a good number.
  • objective: “reg:linear” for continuous outcomes.
  • eta: the learning rate.
  • max_depth: depth of trees.
  • early_stopping_rounds: after this many rounds without improvement, stop.
  • verbose: 0 to stay silent.
# Load the package xgboost
library(xgboost)

# Run xgb.cv
cv <- xgb.cv(data = as.matrix(bikesJuly.treat), 
            label = bikesJuly$cnt,
            nrounds = 100,
            nfold = 5,
            objective = "reg:linear",
            eta = 0.3,
            max_depth = 6,
            early_stopping_rounds = 10,
            verbose = 0    # silent
)

# Get the evaluation log 
elog <- cv$evaluation_log

# Determine and print how many trees minimize training and test error
elog %>% 
   summarize(ntrees.train = which.min(train_rmse_mean),   # find the index of min(train_rmse_mean)
             ntrees.test  = which.min(test_rmse_mean))   # find the index of min(test_rmse_mean)
ABCDEFGHIJ0123456789
ntrees.train
<int>
ntrees.test
<int>
6555

In most cases, ntrees.test is less than ntrees.train. The training error keeps decreasing even after the test error starts to increase. It’s important to use cross-validation to find the right number of trees (as determined by ntrees.test) and avoid an overfit model.

5.4.2 Fit an xgboost bike rental model and predict

We will fit a gradient boosting model using xgboost() to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. We will train the model on data from the month of July and predict on data for the month of August.

Remember the vtreat-ed data no longer has the outcome column, so you must get it from the original data (the cnt column).

ntrees <- 84
# The number of trees to use, as determined by xgb.cv
ntrees
[1] 84
# Run xgboost
bike_model_xgb <- xgboost(data = as.matrix(bikesJuly.treat), # training data as matrix
                   label = bikesJuly$cnt,  # column of outcomes
                   nrounds = ntrees,       # number of trees to build
                   objective = "reg:linear", # objective
                   eta = 0.3,
                   depth = 6,
                   verbose = 0  # silent
)

# Make predictions
bikesAugust$pred <- predict(bike_model_xgb, as.matrix(bikesAugust.treat))

# Plot predictions (on x axis) vs actual bike rental count
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()

Overall, the scatterplot looked pretty good, but did you notice that the model made some negative predictions?

5.4.3 Evaluate the xgboost bike rental model

We will evaluate the gradient boosting model bike_model_xgb using data from the month of August. We’ll compare this model’s RMSE for August to the RMSE of previous models that we’ve built.

# bikesAugust is in the workspace
str(bikesAugust)
'data.frame':   744 obs. of  13 variables:
 $ hr        : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ workingday: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ weathersit: chr  "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
 $ temp      : num  0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
 $ atemp     : num  0.636 0.606 0.576 0.576 0.591 ...
 $ hum       : num  0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
 $ windspeed : num  0.1642 0.0896 0.1045 0.1045 0.1343 ...
 $ cnt       : int  47 33 13 7 4 49 185 487 681 350 ...
 $ instant   : int  13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
 $ mnth      : int  8 8 8 8 8 8 8 8 8 8 ...
 $ yr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ pred      : num  48.548 35.349 0.625 -6.652 3.92 ...
# Calculate RMSE
bikesAugust %>%
  mutate(residuals = cnt - pred) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
ABCDEFGHIJ0123456789
rmse
<dbl>
76.73242

Even though this gradient boosting made some negative predictions, overall it makes smaller errors than the previous two models. Perhaps rounding negative predictions up to zero is a reasonable tradeoff.

5.4.4 Visualize the xgboost bike rental model

You’ve now seen three different ways to model the bike rental data. For this example, you’ve seen that the gradient boosting model had the smallest RMSE. Let’s compare the gradient boosting model’s predictions to the other two models as a function of time.

# Print quasipoisson_plot
quasipoisson_plot


# Print randomforest_plot
randomforest_plot


# Plot predictions and actual bike rentals as a function of time (days)
bikesAugust %>% 
  mutate(instant = (instant - min(instant))/24) %>%  # set start to 0, convert unit to days
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # first two weeks
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Gradient Boosting model")

The gradient boosting pattern captures rental variations due to time of day and other factors better than the previous models.

---
title: "Supervised Learning in R: Regression"
output:
  html_notebook:
    toc: true
    toc_float: true
    toc_collapsed: false
    number_sections: true
    
toc_depth: 3
---
# What is Regression?

## Introduction

From a machine learning perspective, the term regression generally encompasses the prediction of continuous values. Statistically, these predictions are the expected value, or the average value one would observe for the given input values.

## Linear regression - the fundamental method

### Code a simple one-variable regression

We'll create a formula to define a one-variable modeling task, and then fit a linear model to the data. We are given the rates of male and female unemployment in the United States over several years ([Source](http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/slr/frames/slr04.html)).
```{r}
unemployment <- readRDS("unemployment.rds")
```
The task is to predict the rate of female unemployment from the observed rate of male unemployment. The outcome is female_unemployment, and the input is male_unemployment.

The sign of the variable coefficient tells you whether the outcome increases (+) or decreases (-) as the variable increases.

The calling interface for lm() is:

    lm(formula, data = ___)

```{r}
# unemployment is loaded in the workspace
summary(unemployment)

# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment

# Print it
fmla

# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data = unemployment)

# Print it
unemployment_model
```
The coefficient for male unemployment is positive, so female unemployment increases as male unemployment does. Linear regression is the most basic of regression approaches. You can think of this course as ways to address its limitations.

### Examining a model

Let's look at the model unemployment_model that we have just created. There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().
```{r}
# broom and sigr are already loaded in your workspace
# Print unemployment_model
unemployment_model

# Call summary() on unemployment_model to get more details
summary(unemployment_model)

# Call glance() on unemployment_model to see the details in a tidier form
library(broom)
glance(unemployment_model)

# Call wrapFTest() on unemployment_model to see the most relevant details
library(sigr)
wrapFTest(unemployment_model)
```
There are several different ways to get diagnostics for our model. Use the one that suits your needs or preferences the best.

## Predicting once you fit a model

### Predicting from the unemployment model

We will use our unemployment model unemployment_model to make predictions from the unemployment data, and compare predicted female unemployment rates to the actual observed female unemployment rates on the training data, unemployment. We will also use our model to predict on the new data in newrates, which consists of only one observation, where male unemployment is 5%.

The predict() interface for lm models takes the form

    predict(model, newdata)
    
We will use the ggplot2 package to make the plots, so we will add the prediction column to the unemployment data frame. We will plot outcome versus prediction, and compare them to the line that represents perfect predictions (that is when the outcome is equal to the predicted value).

The ggplot2 command to plot a scatterplot of dframe\$outcome versus dframe\$pred (pred on the x axis, outcome on the y axis), along with a blue line where outcome == pred is as follows:

    ggplot(dframe, aes(x = pred, y = outcome)) + 
           geom_point() +  
           geom_abline(color = "blue")

```{r}
newrates <- data.frame(male_unemployment = 5)
newrates
```
```{r}
# unemployment is in your workspace
summary(unemployment)

# newrates is in your workspace
newrates

# Predict female unemployment in the unemployment data set
unemployment$prediction <-  predict(unemployment_model, unemployment)

# load the ggplot2 package
library(ggplot2)

# Make a plot to compare predictions to actual (prediction on x axis). 
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) + 
  geom_point() +
  geom_abline(color = "blue")

# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newrates)
# Print it
pred
```
While all the modeling algorithms in R implement the predict() method, the call may be a little different for each one.

### Multivariate linear regression

We will work with the blood pressure dataset ([Source](http://college.cengage.com/mathematics/brase/understandable_statistics/7e/students/datasets/mlr/frames/frame.html)), and model blood_pressure as a function of weight and age.

```{r}
bloodpressure <- readRDS("bloodpressure.rds")
```
```{r}
# bloodpressure is in the workspace
summary(bloodpressure)

# Create the formula and print it
fmla <- blood_pressure ~ weight + age
fmla

# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)

# Print bloodpressure_model and call summary() 
bloodpressure_model
summary(bloodpressure_model)
```
One of the advantages of linear regression is that you can interpret the effects of each variable on the input – to a certain extent. In this case the coefficients for both age and weight are positive, which indicates that bloodpressure tends to increase as both age and weight increase.

Now we will make predictions using the blood pressure model bloodpressure_model.
```{r}
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, bloodpressure)

# plot the results
ggplot(bloodpressure, aes(x = prediction, y = blood_pressure)) + 
    geom_point() +
    geom_abline(color = "blue")
```
The results stay fairly close to the line of perfect prediction, indicating that the model fits the training data well. From a prediction perspective, multivariate linear regression behaves much as simple (one-variable) linear regression does.

## Wrapping up linear regression

### Pros and Cons of Linear Regression

* Pros
  * Easy to t and to apply
  * Concise
  * Less prone to overtting
  * Interpretable
* Cons
  * Can only express linear and additive relationships

### Collinearity

* Collinearity -- when variables are partially correlated.
* Coefcients might change sign
* High collinearity:
  * Coefcients (or standard errors) look too large
  * Model may be unstable

# Training and Evaluating Regression Models

## Evaluating a model graphically

### Graphically evaluate the unemployment model

We will graphically evaluate the unemployment model, unemployment_model, that we fit to the unemployment data.

We will plot the model's predictions against the actual female_unemployment. Then we will calculate the residuals:

    residuals <- actual outcome - predicted outcome

and plot predictions against residuals. The residual graph will take a slightly different form: we compare the residuals to the horizontal line $x=0$ (using geom_hline()) rather than to the line $x=y$. The command will be provided.

```{r}
# Make predictions from the model
unemployment$predictions <- predict(unemployment_model, unemployment)

# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) + 
  geom_point() + 
  geom_abline()
```
```{r}
# From previous step
unemployment$predictions <- predict(unemployment_model)

# Calculate residuals
unemployment$residuals <- unemployment$female_unemployment - unemployment$predictions

# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) + 
  geom_pointrange(aes(ymin = 0, ymax = residuals)) + 
  geom_hline(yintercept = 0, linetype = 3) + 
  ggtitle("residuals vs. linear model prediction")
```
We have now evaluated model predictions by comparing them to ground truth, and by examining prediction error.

### The gain curve to evaluate the unemployment model

We made predictions about female_unemployment and visualized the predictions and the residuals. Now, we will also plot the gain curve of the unemployment_model's predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.

For situations where order is more important than exact values, the gain curve helps you check if the model's predictions sort in the same order as the true outcome.

Calls to the function GainCurvePlot() look like:

    GainCurvePlot(frame, xvar, truthvar, title)
    
where

* frame is a data frame
* xvar and truthvar are strings naming the prediction and actual outcome columns of frame
* title is the title of the plot

When the predictions sort in exactly the same order, the relative Gini coefficient is 1. When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative.
```{r}
# Load the package WVPlots
library(WVPlots)

# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")
```
A relative gini coefficient close to one shows that the model correctly sorts high unemployment situations from lower ones.

## Root Mean Squared Error (RMSE)

### Calculate RMSE

We will calculate the RMSE of your unemployment model. We added two columns to the unemployment dataset:

* the model's predictions (predictions column)
* the residuals between the predictions and the outcome (residuals column)

We can calculate the RMSE from a vector of residuals, $res$, as:

$RMSE = \sqrt{mean(res^2)}$

We want RMSE to be small. How small is "small"? One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.

```{r}
# For convenience put the residuals in the variable res
res <- unemployment$residuals

# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res**2)))

# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
```
An RMSE much smaller than the outcome's standard deviation suggests a model that predicts well.

## R-Squared

### Calculate R-Squared

We will examine how well the model fits the data: that is, how much variance does it explain. You can do this using $R^2$.

Suppose $y$ is the true outcome, $p$ is the prediction from the model, and $res=y−p$ are the residuals of the predictions.

Then the total sum of squares $tss$ ("total variance") of the data is:

$tss = \sum{(y - \overline{y})^2}$

where $\overline{y}$ is the mean value of $y$.

The residual sum of squared errors of the model, $rss$ is:

$rss = \sum{res^2}$

$R^2$ (R-Squared), the "variance explained" by the model, is then:

$1 - \frac{rss}{tss}$

After we calculate $R2$, we will compare what you computed with the $R2$ reported by glance(). glance() returns a one-row data frame; for a linear regression model, one of the columns returned is the $R2$ of the model on the training data.
```{r}
# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))

# Calculate total sum of squares: tss. Print it
(tss <- sum((unemployment$female_unemployment - fe_mean)^2))

# Calculate residual sum of squares: rss. Print it
(rss <- sum((unemployment$residuals)^2))

# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1 - rss/tss)

# Get R-squared from glance. Print it
(rsq_glance <- glance(unemployment_model)$r.squared)
```
An R-squared close to one suggests a model that predicts well.

### Correlation and R-squared

The linear correlation of two variables, $x$ and $y$, measures the strength of the linear relationship between them. When x and y are respectively:

* the outcomes of a regression model that minimizes squared-error (like linear regression) and
* the true outcomes of the training data,

then the square of the correlation is the same as $R2$. We will verify that in this exercise.

```{r}
# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions, unemployment$female_unemployment))

# Square rho: rho2 and print it
(rho2 <- rho^2)

# Get R-squared from glance and print it
(rsq_glance <- glance(unemployment_model)$r.squared)
```
Remember this equivalence is only true for the training data, and only for models that minimize squared error.

## Properly Training a Model

### Generating a random test/train split

We will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.

We will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif().

If you have a data set dframe of size $N$, and you want a random subset of approximately size $100∗X%$ of $N$ (where $X$ is between 0 and 1), then:

1. Generate a vector of uniform random numbers: gp = runif(N).
2. dframe[gp < X,] will be about the right size.
3. dframe[gp >= X,] will be the complement.
```{r}
# mpg is in the workspace
summary(mpg)
dim(mpg)
```
```{r}
# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))

# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(N * 0.75))

# Create the vector of N uniform random variables: gp
gp <- runif(N)

# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp < 0.75,]
mpg_test <- mpg[gp >= 0.75,]

# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
nrow(mpg_test)
```
A random split won't always produce sets of exactly X% and (100-X)% of the data, but it should be close.

### Train a model using test/train split

We will use mpg_train to train a model to predict city fuel efficiency (cty) from highway fuel efficiency (hwy).
```{r}
# mpg_train is in the workspace
summary(mpg_train)

# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty ~ hwy)

# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy 
mpg_model <- lm(fmla, mpg_train)

# Use summary() to examine the model
summary(mpg_model)
```
### Evaluate a model using test/train split

Now weu will test the model mpg_model on the test data, mpg_test. Functions rmse() and r_squared() to calculate RMSE and R-squared have been provided for convenience:

    rmse(predcol, ycol)
    r_squared(predcol, ycol)
    
```{r}
# rmse function
rmse <- function (predcol, ycol) 
{
    res = predcol - ycol
    sqrt(mean(res^2))
}

# r_squared function
r_squared <- function (predcol, ycol) 
{
    res = predcol - ycol
    sqrt(mean(res^2))
}
```
where:

- predcol: The predicted values
- ycol: The actual outcome

We will also plot the predictions vs. the outcome.

Generally, model performance is better on the training data than the test data (though sometimes the test set "gets lucky"). A slight difference in performance is okay; if the performance on training is significantly better, there is a problem.
```{r}
# Examine the objects in the workspace
# ls.str()
```
```{r}
# Examine the objects in the workspace
ls.str()

# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model, mpg_train)

# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, mpg_test)

# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))


# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))

# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) + 
  geom_point() + 
  geom_abline()
```
Good performance on the test data is more confirmation that the model works as expected.

### Create a cross validation plan

There are several ways to implement an n-fold cross validation plan. We will create such a plan using vtreat::kWayCrossValidation(), and examine it.

kWayCrossValidation() creates a cross validation plan with the following call:

    splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)

where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.

Strictly speaking, dframe and y aren't used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. We can set them both to NULL.

The resulting splitPlan is a list of nSplits elements; each element contains two vectors:

* train: the indices of dframe that will form the training set
* app: the indices of dframe that will form the test (or application) set

We will create a 3-fold cross-validation plan for the data set mpg.
```{r}
# Load the package vtreat
library(vtreat)

# mpg is in the workspace
summary(mpg)

# Get the number of rows in mpg
nRows <- nrow(mpg)

# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3, dframe = NULL, y = NULL)

# Examine the split plan
str(splitPlan)
```
### Evaluate a modeling procedure using n-fold cross-validation

We will use splitPlan to make predictions from a model that predicts mpg\$cty from mpg\$hwy.

If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:

    # Initialize a column of the appropriate length
    dframe$pred.cv <- 0 
    
    # k is the number of folds
    # splitPlan is the cross validation plan
    
    for(i in 1:k) {
      # Get the ith split
      split <- splitPlan[[i]]
    
      # Build a model on the training data 
      # from this split 
      # (lm, in this case)
      model <- lm(fmla, data = dframe[split$train,])
    
      # make predictions on the 
      # application data from this split
      dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
    }

Cross-validation predicts how well a model built from all the data will perform on new data. As with the test/train split, for a good modeling procedure, cross-validation performance and training performance should be close.
```{r}
# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0 
for(i in 1:k) {
  split <- splitPlan[[i]]
  model <- lm(cty ~ hwy, data = mpg[split$train,])
  mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app,])
}

# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))

# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)

# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
```
We have successfully estimated a model's out-of-sample error via cross-validation. Remember, cross-validation validates the modeling process, not an actual model.

# Issues to Consider

## Categorical inputs

### Examining the structure of categorical inputs

For this exercise you will call model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling. The dataset flowers (derived from the Sleuth3 package) is loaded into your workspace. It has the following columns:

* Flowers: the average number of flowers on a meadowfoam plant
* Intensity: the intensity of a light treatment applied to the plant
* Time: A categorical variable - when (Late or Early) in the lifecycle the light treatment occurred

The ultimate goal is to predict Flowers as a function of Time and Intensity.
```{r}
flowers <- readRDS("flowers.rds")
```
```{r}
# Call str on flowers to see the types of each column
str(flowers)

# Use unique() to see how many possible values Time takes
unique(flowers$Time)

# Build a formula to express Flowers as a function of Intensity and Time: fmla. Print it
(fmla <- as.formula("Flowers ~ Intensity + Time"))

# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- model.matrix(fmla, data = flowers)

# Examine the first 20 lines of flowers
head(flowers, 20)

# Examine the first 20 lines of mmat
head(mmat, 20)
```
Now we have seen how most R modeling functions represent categorical variables internally.

### Modeling with categorical inputs

you will fit a linear model to the flowers data, to predict Flowers as a function of Time and Intensity.
```{r}
# flowers in is the workspace
str(flowers)

# fmla is in the workspace
fmla

# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla, data = flowers)

# Use summary on mmat to remind yourself of its structure
summary(mmat)

# Use summary to examine flower_model 
summary(flower_model)

# Predict the number of flowers on each plant
flowers$predictions <- predict(flower_model, flowers)

# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) + 
  geom_point() +
  geom_abline(color = "blue") 
```
We've successfully fit a model with a categorical variable, and seen how categorical variables are represented in that model.

## Interactions

### Modeling an interaction

We will use interactions to model the effect of gender and gastric activity on alcohol metabolism.

The data frame alcohol has columns:

- Metabol: the alcohol metabolism rate
- Gastric: the rate of gastric alcohol dehydrogenase activity
- Sex: the sex of the drinker (Male or Female)
```{r}
alcohol <- readRDS("alcohol.rds")
```
We can fit three models to the alcohol data:

- one with only additive (main effect) terms : Metabol ~ Gastric + Sex
- two models, each with interactions between gastric activity and sex

We saw that one of the models with interaction terms had a better R-squared than the additive model, suggesting that using interaction terms gives a better fit. In this exercise we will compare the R-squared of one of the interaction models to the main-effects-only model.

Recall that the operator : designates the interaction between two variables. The operator * designates the interaction between the two variables, plus the main effects.

$x*y = x + y + x:y$
```{r}
# alcohol is in the workspace
summary(alcohol)

# Create the formula with main effects only
(fmla_add <- Metabol ~ Gastric + Sex)

# Create the formula with main effect and interactions
(fmla_main_and_interaction <- Metabol ~ Gastric * Sex)

# Create the formula with interactions
(fmla_interaction <- Metabol ~ Gastric + Gastric:Sex)

# Fit the main effects only model
model_add <- lm(fmla_add, alcohol)

# Fit the main effects and interactions
model_main_and_interaction <- lm(fmla_main_and_interaction, alcohol)

# Fit the interaction model
model_interaction <- lm(fmla_interaction, alcohol)

# Call summary on both models and compare
summary(model_add)
summary(model_main_and_interaction)
summary(model_interaction)
```
We will compare the performance of the interaction model you fit in the previous exercise to the performance of a main-effects only model. Because this data set is small, we will use cross-validation to simulate making predictions on out-of-sample data.

We will use the dplyr package to do calculations.

- mutate() adds new columns to a tbl (a type of data frame)
- group_by() specifies how rows are grouped in a tbl
- summarize() computes summary statistics of a column

You will also use tidyr's gather() which takes multiple columns and collapses them into key-value pairs.
```{r}
# alcohol is in the workspace
summary(alcohol)

# Both the formulae are in the workspace
fmla_add
fmla_interaction

# Create the splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)

# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_add <- lm(fmla_add, data = alcohol[split$train, ])
  alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}

# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_interaction <- lm(fmla_interaction, data = alcohol[split$train, ])
  alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}

# Get RMSE
alcohol %>% 
  gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
  mutate(residuals = Metabol - pred) %>%      
  group_by(modeltype) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
```
Cross-validation confirms that a model with interaction will likely give better predictions.

## Transforming the response before modeling

### Relative error

We will compare relative error to absolute error. For the purposes of modeling, we will define relative error as:

$rel = \frac{(y-pred)}{y}$

that is, the error is relative to the true outcome. You will measure the overall relative error of a model using root mean squared relative error:

$rmse_{rel} = \sqrt{\overline{(rel^2)}}$

where $\overline{rel2}$ is the mean of $rel2$.

The example (toy) dataset fdata is loaded in your workspace. It includes the columns:

- y: the true output to be predicted by some model; imagine it is the amount of money a customer will spend on a visit to your store.
- pred: the predictions of a model that predicts y.
- label: categorical: whether y comes from a population that makes small purchases, or large ones.

You want to know which model does "better": the one predicting the small purchases, or the one predicting large ones.
```{r}
fdata <- readRDS("fdata.rds")
```
```{r}
# fdata is in the workspace
summary(fdata)

# Examine the data: generate the summaries for the groups large and small:
fdata %>% 
    group_by(label) %>%     # group by small/large purchases
    summarize(min  = min(y),   # min of y
              mean = mean(y),   # mean of y
              max  = max(y))   # max of y

# Fill in the blanks to add error columns
fdata2 <- fdata %>% 
         group_by(label) %>%       # group by label
           mutate(residual = y - pred,  # Residual
                  relerr   = residual / y)  # Relative error

# Compare the rmse and rmse.rel of the large and small groups:
fdata2 %>% 
  group_by(label) %>% 
  summarize(rmse     = sqrt(mean(residual^2)),   # RMSE
            rmse.rel = sqrt(mean(relerr^2)))   # Root mean squared relative error
            
# Plot the predictions for both groups of purchases
ggplot(fdata2, aes(x = pred, y = y, color = label)) + 
  geom_point() + 
  geom_abline() + 
  facet_wrap(~ label, ncol = 1, scales = "free") + 
  ggtitle("Outcome vs prediction")
```
Notice from this example how a model with larger RMSE might still be better, if relative errors are more important than absolute errors.

### Modeling log-transformed monetary output

 you will practice modeling on log-transformed monetary output, and then transforming the "log-money" predictions back into monetary units. 

 The data records subjects' incomes in 2005 (Income2005), as well as the results of several aptitude tests taken by the subjects in 1981:

- Arith
- Word
- Parag
- Math
- AFQT (Percentile on the Armed Forces Qualifying Test)

The data have already been split into training and test sets (income_train and income_test respectively).
```{r}
income_train <- readRDS("income_train.rds")
income_test <- readRDS("income_test.rds")
```
We will build a model of log(income) from the inputs, and then convert log(income) back into income.
```{r}
# Examine Income2005 in the training set
summary(income_train$Income2005)

# Write the formula for log income as a function of the tests and print it
(fmla.log <- log(Income2005) ~ Arith + Word + Parag + Math + AFQT)

# Fit the linear model
model.log <-  lm(fmla.log, data = income_train)

# Make predictions on income_test
income_test$logpred <- predict(model.log, income_test)
summary(income_test$logpred)

# Convert the predictions to monetary units
income_test$pred.income <- exp(income_test$logpred)
summary(income_test$pred.income)

#  Plot predicted income (x axis) vs income
ggplot(income_test, aes(x = pred.income, y = Income2005)) + 
  geom_point() + 
  geom_abline(color = "blue")
```
Remember that when we transform the output before modeling, we have to 'reverse transform' the resulting predictions after applying the model.

### Comparing RMSE and root-mean-squared Relative Error

We will show that log-transforming a monetary output before modeling improves mean relative error (but increases RMSE) compared to modeling the monetary output directly. You will compare the results of model.log from the previous exercise to a model (model.abs) that directly fits income.

```{r}
# Write the formula for income as a function of the tests and print it
(fmla.abs <- Income2005 ~ Arith + Word + Parag + Math + AFQT)

# Fit the linear model
model.abs <-  lm(fmla.abs, data = income_train)

# model.abs is in the workspace
summary(model.abs)
```
```{r}
# Add predictions to the test set
income_test <- income_test %>%
  mutate(pred.absmodel = predict(model.abs, income_test),        # predictions from model.abs
         pred.logmodel = exp(predict(model.log, income_test)))   # predictions from model.log

# Gather the predictions and calculate residuals and relative error
income_long <- income_test %>% 
  gather(key = modeltype, value = pred, pred.absmodel, pred.logmodel) %>%
  mutate(residual = pred - Income2005,   # residuals
         relerr   = residual / Income2005)   # relative error

# Calculate RMSE and relative RMSE and compare
income_long %>% 
  group_by(modeltype) %>%      # group by modeltype
  summarize(rmse     = sqrt(mean(residual^2)),    # RMSE
            rmse.rel = sqrt(mean(relerr^2)))    # Root mean squared relative error
```
We've seen how modeling log(income) can reduce the relative error of the fit, at the cost of increased RMSE. Which tradeoff to make depends on the goals of our project.

## Transforming inputs before modeling

### Input transforms: the "hockey stick"

We will build a model to predict price from a measure of the house's size (surface area). 
```{r}
houseprice <- readRDS("houseprice.rds")
```
The data set houseprice has the columns:

- price : house price in units of $1000
- size: surface area

A scatterplot of the data shows that the data is quite non-linear: a sort of "hockey-stick" where price is fairly flat for smaller houses, but rises steeply as the house gets larger. Quadratics and tritics are often good functional forms to express hockey-stick like relationships. Note that there may not be a "physical" reason that price is related to the square of the size; a quadratic is simply a closed form approximation of the observed relationship.
```{r}
ggplot(houseprice, aes(size, price))+
  geom_point()
```
We will fit a model to predict price as a function of the squared size, and look at its fit on the training data.

Because ^ is also a symbol to express interactions, use the function I() to treat the expression x^2 “as is”: that is, as the square of x rather than the interaction of x with itself.

    exampleFormula = y ~ I(x^2)
    
```{r}
# houseprice is in the workspace
summary(houseprice)

# Create the formula for price as a function of squared size
(fmla_sqr <- price ~ I(size^2))

# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, houseprice)

# Fit a model of price as a linear function of size
model_lin <- lm(price ~ size, houseprice)

# Make predictions and compare
houseprice %>% 
    mutate(pred_lin = predict(model_lin, houseprice),       # predictions from linear model
           pred_sqr = predict(model_sqr, houseprice)) %>%   # predictions from quadratic model 
    gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
    ggplot(aes(x = size)) + 
       geom_point(aes(y = price)) +                   # actual prices
       geom_line(aes(y = pred, color = modeltype)) + # the predictions
       scale_color_brewer(palette = "Dark2")
```
A quadratic model seems to fit the houseprice data better than a linear model. We will confirm whether the quadratic model would perform better on out-of-sample data. Since this data set is small, we will use cross-validation. 
```{r}
# houseprice is in the workspace
summary(houseprice)

# fmla_sqr is in the workspace
fmla_sqr

# Create a splitting plan for 3-fold cross validation
set.seed(34245)  # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(houseprice) ,3 ,NULL, NULL)

# Sample code: get cross-val predictions for price ~ size
houseprice$pred_lin <- 0  # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_lin <- lm(price ~ size, data = houseprice[split$train,])
  houseprice$pred_lin[split$app] <- predict(model_lin, newdata = houseprice[split$app,])
}

# Get cross-val predictions for price as a function of size^2 (use fmla_sqr)
houseprice$pred_sqr <- 0 # initialize the prediction vector
for(i in 1:3) {
  split <- splitPlan[[i]]
  model_sqr <- lm(fmla_sqr, data = houseprice[split$train, ])
  houseprice$pred_sqr[split$app] <- predict(model_sqr, newdata = houseprice[split$app, ])
}

# Gather the predictions and calculate the residuals
houseprice_long <- houseprice %>%
  gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>%
  mutate(residuals = pred - price)

# Compare the cross-validated RMSE for the two models
houseprice_long %>% 
  group_by(modeltype) %>% # group by modeltype
  summarize(rmse = sqrt(mean(residuals^2)))
```
We've confirmed that the quadratic input tranformation improved the model.

# Dealing with Non-Linear Responses

## Logistic regression to predict probabilities

### Fit a model of sparrow survival probability

you will estimate the probability that a sparrow survives a severe winter storm, based on physical characteristics of the sparrow.

The outcome to be predicted is status ("Survived", "Perished"). The variables we will consider are:

- total_length: length of the bird from tip of beak to tip of tail (mm)
- weight: in grams
- humerus : length of humerus ("upper arm bone" that connects the wing to the body) (inches)

When using glm() to create a logistic regression model, we must explicitly specify that family = binomial:

    glm(formula, data = data, family = binomial)
    
We will call summary(), broom::glance() to see different functions for examining a logistic regression model. One of the diagnostics that you will look at is the analog to $R2$, called $pseudo-R2$.

$pseudoR^2 = 1 - \frac{deviance}{null.deciance}$

You can think of deviance as analogous to variance: it is a measure of the variation in categorical data. The pseudo-$R^2$ is analogous to $R^2$ for standard regression: $R^2$ is a measure of the "variance explained" of a regression model. The pseudo-$R^2$ is a measure of the "deviance explained".

```{r}
library(broom)
sparrow <- readRDS("sparrow.rds")
```
```{r}
# sparrow is in the workspace
summary(sparrow)

# Create the survived column
sparrow$survived <- ifelse(sparrow$status == "Survived", TRUE, FALSE)

# Create the formula
(fmla <- survived ~ total_length + weight + humerus)

# Fit the logistic regression model
sparrow_model <- glm(fmla, data = sparrow, family = binomial)

# Call summary
summary(sparrow_model)

# Call glance
(perf <- glance(sparrow_model))

# Calculate pseudo-R-squared
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)
```
We've fit a logistic regression model to predict probabilities! When looking at the pseudo-R2 of a logistic regression model, we should hope to see a value close to 1.

### Predict sparrow survival

you will predict the probability of survival. When calling predict() to get the predicted probabilities from a glm() model, you must specify that you want the response:

    predict(model, type = "response")
    
Otherwise, predict() on a logistic regression model returns the predicted log-odds of the event, not the probability.

We will also use the GainCurvePlot() function to plot the gain curve from the model predictions. If the model's gain curve is close to the ideal ("wizard") gain curve, then the model sorted the sparrows well: that is, the model predicted that sparrows that actually survived would have a higher probability of survival. The inputs to the GainCurvePlot() function are:

- frame: data frame with prediction column and ground truth column
- xvar: the name of the column of predictions (as a string)
- truthVar: the name of the column with actual outcome (as a string)
- title: a title for the plot (as a string)

    GainCurvePlot(frame, xvar, truthVar, title)
    
```{r}
# sparrow is in the workspace
summary(sparrow)

# sparrow_model is in the workspace
summary(sparrow_model)

# Make predictions
sparrow$pred <- predict(sparrow_model, type = "response")

# Look at gain curve
GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")
```
We see from the gain curve that the model follows the wizard curve for about the first 30% of the data, identifying about 45% of the surviving sparrows with only a few false positives.

## Poisson and quasipoisson regression to predict counts

### Poisson or quasipoisson

One of the assumptions of Poisson regression to predict counts is that the event you are counting is Poisson distributed: the average count per unit time is the same as the variance of the count. In practice, "the same" means that the mean and the variance should be of a similar order of magnitude.

When the variance is much larger than the mean, the Poisson assumption doesn't apply, and one solution is to use quasipoisson regression, which does not assume that variance=mean.

### Fit a model to predict bike rental counts

We will build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. We will train the model on data from the month of July.

The data frame has the columns:

- cnt: the number of bikes rented in that hour (the outcome)
- hr: the hour of the day (0-23, as a factor)
- holiday: TRUE/FALSE
- workingday: TRUE if neither a holiday nor a weekend, else FALSE
- weathersit: categorical, "Clear to partly cloudy"/"Light Precipitation"/"Misty"
- temp: normalized temperature in Celsius
- atemp: normalized "feeling" temperature in Celsius
- hum: normalized humidity
- windspeed: normalized windspeed
- instant: the time index -- number of hours since beginning of data set (not a variable)
- mnth and yr: month and year indices (not variables)

Remember that we must specify family = poisson or family = quasipoisson when using glm() to fit a count model.
```{r}
bikesAugust <- miceadds::load.Rdata2( filename="Bikes.RData")
bikesJuly <- readRDS("bikesJuly.rds")
```
```{r}
# bikesJuly is in the workspace
str(bikesJuly)

# The outcome column
(outcome <- "cnt")

# The inputs to use
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
```
```{r}
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))

# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))
(var_bikes <- var(bikesJuly$cnt))

# Fit the model
bike_model <- glm(fmla, data = bikesJuly, family = quasipoisson)

# Call glance
(perf <- glance(bike_model))

# Calculate pseudo-R-squared
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)
```
We've fit a (quasi)poisson model to predict counts! As with a logistic model, you hope for a pseudo-R2 near 1.

### Predict bike rentals on new data

We will use the model to make predictions for the month of August. 
```{r}
# bikesAugust is in the workspace
str(bikesAugust)

# bike_model is in the workspace
summary(bike_model)

# Make predictions on August data
bikesAugust$pred  <- predict(bike_model, newdata = bikesAugust, type = "response")

# Calculate the RMSE
bikesAugust %>% 
  mutate(residual = pred - cnt) %>%
  summarize(rmse  = sqrt(mean(residual^2)))

# Plot predictions vs cnt (pred on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
  geom_point() + 
  geom_abline(color = "darkblue")
```
(Quasi)poisson models predict non-negative rates, making them useful for count or frequency data.

### Visualize the Bike Rental Predictions

We visualized the bike model's predictions using the standard "outcome vs. prediction" scatter plot. Since the bike rental data is time series data, we might be interested in how the model performs as a function of time. In this exercise, we will compare the predictions and actual rentals on an hourly basis, for the first 14 days of August.

To create the plot you will use the function tidyr::gather() to consolidate the predicted and actual values from bikesAugust in a single column. gather() takes as arguments:

- The "wide" data frame to be gathered (implicit in a pipe)
- The name of the key column to be created - contains the names of the gathered columns.
- The name of the value column to be created - contains the values of the gathered columns.
- The names of the columns to be gathered into a single column.

We'll use the gathered data frame to compare the actual and predicted rental counts as a function of time. The time index, instant counts the number of observations since the beginning of data collection. The sample code converts the instants to daily units, starting from 0.
```{r}
# Plot predictions and cnt by date/time
(quasipoisson_plot <- bikesAugust %>% 
  # set start to 0, convert unit to days
  mutate(instant = (instant - min(instant))/24) %>%  
  # gather cnt and pred into a value column
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # restric to first 14 days
  # plot value by instant
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Quasipoisson model"))
```
This model mostly identifies the slow and busy hours of the day, although it often underestimates peak demand.

## GAM to learn non-linear transforms

### Writing formulas for GAM models

When using gam() to model outcome as an additive function of the inputs, you can use the s() function inside formulas to designate that you want to use a spline to model the non-linear relationship of a continuous variable to the outcome.

Suppose that you want to predict how much weight (Wtloss) a dieter will lose over a 2-year diet plan as a function of:

- Diet type (categorical)
- Sex (categorical)
- Age at beginning of diet (continuous)
- BMI (body mass index) at beginning of diet (continuous)

    Wtloss ~ Diet + Sex + s(Age) + s(BMI)
    
This formula says that Age and BMI will both be modeled non-linearly.

Suppose that in the diet problem from the previous exercise, you now also want to take into account

- the dieter's resting metabolic rate (BMR -- continuous) and
- the dieter's average number hours of aerobic exercise per day (E -- continuous) at the beginning of the study.

You have reason to believe that the relationship between BMR and weight loss is linear (and you want to model it that way), but not necessarily the relationship between aerobic exercise and weight loss.

    Wtloss ~ Diet + Sex + s(Age) + s(BMI) + BMR + s(E)

This formula says to model Age, BMI and E as non-linear, but model BMR as linear.

### Model soybean growth with GAM

We will model the average leaf weight on a soybean plant as a function of time (after planting). As we will see, the soybean plant doesn't grow at a steady rate, but rather has a "growth spurt" that eventually tapers off. Hence, leaf weight is not well described by a linear model.

We can designate which variable you want to model non-linearly in a formula with the s() function:

    y ~ s(x)
    
gam() from the package mgcv has the calling interface:

    gam(formula, family, data)

For standard regression, use family = gaussian (the default).
```{r}
soybean_train <- readRDS("soybean_train.rds")
```
```{r}
# soybean_train is in the workspace
summary(soybean_train)

# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) + 
  geom_point() 
```
```{r}
# Load the package mgcv
library(mgcv)

# Create the formula 
(fmla.gam <- weight ~ s(Time) )

# Fit the GAM Model
model.gam <- gam(fmla.gam, family = gaussian, soybean_train)
```
```{r}
# Create the formula 
(fmla.lin <- weight ~ Time )

# Fit the GAM Model
model.lin <- lm(fmla.lin, soybean_train)
```
```{r}
# From previous step
library(mgcv)
fmla.gam <- weight ~ s(Time)
model.gam <- gam(fmla.gam, data = soybean_train, family = gaussian)

# Call summary() on model.lin and look for R-squared
summary(model.lin)

# Call summary() on model.gam and look for R-squared
summary(model.gam)

# Call plot() on model.gam
plot(model.gam)
```
We have fit a generalized additive model. For this data, the GAM appears to fit the data better than a linear model, as measured by the R-squared.

### Predict with the soybean model on test data

We will apply the soybean models to new data: soybean_test

```{r}
soybean_test <- readRDS("soybean_test.rds")
```
```{r}
# soybean_test is in the workspace
summary(soybean_test)

# Get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)

# Get predictions from gam model
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))

# Gather the predictions into a "long" dataset
soybean_long <- soybean_test %>%
  gather(key = modeltype, value = pred, pred.lin, pred.gam)

# Calculate the rmse
soybean_long %>%
  mutate(residual = weight - pred) %>%     # residuals
  group_by(modeltype) %>%                  # group by modeltype
  summarize(rmse = sqrt(mean(residual^2))) # calculate the RMSE

# Compare the predictions against actual weights on the test data
soybean_long %>%
  ggplot(aes(x = Time)) +                          # the column for the x axis
  geom_point(aes(y = weight)) +                    # the y-column for the scatterplot
  geom_point(aes(y = pred, color = modeltype)) +   # the y-column for the point-and-line plot
  geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + # the y-column for the point-and-line plot
  scale_color_brewer(palette = "Dark2")
```
The GAM learns the non-linear growth function of the soybean plants, including the fact that weight is never negative.

# Tree-Based Methods

## The intuition behind tree-based methods

![](http://s3.amazonaws.com/assets.datacamp.com/production/course_3851/datasets/decisiontree_intelligence.png)
The tree predicts the expected intelligence (humans have an intelligence of 1) of several mammals, as a function of gestation time (in days) and average litter size for that species.

The leaf nodes show the expected brain size for the datums in that node, as well as how many (percentage-wise) of the datums fall into the node.

## Random forests

### Build a random forest model for bike rentals

We will again build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. We will train the model on data from the month of July.

We will use the ranger package to fit the random forest model. For this exercise, the key arguments to the ranger() call are:

- formula
- data
- num.trees: the number of trees in the forest.
- respect.unordered.factors : Specifies how to treat unordered factor variables. We recommend setting this to "order" for regression.
- seed: because this is a random algorithm, you will set the seed to get reproducible results

Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in the variables outcome and vars, and use paste() to assemble a string representing the model formula.
```{r}
# bikesJuly is in the workspace
str(bikesJuly)

# Random seed to reproduce results
seed <- 423563

# The outcome column
(outcome <- "cnt")

# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))

# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))

# Load the package ranger
library(ranger)

# Fit and print the random forest model
(bike_model_rf <- ranger(fmla, # formula 
                         bikesJuly, # data
                         num.trees = 500, 
                         respect.unordered.factors = "order", 
                         seed = seed))
```
We have fit a model to the data with a respectable R-squared.

### Predict bike rentals with the random forest model

We will use the model to predict bike rentals for the month of August.

The predict() function for a ranger model produces a list. One of the elements of this list is predictions, a vector of predicted values. You can access predictions with the $ notation for accessing named elements of a list:

    predict(model, data)$predictions
    
```{r}
# bikesAugust is in the workspace
str(bikesAugust)

# bike_model_rf is in the workspace
bike_model_rf

# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions

# Calculate the RMSE of the predictions
bikesAugust %>% 
  mutate(residual = cnt - pred)  %>% # calculate the residual
  summarize(rmse  = sqrt(mean(residual^2)))      # calculate rmse

# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()
```
This random forest model outperforms the poisson count model on the same data; it is discovering more complex non-linear or non-additive relationships in the data.

### Visualize random forest bike model predictions

We saw that the random forest bike model did better on the August data than the quasiposson model, in terms of RMSE.

you will visualize the random forest model's August predictions as a function of time. 

Recall that the quasipoisson model mostly identified the pattern of slow and busy hours in the day, but it somewhat underestimated peak demands. You would like to see how the random forest model compares.

The plot quasipoisson_plot of quasipoisson model predictions as a function of time is shown.
```{r}
quasipoisson_plot
```

```{r}
first_two_weeks <- bikesAugust %>% 
  # Set start to 0, convert unit to days
  mutate(instant = (instant - min(instant)) / 24) %>% 
  # Gather cnt and pred into a column named value with key valuetype
  gather(key = valuetype, value = value, cnt, pred) %>%
  # Filter for rows in the first two
  filter(instant < 14)

# Plot predictions and cnt by date/time 
(randomforest_plot <- ggplot(first_two_weeks, aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Random Forest plot"))
```
The random forest model captured the day-to-day variations in peak demand better than the quasipoisson model, but it still underestmates peak demand, and also overestimates minimum demand. So there is still room for improvement.

## One-Hot-Encoding Categorical Variables

### vtreat on a small example

We will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded "lev"), and to clean bad values out of numerical variables (coded "clean").

To design a treatment plan use the function designTreatmentsZ()

    treatplan <- designTreatmentsZ(data, varlist)

- data: the original training data frame
- varlist: a vector of input variables to be treated (as strings).

designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:

    scoreFrame <- treatplan %>% 
                magrittr::use_series(scoreFrame) %>% 
                select(varName, origName, code)

- varName: the name of the new treated variable
- origName: the name of the original variable that the treated variable comes from
- code: the type of the new variable.
    - "clean": a numerical variable with no NAs or NaNs
    - "lev": an indicator variable for a specific level of the original categorical variable.
    
(magrittr::use_series() is an alias for $ that you can use in pipes.)

For these exercises, we want varName where code is either "clean" or "lev":

    newvarlist <- scoreFrame %>% 
                 filter(code %in% c("clean", "lev") %>%
                 magrittr::use_series(varName)

To transform the data set into all numerical and one-hot-encoded variables, use prepare():

    data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)
    
- treatplan: the treatment plan
- data: the data frame to be treated
- varRestrictions: the variables desired in the treated data

```{r}
dframe <- readRDS("dframe.rds")
```
```{r}
# dframe is in the workspace
dframe

# Create and print a vector of variable names
(vars <- c("color", "size"))

# Load the package vtreat
library(vtreat)

# Create the treatment plan
treatplan <- designTreatmentsZ(dframe, vars)

# Examine the scoreFrame
(scoreFrame <- treatplan %>%
    magrittr::use_series(scoreFrame) %>%
    select(varName, origName, code))

# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
    filter(code %in% c("clean", "lev")) %>%
    magrittr::use_series(varName))

# Create the treated training data
(dframe.treat <- prepare(treatplan, dframe, varRestriction = newvars))
```
We have successfully one-hot-encoded categorical data. The new indicator variables have '_lev_' in their names, and the new cleaned continuous variables have '_clean' in their names. The treated data is all numerical, with no missing values, and is suitable for use with xgboost and other R modeling functions.

### Novel levels

When a level of a categorical variable is rare, sometimes it will fail to show up in training data. If that rare level then appears in future data, downstream models may not know what to do with it. When such novel levels appear, using model.matrix or caret::dummyVars to one-hot-encode will not work correctly.

vtreat is a "safer" alternative to model.matrix for one-hot-encoding, because it can manage novel levels safely. vtreat also manages missing values in the data (both categorical and continuous).

We will see how vtreat handles categorical values that did not appear in the training set. 

```{r}
testframe <- readRDS("testframe.rds")
```
```{r}
# treatplan is in the workspace
summary(treatplan)

# newvars is in the workspace
newvars

# Print dframe and testframe
dframe
testframe

# Use prepare() to one-hot-encode testframe
(testframe.treat <- prepare(treatplan, testframe, varRestriction = newvars))
```
As you saw, vtreat encodes novel colors like yellow that were not present in the data as all zeros: 'none of the known colors'. This allows downstream models to accept these novel values without crashing.

### vtreat the bike rental data

We will create one-hot-encoded data frames of the July/August bike data, for use with xgboost later on.

For your convenience, we have defined the variable vars with the list of variable columns for the model.
```{r}
library(magrittr)
```
```{r}
# The outcome column
(outcome <- "cnt")

# The input columns
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))

# Load the package vtreat
library(vtreat)

# Create the treatment plan from bikesJuly (the training data)
treatplan <- designTreatmentsZ(bikesJuly, vars, verbose = FALSE)

# Get the "clean" and "lev" variables from the scoreFrame
(newvars <- treatplan %>%
  use_series(scoreFrame) %>%        
  filter(code %in% c("lev", "clean")) %>%  # get the rows you care about
  use_series(varName))           # get the varName column

# Prepare the training data
bikesJuly.treat <- prepare(treatplan, bikesJuly,  varRestriction = newvars)

# Prepare the test data
bikesAugust.treat <- prepare(treatplan, bikesAugust,  varRestriction = newvars)

# Call str() on the treated data
str(bikesJuly.treat)
str(bikesAugust.treat)
```
The bike data is now in completely numeric form, ready to use with xgboost. Note that the treated data does not include the outcome column.

## Gradient boosting machines

### Find the right number of trees for a gradient boosting machine

you will get ready to build a gradient boosting model to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. We will train the model on data from the month of July.

Remember that bikesJuly.treat no longer has the outcome column, so we must get it from the untreated data: bikesJuly$cnt.

You will use the xgboost package to fit the random forest model. The function xgb.cv() uses cross-validation to estimate the out-of-sample learning error as each new tree is added to the model. The appropriate number of trees to use in the final model is the number that minimizes the holdout RMSE.

For this example, the key arguments to the xgb.cv() call are:

- data: a numeric matrix.
- label: vector of outcomes (also numeric).
- nrounds: the maximum number of rounds (trees to build).
- nfold: the number of folds for the cross-validation. 5 is a good number.
- objective: "reg:linear" for continuous outcomes.
- eta: the learning rate.
- max_depth: depth of trees.
- early_stopping_rounds: after this many rounds without improvement, stop.
- verbose: 0 to stay silent.

```{r}
# Load the package xgboost
library(xgboost)

# Run xgb.cv
cv <- xgb.cv(data = as.matrix(bikesJuly.treat), 
            label = bikesJuly$cnt,
            nrounds = 100,
            nfold = 5,
            objective = "reg:linear",
            eta = 0.3,
            max_depth = 6,
            early_stopping_rounds = 10,
            verbose = 0    # silent
)

# Get the evaluation log 
elog <- cv$evaluation_log

# Determine and print how many trees minimize training and test error
elog %>% 
   summarize(ntrees.train = which.min(train_rmse_mean),   # find the index of min(train_rmse_mean)
             ntrees.test  = which.min(test_rmse_mean))   # find the index of min(test_rmse_mean)
```
In most cases, ntrees.test is less than ntrees.train. The training error keeps decreasing even after the test error starts to increase. It's important to use cross-validation to find the right number of trees (as determined by ntrees.test) and avoid an overfit model.

### Fit an xgboost bike rental model and predict

We will fit a gradient boosting model using xgboost() to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. We will train the model on data from the month of July and predict on data for the month of August.

Remember the vtreat-ed data no longer has the outcome column, so you must get it from the original data (the cnt column).
```{r}
ntrees <- 84
```
```{r}
# The number of trees to use, as determined by xgb.cv
ntrees

# Run xgboost
bike_model_xgb <- xgboost(data = as.matrix(bikesJuly.treat), # training data as matrix
                   label = bikesJuly$cnt,  # column of outcomes
                   nrounds = ntrees,       # number of trees to build
                   objective = "reg:linear", # objective
                   eta = 0.3,
                   depth = 6,
                   verbose = 0  # silent
)

# Make predictions
bikesAugust$pred <- predict(bike_model_xgb, as.matrix(bikesAugust.treat))

# Plot predictions (on x axis) vs actual bike rental count
ggplot(bikesAugust, aes(x = pred, y = cnt)) + 
  geom_point() + 
  geom_abline()
```
Overall, the scatterplot looked pretty good, but did you notice that the model made some negative predictions?

### Evaluate the xgboost bike rental model

We will evaluate the gradient boosting model bike_model_xgb using data from the month of August. We'll compare this model's RMSE for August to the RMSE of previous models that we've built.
```{r}
# bikesAugust is in the workspace
str(bikesAugust)

# Calculate RMSE
bikesAugust %>%
  mutate(residuals = cnt - pred) %>%
  summarize(rmse = sqrt(mean(residuals^2)))
```
Even though this gradient boosting made some negative predictions, overall it makes smaller errors than the previous two models. Perhaps rounding negative predictions up to zero is a reasonable tradeoff.

### Visualize the xgboost bike rental model

You've now seen three different ways to model the bike rental data. For this example, you've seen that the gradient boosting model had the smallest RMSE. Let's compare the gradient boosting model's predictions to the other two models as a function of time.
```{r}
# Print quasipoisson_plot
quasipoisson_plot

# Print randomforest_plot
randomforest_plot

# Plot predictions and actual bike rentals as a function of time (days)
bikesAugust %>% 
  mutate(instant = (instant - min(instant))/24) %>%  # set start to 0, convert unit to days
  gather(key = valuetype, value = value, cnt, pred) %>%
  filter(instant < 14) %>% # first two weeks
  ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) + 
  geom_point() + 
  geom_line() + 
  scale_x_continuous("Day", breaks = 0:14, labels = 0:14) + 
  scale_color_brewer(palette = "Dark2") + 
  ggtitle("Predicted August bike rentals, Gradient Boosting model")
```
The gradient boosting pattern captures rental variations due to time of day and other factors better than the previous models.
