From a machine learning perspective, the term regression generally encompasses the prediction of continuous values. Statistically, these predictions are the expected value, or the average value one would observe for the given input values.
We’ll create a formula to define a one-variable modeling task, and then fit a linear model to the data. We are given the rates of male and female unemployment in the United States over several years (Source).
unemployment <- readRDS("unemployment.rds")
The task is to predict the rate of female unemployment from the observed rate of male unemployment. The outcome is female_unemployment, and the input is male_unemployment.
The sign of the variable coefficient tells you whether the outcome increases (+) or decreases (-) as the variable increases.
The calling interface for lm() is:
lm(formula, data = ___)
# unemployment is loaded in the workspace
summary(unemployment)
male_unemployment female_unemployment
Min. :2.900 Min. :4.000
1st Qu.:4.900 1st Qu.:4.400
Median :6.000 Median :5.200
Mean :5.954 Mean :5.569
3rd Qu.:6.700 3rd Qu.:6.100
Max. :9.800 Max. :7.900
# Define a formula to express female_unemployment as a function of male_unemployment
fmla <- female_unemployment ~ male_unemployment
# Print it
fmla
female_unemployment ~ male_unemployment
# Use the formula to fit a model: unemployment_model
unemployment_model <- lm(fmla, data = unemployment)
# Print it
unemployment_model
Call:
lm(formula = fmla, data = unemployment)
Coefficients:
(Intercept) male_unemployment
1.4341 0.6945
The coefficient for male unemployment is positive, so female unemployment increases as male unemployment does. Linear regression is the most basic of regression approaches. You can think of this course as ways to address its limitations.
Let’s look at the model unemployment_model that we have just created. There are a variety of different ways to examine a model; each way provides different information. We will use summary(), broom::glance(), and sigr::wrapFTest().
# broom and sigr are already loaded in your workspace
# Print unemployment_model
unemployment_model
Call:
lm(formula = fmla, data = unemployment)
Coefficients:
(Intercept) male_unemployment
1.4341 0.6945
# Call summary() on unemployment_model to get more details
summary(unemployment_model)
Call:
lm(formula = fmla, data = unemployment)
Residuals:
Min 1Q Median 3Q Max
-0.77621 -0.34050 -0.09004 0.27911 1.31254
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.43411 0.60340 2.377 0.0367 *
male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5803 on 11 degrees of freedom
Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Call glance() on unemployment_model to see the details in a tidier form
library(broom)
glance(unemployment_model)
r.squared <dbl> | adj.r.squared <dbl> | sigma <dbl> | statistic <dbl> | p.value <dbl> | df <int> | logLik <dbl> | AIC <dbl> | |
---|---|---|---|---|---|---|---|---|
0.8213157 | 0.8050716 | 0.5802596 | 50.56108 | 1.965985e-05 | 2 | -10.28471 | 26.56943 |
# Call wrapFTest() on unemployment_model to see the most relevant details
library(sigr)
wrapFTest(unemployment_model)
[1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."
There are several different ways to get diagnostics for our model. Use the one that suits your needs or preferences the best.
We will use our unemployment model unemployment_model to make predictions from the unemployment data, and compare predicted female unemployment rates to the actual observed female unemployment rates on the training data, unemployment. We will also use our model to predict on the new data in newrates, which consists of only one observation, where male unemployment is 5%.
The predict() interface for lm models takes the form
predict(model, newdata)
We will use the ggplot2 package to make the plots, so we will add the prediction column to the unemployment data frame. We will plot outcome versus prediction, and compare them to the line that represents perfect predictions (that is when the outcome is equal to the predicted value).
The ggplot2 command to plot a scatterplot of dframe$outcome versus dframe$pred (pred on the x axis, outcome on the y axis), along with a blue line where outcome == pred is as follows:
ggplot(dframe, aes(x = pred, y = outcome)) +
geom_point() +
geom_abline(color = "blue")
newrates <- data.frame(male_unemployment = 5)
newrates
male_unemployment <dbl> | ||||
---|---|---|---|---|
5 |
# unemployment is in your workspace
summary(unemployment)
male_unemployment female_unemployment
Min. :2.900 Min. :4.000
1st Qu.:4.900 1st Qu.:4.400
Median :6.000 Median :5.200
Mean :5.954 Mean :5.569
3rd Qu.:6.700 3rd Qu.:6.100
Max. :9.800 Max. :7.900
# newrates is in your workspace
newrates
male_unemployment <dbl> | ||||
---|---|---|---|---|
5 |
# Predict female unemployment in the unemployment data set
unemployment$prediction <- predict(unemployment_model, unemployment)
# load the ggplot2 package
library(ggplot2)
# Make a plot to compare predictions to actual (prediction on x axis).
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) +
geom_point() +
geom_abline(color = "blue")
# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newrates)
# Print it
pred
1
4.906757
While all the modeling algorithms in R implement the predict() method, the call may be a little different for each one.
We will work with the blood pressure dataset (Source), and model blood_pressure as a function of weight and age.
bloodpressure <- readRDS("bloodpressure.rds")
# bloodpressure is in the workspace
summary(bloodpressure)
blood_pressure age weight
Min. :128.0 Min. :46.00 Min. :167
1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
Median :153.0 Median :64.00 Median :194
Mean :150.1 Mean :62.45 Mean :195
3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
Max. :168.0 Max. :74.00 Max. :220
# Create the formula and print it
fmla <- blood_pressure ~ weight + age
fmla
blood_pressure ~ weight + age
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)
# Print bloodpressure_model and call summary()
bloodpressure_model
Call:
lm(formula = fmla, data = bloodpressure)
Coefficients:
(Intercept) weight age
30.9941 0.3349 0.8614
summary(bloodpressure_model)
Call:
lm(formula = fmla, data = bloodpressure)
Residuals:
Min 1Q Median 3Q Max
-3.4640 -1.1949 -0.4078 1.8511 2.6981
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.9941 11.9438 2.595 0.03186 *
weight 0.3349 0.1307 2.563 0.03351 *
age 0.8614 0.2482 3.470 0.00844 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.318 on 8 degrees of freedom
Multiple R-squared: 0.9768, Adjusted R-squared: 0.9711
F-statistic: 168.8 on 2 and 8 DF, p-value: 2.874e-07
One of the advantages of linear regression is that you can interpret the effects of each variable on the input – to a certain extent. In this case the coefficients for both age and weight are positive, which indicates that bloodpressure tends to increase as both age and weight increase.
Now we will make predictions using the blood pressure model bloodpressure_model.
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, bloodpressure)
# plot the results
ggplot(bloodpressure, aes(x = prediction, y = blood_pressure)) +
geom_point() +
geom_abline(color = "blue")
The results stay fairly close to the line of perfect prediction, indicating that the model fits the training data well. From a prediction perspective, multivariate linear regression behaves much as simple (one-variable) linear regression does.
We will graphically evaluate the unemployment model, unemployment_model, that we fit to the unemployment data.
We will plot the model’s predictions against the actual female_unemployment. Then we will calculate the residuals:
residuals <- actual outcome - predicted outcome
and plot predictions against residuals. The residual graph will take a slightly different form: we compare the residuals to the horizontal line (using geom_hline()) rather than to the line . The command will be provided.
# Make predictions from the model
unemployment$predictions <- predict(unemployment_model, unemployment)
# Fill in the blanks to plot predictions (on x-axis) versus the female_unemployment rates
ggplot(unemployment, aes(x = predictions, y = female_unemployment)) +
geom_point() +
geom_abline()
# From previous step
unemployment$predictions <- predict(unemployment_model)
# Calculate residuals
unemployment$residuals <- unemployment$female_unemployment - unemployment$predictions
# Fill in the blanks to plot predictions (on x-axis) versus the residuals
ggplot(unemployment, aes(x = predictions, y = residuals)) +
geom_pointrange(aes(ymin = 0, ymax = residuals)) +
geom_hline(yintercept = 0, linetype = 3) +
ggtitle("residuals vs. linear model prediction")
We have now evaluated model predictions by comparing them to ground truth, and by examining prediction error.
We made predictions about female_unemployment and visualized the predictions and the residuals. Now, we will also plot the gain curve of the unemployment_model’s predictions against actual female_unemployment using the WVPlots::GainCurvePlot() function.
For situations where order is more important than exact values, the gain curve helps you check if the model’s predictions sort in the same order as the true outcome.
Calls to the function GainCurvePlot() look like:
GainCurvePlot(frame, xvar, truthvar, title)
where
When the predictions sort in exactly the same order, the relative Gini coefficient is 1. When the model sorts poorly, the relative Gini coefficient is close to zero, or even negative.
# Load the package WVPlots
library(WVPlots)
# Plot the Gain Curve
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")
A relative gini coefficient close to one shows that the model correctly sorts high unemployment situations from lower ones.
We will calculate the RMSE of your unemployment model. We added two columns to the unemployment dataset:
We can calculate the RMSE from a vector of residuals, , as:
We want RMSE to be small. How small is “small”? One heuristic is to compare the RMSE to the standard deviation of the outcome. With a good model, the RMSE should be smaller.
# For convenience put the residuals in the variable res
res <- unemployment$residuals
# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res**2)))
[1] 0.5337612
# Calculate the standard deviation of female_unemployment and print it
(sd_unemployment <- sd(unemployment$female_unemployment))
[1] 1.314271
An RMSE much smaller than the outcome’s standard deviation suggests a model that predicts well.
We will examine how well the model fits the data: that is, how much variance does it explain. You can do this using .
Suppose is the true outcome, is the prediction from the model, and are the residuals of the predictions.
Then the total sum of squares (“total variance”) of the data is:
where is the mean value of .
The residual sum of squared errors of the model, is:
(R-Squared), the “variance explained” by the model, is then:
After we calculate , we will compare what you computed with the reported by glance(). glance() returns a one-row data frame; for a linear regression model, one of the columns returned is the of the model on the training data.
# Calculate mean female_unemployment: fe_mean. Print it
(fe_mean <- mean(unemployment$female_unemployment))
[1] 5.569231
# Calculate total sum of squares: tss. Print it
(tss <- sum((unemployment$female_unemployment - fe_mean)^2))
[1] 20.72769
# Calculate residual sum of squares: rss. Print it
(rss <- sum((unemployment$residuals)^2))
[1] 3.703714
# Calculate R-squared: rsq. Print it. Is it a good fit?
(rsq <- 1 - rss/tss)
[1] 0.8213157
# Get R-squared from glance. Print it
(rsq_glance <- glance(unemployment_model)$r.squared)
[1] 0.8213157
An R-squared close to one suggests a model that predicts well.
The linear correlation of two variables, and , measures the strength of the linear relationship between them. When x and y are respectively:
then the square of the correlation is the same as . We will verify that in this exercise.
# Get the correlation between the prediction and true outcome: rho and print it
(rho <- cor(unemployment$predictions, unemployment$female_unemployment))
[1] 0.9062647
# Square rho: rho2 and print it
(rho2 <- rho^2)
[1] 0.8213157
# Get R-squared from glance and print it
(rsq_glance <- glance(unemployment_model)$r.squared)
[1] 0.8213157
Remember this equivalence is only true for the training data, and only for models that minimize squared error.
We will use the mpg data from the package ggplot2. The data describes the characteristics of several makes and models of cars from different years. The goal is to predict city fuel efficiency from highway fuel efficiency.
We will split mpg into a training set mpg_train (75% of the data) and a test set mpg_test (25% of the data). One way to do this is to generate a column of uniform random numbers between 0 and 1, using the function runif().
If you have a data set dframe of size , and you want a random subset of approximately size of (where is between 0 and 1), then:
# mpg is in the workspace
summary(mpg)
manufacturer model displ year
Length:234 Length:234 Min. :1.600 Min. :1999
Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
Mode :character Mode :character Median :3.300 Median :2004
Mean :3.472 Mean :2004
3rd Qu.:4.600 3rd Qu.:2008
Max. :7.000 Max. :2008
cyl trans drv cty
Min. :4.000 Length:234 Length:234 Min. : 9.00
1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
Median :6.000 Mode :character Mode :character Median :17.00
Mean :5.889 Mean :16.86
3rd Qu.:8.000 3rd Qu.:19.00
Max. :8.000 Max. :35.00
hwy fl class pred.cv
Min. :12.00 Length:234 Length:234 Min. : 8.942
1st Qu.:18.00 Class :character Class :character 1st Qu.:13.160
Median :24.00 Mode :character Mode :character Median :17.262
Mean :23.44 Mean :16.859
3rd Qu.:27.00 3rd Qu.:19.308
Max. :44.00 Max. :30.975
pred
Min. : 9.043
1st Qu.:13.142
Median :17.241
Mean :16.859
3rd Qu.:19.291
Max. :30.906
dim(mpg)
[1] 234 13
# Use nrow to get the number of rows in mpg (N) and print it
(N <- nrow(mpg))
[1] 234
# Calculate how many rows 75% of N should be and print it
# Hint: use round() to get an integer
(target <- round(N * 0.75))
[1] 176
# Create the vector of N uniform random variables: gp
gp <- runif(N)
# Use gp to create the training set: mpg_train (75% of data) and mpg_test (25% of data)
mpg_train <- mpg[gp < 0.75,]
mpg_test <- mpg[gp >= 0.75,]
# Use nrow() to examine mpg_train and mpg_test
nrow(mpg_train)
[1] 174
nrow(mpg_test)
[1] 60
A random split won’t always produce sets of exactly X% and (100-X)% of the data, but it should be close.
We will use mpg_train to train a model to predict city fuel efficiency (cty) from highway fuel efficiency (hwy).
# mpg_train is in the workspace
summary(mpg_train)
manufacturer model displ year
Length:174 Length:174 Min. :1.600 Min. :1999
Class :character Class :character 1st Qu.:2.425 1st Qu.:1999
Mode :character Mode :character Median :3.300 Median :1999
Mean :3.472 Mean :2003
3rd Qu.:4.600 3rd Qu.:2008
Max. :7.000 Max. :2008
cyl trans drv cty
Min. :4.000 Length:174 Length:174 Min. : 9.00
1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
Median :6.000 Mode :character Mode :character Median :17.00
Mean :5.879 Mean :16.86
3rd Qu.:8.000 3rd Qu.:19.00
Max. :8.000 Max. :35.00
hwy fl class pred.cv
Min. :12.00 Length:174 Length:174 Min. : 9.062
1st Qu.:18.00 Class :character Class :character 1st Qu.:13.190
Median :25.00 Mode :character Mode :character Median :17.893
Mean :23.53 Mean :16.922
3rd Qu.:27.00 3rd Qu.:19.308
Max. :44.00 Max. :30.975
pred
Min. : 9.043
1st Qu.:13.142
Median :17.925
Mean :16.919
3rd Qu.:19.291
Max. :30.906
# Create a formula to express cty as a function of hwy: fmla and print it.
(fmla <- cty ~ hwy)
cty ~ hwy
# Now use lm() to build a model mpg_model from mpg_train that predicts cty from hwy
mpg_model <- lm(fmla, mpg_train)
# Use summary() to examine the model
summary(mpg_model)
Call:
lm(formula = fmla, data = mpg_train)
Residuals:
Min 1Q Median 3Q Max
-2.8629 -0.8119 -0.0603 0.7256 4.2120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.85627 0.40174 2.131 0.0345 *
hwy 0.68027 0.01658 41.041 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.272 on 172 degrees of freedom
Multiple R-squared: 0.9073, Adjusted R-squared: 0.9068
F-statistic: 1684 on 1 and 172 DF, p-value: < 2.2e-16
Now weu will test the model mpg_model on the test data, mpg_test. Functions rmse() and r_squared() to calculate RMSE and R-squared have been provided for convenience:
rmse(predcol, ycol)
r_squared(predcol, ycol)
# rmse function
rmse <- function (predcol, ycol)
{
res = predcol - ycol
sqrt(mean(res^2))
}
# r_squared function
r_squared <- function (predcol, ycol)
{
res = predcol - ycol
sqrt(mean(res^2))
}
where:
We will also plot the predictions vs. the outcome.
Generally, model performance is better on the training data than the test data (though sometimes the test set “gets lucky”). A slight difference in performance is okay; if the performance on training is significantly better, there is a problem.
# Examine the objects in the workspace
# ls.str()
# Examine the objects in the workspace
ls.str()
alcohol : 'data.frame': 32 obs. of 7 variables:
$ Subject : int 1 2 3 4 5 6 7 8 9 10 ...
$ Metabol : num 0.6 0.6 1.5 0.4 0.1 0.2 0.3 0.3 0.4 1 ...
$ Gastric : num 1 1.6 1.5 2.2 1.1 1.2 0.9 0.8 1.5 0.9 ...
$ Sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 1 1 1 1 1 ...
$ Alcohol : Factor w/ 2 levels "Alcoholic","Non-alcoholic": 1 1 1 2 2 2 2 2 2 2 ...
$ pred_add : num 0.105 1.358 0.703 2.61 0.412 ...
$ pred_interaction: num 0.475 1.259 0.674 2.044 0.721 ...
bike_model : List of 30
$ coefficients : Named num [1:32] 5.935 -0.58 -0.892 -1.662 -2.35 ...
$ residuals : Named num [1:744] 0.724 0.973 1.52 1.085 -0.502 ...
$ fitted.values : Named num [1:744] 86.43 47.13 35.71 15.82 8.03 ...
$ effects : Named num [1:744] -2648.1 77 77.2 70.8 61.7 ...
$ R : num [1:32, 1:32] -451 0 0 0 0 ...
$ rank : int 32
$ qr :List of 5
$ family :List of 11
$ linear.predictors: Named num [1:744] 4.46 3.85 3.58 2.76 2.08 ...
$ deviance : num 28775
$ aic : num NA
$ null.deviance : num 133365
$ iter : int 5
$ weights : Named num [1:744] 86.43 47.13 35.71 15.82 8.03 ...
$ prior.weights : Named num [1:744] 1 1 1 1 1 1 1 1 1 1 ...
$ df.residual : int 712
$ df.null : int 743
$ y : Named int [1:744] 149 93 90 33 4 10 27 50 142 219 ...
$ converged : logi TRUE
$ boundary : logi FALSE
$ model :'data.frame': 744 obs. of 9 variables:
$ call : language glm(formula = fmla, family = quasipoisson, data = bikesJuly)
$ formula : chr "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
$ terms :Classes 'terms', 'formula' language cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed
$ data :'data.frame': 744 obs. of 12 variables:
$ offset : NULL
$ control :List of 3
$ method : chr "glm.fit"
$ contrasts :List of 4
$ xlevels :List of 2
bike_model_rf : List of 14
$ predictions : num [1:744] 122.9 78.1 95.1 70.2 44.9 ...
$ num.trees : num 500
$ num.independent.variables: num 8
$ mtry : num 2
$ min.node.size : num 5
$ prediction.error : num 8231
$ forest :List of 8
$ splitrule : chr "variance"
$ treetype : chr "Regression"
$ r.squared : num 0.821
$ call : language ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order", seed = seed)
$ importance.mode : chr "none"
$ num.samples : int 744
$ replace : logi TRUE
bikesAugust : 'data.frame': 744 obs. of 13 variables:
$ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
$ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
$ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
$ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
$ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
$ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
$ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
$ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
$ yr : int 1 1 1 1 1 1 1 1 1 1 ...
$ pred : num 77.4 36.2 38.3 28.2 42.2 ...
bikesJuly : 'data.frame': 744 obs. of 12 variables:
$ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
$ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
$ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
$ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
$ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
$ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
$ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
$ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
$ yr : int 1 1 1 1 1 1 1 1 1 1 ...
bloodpressure : 'data.frame': 11 obs. of 4 variables:
$ blood_pressure: int 132 143 153 162 154 168 137 149 159 128 ...
$ age : int 52 59 67 73 64 74 54 61 65 46 ...
$ weight : int 173 184 194 211 196 220 188 188 207 167 ...
$ prediction : num 134 143 154 165 152 ...
bloodpressure_model : List of 12
$ coefficients : Named num [1:3] 30.994 0.335 0.861
$ residuals : Named num [1:11] -1.718 -0.432 -0.672 -2.533 2.243 ...
$ effects : Named num [1:11] -497.8 41.82 -8.04 -2.09 2.66 ...
$ rank : int 3
$ fitted.values: Named num [1:11] 134 143 154 165 152 ...
$ assign : int [1:3] 0 1 2
$ qr :List of 5
$ df.residual : int 8
$ xlevels : Named list()
$ call : language lm(formula = fmla, data = bloodpressure)
$ terms :Classes 'terms', 'formula' language blood_pressure ~ weight + age
$ model :'data.frame': 11 obs. of 3 variables:
fdata : 'data.frame': 100 obs. of 3 variables:
$ y : num 9.15 1.9 -3.86 2.39 1.54 ...
$ pred : num 6.43 3.47 1.59 3.76 9.51 ...
$ label: Factor w/ 2 levels "small purchases",..: 1 1 1 1 1 1 1 1 1 1 ...
fdata2 : tibble [100 × 5] (S3: grouped_df/tbl_df/tbl/data.frame)
fe_mean : num 5.57
flower_model : List of 13
$ coefficients : Named num [1:3] 83.4642 -0.0405 -12.1583
$ residuals : Named num [1:24] -2.94 12.16 -3.86 -4.96 -3.49 ...
$ effects : Named num [1:24] -275.02 -50.79 29.78 -5.16 -2.41 ...
$ rank : int 3
$ fitted.values: Named num [1:24] 65.2 65.2 59.2 59.2 53.1 ...
$ assign : int [1:3] 0 1 2
$ qr :List of 5
$ df.residual : int 21
$ contrasts :List of 1
$ xlevels :List of 1
$ call : language lm(formula = fmla, data = flowers)
$ terms :Classes 'terms', 'formula' language Flowers ~ Intensity + Time
$ model :'data.frame': 24 obs. of 3 variables:
flowers : 'data.frame': 24 obs. of 4 variables:
$ Flowers : num 62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
$ Time : chr "Late" "Late" "Late" "Late" ...
$ Intensity : int 150 150 300 300 450 450 600 600 750 750 ...
$ predictions: num 65.2 65.2 59.2 59.2 53.1 ...
fmla : Class 'formula' language cty ~ hwy
fmla_add : Class 'formula' language Metabol ~ Gastric + Sex
fmla_interaction : Class 'formula' language Metabol ~ Gastric + Gastric:Sex
fmla_main_and_interaction : Class 'formula' language Metabol ~ Gastric * Sex
fmla_sqr : Class 'formula' language price ~ I(size^2)
fmla.abs : Class 'formula' language Income2005 ~ Arith + Word + Parag + Math + AFQT
fmla.gam : Class 'formula' language weight ~ s(Time)
fmla.lin : Class 'formula' language weight ~ Time
fmla.log : Class 'formula' language log(Income2005) ~ Arith + Word + Parag + Math + AFQT
gp : num [1:234] 0.931 0.424 0.39 0.13 0.647 ...
houseprice : 'data.frame': 40 obs. of 4 variables:
$ size : int 72 98 92 90 44 46 90 150 94 90 ...
$ price : int 156 153 230 152 42 157 113 573 202 261 ...
$ pred_lin: num 159.7 259.4 244.8 227.1 41.1 ...
$ pred_sqr: num 150.6 243.3 222.8 211.4 80.5 ...
houseprice_long : 'data.frame': 80 obs. of 5 variables:
$ size : int 72 98 92 90 44 46 90 150 94 90 ...
$ price : int 156 153 230 152 42 157 113 573 202 261 ...
$ modeltype: chr "pred_lin" "pred_lin" "pred_lin" "pred_lin" ...
$ pred : num 159.7 259.4 244.8 227.1 41.1 ...
$ residuals: num 3.705 106.419 14.825 75.08 -0.866 ...
i : int 3
income_long : 'data.frame': 1030 obs. of 13 variables:
$ Subject : int 7 13 47 62 73 78 89 105 106 107 ...
$ Arith : int 14 30 26 12 18 25 24 28 30 19 ...
$ Word : int 27 29 33 25 34 35 34 32 35 31 ...
$ Parag : int 8 13 13 10 13 14 15 14 12 11 ...
$ Math : int 11 24 16 10 18 21 17 20 23 16 ...
$ AFQT : num 47.4 72.3 75.5 36.4 81.5 ...
$ Income2005 : int 19000 8000 66309 30000 186135 14657 62000 40000 105000 43000 ...
$ logpred : num 10.3 10.9 10.7 10.1 10.5 ...
$ pred.income: num 28644 56646 44468 25462 36356 ...
$ modeltype : chr "pred.absmodel" "pred.absmodel" "pred.absmodel" "pred.absmodel" ...
$ pred : num 42844 75499 63519 35008 53495 ...
$ residual : num 23844 67499 -2790 5008 -132640 ...
$ relerr : num 1.255 8.4374 -0.0421 0.1669 -0.7126 ...
income_test : 'data.frame': 515 obs. of 11 variables:
$ Subject : int 7 13 47 62 73 78 89 105 106 107 ...
$ Arith : int 14 30 26 12 18 25 24 28 30 19 ...
$ Word : int 27 29 33 25 34 35 34 32 35 31 ...
$ Parag : int 8 13 13 10 13 14 15 14 12 11 ...
$ Math : int 11 24 16 10 18 21 17 20 23 16 ...
$ AFQT : num 47.4 72.3 75.5 36.4 81.5 ...
$ Income2005 : int 19000 8000 66309 30000 186135 14657 62000 40000 105000 43000 ...
$ logpred : num 10.3 10.9 10.7 10.1 10.5 ...
$ pred.income : num 28644 56646 44468 25462 36356 ...
$ pred.absmodel: Named num 42844 75499 63519 35008 53495 ...
$ pred.logmodel: Named num 28644 56646 44468 25462 36356 ...
income_train : 'data.frame': 2069 obs. of 7 variables:
$ Subject : int 2 6 8 9 16 17 18 20 21 22 ...
$ Arith : int 8 30 13 21 17 29 30 17 29 27 ...
$ Word : int 15 35 35 28 30 33 35 28 33 31 ...
$ Parag : int 6 15 12 10 12 13 14 14 13 14 ...
$ Math : int 6 23 4 13 17 21 23 20 25 22 ...
$ AFQT : num 6.84 99.39 44.02 59.68 50.28 ...
$ Income2005: int 5500 65000 36000 65000 71000 43000 120000 64000 253043 45300 ...
k : num 3
mean_bikes : num 274
mmat : num [1:24, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
model : List of 12
$ coefficients : Named num [1:2] 0.865 0.683
$ residuals : Named num [1:156] -0.35705 -2.62476 -1.30783 0.00909 -0.30783 ...
$ effects : Named num [1:156] -214.412 -49.087 -1.187 0.166 -0.187 ...
$ rank : int 2
$ fitted.values: Named num [1:156] 21.4 18.6 19.3 20 19.3 ...
$ assign : int [1:2] 0 1
$ qr :List of 5
$ df.residual : int 154
$ xlevels : Named list()
$ call : language lm(formula = cty ~ hwy, data = mpg[split$train, ])
$ terms :Classes 'terms', 'formula' language cty ~ hwy
$ model :'data.frame': 156 obs. of 2 variables:
model_add : List of 13
$ coefficients : Named num [1:3] -1.4 1.65 1.81
$ residuals : Named num [1:21] 0.352 -0.635 0.43 -1.822 -0.377 ...
$ effects : Named num [1:21] -10.213 9.007 3.55 -2.059 -0.399 ...
$ rank : int 3
$ fitted.values: Named num [1:21] 0.248 1.235 1.07 2.222 0.577 ...
$ assign : int [1:3] 0 1 2
$ qr :List of 5
$ df.residual : int 18
$ contrasts :List of 1
$ xlevels :List of 1
$ call : language lm(formula = fmla_add, data = alcohol[split$train, ])
$ terms :Classes 'terms', 'formula' language Metabol ~ Gastric + Sex
$ model :'data.frame': 21 obs. of 3 variables:
model_interaction : List of 13
$ coefficients : Named num [1:3] -0.0572 0.7072 1.1427
$ residuals : Named num [1:21] -0.05 -0.474 0.496 -1.099 -0.591 ...
$ effects : Named num [1:21] -10.213 9.007 3.99 -1.341 -0.552 ...
$ rank : int 3
$ fitted.values: Named num [1:21] 0.65 1.074 1.004 1.499 0.791 ...
$ assign : int [1:3] 0 1 2
$ qr :List of 5
$ df.residual : int 18
$ contrasts :List of 1
$ xlevels :List of 1
$ call : language lm(formula = fmla_interaction, data = alcohol[split$train, ])
$ terms :Classes 'terms', 'formula' language Metabol ~ Gastric + Gastric:Sex
$ model :'data.frame': 21 obs. of 3 variables:
model_lin : List of 12
$ coefficients : Named num [1:2] -136.73 4.04
$ residuals : Named num [1:27] 1.68 -5.16 107.78 -114.08 -41.25 ...
$ effects : Named num [1:27] -1287.3 -581.3 105.8 -114.4 -41.4 ...
$ rank : int 2
$ fitted.values: Named num [1:27] 154.3 235.2 49.2 227.1 243.2 ...
$ assign : int [1:2] 0 1
$ qr :List of 5
$ df.residual : int 25
$ xlevels : Named list()
$ call : language lm(formula = price ~ size, data = houseprice[split$train, ])
$ terms :Classes 'terms', 'formula' language price ~ size
$ model :'data.frame': 27 obs. of 2 variables:
model_main_and_interaction : List of 13
$ coefficients : Named num [1:4] -0.197 0.837 -0.988 1.507
$ residuals : Named num [1:32] -0.0397 -0.5418 0.4418 -1.244 -0.6234 ...
$ effects : Named num [1:32] -13.7 12.246 4.211 3.254 -0.734 ...
$ rank : int 4
$ fitted.values: Named num [1:32] 0.64 1.142 1.058 1.644 0.723 ...
$ assign : int [1:4] 0 1 2 3
$ qr :List of 5
$ df.residual : int 28
$ contrasts :List of 1
$ xlevels :List of 1
$ call : language lm(formula = fmla_main_and_interaction, data = alcohol)
$ terms :Classes 'terms', 'formula' language Metabol ~ Gastric * Sex
$ model :'data.frame': 32 obs. of 3 variables:
model_sqr : List of 12
$ coefficients : Named num [1:2] 39.416 0.0212
$ residuals : Named num [1:27] 6.52 10.88 72.66 -98.39 -25.02 ...
$ effects : Named num [1:27] -1287.3 610.5 69.3 -99.8 -26.1 ...
$ rank : int 2
$ fitted.values: Named num [1:27] 149.5 219.1 84.3 211.4 227 ...
$ assign : int [1:2] 0 1
$ qr :List of 5
$ df.residual : int 25
$ xlevels : Named list()
$ call : language lm(formula = fmla_sqr, data = houseprice[split$train, ])
$ terms :Classes 'terms', 'formula' language price ~ I(size^2)
$ model :'data.frame': 27 obs. of 2 variables:
model.abs : List of 12
$ coefficients : Named num [1:6] 17517 1552 -132 -1155 725 ...
$ residuals : Named num [1:2069] -21089 -11484 6068 10099 23653 ...
$ effects : Named num [1:2069] -2269500 -729259 4672 -47914 -160517 ...
$ rank : int 6
$ fitted.values: Named num [1:2069] 26589 76484 29932 54901 47347 ...
$ assign : int [1:6] 0 1 2 3 4 5
$ qr :List of 5
$ df.residual : int 2063
$ xlevels : Named list()
$ call : language lm(formula = fmla.abs, data = income_train)
$ terms :Classes 'terms', 'formula' language Income2005 ~ Arith + Word + Parag + Math + AFQT
$ model :'data.frame': 2069 obs. of 6 variables:
model.gam : List of 46
$ coefficients : Named num [1:10] 6.16 -10.95 18.33 -3.09 11.91 ...
$ residuals : num [1:330] -0.024 0.047 -0.305 0.424 0.244 ...
$ fitted.values : num [1:330] 0.13 0.214 0.971 1.686 5.986 ...
$ family :List of 11
$ linear.predictors: num [1:330] 0.13 0.214 0.971 1.686 5.986 ...
$ deviance : num 1382
$ null.deviance : num 14416
$ iter : int 1
$ weights : num [1:330] 1 1 1 1 1 1 1 1 1 1 ...
$ prior.weights : num [1:330] 1 1 1 1 1 1 1 1 1 1 ...
$ df.null : int 329
$ y : num [1:330] 0.106 0.261 0.666 2.11 6.23 ...
$ converged : logi TRUE
$ sig2 : num 4.31
$ edf : Named num [1:10] 1 1 1.03 0.98 0.932 ...
$ edf1 : Named num [1:10] 1 1 1.006 1 0.998 ...
$ hat : num [1:330] 0.0311 0.0275 0.0235 0.0244 0.0259 ...
$ R : num [1:10, 1:10] -18.2 0 0 0 0 ...
$ boundary : logi FALSE
$ sp : Named num 0.00597
$ nsdf : int 1
$ Ve : num [1:10, 1:10] 1.31e-02 7.10e-16 -5.34e-15 1.32e-16 -4.24e-15 ...
$ Vp : num [1:10, 1:10] 0.0131 0 0 0 0 ...
$ rV : num [1:10, 1:10] 0 -0.011195 -0.001636 -0.011576 0.000107 ...
$ mgcv.conv :List of 7
$ gcv.ubre : Named num 4.44
$ aic : num 1430
$ rank : int 10
$ gcv.ubre.dev : num 4.44
$ scale.estimated : logi TRUE
$ method : chr "GCV"
$ smooth :List of 1
$ formula :Class 'formula' language weight ~ s(Time)
$ var.summary :List of 1
$ cmX : Named num [1:10] 1.00 1.46e-17 -1.49e-16 3.69e-18 -8.68e-17 ...
$ model :'data.frame': 330 obs. of 2 variables:
$ control :List of 19
$ terms :Classes 'terms', 'formula' language weight ~ 1 + Time
$ pred.formula :Class 'formula' language ~Time
$ pterms :Classes 'terms', 'formula' language weight ~ 1
$ assign : int 0
$ offset : num [1:330] 0 0 0 0 0 0 0 0 0 0 ...
$ df.residual : num 321
$ min.edf : num 2
$ optimizer : chr "magic"
$ call : language gam(formula = fmla.gam, family = gaussian, data = soybean_train)
model.lin : List of 12
$ coefficients : Named num [1:2] -6.559 0.292
$ residuals : Named num [1:330] 2.576 0.686 -0.953 -1.554 -1.523 ...
$ effects : Named num [1:330] -111.98 109.01 -1.11 -1.7 -1.65 ...
$ rank : int 2
$ fitted.values: Named num [1:330] -2.47 -0.425 1.619 3.664 7.753 ...
$ assign : int [1:2] 0 1
$ qr :List of 5
$ df.residual : int 328
$ xlevels : Named list()
$ call : language lm(formula = fmla.lin, data = soybean_train)
$ terms :Classes 'terms', 'formula' language weight ~ Time
$ model :'data.frame': 330 obs. of 2 variables:
model.log : List of 12
$ coefficients : Named num [1:6] 9.675486 0.030436 -0.000552 -0.010679 0.015518 ...
$ residuals : Named num [1:2069] -1.34 0.126 0.421 0.574 0.762 ...
$ effects : Named num [1:2069] -474.504 -14.454 -0.552 -0.229 -2.946 ...
$ rank : int 6
$ fitted.values: Named num [1:2069] 9.95 10.96 10.07 10.51 10.41 ...
$ assign : int [1:6] 0 1 2 3 4 5
$ qr :List of 5
$ df.residual : int 2063
$ xlevels : Named list()
$ call : language lm(formula = fmla.log, data = income_train)
$ terms :Classes 'terms', 'formula' language log(Income2005) ~ Arith + Word + Parag + Math + AFQT
$ model :'data.frame': 2069 obs. of 6 variables:
mpg : tibble [234 × 13] (S3: tbl_df/tbl/data.frame)
mpg_model : List of 12
$ coefficients : Named num [1:2] 0.856 0.68
$ residuals : Named num [1:174] 0.416 -1.945 -0.264 -2.543 -0.543 ...
$ effects : Named num [1:174] -222.426 -52.214 -0.15 -2.523 -0.523 ...
$ rank : int 2
$ fitted.values: Named num [1:174] 20.6 21.9 21.3 18.5 18.5 ...
$ assign : int [1:2] 0 1
$ qr :List of 5
$ df.residual : int 172
$ xlevels : Named list()
$ call : language lm(formula = fmla, data = mpg_train)
$ terms :Classes 'terms', 'formula' language cty ~ hwy
$ model :'data.frame': 174 obs. of 2 variables:
mpg_test : tibble [60 × 13] (S3: tbl_df/tbl/data.frame)
mpg_train : tibble [174 × 13] (S3: tbl_df/tbl/data.frame)
N : int 234
newrates : 'data.frame': 1 obs. of 1 variable:
$ male_unemployment: num 5
nRows : int 234
outcome : chr "cnt"
perf : tibble [1 × 7] (S3: tbl_df/tbl/data.frame)
pred : Named num 4.91
pseudoR2 : num 0.784
r_squared : function (predcol, ycol)
res : num [1:13] 0.552 1.313 0.163 0.279 -0.34 ...
rho : num 0.906
rho2 : num 0.821
rmse : function (predcol, ycol)
rmse_test : num 1.21
rmse_train : num 1.26
rsq : num 0.821
rsq_glance : num 0.821
rsq_test : num 1.21
rsq_train : num 1.26
rss : num 3.7
sd_unemployment : num 1.31
seed : num 423563
soybean_long : 'data.frame': 164 obs. of 7 variables:
$ Plot : Ord.factor w/ 41 levels "1988F4"<"1988F2"<..: 3 3 3 2 2 2 6 6 1 1 ...
$ Variety : Factor w/ 2 levels "F","P": 1 1 1 1 1 1 1 1 1 1 ...
$ Year : Factor w/ 3 levels "1988","1989",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Time : num 42 56 70 14 42 77 49 63 14 35 ...
$ weight : num 3.56 8.71 16.342 0.104 2.93 ...
$ modeltype: chr "pred.lin" "pred.lin" "pred.lin" "pred.lin" ...
$ pred : num 5.71 9.8 13.89 -2.47 5.71 ...
soybean_test : Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame': 82 obs. of 7 variables:
$ Plot : Ord.factor w/ 41 levels "1988F4"<"1988F2"<..: 3 3 3 2 2 2 6 6 1 1 ...
$ Variety : Factor w/ 2 levels "F","P": 1 1 1 1 1 1 1 1 1 1 ...
$ Year : Factor w/ 3 levels "1988","1989",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Time : num 42 56 70 14 42 77 49 63 14 35 ...
$ weight : num 3.56 8.71 16.342 0.104 2.93 ...
$ pred.lin: num 5.71 9.8 13.89 -2.47 5.71 ...
$ pred.gam: num 3.94 9.96 16.55 0.13 3.94 ...
soybean_train : Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame': 330 obs. of 5 variables:
$ Plot : Ord.factor w/ 48 levels "1988F4"<"1988F2"<..: 3 3 3 3 3 3 3 2 2 2 ...
$ Variety: Factor w/ 2 levels "F","P": 1 1 1 1 1 1 1 1 1 1 ...
$ Year : Factor w/ 3 levels "1988","1989",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Time : num 14 21 28 35 49 63 77 21 28 35 ...
$ weight : num 0.106 0.261 0.666 2.11 6.23 ...
sparrow : 'data.frame': 87 obs. of 13 variables:
$ status : Factor w/ 2 levels "Perished","Survived": 2 2 2 2 2 2 2 2 2 2 ...
$ age : chr "adult" "adult" "adult" "adult" ...
$ total_length: int 154 160 155 154 156 161 157 159 158 158 ...
$ wingspan : int 241 252 243 245 247 253 251 247 247 252 ...
$ weight : num 24.5 26.9 26.9 24.3 24.1 26.5 24.6 24.2 23.6 26.2 ...
$ beak_head : num 31.2 30.8 30.6 31.7 31.5 31.8 31.1 31.4 29.8 32 ...
$ humerus : num 0.69 0.74 0.73 0.74 0.71 0.78 0.74 0.73 0.7 0.75 ...
$ femur : num 0.67 0.71 0.7 0.69 0.71 0.74 0.74 0.72 0.67 0.74 ...
$ legbone : num 1.02 1.18 1.15 1.15 1.13 1.14 1.15 1.13 1.08 1.15 ...
$ skull : num 0.59 0.6 0.6 0.58 0.57 0.61 0.61 0.61 0.6 0.61 ...
$ sternum : num 0.83 0.84 0.85 0.84 0.82 0.89 0.86 0.79 0.82 0.86 ...
$ survived : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ pred : num 0.789 0.614 0.919 0.995 0.878 ...
sparrow_model : List of 30
$ coefficients : Named num [1:4] 46.881 -0.543 -0.569 75.461
$ residuals : Named num [1:87] 1.27 1.63 1.09 1.01 1.14 ...
$ fitted.values : Named num [1:87] 0.789 0.614 0.919 0.995 0.878 ...
$ effects : Named num [1:87] -1.065 2.49 0.351 -3.939 0.437 ...
$ R : num [1:4, 1:4] -3.48 0 0 0 -559.05 ...
$ rank : int 4
$ qr :List of 5
$ family :List of 12
$ linear.predictors: Named num [1:87] 1.319 0.466 2.429 5.206 1.969 ...
$ deviance : num 75.1
$ aic : num 83.1
$ null.deviance : num 118
$ iter : int 5
$ weights : Named num [1:87] 0.16646 0.2369 0.07444 0.00542 0.10749 ...
$ prior.weights : Named num [1:87] 1 1 1 1 1 1 1 1 1 1 ...
$ df.residual : int 83
$ df.null : int 86
$ y : Named num [1:87] 1 1 1 1 1 1 1 1 1 1 ...
$ converged : logi TRUE
$ boundary : logi FALSE
$ model :'data.frame': 87 obs. of 4 variables:
$ call : language glm(formula = fmla, family = binomial, data = sparrow)
$ formula :Class 'formula' language survived ~ total_length + weight + humerus
$ terms :Classes 'terms', 'formula' language survived ~ total_length + weight + humerus
$ data :'data.frame': 87 obs. of 12 variables:
$ offset : NULL
$ control :List of 3
$ method : chr "glm.fit"
$ contrasts : NULL
$ xlevels : Named list()
split : List of 2
$ train: int [1:27] 1 3 6 7 9 10 11 12 13 14 ...
$ app : int [1:13] 5 36 28 16 2 4 17 31 20 34 ...
splitPlan : List of 3
$ :List of 2
$ :List of 2
$ :List of 2
target : num 176
tss : num 20.7
unemployment : 'data.frame': 13 obs. of 5 variables:
$ male_unemployment : num 2.9 6.7 4.9 7.9 9.8 ...
$ female_unemployment: num 4 7.4 5 7.2 7.9 ...
$ prediction : num 3.45 6.09 4.84 6.92 8.24 ...
$ predictions : num 3.45 6.09 4.84 6.92 8.24 ...
$ residuals : num 0.552 1.313 0.163 0.279 -0.34 ...
unemployment_model : List of 12
$ coefficients : Named num [1:2] 1.434 0.695
$ residuals : Named num [1:13] 0.552 1.313 0.163 0.279 -0.34 ...
$ effects : Named num [1:13] -20.08 -4.126 0.106 -0.264 -1.192 ...
$ rank : int 2
$ fitted.values: Named num [1:13] 3.45 6.09 4.84 6.92 8.24 ...
$ assign : int [1:2] 0 1
$ qr :List of 5
$ df.residual : int 11
$ xlevels : Named list()
$ call : language lm(formula = fmla, data = unemployment)
$ terms :Classes 'terms', 'formula' language female_unemployment ~ male_unemployment
$ model :'data.frame': 13 obs. of 2 variables:
var_bikes : num 45864
vars : chr [1:8] "hr" "holiday" "workingday" "weathersit" "temp" "atemp" ...
# predict cty from hwy for the training set
mpg_train$pred <- predict(mpg_model, mpg_train)
# predict cty from hwy for the test set
mpg_test$pred <- predict(mpg_model, mpg_test)
# Evaluate the rmse on both training and test data and print them
(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))
[1] 1.264899
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))
[1] 1.199585
# Evaluate the r-squared on both training and test data.and print them
(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))
[1] 1.264899
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))
[1] 1.199585
# Plot the predictions (on the x-axis) against the outcome (cty) on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) +
geom_point() +
geom_abline()
Good performance on the test data is more confirmation that the model works as expected.
There are several ways to implement an n-fold cross validation plan. We will create such a plan using vtreat::kWayCrossValidation(), and examine it.
kWayCrossValidation() creates a cross validation plan with the following call:
splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)
where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.
Strictly speaking, dframe and y aren’t used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. We can set them both to NULL.
The resulting splitPlan is a list of nSplits elements; each element contains two vectors:
We will create a 3-fold cross-validation plan for the data set mpg.
# Load the package vtreat
library(vtreat)
# mpg is in the workspace
summary(mpg)
manufacturer model displ year
Length:234 Length:234 Min. :1.600 Min. :1999
Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
Mode :character Mode :character Median :3.300 Median :2004
Mean :3.472 Mean :2004
3rd Qu.:4.600 3rd Qu.:2008
Max. :7.000 Max. :2008
cyl trans drv cty
Min. :4.000 Length:234 Length:234 Min. : 9.00
1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
Median :6.000 Mode :character Mode :character Median :17.00
Mean :5.889 Mean :16.86
3rd Qu.:8.000 3rd Qu.:19.00
Max. :8.000 Max. :35.00
hwy fl class pred.cv
Min. :12.00 Length:234 Length:234 Min. : 8.942
1st Qu.:18.00 Class :character Class :character 1st Qu.:13.160
Median :24.00 Mode :character Mode :character Median :17.262
Mean :23.44 Mean :16.859
3rd Qu.:27.00 3rd Qu.:19.308
Max. :44.00 Max. :30.975
pred
Min. : 9.043
1st Qu.:13.142
Median :17.241
Mean :16.859
3rd Qu.:19.291
Max. :30.906
# Get the number of rows in mpg
nRows <- nrow(mpg)
# Implement the 3-fold cross-fold plan with vtreat
splitPlan <- kWayCrossValidation(nRows, 3, dframe = NULL, y = NULL)
# Examine the split plan
str(splitPlan)
List of 3
$ :List of 2
..$ train: int [1:156] 1 2 3 4 5 6 7 8 10 11 ...
..$ app : int [1:78] 20 161 98 230 9 179 73 124 126 83 ...
$ :List of 2
..$ train: int [1:156] 1 2 3 5 8 9 10 13 14 15 ...
..$ app : int [1:78] 138 88 53 210 113 80 190 189 24 112 ...
$ :List of 2
..$ train: int [1:156] 4 6 7 9 11 12 13 14 15 16 ...
..$ app : int [1:78] 30 64 133 209 75 197 91 28 194 34 ...
- attr(*, "splitmethod")= chr "kwaycross"
We will use splitPlan to make predictions from a model that predicts mpg$cty from mpg$hwy.
If dframe is the training data, then one way to add a column of cross-validation predictions to the frame is as follows:
# Initialize a column of the appropriate length
dframe$pred.cv <- 0
# k is the number of folds
# splitPlan is the cross validation plan
for(i in 1:k) {
# Get the ith split
split <- splitPlan[[i]]
# Build a model on the training data
# from this split
# (lm, in this case)
model <- lm(fmla, data = dframe[split$train,])
# make predictions on the
# application data from this split
dframe$pred.cv[split$app] <- predict(model, newdata = dframe[split$app,])
}
Cross-validation predicts how well a model built from all the data will perform on new data. As with the test/train split, for a good modeling procedure, cross-validation performance and training performance should be close.
# Run the 3-fold cross validation plan from splitPlan
k <- 3 # Number of folds
mpg$pred.cv <- 0
for(i in 1:k) {
split <- splitPlan[[i]]
model <- lm(cty ~ hwy, data = mpg[split$train,])
mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app,])
}
# Predict from a full model
mpg$pred <- predict(lm(cty ~ hwy, data = mpg))
# Get the rmse of the full model's predictions
rmse(mpg$pred, mpg$cty)
[1] 1.247045
# Get the rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)
[1] 1.249645
We have successfully estimated a model’s out-of-sample error via cross-validation. Remember, cross-validation validates the modeling process, not an actual model.
For this exercise you will call model.matrix() to examine how R represents data with both categorical and numerical inputs for modeling. The dataset flowers (derived from the Sleuth3 package) is loaded into your workspace. It has the following columns:
The ultimate goal is to predict Flowers as a function of Time and Intensity.
flowers <- readRDS("flowers.rds")
# Call str on flowers to see the types of each column
str(flowers)
'data.frame': 24 obs. of 3 variables:
$ Flowers : num 62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
$ Time : chr "Late" "Late" "Late" "Late" ...
$ Intensity: int 150 150 300 300 450 450 600 600 750 750 ...
# Use unique() to see how many possible values Time takes
unique(flowers$Time)
[1] "Late" "Early"
# Build a formula to express Flowers as a function of Intensity and Time: fmla. Print it
(fmla <- as.formula("Flowers ~ Intensity + Time"))
Flowers ~ Intensity + Time
# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- model.matrix(fmla, data = flowers)
# Examine the first 20 lines of flowers
head(flowers, 20)
Flowers <dbl> | Time <chr> | Intensity <int> | ||
---|---|---|---|---|
1 | 62.3 | Late | 150 | |
2 | 77.4 | Late | 150 | |
3 | 55.3 | Late | 300 | |
4 | 54.2 | Late | 300 | |
5 | 49.6 | Late | 450 | |
6 | 61.9 | Late | 450 | |
7 | 39.4 | Late | 600 | |
8 | 45.7 | Late | 600 | |
9 | 31.3 | Late | 750 | |
10 | 44.9 | Late | 750 |
# Examine the first 20 lines of mmat
head(mmat, 20)
(Intercept) Intensity TimeLate
1 1 150 1
2 1 150 1
3 1 300 1
4 1 300 1
5 1 450 1
6 1 450 1
7 1 600 1
8 1 600 1
9 1 750 1
10 1 750 1
11 1 900 1
12 1 900 1
13 1 150 0
14 1 150 0
15 1 300 0
16 1 300 0
17 1 450 0
18 1 450 0
19 1 600 0
20 1 600 0
Now we have seen how most R modeling functions represent categorical variables internally.
you will fit a linear model to the flowers data, to predict Flowers as a function of Time and Intensity.
# flowers in is the workspace
str(flowers)
'data.frame': 24 obs. of 3 variables:
$ Flowers : num 62.3 77.4 55.3 54.2 49.6 61.9 39.4 45.7 31.3 44.9 ...
$ Time : chr "Late" "Late" "Late" "Late" ...
$ Intensity: int 150 150 300 300 450 450 600 600 750 750 ...
# fmla is in the workspace
fmla
Flowers ~ Intensity + Time
# Fit a model to predict Flowers from Intensity and Time : flower_model
flower_model <- lm(fmla, data = flowers)
# Use summary on mmat to remind yourself of its structure
summary(mmat)
(Intercept) Intensity TimeLate
Min. :1 Min. :150 Min. :0.0
1st Qu.:1 1st Qu.:300 1st Qu.:0.0
Median :1 Median :525 Median :0.5
Mean :1 Mean :525 Mean :0.5
3rd Qu.:1 3rd Qu.:750 3rd Qu.:1.0
Max. :1 Max. :900 Max. :1.0
# Use summary to examine flower_model
summary(flower_model)
Call:
lm(formula = fmla, data = flowers)
Residuals:
Min 1Q Median 3Q Max
-9.652 -4.139 -1.558 5.632 12.165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 83.464167 3.273772 25.495 < 2e-16 ***
Intensity -0.040471 0.005132 -7.886 1.04e-07 ***
TimeLate -12.158333 2.629557 -4.624 0.000146 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.441 on 21 degrees of freedom
Multiple R-squared: 0.7992, Adjusted R-squared: 0.78
F-statistic: 41.78 on 2 and 21 DF, p-value: 4.786e-08
# Predict the number of flowers on each plant
flowers$predictions <- predict(flower_model, flowers)
# Plot predictions vs actual flowers (predictions on x-axis)
ggplot(flowers, aes(x = predictions, y = Flowers)) +
geom_point() +
geom_abline(color = "blue")
We’ve successfully fit a model with a categorical variable, and seen how categorical variables are represented in that model.
We will use interactions to model the effect of gender and gastric activity on alcohol metabolism.
The data frame alcohol has columns:
alcohol <- readRDS("alcohol.rds")
We can fit three models to the alcohol data:
We saw that one of the models with interaction terms had a better R-squared than the additive model, suggesting that using interaction terms gives a better fit. In this exercise we will compare the R-squared of one of the interaction models to the main-effects-only model.
Recall that the operator : designates the interaction between two variables. The operator * designates the interaction between the two variables, plus the main effects.
# alcohol is in the workspace
summary(alcohol)
Subject Metabol Gastric Sex
Min. : 1.00 Min. : 0.100 Min. :0.800 Female:18
1st Qu.: 8.75 1st Qu.: 0.600 1st Qu.:1.200 Male :14
Median :16.50 Median : 1.700 Median :1.600
Mean :16.50 Mean : 2.422 Mean :1.863
3rd Qu.:24.25 3rd Qu.: 2.925 3rd Qu.:2.200
Max. :32.00 Max. :12.300 Max. :5.200
Alcohol
Alcoholic : 8
Non-alcoholic:24
# Create the formula with main effects only
(fmla_add <- Metabol ~ Gastric + Sex)
Metabol ~ Gastric + Sex
# Create the formula with main effect and interactions
(fmla_main_and_interaction <- Metabol ~ Gastric * Sex)
Metabol ~ Gastric * Sex
# Create the formula with interactions
(fmla_interaction <- Metabol ~ Gastric + Gastric:Sex)
Metabol ~ Gastric + Gastric:Sex
# Fit the main effects only model
model_add <- lm(fmla_add, alcohol)
# Fit the main effects and interactions
model_main_and_interaction <- lm(fmla_main_and_interaction, alcohol)
# Fit the interaction model
model_interaction <- lm(fmla_interaction, alcohol)
# Call summary on both models and compare
summary(model_add)
Call:
lm(formula = fmla_add, data = alcohol)
Residuals:
Min 1Q Median 3Q Max
-2.2779 -0.6328 -0.0966 0.5783 4.5703
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9466 0.5198 -3.745 0.000796 ***
Gastric 1.9656 0.2674 7.352 4.24e-08 ***
SexMale 1.6174 0.5114 3.163 0.003649 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.331 on 29 degrees of freedom
Multiple R-squared: 0.7654, Adjusted R-squared: 0.7492
F-statistic: 47.31 on 2 and 29 DF, p-value: 7.41e-10
summary(model_main_and_interaction)
Call:
lm(formula = fmla_main_and_interaction, data = alcohol)
Residuals:
Min 1Q Median 3Q Max
-2.4427 -0.6111 -0.0326 0.5436 3.8759
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1973 0.8022 -0.246 0.8075
Gastric 0.8369 0.4839 1.730 0.0947 .
SexMale -0.9885 1.0724 -0.922 0.3645
Gastric:SexMale 1.5069 0.5591 2.695 0.0118 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.207 on 28 degrees of freedom
Multiple R-squared: 0.8137, Adjusted R-squared: 0.7938
F-statistic: 40.77 on 3 and 28 DF, p-value: 2.386e-10
summary(model_interaction)
Call:
lm(formula = fmla_interaction, data = alcohol)
Residuals:
Min 1Q Median 3Q Max
-2.4656 -0.5091 0.0143 0.5660 4.0668
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7504 0.5310 -1.413 0.168236
Gastric 1.1489 0.3450 3.331 0.002372 **
Gastric:SexMale 1.0422 0.2412 4.321 0.000166 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.204 on 29 degrees of freedom
Multiple R-squared: 0.8081, Adjusted R-squared: 0.7948
F-statistic: 61.05 on 2 and 29 DF, p-value: 4.033e-11
We will compare the performance of the interaction model you fit in the previous exercise to the performance of a main-effects only model. Because this data set is small, we will use cross-validation to simulate making predictions on out-of-sample data.
We will use the dplyr package to do calculations.
You will also use tidyr’s gather() which takes multiple columns and collapses them into key-value pairs.
# alcohol is in the workspace
summary(alcohol)
Subject Metabol Gastric Sex
Min. : 1.00 Min. : 0.100 Min. :0.800 Female:18
1st Qu.: 8.75 1st Qu.: 0.600 1st Qu.:1.200 Male :14
Median :16.50 Median : 1.700 Median :1.600
Mean :16.50 Mean : 2.422 Mean :1.863
3rd Qu.:24.25 3rd Qu.: 2.925 3rd Qu.:2.200
Max. :32.00 Max. :12.300 Max. :5.200
Alcohol
Alcoholic : 8
Non-alcoholic:24
# Both the formulae are in the workspace
fmla_add
Metabol ~ Gastric + Sex
fmla_interaction
Metabol ~ Gastric + Gastric:Sex
# Create the splitting plan for 3-fold cross validation
set.seed(34245) # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(alcohol), 3, NULL, NULL)
# Sample code: Get cross-val predictions for main-effects only model
alcohol$pred_add <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_add <- lm(fmla_add, data = alcohol[split$train, ])
alcohol$pred_add[split$app] <- predict(model_add, newdata = alcohol[split$app, ])
}
# Get the cross-val predictions for the model with interactions
alcohol$pred_interaction <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_interaction <- lm(fmla_interaction, data = alcohol[split$train, ])
alcohol$pred_interaction[split$app] <- predict(model_interaction, newdata = alcohol[split$app, ])
}
# Get RMSE
alcohol %>%
gather(key = modeltype, value = pred, pred_add, pred_interaction) %>%
mutate(residuals = Metabol - pred) %>%
group_by(modeltype) %>%
summarize(rmse = sqrt(mean(residuals^2)))
modeltype <chr> | rmse <dbl> | |||
---|---|---|---|---|
pred_add | 1.375565 | |||
pred_interaction | 1.303637 |
Cross-validation confirms that a model with interaction will likely give better predictions.
We will compare relative error to absolute error. For the purposes of modeling, we will define relative error as:
that is, the error is relative to the true outcome. You will measure the overall relative error of a model using root mean squared relative error:
where is the mean of .
The example (toy) dataset fdata is loaded in your workspace. It includes the columns:
You want to know which model does “better”: the one predicting the small purchases, or the one predicting large ones.
fdata <- readRDS("fdata.rds")
# fdata is in the workspace
summary(fdata)
y pred label
Min. : -5.894 Min. : 1.072 small purchases:50
1st Qu.: 5.407 1st Qu.: 6.373 large purchases:50
Median : 57.374 Median : 55.693
Mean : 306.204 Mean : 305.905
3rd Qu.: 550.903 3rd Qu.: 547.886
Max. :1101.619 Max. :1098.896
# Examine the data: generate the summaries for the groups large and small:
fdata %>%
group_by(label) %>% # group by small/large purchases
summarize(min = min(y), # min of y
mean = mean(y), # mean of y
max = max(y)) # max of y
label <fctr> | min <dbl> | mean <dbl> | max <dbl> | |
---|---|---|---|---|
small purchases | -5.893499 | 6.478254 | 18.62829 | |
large purchases | 96.119814 | 605.928673 | 1101.61864 |
# Fill in the blanks to add error columns
fdata2 <- fdata %>%
group_by(label) %>% # group by label
mutate(residual = y - pred, # Residual
relerr = residual / y) # Relative error
# Compare the rmse and rmse.rel of the large and small groups:
fdata2 %>%
group_by(label) %>%
summarize(rmse = sqrt(mean(residual^2)), # RMSE
rmse.rel = sqrt(mean(relerr^2))) # Root mean squared relative error
label <fctr> | rmse <dbl> | rmse.rel <dbl> | ||
---|---|---|---|---|
small purchases | 4.014969 | 1.24965674 | ||
large purchases | 5.544439 | 0.01473322 |
# Plot the predictions for both groups of purchases
ggplot(fdata2, aes(x = pred, y = y, color = label)) +
geom_point() +
geom_abline() +
facet_wrap(~ label, ncol = 1, scales = "free") +
ggtitle("Outcome vs prediction")
Notice from this example how a model with larger RMSE might still be better, if relative errors are more important than absolute errors.
you will practice modeling on log-transformed monetary output, and then transforming the “log-money” predictions back into monetary units.
The data records subjects’ incomes in 2005 (Income2005), as well as the results of several aptitude tests taken by the subjects in 1981:
The data have already been split into training and test sets (income_train and income_test respectively).
income_train <- readRDS("income_train.rds")
income_test <- readRDS("income_test.rds")
We will build a model of log(income) from the inputs, and then convert log(income) back into income.
# Examine Income2005 in the training set
summary(income_train$Income2005)
Min. 1st Qu. Median Mean 3rd Qu. Max.
63 23000 39000 49894 61500 703637
# Write the formula for log income as a function of the tests and print it
(fmla.log <- log(Income2005) ~ Arith + Word + Parag + Math + AFQT)
log(Income2005) ~ Arith + Word + Parag + Math + AFQT
# Fit the linear model
model.log <- lm(fmla.log, data = income_train)
# Make predictions on income_test
income_test$logpred <- predict(model.log, income_test)
summary(income_test$logpred)
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.766 10.133 10.423 10.419 10.705 11.006
# Convert the predictions to monetary units
income_test$pred.income <- exp(income_test$logpred)
summary(income_test$pred.income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
17432 25167 33615 35363 44566 60217
# Plot predicted income (x axis) vs income
ggplot(income_test, aes(x = pred.income, y = Income2005)) +
geom_point() +
geom_abline(color = "blue")
Remember that when we transform the output before modeling, we have to ‘reverse transform’ the resulting predictions after applying the model.
We will show that log-transforming a monetary output before modeling improves mean relative error (but increases RMSE) compared to modeling the monetary output directly. You will compare the results of model.log from the previous exercise to a model (model.abs) that directly fits income.
# Write the formula for income as a function of the tests and print it
(fmla.abs <- Income2005 ~ Arith + Word + Parag + Math + AFQT)
Income2005 ~ Arith + Word + Parag + Math + AFQT
# Fit the linear model
model.abs <- lm(fmla.abs, data = income_train)
# model.abs is in the workspace
summary(model.abs)
Call:
lm(formula = fmla.abs, data = income_train)
Residuals:
Min 1Q Median 3Q Max
-78728 -24137 -6979 11964 648573
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17516.7 6420.1 2.728 0.00642 **
Arith 1552.3 303.4 5.116 3.41e-07 ***
Word -132.3 265.0 -0.499 0.61754
Parag -1155.1 618.3 -1.868 0.06189 .
Math 725.5 372.0 1.950 0.05127 .
AFQT 177.8 144.1 1.234 0.21734
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 45500 on 2063 degrees of freedom
Multiple R-squared: 0.1165, Adjusted R-squared: 0.1144
F-statistic: 54.4 on 5 and 2063 DF, p-value: < 2.2e-16
# Add predictions to the test set
income_test <- income_test %>%
mutate(pred.absmodel = predict(model.abs, income_test), # predictions from model.abs
pred.logmodel = exp(predict(model.log, income_test))) # predictions from model.log
# Gather the predictions and calculate residuals and relative error
income_long <- income_test %>%
gather(key = modeltype, value = pred, pred.absmodel, pred.logmodel) %>%
mutate(residual = pred - Income2005, # residuals
relerr = residual / Income2005) # relative error
# Calculate RMSE and relative RMSE and compare
income_long %>%
group_by(modeltype) %>% # group by modeltype
summarize(rmse = sqrt(mean(residual^2)), # RMSE
rmse.rel = sqrt(mean(relerr^2))) # Root mean squared relative error
modeltype <chr> | rmse <dbl> | rmse.rel <dbl> | ||
---|---|---|---|---|
pred.absmodel | 37448.37 | 3.183705 | ||
pred.logmodel | 39234.90 | 2.218663 |
We’ve seen how modeling log(income) can reduce the relative error of the fit, at the cost of increased RMSE. Which tradeoff to make depends on the goals of our project.
We will build a model to predict price from a measure of the house’s size (surface area).
houseprice <- readRDS("houseprice.rds")
The data set houseprice has the columns:
A scatterplot of the data shows that the data is quite non-linear: a sort of “hockey-stick” where price is fairly flat for smaller houses, but rises steeply as the house gets larger. Quadratics and tritics are often good functional forms to express hockey-stick like relationships. Note that there may not be a “physical” reason that price is related to the square of the size; a quadratic is simply a closed form approximation of the observed relationship.
ggplot(houseprice, aes(size, price))+
geom_point()
We will fit a model to predict price as a function of the squared size, and look at its fit on the training data.
Because ^ is also a symbol to express interactions, use the function I() to treat the expression x^2 “as is”: that is, as the square of x rather than the interaction of x with itself.
exampleFormula = y ~ I(x^2)
# houseprice is in the workspace
summary(houseprice)
size price
Min. : 44.0 Min. : 42.0
1st Qu.: 73.5 1st Qu.:164.5
Median : 91.0 Median :203.5
Mean : 94.3 Mean :249.2
3rd Qu.:118.5 3rd Qu.:287.8
Max. :150.0 Max. :573.0
# Create the formula for price as a function of squared size
(fmla_sqr <- price ~ I(size^2))
price ~ I(size^2)
# Fit a model of price as a function of squared size (use fmla_sqr)
model_sqr <- lm(fmla_sqr, houseprice)
# Fit a model of price as a linear function of size
model_lin <- lm(price ~ size, houseprice)
# Make predictions and compare
houseprice %>%
mutate(pred_lin = predict(model_lin, houseprice), # predictions from linear model
pred_sqr = predict(model_sqr, houseprice)) %>% # predictions from quadratic model
gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>% # gather the predictions
ggplot(aes(x = size)) +
geom_point(aes(y = price)) + # actual prices
geom_line(aes(y = pred, color = modeltype)) + # the predictions
scale_color_brewer(palette = "Dark2")
A quadratic model seems to fit the houseprice data better than a linear model. We will confirm whether the quadratic model would perform better on out-of-sample data. Since this data set is small, we will use cross-validation.
# houseprice is in the workspace
summary(houseprice)
size price
Min. : 44.0 Min. : 42.0
1st Qu.: 73.5 1st Qu.:164.5
Median : 91.0 Median :203.5
Mean : 94.3 Mean :249.2
3rd Qu.:118.5 3rd Qu.:287.8
Max. :150.0 Max. :573.0
# fmla_sqr is in the workspace
fmla_sqr
price ~ I(size^2)
# Create a splitting plan for 3-fold cross validation
set.seed(34245) # set the seed for reproducibility
splitPlan <- kWayCrossValidation(nrow(houseprice) ,3 ,NULL, NULL)
# Sample code: get cross-val predictions for price ~ size
houseprice$pred_lin <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_lin <- lm(price ~ size, data = houseprice[split$train,])
houseprice$pred_lin[split$app] <- predict(model_lin, newdata = houseprice[split$app,])
}
# Get cross-val predictions for price as a function of size^2 (use fmla_sqr)
houseprice$pred_sqr <- 0 # initialize the prediction vector
for(i in 1:3) {
split <- splitPlan[[i]]
model_sqr <- lm(fmla_sqr, data = houseprice[split$train, ])
houseprice$pred_sqr[split$app] <- predict(model_sqr, newdata = houseprice[split$app, ])
}
# Gather the predictions and calculate the residuals
houseprice_long <- houseprice %>%
gather(key = modeltype, value = pred, pred_lin, pred_sqr) %>%
mutate(residuals = pred - price)
# Compare the cross-validated RMSE for the two models
houseprice_long %>%
group_by(modeltype) %>% # group by modeltype
summarize(rmse = sqrt(mean(residuals^2)))
modeltype <chr> | rmse <dbl> | |||
---|---|---|---|---|
pred_lin | 73.14852 | |||
pred_sqr | 60.27335 |
We’ve confirmed that the quadratic input tranformation improved the model.
you will estimate the probability that a sparrow survives a severe winter storm, based on physical characteristics of the sparrow.
The outcome to be predicted is status (“Survived”, “Perished”). The variables we will consider are:
When using glm() to create a logistic regression model, we must explicitly specify that family = binomial:
glm(formula, data = data, family = binomial)
We will call summary(), broom::glance() to see different functions for examining a logistic regression model. One of the diagnostics that you will look at is the analog to , called .
You can think of deviance as analogous to variance: it is a measure of the variation in categorical data. The pseudo- is analogous to for standard regression: is a measure of the “variance explained” of a regression model. The pseudo- is a measure of the “deviance explained”.
library(broom)
sparrow <- readRDS("sparrow.rds")
# sparrow is in the workspace
summary(sparrow)
status age total_length wingspan
Perished:36 Length:87 Min. :153.0 Min. :236.0
Survived:51 Class :character 1st Qu.:158.0 1st Qu.:245.0
Mode :character Median :160.0 Median :247.0
Mean :160.4 Mean :247.5
3rd Qu.:162.5 3rd Qu.:251.0
Max. :167.0 Max. :256.0
weight beak_head humerus femur
Min. :23.2 Min. :29.80 Min. :0.6600 Min. :0.6500
1st Qu.:24.7 1st Qu.:31.40 1st Qu.:0.7250 1st Qu.:0.7000
Median :25.8 Median :31.70 Median :0.7400 Median :0.7100
Mean :25.8 Mean :31.64 Mean :0.7353 Mean :0.7134
3rd Qu.:26.7 3rd Qu.:32.10 3rd Qu.:0.7500 3rd Qu.:0.7300
Max. :31.0 Max. :33.00 Max. :0.7800 Max. :0.7600
legbone skull sternum
Min. :1.010 Min. :0.5600 Min. :0.7700
1st Qu.:1.110 1st Qu.:0.5900 1st Qu.:0.8300
Median :1.130 Median :0.6000 Median :0.8500
Mean :1.131 Mean :0.6032 Mean :0.8511
3rd Qu.:1.160 3rd Qu.:0.6100 3rd Qu.:0.8800
Max. :1.230 Max. :0.6400 Max. :0.9300
# Create the survived column
sparrow$survived <- ifelse(sparrow$status == "Survived", TRUE, FALSE)
# Create the formula
(fmla <- survived ~ total_length + weight + humerus)
survived ~ total_length + weight + humerus
# Fit the logistic regression model
sparrow_model <- glm(fmla, data = sparrow, family = binomial)
# Call summary
summary(sparrow_model)
Call:
glm(formula = fmla, family = binomial, data = sparrow)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1117 -0.6026 0.2871 0.6577 1.7082
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 46.8813 16.9631 2.764 0.005715 **
total_length -0.5435 0.1409 -3.858 0.000115 ***
weight -0.5689 0.2771 -2.053 0.040060 *
humerus 75.4610 19.1586 3.939 8.19e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 118.008 on 86 degrees of freedom
Residual deviance: 75.094 on 83 degrees of freedom
AIC: 83.094
Number of Fisher Scoring iterations: 5
# Call glance
(perf <- glance(sparrow_model))
null.deviance <dbl> | df.null <int> | logLik <dbl> | AIC <dbl> | BIC <dbl> | deviance <dbl> | df.residual <int> |
---|---|---|---|---|---|---|
118.0084 | 86 | -37.54718 | 83.09436 | 92.95799 | 75.09436 | 83 |
# Calculate pseudo-R-squared
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)
[1] 0.3636526
We’ve fit a logistic regression model to predict probabilities! When looking at the pseudo-R2 of a logistic regression model, we should hope to see a value close to 1.
you will predict the probability of survival. When calling predict() to get the predicted probabilities from a glm() model, you must specify that you want the response:
predict(model, type = "response")
Otherwise, predict() on a logistic regression model returns the predicted log-odds of the event, not the probability.
We will also use the GainCurvePlot() function to plot the gain curve from the model predictions. If the model’s gain curve is close to the ideal (“wizard”) gain curve, then the model sorted the sparrows well: that is, the model predicted that sparrows that actually survived would have a higher probability of survival. The inputs to the GainCurvePlot() function are:
title: a title for the plot (as a string)
GainCurvePlot(frame, xvar, truthVar, title)
# sparrow is in the workspace
summary(sparrow)
status age total_length wingspan
Perished:36 Length:87 Min. :153.0 Min. :236.0
Survived:51 Class :character 1st Qu.:158.0 1st Qu.:245.0
Mode :character Median :160.0 Median :247.0
Mean :160.4 Mean :247.5
3rd Qu.:162.5 3rd Qu.:251.0
Max. :167.0 Max. :256.0
weight beak_head humerus femur
Min. :23.2 Min. :29.80 Min. :0.6600 Min. :0.6500
1st Qu.:24.7 1st Qu.:31.40 1st Qu.:0.7250 1st Qu.:0.7000
Median :25.8 Median :31.70 Median :0.7400 Median :0.7100
Mean :25.8 Mean :31.64 Mean :0.7353 Mean :0.7134
3rd Qu.:26.7 3rd Qu.:32.10 3rd Qu.:0.7500 3rd Qu.:0.7300
Max. :31.0 Max. :33.00 Max. :0.7800 Max. :0.7600
legbone skull sternum survived
Min. :1.010 Min. :0.5600 Min. :0.7700 Mode :logical
1st Qu.:1.110 1st Qu.:0.5900 1st Qu.:0.8300 FALSE:36
Median :1.130 Median :0.6000 Median :0.8500 TRUE :51
Mean :1.131 Mean :0.6032 Mean :0.8511
3rd Qu.:1.160 3rd Qu.:0.6100 3rd Qu.:0.8800
Max. :1.230 Max. :0.6400 Max. :0.9300
# sparrow_model is in the workspace
summary(sparrow_model)
Call:
glm(formula = fmla, family = binomial, data = sparrow)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1117 -0.6026 0.2871 0.6577 1.7082
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 46.8813 16.9631 2.764 0.005715 **
total_length -0.5435 0.1409 -3.858 0.000115 ***
weight -0.5689 0.2771 -2.053 0.040060 *
humerus 75.4610 19.1586 3.939 8.19e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 118.008 on 86 degrees of freedom
Residual deviance: 75.094 on 83 degrees of freedom
AIC: 83.094
Number of Fisher Scoring iterations: 5
# Make predictions
sparrow$pred <- predict(sparrow_model, type = "response")
# Look at gain curve
GainCurvePlot(sparrow, "pred", "survived", "sparrow survival model")
We see from the gain curve that the model follows the wizard curve for about the first 30% of the data, identifying about 45% of the surviving sparrows with only a few false positives.
One of the assumptions of Poisson regression to predict counts is that the event you are counting is Poisson distributed: the average count per unit time is the same as the variance of the count. In practice, “the same” means that the mean and the variance should be of a similar order of magnitude.
When the variance is much larger than the mean, the Poisson assumption doesn’t apply, and one solution is to use quasipoisson regression, which does not assume that variance=mean.
We will build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. We will train the model on data from the month of July.
The data frame has the columns:
Remember that we must specify family = poisson or family = quasipoisson when using glm() to fit a count model.
bikesAugust <- miceadds::load.Rdata2( filename="Bikes.RData")
bikesJuly <- readRDS("bikesJuly.rds")
# bikesJuly is in the workspace
str(bikesJuly)
'data.frame': 744 obs. of 12 variables:
$ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
$ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
$ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
$ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
$ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
$ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
$ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
$ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
$ yr : int 1 1 1 1 1 1 1 1 1 1 ...
# The outcome column
(outcome <- "cnt")
[1] "cnt"
# The inputs to use
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
[1] "hr" "holiday" "workingday" "weathersit" "temp"
[6] "atemp" "hum" "windspeed"
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
[1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Calculate the mean and variance of the outcome
(mean_bikes <- mean(bikesJuly$cnt))
[1] 273.6653
(var_bikes <- var(bikesJuly$cnt))
[1] 45863.84
# Fit the model
bike_model <- glm(fmla, data = bikesJuly, family = quasipoisson)
# Call glance
(perf <- glance(bike_model))
null.deviance <dbl> | df.null <int> | logLik <dbl> | AIC <dbl> | BIC <dbl> | deviance <dbl> | df.residual <int> |
---|---|---|---|---|---|---|
133364.9 | 743 | NA | NA | NA | 28774.9 | 712 |
# Calculate pseudo-R-squared
(pseudoR2 <- 1 - perf$deviance / perf$null.deviance)
[1] 0.7842393
We’ve fit a (quasi)poisson model to predict counts! As with a logistic model, you hope for a pseudo-R2 near 1.
We will use the model to make predictions for the month of August.
# bikesAugust is in the workspace
str(bikesAugust)
'data.frame': 744 obs. of 12 variables:
$ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
$ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
$ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
$ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
$ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
$ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
$ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
$ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
$ yr : int 1 1 1 1 1 1 1 1 1 1 ...
# bike_model is in the workspace
summary(bike_model)
Call:
glm(formula = fmla, family = quasipoisson, data = bikesJuly)
Deviance Residuals:
Min 1Q Median 3Q Max
-21.6117 -4.3121 -0.7223 3.5507 16.5079
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.934986 0.439027 13.519 < 2e-16 ***
hr1 -0.580055 0.193354 -3.000 0.002794 **
hr2 -0.892314 0.215452 -4.142 3.86e-05 ***
hr3 -1.662342 0.290658 -5.719 1.58e-08 ***
hr4 -2.350204 0.393560 -5.972 3.71e-09 ***
hr5 -1.084289 0.230130 -4.712 2.96e-06 ***
hr6 0.211945 0.156476 1.354 0.176012
hr7 1.211135 0.132332 9.152 < 2e-16 ***
hr8 1.648361 0.127177 12.961 < 2e-16 ***
hr9 1.155669 0.133927 8.629 < 2e-16 ***
hr10 0.993913 0.137096 7.250 1.09e-12 ***
hr11 1.116547 0.136300 8.192 1.19e-15 ***
hr12 1.282685 0.134769 9.518 < 2e-16 ***
hr13 1.273010 0.135872 9.369 < 2e-16 ***
hr14 1.237721 0.136386 9.075 < 2e-16 ***
hr15 1.260647 0.136144 9.260 < 2e-16 ***
hr16 1.515893 0.132727 11.421 < 2e-16 ***
hr17 1.948404 0.128080 15.212 < 2e-16 ***
hr18 1.893915 0.127812 14.818 < 2e-16 ***
hr19 1.669277 0.128471 12.993 < 2e-16 ***
hr20 1.420732 0.131004 10.845 < 2e-16 ***
hr21 1.146763 0.134042 8.555 < 2e-16 ***
hr22 0.856182 0.138982 6.160 1.21e-09 ***
hr23 0.479197 0.148051 3.237 0.001265 **
holidayTRUE 0.201598 0.079039 2.551 0.010961 *
workingdayTRUE 0.116798 0.033510 3.485 0.000521 ***
weathersitLight Precipitation -0.214801 0.072699 -2.955 0.003233 **
weathersitMisty -0.010757 0.038600 -0.279 0.780572
temp -3.246001 1.148270 -2.827 0.004833 **
atemp 2.042314 0.953772 2.141 0.032589 *
hum -0.748557 0.236015 -3.172 0.001581 **
windspeed 0.003277 0.148814 0.022 0.982439
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for quasipoisson family taken to be 38.98949)
Null deviance: 133365 on 743 degrees of freedom
Residual deviance: 28775 on 712 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
# Make predictions on August data
bikesAugust$pred <- predict(bike_model, newdata = bikesAugust, type = "response")
# Calculate the RMSE
bikesAugust %>%
mutate(residual = pred - cnt) %>%
summarize(rmse = sqrt(mean(residual^2)))
rmse <dbl> | ||||
---|---|---|---|---|
112.5815 |
# Plot predictions vs cnt (pred on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
geom_point() +
geom_abline(color = "darkblue")
(Quasi)poisson models predict non-negative rates, making them useful for count or frequency data.
We visualized the bike model’s predictions using the standard “outcome vs. prediction” scatter plot. Since the bike rental data is time series data, we might be interested in how the model performs as a function of time. In this exercise, we will compare the predictions and actual rentals on an hourly basis, for the first 14 days of August.
To create the plot you will use the function tidyr::gather() to consolidate the predicted and actual values from bikesAugust in a single column. gather() takes as arguments:
We’ll use the gathered data frame to compare the actual and predicted rental counts as a function of time. The time index, instant counts the number of observations since the beginning of data collection. The sample code converts the instants to daily units, starting from 0.
# Plot predictions and cnt by date/time
(quasipoisson_plot <- bikesAugust %>%
# set start to 0, convert unit to days
mutate(instant = (instant - min(instant))/24) %>%
# gather cnt and pred into a value column
gather(key = valuetype, value = value, cnt, pred) %>%
filter(instant < 14) %>% # restric to first 14 days
# plot value by instant
ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) +
geom_point() +
geom_line() +
scale_x_continuous("Day", breaks = 0:14, labels = 0:14) +
scale_color_brewer(palette = "Dark2") +
ggtitle("Predicted August bike rentals, Quasipoisson model"))
This model mostly identifies the slow and busy hours of the day, although it often underestimates peak demand.
When using gam() to model outcome as an additive function of the inputs, you can use the s() function inside formulas to designate that you want to use a spline to model the non-linear relationship of a continuous variable to the outcome.
Suppose that you want to predict how much weight (Wtloss) a dieter will lose over a 2-year diet plan as a function of:
BMI (body mass index) at beginning of diet (continuous)
Wtloss ~ Diet + Sex + s(Age) + s(BMI)
This formula says that Age and BMI will both be modeled non-linearly.
Suppose that in the diet problem from the previous exercise, you now also want to take into account
You have reason to believe that the relationship between BMR and weight loss is linear (and you want to model it that way), but not necessarily the relationship between aerobic exercise and weight loss.
Wtloss ~ Diet + Sex + s(Age) + s(BMI) + BMR + s(E)
This formula says to model Age, BMI and E as non-linear, but model BMR as linear.
We will model the average leaf weight on a soybean plant as a function of time (after planting). As we will see, the soybean plant doesn’t grow at a steady rate, but rather has a “growth spurt” that eventually tapers off. Hence, leaf weight is not well described by a linear model.
We can designate which variable you want to model non-linearly in a formula with the s() function:
y ~ s(x)
gam() from the package mgcv has the calling interface:
gam(formula, family, data)
For standard regression, use family = gaussian (the default).
soybean_train <- readRDS("soybean_train.rds")
# soybean_train is in the workspace
summary(soybean_train)
Plot Variety Year Time weight
1988F6 : 10 F:161 1988:124 Min. :14.00 Min. : 0.0290
1988F7 : 9 P:169 1989:102 1st Qu.:27.00 1st Qu.: 0.6663
1988P1 : 9 1990:104 Median :42.00 Median : 3.5233
1988P8 : 9 Mean :43.56 Mean : 6.1645
1988P2 : 9 3rd Qu.:56.00 3rd Qu.:10.3808
1988F3 : 8 Max. :84.00 Max. :27.3700
(Other):276
# Plot weight vs Time (Time on x axis)
ggplot(soybean_train, aes(x = Time, y = weight)) +
geom_point()
# Load the package mgcv
library(mgcv)
# Create the formula
(fmla.gam <- weight ~ s(Time) )
weight ~ s(Time)
# Fit the GAM Model
model.gam <- gam(fmla.gam, family = gaussian, soybean_train)
# Create the formula
(fmla.lin <- weight ~ Time )
weight ~ Time
# Fit the GAM Model
model.lin <- lm(fmla.lin, soybean_train)
# From previous step
library(mgcv)
fmla.gam <- weight ~ s(Time)
model.gam <- gam(fmla.gam, data = soybean_train, family = gaussian)
# Call summary() on model.lin and look for R-squared
summary(model.lin)
Call:
lm(formula = fmla.lin, data = soybean_train)
Residuals:
Min 1Q Median 3Q Max
-9.3933 -1.7100 -0.3909 1.9056 11.4381
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.559283 0.358527 -18.30 <2e-16 ***
Time 0.292094 0.007444 39.24 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.778 on 328 degrees of freedom
Multiple R-squared: 0.8244, Adjusted R-squared: 0.8238
F-statistic: 1540 on 1 and 328 DF, p-value: < 2.2e-16
# Call summary() on model.gam and look for R-squared
summary(model.gam)
Family: gaussian
Link function: identity
Formula:
weight ~ s(Time)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.1645 0.1143 53.93 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Time) 8.495 8.93 338.2 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.902 Deviance explained = 90.4%
GCV = 4.4395 Scale est. = 4.3117 n = 330
# Call plot() on model.gam
plot(model.gam)
We have fit a generalized additive model. For this data, the GAM appears to fit the data better than a linear model, as measured by the R-squared.
We will apply the soybean models to new data: soybean_test
soybean_test <- readRDS("soybean_test.rds")
# soybean_test is in the workspace
summary(soybean_test)
Plot Variety Year Time weight
1988F8 : 4 F:43 1988:32 Min. :14.00 Min. : 0.0380
1988P7 : 4 P:39 1989:26 1st Qu.:23.00 1st Qu.: 0.4248
1989F8 : 4 1990:24 Median :41.00 Median : 3.0025
1990F8 : 4 Mean :44.09 Mean : 7.1576
1988F4 : 3 3rd Qu.:69.00 3rd Qu.:15.0113
1988F2 : 3 Max. :84.00 Max. :30.2717
(Other):60
# Get predictions from linear model
soybean_test$pred.lin <- predict(model.lin, newdata = soybean_test)
# Get predictions from gam model
soybean_test$pred.gam <- as.numeric(predict(model.gam, newdata = soybean_test))
# Gather the predictions into a "long" dataset
soybean_long <- soybean_test %>%
gather(key = modeltype, value = pred, pred.lin, pred.gam)
# Calculate the rmse
soybean_long %>%
mutate(residual = weight - pred) %>% # residuals
group_by(modeltype) %>% # group by modeltype
summarize(rmse = sqrt(mean(residual^2))) # calculate the RMSE
modeltype <chr> | rmse <dbl> | |||
---|---|---|---|---|
pred.gam | 2.286451 | |||
pred.lin | 3.190995 |
# Compare the predictions against actual weights on the test data
soybean_long %>%
ggplot(aes(x = Time)) + # the column for the x axis
geom_point(aes(y = weight)) + # the y-column for the scatterplot
geom_point(aes(y = pred, color = modeltype)) + # the y-column for the point-and-line plot
geom_line(aes(y = pred, color = modeltype, linetype = modeltype)) + # the y-column for the point-and-line plot
scale_color_brewer(palette = "Dark2")
The GAM learns the non-linear growth function of the soybean plants, including the fact that weight is never negative.
The tree predicts the expected intelligence (humans have an
intelligence of 1) of several mammals, as a function of gestation time
(in days) and average litter size for that species.
The leaf nodes show the expected brain size for the datums in that node, as well as how many (percentage-wise) of the datums fall into the node.
We will again build a model to predict the number of bikes rented in an hour as a function of the weather, the type of day (holiday, working day, or weekend), and the time of day. We will train the model on data from the month of July.
We will use the ranger package to fit the random forest model. For this exercise, the key arguments to the ranger() call are:
Since there are a lot of input variables, for convenience we will specify the outcome and the inputs in the variables outcome and vars, and use paste() to assemble a string representing the model formula.
# bikesJuly is in the workspace
str(bikesJuly)
'data.frame': 744 obs. of 12 variables:
$ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ workingday: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
$ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
$ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
$ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
$ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
$ cnt : int 149 93 90 33 4 10 27 50 142 219 ...
$ instant : int 13004 13005 13006 13007 13008 13009 13010 13011 13012 13013 ...
$ mnth : int 7 7 7 7 7 7 7 7 7 7 ...
$ yr : int 1 1 1 1 1 1 1 1 1 1 ...
# Random seed to reproduce results
seed <- 423563
# The outcome column
(outcome <- "cnt")
[1] "cnt"
# The input variables
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
[1] "hr" "holiday" "workingday" "weathersit" "temp"
[6] "atemp" "hum" "windspeed"
# Create the formula string for bikes rented as a function of the inputs
(fmla <- paste(outcome, "~", paste(vars, collapse = " + ")))
[1] "cnt ~ hr + holiday + workingday + weathersit + temp + atemp + hum + windspeed"
# Load the package ranger
library(ranger)
# Fit and print the random forest model
(bike_model_rf <- ranger(fmla, # formula
bikesJuly, # data
num.trees = 500,
respect.unordered.factors = "order",
seed = seed))
Ranger result
Call:
ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order", seed = seed)
Type: Regression
Number of trees: 500
Sample size: 744
Number of independent variables: 8
Mtry: 2
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 8230.568
R squared (OOB): 0.8205434
We have fit a model to the data with a respectable R-squared.
We will use the model to predict bike rentals for the month of August.
The predict() function for a ranger model produces a list. One of the elements of this list is predictions, a vector of predicted values. You can access predictions with the $ notation for accessing named elements of a list:
predict(model, data)$predictions
# bikesAugust is in the workspace
str(bikesAugust)
'data.frame': 744 obs. of 13 variables:
$ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
$ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
$ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
$ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
$ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
$ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
$ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
$ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
$ yr : int 1 1 1 1 1 1 1 1 1 1 ...
$ pred : num 94.96 51.74 37.98 17.58 9.36 ...
# bike_model_rf is in the workspace
bike_model_rf
Ranger result
Call:
ranger(fmla, bikesJuly, num.trees = 500, respect.unordered.factors = "order", seed = seed)
Type: Regression
Number of trees: 500
Sample size: 744
Number of independent variables: 8
Mtry: 2
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 8230.568
R squared (OOB): 0.8205434
# Make predictions on the August data
bikesAugust$pred <- predict(bike_model_rf, bikesAugust)$predictions
# Calculate the RMSE of the predictions
bikesAugust %>%
mutate(residual = cnt - pred) %>% # calculate the residual
summarize(rmse = sqrt(mean(residual^2))) # calculate rmse
rmse <dbl> | ||||
---|---|---|---|---|
97.18347 |
# Plot actual outcome vs predictions (predictions on x-axis)
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
geom_point() +
geom_abline()
This random forest model outperforms the poisson count model on the same data; it is discovering more complex non-linear or non-additive relationships in the data.
We saw that the random forest bike model did better on the August data than the quasiposson model, in terms of RMSE.
you will visualize the random forest model’s August predictions as a function of time.
Recall that the quasipoisson model mostly identified the pattern of slow and busy hours in the day, but it somewhat underestimated peak demands. You would like to see how the random forest model compares.
The plot quasipoisson_plot of quasipoisson model predictions as a function of time is shown.
quasipoisson_plot
first_two_weeks <- bikesAugust %>%
# Set start to 0, convert unit to days
mutate(instant = (instant - min(instant)) / 24) %>%
# Gather cnt and pred into a column named value with key valuetype
gather(key = valuetype, value = value, cnt, pred) %>%
# Filter for rows in the first two
filter(instant < 14)
# Plot predictions and cnt by date/time
(randomforest_plot <- ggplot(first_two_weeks, aes(x = instant, y = value, color = valuetype, linetype = valuetype)) +
geom_point() +
geom_line() +
scale_x_continuous("Day", breaks = 0:14, labels = 0:14) +
scale_color_brewer(palette = "Dark2") +
ggtitle("Predicted August bike rentals, Random Forest plot"))
The random forest model captured the day-to-day variations in peak demand better than the quasipoisson model, but it still underestmates peak demand, and also overestimates minimum demand. So there is still room for improvement.
We will use vtreat to one-hot-encode a categorical variable on a small example. vtreat creates a treatment plan to transform categorical variables into indicator variables (coded “lev”), and to clean bad values out of numerical variables (coded “clean”).
To design a treatment plan use the function designTreatmentsZ()
treatplan <- designTreatmentsZ(data, varlist)
designTreatmentsZ() returns a list with an element scoreFrame: a data frame that includes the names and types of the new variables:
scoreFrame <- treatplan %>%
magrittr::use_series(scoreFrame) %>%
select(varName, origName, code)
(magrittr::use_series() is an alias for $ that you can use in pipes.)
For these exercises, we want varName where code is either “clean” or “lev”:
newvarlist <- scoreFrame %>%
filter(code %in% c("clean", "lev") %>%
magrittr::use_series(varName)
To transform the data set into all numerical and one-hot-encoded variables, use prepare():
data.treat <- prepare(treatplan, data, varRestrictions = newvarlist)
dframe <- readRDS("dframe.rds")
# dframe is in the workspace
dframe
color <fctr> | size <dbl> | popularity <dbl> | ||
---|---|---|---|---|
b | 13 | 1.0785088 | ||
r | 11 | 1.3956245 | ||
r | 15 | 0.9217988 | ||
r | 14 | 1.2025453 | ||
r | 13 | 1.0838662 | ||
b | 11 | 0.8043527 | ||
r | 9 | 1.1035440 | ||
g | 12 | 0.8746332 | ||
b | 7 | 0.6947058 | ||
b | 12 | 0.8832502 |
# Create and print a vector of variable names
(vars <- c("color", "size"))
[1] "color" "size"
# Load the package vtreat
library(vtreat)
# Create the treatment plan
treatplan <- designTreatmentsZ(dframe, vars)
[1] "vtreat 1.6.0 inspecting inputs Wed Jun 10 17:55:13 2020"
[1] "designing treatments Wed Jun 10 17:55:13 2020"
[1] " have initial level statistics Wed Jun 10 17:55:13 2020"
[1] " scoring treatments Wed Jun 10 17:55:13 2020"
[1] "have treatment plan Wed Jun 10 17:55:13 2020"
# Examine the scoreFrame
(scoreFrame <- treatplan %>%
magrittr::use_series(scoreFrame) %>%
select(varName, origName, code))
varName <chr> | origName <chr> | code <chr> | ||
---|---|---|---|---|
1 | color_catP | color | catP | |
2 | size | size | clean | |
3 | color_lev_x_b | color | lev | |
4 | color_lev_x_g | color | lev | |
5 | color_lev_x_r | color | lev |
# We only want the rows with codes "clean" or "lev"
(newvars <- scoreFrame %>%
filter(code %in% c("clean", "lev")) %>%
magrittr::use_series(varName))
[1] "size" "color_lev_x_b" "color_lev_x_g" "color_lev_x_r"
# Create the treated training data
(dframe.treat <- prepare(treatplan, dframe, varRestriction = newvars))
size <dbl> | color_lev_x_b <dbl> | color_lev_x_g <dbl> | color_lev_x_r <dbl> | |
---|---|---|---|---|
13 | 1 | 0 | 0 | |
11 | 0 | 0 | 1 | |
15 | 0 | 0 | 1 | |
14 | 0 | 0 | 1 | |
13 | 0 | 0 | 1 | |
11 | 1 | 0 | 0 | |
9 | 0 | 0 | 1 | |
12 | 0 | 1 | 0 | |
7 | 1 | 0 | 0 | |
12 | 1 | 0 | 0 |
We have successfully one-hot-encoded categorical data. The new indicator variables have ‘lev’ in their names, and the new cleaned continuous variables have ’_clean’ in their names. The treated data is all numerical, with no missing values, and is suitable for use with xgboost and other R modeling functions.
When a level of a categorical variable is rare, sometimes it will fail to show up in training data. If that rare level then appears in future data, downstream models may not know what to do with it. When such novel levels appear, using model.matrix or caret::dummyVars to one-hot-encode will not work correctly.
vtreat is a “safer” alternative to model.matrix for one-hot-encoding, because it can manage novel levels safely. vtreat also manages missing values in the data (both categorical and continuous).
We will see how vtreat handles categorical values that did not appear in the training set.
testframe <- readRDS("testframe.rds")
# treatplan is in the workspace
summary(treatplan)
Length Class Mode
treatments 3 -none- list
scoreFrame 8 data.frame list
outcomename 1 -none- character
vtreatVersion 1 package_version list
outcomeType 1 -none- character
outcomeTarget 1 -none- character
meanY 1 -none- logical
splitmethod 1 -none- character
# newvars is in the workspace
newvars
[1] "size" "color_lev_x_b" "color_lev_x_g" "color_lev_x_r"
# Print dframe and testframe
dframe
color <fctr> | size <dbl> | popularity <dbl> | ||
---|---|---|---|---|
b | 13 | 1.0785088 | ||
r | 11 | 1.3956245 | ||
r | 15 | 0.9217988 | ||
r | 14 | 1.2025453 | ||
r | 13 | 1.0838662 | ||
b | 11 | 0.8043527 | ||
r | 9 | 1.1035440 | ||
g | 12 | 0.8746332 | ||
b | 7 | 0.6947058 | ||
b | 12 | 0.8832502 |
testframe
color <fctr> | size <dbl> | popularity <dbl> | ||
---|---|---|---|---|
g | 7 | 0.9733920 | ||
g | 8 | 0.9122529 | ||
y | 10 | 1.4217153 | ||
g | 12 | 1.1905828 | ||
g | 6 | 0.9866464 | ||
y | 8 | 1.3697515 | ||
b | 12 | 1.0959387 | ||
g | 12 | 0.9161547 | ||
g | 12 | 1.0000460 | ||
r | 8 | 1.3137360 |
# Use prepare() to one-hot-encode testframe
(testframe.treat <- prepare(treatplan, testframe, varRestriction = newvars))
size <dbl> | color_lev_x_b <dbl> | color_lev_x_g <dbl> | color_lev_x_r <dbl> | |
---|---|---|---|---|
7 | 0 | 1 | 0 | |
8 | 0 | 1 | 0 | |
10 | 0 | 0 | 0 | |
12 | 0 | 1 | 0 | |
6 | 0 | 1 | 0 | |
8 | 0 | 0 | 0 | |
12 | 1 | 0 | 0 | |
12 | 0 | 1 | 0 | |
12 | 0 | 1 | 0 | |
8 | 0 | 0 | 1 |
As you saw, vtreat encodes novel colors like yellow that were not present in the data as all zeros: ‘none of the known colors’. This allows downstream models to accept these novel values without crashing.
We will create one-hot-encoded data frames of the July/August bike data, for use with xgboost later on.
For your convenience, we have defined the variable vars with the list of variable columns for the model.
library(magrittr)
Attaching package: ‘magrittr’
The following object is masked from ‘package:tidyr’:
extract
# The outcome column
(outcome <- "cnt")
[1] "cnt"
# The input columns
(vars <- c("hr", "holiday", "workingday", "weathersit", "temp", "atemp", "hum", "windspeed"))
[1] "hr" "holiday" "workingday" "weathersit" "temp"
[6] "atemp" "hum" "windspeed"
# Load the package vtreat
library(vtreat)
# Create the treatment plan from bikesJuly (the training data)
treatplan <- designTreatmentsZ(bikesJuly, vars, verbose = FALSE)
# Get the "clean" and "lev" variables from the scoreFrame
(newvars <- treatplan %>%
use_series(scoreFrame) %>%
filter(code %in% c("lev", "clean")) %>% # get the rows you care about
use_series(varName)) # get the varName column
[1] "holiday"
[2] "workingday"
[3] "temp"
[4] "atemp"
[5] "hum"
[6] "windspeed"
[7] "hr_lev_x_0"
[8] "hr_lev_x_1"
[9] "hr_lev_x_10"
[10] "hr_lev_x_11"
[11] "hr_lev_x_12"
[12] "hr_lev_x_13"
[13] "hr_lev_x_14"
[14] "hr_lev_x_15"
[15] "hr_lev_x_16"
[16] "hr_lev_x_17"
[17] "hr_lev_x_18"
[18] "hr_lev_x_19"
[19] "hr_lev_x_2"
[20] "hr_lev_x_20"
[21] "hr_lev_x_21"
[22] "hr_lev_x_22"
[23] "hr_lev_x_23"
[24] "hr_lev_x_3"
[25] "hr_lev_x_4"
[26] "hr_lev_x_5"
[27] "hr_lev_x_6"
[28] "hr_lev_x_7"
[29] "hr_lev_x_8"
[30] "hr_lev_x_9"
[31] "weathersit_lev_x_Clear_to_partly_cloudy"
[32] "weathersit_lev_x_Light_Precipitation"
[33] "weathersit_lev_x_Misty"
# Prepare the training data
bikesJuly.treat <- prepare(treatplan, bikesJuly, varRestriction = newvars)
# Prepare the test data
bikesAugust.treat <- prepare(treatplan, bikesAugust, varRestriction = newvars)
# Call str() on the treated data
str(bikesJuly.treat)
'data.frame': 744 obs. of 33 variables:
$ holiday : num 0 0 0 0 0 0 0 0 0 0 ...
$ workingday : num 0 0 0 0 0 0 0 0 0 0 ...
$ temp : num 0.76 0.74 0.72 0.72 0.7 0.68 0.7 0.74 0.78 0.82 ...
$ atemp : num 0.727 0.697 0.697 0.712 0.667 ...
$ hum : num 0.66 0.7 0.74 0.84 0.79 0.79 0.79 0.7 0.62 0.56 ...
$ windspeed : num 0 0.1343 0.0896 0.1343 0.194 ...
$ hr_lev_x_0 : num 1 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_1 : num 0 1 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_10 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_11 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_12 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_13 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_14 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_15 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_16 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_17 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_18 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_19 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_2 : num 0 0 1 0 0 0 0 0 0 0 ...
$ hr_lev_x_20 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_21 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_22 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_23 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_3 : num 0 0 0 1 0 0 0 0 0 0 ...
$ hr_lev_x_4 : num 0 0 0 0 1 0 0 0 0 0 ...
$ hr_lev_x_5 : num 0 0 0 0 0 1 0 0 0 0 ...
$ hr_lev_x_6 : num 0 0 0 0 0 0 1 0 0 0 ...
$ hr_lev_x_7 : num 0 0 0 0 0 0 0 1 0 0 ...
$ hr_lev_x_8 : num 0 0 0 0 0 0 0 0 1 0 ...
$ hr_lev_x_9 : num 0 0 0 0 0 0 0 0 0 1 ...
$ weathersit_lev_x_Clear_to_partly_cloudy: num 1 1 1 1 1 1 1 1 1 1 ...
$ weathersit_lev_x_Light_Precipitation : num 0 0 0 0 0 0 0 0 0 0 ...
$ weathersit_lev_x_Misty : num 0 0 0 0 0 0 0 0 0 0 ...
str(bikesAugust.treat)
'data.frame': 744 obs. of 33 variables:
$ holiday : num 0 0 0 0 0 0 0 0 0 0 ...
$ workingday : num 1 1 1 1 1 1 1 1 1 1 ...
$ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
$ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
$ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
$ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
$ hr_lev_x_0 : num 1 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_1 : num 0 1 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_10 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_11 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_12 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_13 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_14 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_15 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_16 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_17 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_18 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_19 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_2 : num 0 0 1 0 0 0 0 0 0 0 ...
$ hr_lev_x_20 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_21 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_22 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_23 : num 0 0 0 0 0 0 0 0 0 0 ...
$ hr_lev_x_3 : num 0 0 0 1 0 0 0 0 0 0 ...
$ hr_lev_x_4 : num 0 0 0 0 1 0 0 0 0 0 ...
$ hr_lev_x_5 : num 0 0 0 0 0 1 0 0 0 0 ...
$ hr_lev_x_6 : num 0 0 0 0 0 0 1 0 0 0 ...
$ hr_lev_x_7 : num 0 0 0 0 0 0 0 1 0 0 ...
$ hr_lev_x_8 : num 0 0 0 0 0 0 0 0 1 0 ...
$ hr_lev_x_9 : num 0 0 0 0 0 0 0 0 0 1 ...
$ weathersit_lev_x_Clear_to_partly_cloudy: num 1 1 1 1 0 0 1 0 0 0 ...
$ weathersit_lev_x_Light_Precipitation : num 0 0 0 0 0 0 0 0 0 0 ...
$ weathersit_lev_x_Misty : num 0 0 0 0 1 1 0 1 1 1 ...
The bike data is now in completely numeric form, ready to use with xgboost. Note that the treated data does not include the outcome column.
you will get ready to build a gradient boosting model to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. We will train the model on data from the month of July.
Remember that bikesJuly.treat no longer has the outcome column, so we must get it from the untreated data: bikesJuly$cnt.
You will use the xgboost package to fit the random forest model. The function xgb.cv() uses cross-validation to estimate the out-of-sample learning error as each new tree is added to the model. The appropriate number of trees to use in the final model is the number that minimizes the holdout RMSE.
For this example, the key arguments to the xgb.cv() call are:
# Load the package xgboost
library(xgboost)
# Run xgb.cv
cv <- xgb.cv(data = as.matrix(bikesJuly.treat),
label = bikesJuly$cnt,
nrounds = 100,
nfold = 5,
objective = "reg:linear",
eta = 0.3,
max_depth = 6,
early_stopping_rounds = 10,
verbose = 0 # silent
)
# Get the evaluation log
elog <- cv$evaluation_log
# Determine and print how many trees minimize training and test error
elog %>%
summarize(ntrees.train = which.min(train_rmse_mean), # find the index of min(train_rmse_mean)
ntrees.test = which.min(test_rmse_mean)) # find the index of min(test_rmse_mean)
ntrees.train <int> | ntrees.test <int> | |||
---|---|---|---|---|
65 | 55 |
In most cases, ntrees.test is less than ntrees.train. The training error keeps decreasing even after the test error starts to increase. It’s important to use cross-validation to find the right number of trees (as determined by ntrees.test) and avoid an overfit model.
We will fit a gradient boosting model using xgboost() to predict the number of bikes rented in an hour as a function of the weather and the type and time of day. We will train the model on data from the month of July and predict on data for the month of August.
Remember the vtreat-ed data no longer has the outcome column, so you must get it from the original data (the cnt column).
ntrees <- 84
# The number of trees to use, as determined by xgb.cv
ntrees
[1] 84
# Run xgboost
bike_model_xgb <- xgboost(data = as.matrix(bikesJuly.treat), # training data as matrix
label = bikesJuly$cnt, # column of outcomes
nrounds = ntrees, # number of trees to build
objective = "reg:linear", # objective
eta = 0.3,
depth = 6,
verbose = 0 # silent
)
# Make predictions
bikesAugust$pred <- predict(bike_model_xgb, as.matrix(bikesAugust.treat))
# Plot predictions (on x axis) vs actual bike rental count
ggplot(bikesAugust, aes(x = pred, y = cnt)) +
geom_point() +
geom_abline()
Overall, the scatterplot looked pretty good, but did you notice that the model made some negative predictions?
We will evaluate the gradient boosting model bike_model_xgb using data from the month of August. We’ll compare this model’s RMSE for August to the RMSE of previous models that we’ve built.
# bikesAugust is in the workspace
str(bikesAugust)
'data.frame': 744 obs. of 13 variables:
$ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ holiday : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ workingday: logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ weathersit: chr "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" "Clear to partly cloudy" ...
$ temp : num 0.68 0.66 0.64 0.64 0.64 0.64 0.64 0.64 0.66 0.68 ...
$ atemp : num 0.636 0.606 0.576 0.576 0.591 ...
$ hum : num 0.79 0.83 0.83 0.83 0.78 0.78 0.78 0.83 0.78 0.74 ...
$ windspeed : num 0.1642 0.0896 0.1045 0.1045 0.1343 ...
$ cnt : int 47 33 13 7 4 49 185 487 681 350 ...
$ instant : int 13748 13749 13750 13751 13752 13753 13754 13755 13756 13757 ...
$ mnth : int 8 8 8 8 8 8 8 8 8 8 ...
$ yr : int 1 1 1 1 1 1 1 1 1 1 ...
$ pred : num 48.548 35.349 0.625 -6.652 3.92 ...
# Calculate RMSE
bikesAugust %>%
mutate(residuals = cnt - pred) %>%
summarize(rmse = sqrt(mean(residuals^2)))
rmse <dbl> | ||||
---|---|---|---|---|
76.73242 |
Even though this gradient boosting made some negative predictions, overall it makes smaller errors than the previous two models. Perhaps rounding negative predictions up to zero is a reasonable tradeoff.
You’ve now seen three different ways to model the bike rental data. For this example, you’ve seen that the gradient boosting model had the smallest RMSE. Let’s compare the gradient boosting model’s predictions to the other two models as a function of time.
# Print quasipoisson_plot
quasipoisson_plot
# Print randomforest_plot
randomforest_plot
# Plot predictions and actual bike rentals as a function of time (days)
bikesAugust %>%
mutate(instant = (instant - min(instant))/24) %>% # set start to 0, convert unit to days
gather(key = valuetype, value = value, cnt, pred) %>%
filter(instant < 14) %>% # first two weeks
ggplot(aes(x = instant, y = value, color = valuetype, linetype = valuetype)) +
geom_point() +
geom_line() +
scale_x_continuous("Day", breaks = 0:14, labels = 0:14) +
scale_color_brewer(palette = "Dark2") +
ggtitle("Predicted August bike rentals, Gradient Boosting model")
The gradient boosting pattern captures rental variations due to time of day and other factors better than the previous models.