Fitting a parallel slopes model
We use the lm()
function to fit linear models to data. In this case, we want to understand how the price of MarioKart games sold at auction varies as a function of not only the number of wheels included in the package, but also whether the item is new or used. Obviously, it is expected that you might have to pay a premium to buy these new. But how much is that premium? Can we estimate its value after controlling for the number of wheels?
We will fit a parallel slopes model using lm()
. In addition to the data
argument, lm()
needs to know which variables you want to include in your regression model, and how you want to include them. It accomplishes this using a formula
argument. A simple linear regression formula looks like y ~ x
, where y
is the name of the response variable, and x
is the name of the explanatory variable. Here, we will simply extend this formula to include multiple explanatory variables. A parallel slopes model has the form y ~ x + z
, where z
is a categorical explanatory variable, and x
is a numerical explanatory variable.
The output from lm()
is a model object, which when printed, will show the fitted coefficients.
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mario_kart <- mariokart %>% filter(total_pr < 100)
# Explore the data
glimpse(mario_kart)
## Rows: 141
## Columns: 12
## $ id <dbl> 150377422259, 260483376854, 320432342985, 280405224677,...
## $ duration <int> 3, 7, 3, 3, 1, 3, 1, 1, 3, 7, 1, 1, 1, 1, 7, 7, 3, 3, 1...
## $ n_bids <int> 20, 13, 16, 18, 20, 19, 13, 15, 29, 8, 15, 15, 13, 16, ...
## $ cond <fct> new, used, new, new, new, new, used, new, used, used, n...
## $ start_pr <dbl> 0.99, 0.99, 0.99, 0.99, 0.01, 0.99, 0.01, 1.00, 0.99, 1...
## $ ship_pr <dbl> 4.00, 3.99, 3.50, 0.00, 0.00, 4.00, 0.00, 2.99, 4.00, 4...
## $ total_pr <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99,...
## $ ship_sp <fct> standard, firstClass, firstClass, standard, media, stan...
## $ seller_rate <int> 1580, 365, 998, 7, 820, 270144, 7284, 4858, 27, 201, 48...
## $ stock_photo <fct> yes, yes, no, yes, yes, yes, yes, yes, yes, no, yes, ye...
## $ wheels <int> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2...
## $ title <fct> "~~ Wii MARIO KART & WHEEL ~ NINTENDO Wii ~ BRAND N...
# fit parallel slopes
lm(total_pr ~ wheels + cond, data = mario_kart)
##
## Call:
## lm(formula = total_pr ~ wheels + cond, data = mario_kart)
##
## Coefficients:
## (Intercept) wheels condused
## 42.370 7.233 -5.585
Awesome! On average, how much more does a new game cost compared to a used game?
Using geom_line() and augment()
Parallel slopes models are so-named because we can visualize these models in the data space as not one line, but two parallel lines. To do this, we’ll draw two things:
Our plotting strategy is to compute the fitted values, plot these, and connect the points to form a line. The augment()
function from the broom
package provides an easy way to add the fitted values to our data frame, and the geom_line()
function can then use that data frame to plot the points and connect them.
Note that this approach has the added benefit of automatically coloring the lines appropriately to match the data.
You already know how to use ggplot()
and geom_point()
to make the scatterplot. The only twist is that now you’ll pass your augment()
-ed model as the data
argument in your ggplot()
call. When you add your geom_line()
, instead of letting the y
aesthetic inherit its values from the ggplot()
call, you can set it to the .fitted
column of the augment()
-ed model. This has the advantage of automatically coloring the lines for you.
library(broom)
library(ggplot2)
mod <- lm(total_pr ~ wheels + cond, data = mario_kart)
# Augment the model
augmented_mod <- augment(mod)
glimpse(augmented_mod)
## Rows: 141
## Columns: 9
## $ total_pr <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99, ...
## $ wheels <int> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2,...
## $ cond <fct> new, used, new, new, new, new, used, new, used, used, ne...
## $ .fitted <dbl> 49.60260, 44.01777, 49.60260, 49.60260, 56.83544, 42.369...
## $ .resid <dbl> 1.9473995, -6.9777674, -4.1026005, -5.6026005, 14.164559...
## $ .std.resid <dbl> 0.40270893, -1.43671086, -0.84838977, -1.15857953, 2.926...
## $ .hat <dbl> 0.02103158, 0.01250410, 0.02103158, 0.02103158, 0.019156...
## $ .sigma <dbl> 4.902339, 4.868399, 4.892414, 4.881308, 4.750591, 4.8998...
## $ .cooksd <dbl> 1.161354e-03, 8.712334e-03, 5.154337e-03, 9.612441e-03, ...
# scatterplot, with color
data_space <- ggplot(augmented_mod, aes(x = wheels, y = total_pr, color = cond)) +
geom_point()
# single call to geom_line()
data_space +
geom_line(aes(y = .fitted))
Great work! Thinking geometrically is great way to get your head around how linear models work.
Mathematical
Syntax from math
The babies
data set contains observations about the birthweight and other characteristics of children born in the San Francisco Bay area from 1960–1967.
We would like to build a model for birthweight as a function of the mother’s age and whether this child was her first (parity == 0
). Use the mathematical specification below to code the model in R.
# build model
lm(bwt ~ age + parity, data = babies)
##
## Call:
## lm(formula = bwt ~ age + parity, data = babies)
##
## Coefficients:
## (Intercept) age parity
## 118.27782 0.06315 -1.65248
You’re parallel slopes pro! How big is the effect of parity
compared to age
?
Syntax from plot
This time, we’d like to build a model for birthweight as a function of the length of gestation and the mother’s smoking status. Use the plot to inform your model specification.
# build model
lm(bwt ~ gestation + smoke, data = babies)
##
## Call:
## lm(formula = bwt ~ gestation + smoke, data = babies)
##
## Coefficients:
## (Intercept) gestation smoke
## -0.9317 0.4429 -8.0883
Great work! Visually examining your data is always a great place to start when you’re building models.
R-squared vs. adjusted R-squared
Two common measures of how well a model fits to data are (the coefficient of determination) and the adjusted . The former measures the percentage of the variability in the response variable that is explained by the model. To compute this, we define
,
where and are the sum of the squared residuals, and the total sum of the squares, respectively. One issue with this measure is that the can only decrease as new variables are added to the model, while the depends only on the response variable and therefore is not affected by changes to the model. This means that you can increase by adding any additional variables to your model — even random noise.
The adjusted includes a term that penalizes a model for each additional explanatory variable (where is the number of explanatory variables).
,
We can see both measures in the output of the summary()
function on our model object.
# R^2 and adjusted R^2
summary(mod)
##
## Call:
## lm(formula = total_pr ~ wheels + cond, data = mario_kart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0078 -3.0754 -0.8254 2.9822 14.1646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.3698 1.0651 39.780 < 2e-16 ***
## wheels 7.2328 0.5419 13.347 < 2e-16 ***
## condused -5.5848 0.9245 -6.041 1.35e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.887 on 138 degrees of freedom
## Multiple R-squared: 0.7165, Adjusted R-squared: 0.7124
## F-statistic: 174.4 on 2 and 138 DF, p-value: < 2.2e-16
# add random noise
mario_kart_noisy <- mario_kart %>%
mutate(noise = rnorm(nrow(mario_kart)))
# compute new model
mod2 <- lm(total_pr ~ wheels + cond + noise, data = mario_kart_noisy)
# new R^2 and adjusted R^2
summary(mod2)
##
## Call:
## lm(formula = total_pr ~ wheels + cond + noise, data = mario_kart_noisy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7992 -3.0877 -0.6089 2.9021 14.7793
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.6295 1.0394 41.015 < 2e-16 ***
## wheels 7.1277 0.5281 13.496 < 2e-16 ***
## condused -5.6320 0.8991 -6.264 4.54e-09 ***
## noise 1.2212 0.4084 2.990 0.00331 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.753 on 137 degrees of freedom
## Multiple R-squared: 0.7339, Adjusted R-squared: 0.7281
## F-statistic: 125.9 on 3 and 137 DF, p-value: < 2.2e-16
Perfect! How did adding random noise change the value of ? And the adjusted ?
Prediction
Once we have fit a regression model, we can use it to make predictions for unseen observations or retrieve the fitted values. Here, we explore two methods for doing the latter.
A traditional way to return the fitted values (i.e. the ’s) is to run the predict()
function on the model object. This will return a vector of the fitted values. Note that predict()
will take an optional newdata
argument that will allow you to make predictions for observations that are not in the original data.
A newer alternative is the augment()
function from the broom
package, which returns a data.frame
with the response varible (), the relevant explanatory variables (the ’s), the fitted value () and some information about the residuals (). augment()
will also take a newdata
argument that allows you to make predictions.
# return a vector
predict(mod)
## 1 2 3 4 5 6 7 8
## 49.60260 44.01777 49.60260 49.60260 56.83544 42.36976 36.78493 56.83544
## 9 10 11 12 13 14 15 16
## 44.01777 44.01777 56.83544 56.83544 56.83544 56.83544 44.01777 36.78493
## 17 18 19 20 21 22 23 24
## 49.60260 49.60260 56.83544 36.78493 56.83544 56.83544 56.83544 44.01777
## 25 26 27 28 29 30 31 32
## 56.83544 36.78493 36.78493 36.78493 49.60260 36.78493 36.78493 44.01777
## 33 34 35 36 37 38 39 40
## 51.25061 44.01777 44.01777 36.78493 44.01777 56.83544 56.83544 49.60260
## 41 42 43 44 45 46 47 48
## 44.01777 51.25061 56.83544 56.83544 44.01777 56.83544 36.78493 36.78493
## 49 50 51 52 53 54 55 56
## 44.01777 56.83544 36.78493 44.01777 42.36976 36.78493 36.78493 44.01777
## 57 58 59 60 61 62 63 64
## 44.01777 36.78493 36.78493 56.83544 36.78493 56.83544 36.78493 51.25061
## 65 66 67 68 69 70 71 72
## 56.83544 44.01777 58.48345 51.25061 49.60260 44.01777 49.60260 56.83544
## 73 74 75 76 77 78 79 80
## 56.83544 51.25061 44.01777 36.78493 36.78493 36.78493 44.01777 56.83544
## 81 82 83 84 85 86 87 88
## 44.01777 65.71629 44.01777 56.83544 36.78493 49.60260 49.60260 36.78493
## 89 90 91 92 93 94 95 96
## 44.01777 36.78493 51.25061 44.01777 36.78493 51.25061 42.36976 56.83544
## 97 98 99 100 101 102 103 104
## 51.25061 44.01777 51.25061 56.83544 56.83544 56.83544 36.78493 49.60260
## 105 106 107 108 109 110 111 112
## 51.25061 44.01777 56.83544 49.60260 36.78493 44.01777 51.25061 56.83544
## 113 114 115 116 117 118 119 120
## 64.06828 44.01777 49.60260 44.01777 49.60260 51.25061 42.36976 44.01777
## 121 122 123 124 125 126 127 128
## 56.83544 44.01777 49.60260 44.01777 51.25061 56.83544 56.83544 49.60260
## 129 130 131 132 133 134 135 136
## 56.83544 36.78493 44.01777 44.01777 36.78493 56.83544 36.78493 44.01777
## 137 138 139 140 141
## 36.78493 51.25061 49.60260 36.78493 56.83544
# return a data frame
augment(mod)
## # A tibble: 141 x 9
## total_pr wheels cond .fitted .resid .std.resid .hat .sigma .cooksd
## <dbl> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 51.6 1 new 49.6 1.95 0.403 0.0210 4.90 0.00116
## 2 37.0 1 used 44.0 -6.98 -1.44 0.0125 4.87 0.00871
## 3 45.5 1 new 49.6 -4.10 -0.848 0.0210 4.89 0.00515
## 4 44 1 new 49.6 -5.60 -1.16 0.0210 4.88 0.00961
## 5 71 2 new 56.8 14.2 2.93 0.0192 4.75 0.0557
## 6 45 0 new 42.4 2.63 0.551 0.0475 4.90 0.00505
## 7 37.0 0 used 36.8 0.235 0.0486 0.0209 4.91 0.0000168
## 8 54.0 2 new 56.8 -2.85 -0.588 0.0192 4.90 0.00225
## 9 47 1 used 44.0 2.98 0.614 0.0125 4.90 0.00159
## 10 50 1 used 44.0 5.98 1.23 0.0125 4.88 0.00640
## # ... with 131 more rows
Perfect! Can you see the difference between the functions predict()
and augment()
?
the rate of gas mileage based on displacement will vary based on year of manufacture, but not by a significant amount.
Fitting a model with interaction
Including an interaction term in a model is easy — we just have to tell lm()
that we want to include that new variable. An expression of the form
lm(y ~ x + z + x:z, data = mydata)
will do the trick. The use of the colon (:
) here means that the interaction between and will be a third term in the model.
# include interaction
lm(total_pr ~ duration + cond + cond:duration, data = mario_kart)
##
## Call:
## lm(formula = total_pr ~ duration + cond + cond:duration, data = mario_kart)
##
## Coefficients:
## (Intercept) duration condused duration:condused
## 58.268 -1.966 -17.122 2.325
Great work! How does adding an interaction term affect the other coefficients in the model?
Visualizing interaction models
Interaction allows the slope of the regression line in each group to vary. In this case, this means that the relationship between the final price and the length of the auction is moderated by the condition of each item.
Interaction models are easy to visualize in the data space with ggplot2
because they have the same coefficients as if the models were fit independently to each group defined by the level of the categorical variable. In this case, new and used MarioKarts each get their own regression line. To see this, we can set an aesthetic (e.g. color
) to the categorical variable, and then add a geom_smooth()
layer to overlay the regression line for each color.
# interaction plot
ggplot(mario_kart, aes(y = total_pr, x = duration, color = cond)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Great work! How does the interaction model differ from the parallel slopes model?
Adding one additional variable for “SAT rate” reverses the relationship from negative to positive. Why?
In this case, for every additional thousand dollars of salary for teachers in a particular state, the expected SAT score for a student from that state is about 2 points higher, after controlling for the percentage of students taking the SAT.
Simpson’s paradox in action
A mild version of Simpson’s paradox can be observed in the MarioKart auction data. Consider the relationship between the final auction price and the length of the auction. It seems reasonable to assume that longer auctions would result in higher prices, since—other things being equal—a longer auction gives more bidders more time to see the auction and bid on the item.
However, a simple linear regression model reveals the opposite: longer auctions are associated with lower final prices. The problem is that all other things are not equal. In this case, the new MarioKarts—which people pay a premium for—were mostly sold in one-day auctions, while a plurality of the used MarioKarts were sold in the standard seven-day auctions.
Our simple linear regression model is misleading, in that it suggests a negative relationship between final auction price and duration. However, for the used MarioKarts, the relationship is positive.
slr <- ggplot(mario_kart, aes(y = total_pr, x = duration)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
# model with one slope
lm(total_pr ~ duration, data = mario_kart)
##
## Call:
## lm(formula = total_pr ~ duration, data = mario_kart)
##
## Coefficients:
## (Intercept) duration
## 52.374 -1.317
# plot with two slopes
slr + aes(color = cond)
## `geom_smooth()` using formula 'y ~ x'
Great work! Which of the two groups is showing signs of Simpson’s paradox?
Fitting a MLR model
In terms of the R code, fitting a multiple linear regression model is easy: simply add variables to the model formula you specify in the lm()
command.
In a parallel slopes model, we had two explanatory variables: one was numeric and one was categorical. Here, we will allow both explanatory variables to be numeric.
# Fit the model using duration and startPr
(mod <- lm(total_pr ~ duration + start_pr, data = mario_kart))
##
## Call:
## lm(formula = total_pr ~ duration + start_pr, data = mario_kart)
##
## Coefficients:
## (Intercept) duration start_pr
## 51.030 -1.508 0.233
Awesome work! Using R to fit multiple regression models is just as simple as fitting simpler models.
Tiling the plane
One method for visualizing a multiple linear regression model is to create a heatmap of the fitted values in the plane defined by the two explanatory variables. This heatmap will illustrate how the model output changes over different combinations of the explanatory variables.
This is a multistep process:
grid
.augment()
with the newdata
argument to find the ’s corresponding to the values in grid
.data_space
plot by using the fill
aesthetic and geom_tile()
.library(modelr)
##
## Attaching package: 'modelr'
## The following object is masked from 'package:broom':
##
## bootstrap
data_space <- ggplot(mario_kart, aes(x = duration, y = start_pr)) +
geom_point(aes(color = total_pr))
grid <- mario_kart %>%
data_grid(
duration = seq_range(duration, by = 1),
start_pr = seq_range(start_pr, by = 1)
)
# add predictions to grid
price_hats <- augment(mod, newdata = grid)
# tile the plane
data_space +
geom_tile(data = price_hats, aes(fill = .fitted), alpha = 0.5)
Great work! Visualizing higher dimensional models takes some creativity.
Models in 3D
An alternative way to visualize a multiple regression model with two numeric explanatory variables is as a plane in three dimensions. This is possible in R using the plotly
package.
We have created three objects that you will need:
x
: a vector of unique values of duration
y
: a vector of unique values of startPr
plane
: a matrix of the fitted values across all combinations of x
and y
Much like ggplot()
, the plot_ly()
function will allow you to create a plot object with variables mapped to x
, y
, and z
aesthetics. The add_markers()
function is similar to geom_point()
in that it allows you to add points to your 3D plot.
Note that plot_ly
uses the pipe (%>%
) operator to chain commands together.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
x <- seq(from = 1, to = 10, by = 9/69)
y <- seq(from = 0.01, to = 69.95, by = 69.94/69)
plane <- outer(x, y, function(a, b){mod$coefficients[[1]] + mod$coefficients[[2]]*a + mod$coefficients[[3]]*b})
# draw the 3D scatterplot
p <- plot_ly(data = mario_kart, z = ~total_pr, x = ~duration, y = ~start_pr, opacity = 0.6) %>%
add_markers()
# draw the plane
p %>%
add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
What a plot! You’ve added a whole new dimension to regression!
“While holding age constant … the effect of x on y”
“While holding gestational length constant … the effect of x on y”
Coefficient magnitude
The coefficients from our model for the total auction price of MarioKarts as a function of auction duration and starting price are shown below.
Call:
lm(formula = totalPr ~ duration + startPr, data = mario_kart)
Coefficients:
(Intercept) duration startPr
51.030 -1.508 0.233
A colleague claims that these results imply that the duration of the auction is a more important determinant of final price than starting price, because the coefficient is larger. This interpretation is false because:
The coefficients have different units (dollars per day and dollars per dollar, respectively) and so they are not directly comparable.
Practicing interpretation
Fit a multiple regression model for the total auction price of an item in the mario_kart data set as a function of the starting price and the duration of the auction. Compute the coefficients and choose the correct interpretation of the duration variable.
For each additional day the auction lasts, the expected final price declines by $1.51, after controlling for starting price.
Visualizing parallel planes
By including the duration, starting price, and condition variables in our model, we now have two explanatory variables and one categorical variable. Our model now takes the geometric form of two parallel planes!
The first plane corresponds to the model output when the condition of the item is new
, while the second plane corresponds to the model output when the condition of the item is used
. The planes have the same slopes along both the duration and starting price axes—it is the -intercept that is different.
Once again we have stored the x
and y
vectors for you. Since we now have two planes, there are matrix objects plane0
and plane1
stored for you as well.
(mod <- lm(total_pr ~ duration + start_pr + cond, data = mario_kart))
##
## Call:
## lm(formula = total_pr ~ duration + start_pr + cond, data = mario_kart)
##
## Coefficients:
## (Intercept) duration start_pr condused
## 53.3448 -0.6560 0.1982 -8.9493
plane0 <- outer(x, y, function(a, b){mod$coefficients[[1]] + mod$coefficients[[2]]*a + mod$coefficients[[3]]*b})
plane1 <- outer(x, y, function(a, b){mod$coefficients[[1]] + mod$coefficients[[4]] + mod$coefficients[[2]]*a + mod$coefficients[[3]]*b})
# draw the 3D scatterplot
p <- plot_ly(data = mario_kart, z = ~total_pr, x = ~duration, y = ~start_pr, opacity = 0.6) %>%
add_markers(color = ~cond)
# draw two planes
p %>%
add_surface(x = ~x, y = ~y, z = ~plane0, showscale = FALSE) %>%
add_surface(x = ~x, y = ~y, z = ~plane1, showscale = FALSE)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Awesome! Unfortunately, a parallel planes model is about as complicated a model as we’re able to visualize.
The expected premium for new (relative to used) MarioKarts is $8.95, after controlling for the duration and starting price of the auction.
Interpretation of coefficient in a big model
This time we have thrown even more variables into our model, including the number of bids in each auction (nBids
) and the number of wheels
. Unfortunately this makes a full visualization of our model impossible, but we can still interpret the coefficients.
Call:
lm(formula = totalPr ~ duration + startPr + cond + wheels + nBids,
data = mario_kart)
Coefficients:
(Intercept) duration startPr condused wheels
39.3741 -0.2752 0.1796 -4.7720 6.7216
nBids
0.1909
Choose the correct interpretation of the coefficient on the number of wheels:
Each additional wheel is associated with an increase in the expected auction price of $6.72, after controlling for auction duration, starting price, number of bids, and the condition of the item.
Fitting a line to a binary response
When our response variable is binary, a regression model has several limitations. Among the more obvious—and logically incongruous—is that the regression line extends infinitely in either direction. This means that even though our response variable only takes on the values 0 and 1, our fitted values can range anywhere from to . This doesn’t make sense.
To see this in action, we’ll fit a linear regression model to data about 55 students who applied to medical school. We want to understand how their undergraduate relates to the probability they will be accepted by a particular school ().
library(Stat2Data)
data("MedGPA")
# scatterplot with jitter
data_space <- ggplot(data = MedGPA, aes(y = Acceptance, x = GPA)) +
geom_jitter(width = 0, height = 0.05, alpha = 0.5)
# linear regression line
data_space +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Great work! How well does the linear model fit the data?
Fitting a line to a binary response (2)
In the previous exercise, we identified a major limitation to fitting a linear regression model when we have a binary response variable. However, it is not always inappropriate to do so. Note that our regression line only makes illogical predictions (i.e. or ) for students with very high or very low GPAs. For GPAs closer to average, the predictions seem fine.
Moreover, the alternative logistic regression model — which we will fit next — is very similar to the linear regression model for observations near the average of the explanatory variable. It just so happens that the logistic curve is very straight near its middle. Thus, in these cases a linear regression model may still be acceptable, even for a binary response.
# filter
MedGPA_middle <- MedGPA %>%
filter(GPA >= 3.375, GPA <= 3.770)
# scatterplot with jitter
data_space <- ggplot(data = MedGPA_middle, aes(y = Acceptance, x = GPA)) +
geom_jitter(width = 0, height = 0.05, alpha = 0.5)
# linear regression line
data_space +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Perfect plotting! In cases like this, it can be easier to use a simpler model.
Fitting a model
Logistic regression is a special case of a broader class of generalized linear models, often known as GLMs. Specifying a logistic regression model is very similar to specify a regression model, with two important differences:
glm()
function instead of lm()
family
argument and set it to binomial
. This tells the GLM function that we want to fit a logistic regression model to our binary response. (The terminology stems from the assumption that our binary response follows a binomial distribution.)We still use the formula
and data
arguments with glm()
.
Note that the mathematical model is now:
,
where is the error term.
# fit model
(mod <- glm(Acceptance ~ GPA, data = MedGPA, family = binomial))
##
## Call: glm(formula = Acceptance ~ GPA, family = binomial, data = MedGPA)
##
## Coefficients:
## (Intercept) GPA
## -19.207 5.454
##
## Degrees of Freedom: 54 Total (i.e. Null); 53 Residual
## Null Deviance: 75.79
## Residual Deviance: 56.84 AIC: 60.84
Awesome! Fitting a GLM isn’t much harder than a regular linear model.
Using geom_smooth()
Our logistic regression model can be visualized in the data space by overlaying the appropriate logistic curve. We can use the geom_smooth()
function to do this. Recall that geom_smooth()
takes a method
argument that allows you to specify what type of smoother you want to see. In our case, we need to specify that we want to use the glm()
function to do the smoothing.
However we also need to tell the glm()
function which member of the GLM family we want to use. To do this, we will pass the family
argument to glm()
as a list using the method.args
argument to geom_smooth()
. This mechanism is common in R, and allows one function to pass a list of arguments to another function.
# scatterplot with jitter
data_space <- ggplot(data = MedGPA, aes(y = Acceptance, x = GPA)) +
geom_jitter(width = 0, height = 0.05, alpha = 0.5)
# add logistic curve
data_space +
geom_smooth(method = "glm", se = FALSE, method.args = list(family = "binomial"))
## `geom_smooth()` using formula 'y ~ x'
What a plot! What differences can you spot between this curve and the linear model you plotted earlier?
Using bins
One of the difficulties in working with a binary response variable is understanding how it “changes.” The response itself () is either 0 or 1, while the fitted values () — which are interpreted as probabilities — are between 0 and 1. But if every medical school applicant is either admitted or not, what does it mean to talk about the probability of being accepted?
What we’d like is a larger sample of students, so that for each GPA value (e.g. 3.54) we had many observations (say ), and we could then take the average of those observations to arrive at the estimated probability of acceptance. Unfortunately, since the explanatory variable is continuous, this is hopeless — it would take an infinite amount of data to make these estimates robust.
Instead, what we can do is put the observations into bins based on their GPA value. Within each bin, we can compute the proportion of accepted students, and we can visualize our model as a smooth logistic curve through those binned values.
We have created a data.frame
called MedGPA_binned
that aggregates the original data into separate bins for each 0.25 of GPA. It also contains the fitted values from the logistic regression model.
Here we are plotting as a function of , where that function is
Note that the left hand side is the expected probability of being accepted to medical school.
(gpa_bins <- quantile(MedGPA$GPA, probs = seq(0, 1, 1/6)))
## 0% 16.66667% 33.33333% 50% 66.66667% 83.33333% 100%
## 2.72 3.30 3.44 3.58 3.70 3.87 3.97
MedGPA$bins <- cut(MedGPA$GPA, breaks = gpa_bins, include.lowest = TRUE)
head(MedGPA)
## Accept Acceptance Sex BCPM GPA VR PS WS BS MCAT Apps bins
## 1 D 0 F 3.59 3.62 11 9 9 9 38 5 (3.58,3.7]
## 2 A 1 M 3.75 3.84 12 13 8 12 45 3 (3.7,3.87]
## 3 A 1 F 3.24 3.23 9 10 5 9 33 19 [2.72,3.3]
## 4 A 1 F 3.74 3.69 12 11 7 10 40 5 (3.58,3.7]
## 5 A 1 F 3.53 3.38 9 11 4 11 35 11 (3.3,3.44]
## 6 A 1 M 3.59 3.72 10 9 7 10 36 5 (3.7,3.87]
(MedGPA_binned <- MedGPA %>%
group_by(bins) %>%
summarize(mean_GPA = mean(GPA), acceptance_rate = mean(Acceptance)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 3
## bins mean_GPA acceptance_rate
## <fct> <dbl> <dbl>
## 1 [2.72,3.3] 3.11 0.2
## 2 (3.3,3.44] 3.39 0.2
## 3 (3.44,3.58] 3.54 0.75
## 4 (3.58,3.7] 3.65 0.333
## 5 (3.7,3.87] 3.79 0.889
## 6 (3.87,3.97] 3.91 1
# binned points and line
data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = acceptance_rate)) +
geom_point() + geom_line()
# augmented model
(MedGPA_plus <- mod %>%
augment(type.predict = "response"))
## # A tibble: 55 x 8
## Acceptance GPA .fitted .resid .std.resid .hat .sigma .cooksd
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 3.62 0.631 -1.41 -1.43 0.0268 1.03 0.0242
## 2 1 3.84 0.850 0.569 0.581 0.0386 1.04 0.00368
## 3 1 3.23 0.169 1.88 1.94 0.0527 1.01 0.144
## 4 1 3.69 0.715 0.819 0.832 0.0304 1.04 0.00644
## 5 1 3.38 0.316 1.52 1.55 0.0400 1.02 0.0470
## 6 1 3.72 0.747 0.764 0.776 0.0323 1.04 0.00584
## 7 1 3.89 0.882 0.501 0.512 0.0396 1.04 0.00287
## 8 0 3.34 0.271 -0.795 -0.813 0.0444 1.04 0.00904
## 9 1 3.71 0.737 0.782 0.795 0.0316 1.04 0.00603
## 10 1 3.89 0.882 0.501 0.512 0.0396 1.04 0.00287
## # ... with 45 more rows
# logistic model on probability scale
data_space +
geom_line(data = MedGPA_plus, aes(x = GPA, y = .fitted), color = "red")
Great work! The logistic predictions seem to follow the binned values pretty well.
Odds scale
For most people, the idea that we could estimate the probability of being admitted to medical school based on undergraduate GPA is fairly intuitive. However, thinking about how the probability changes as a function of GPA is complicated by the non-linear logistic curve. By translating the response from the probability scale to the odds scale, we make the right hand side of our equation easier to understand.
If the probability of getting accepted is , then the odds are . Expressions of probabilities in terms of odds are common in many situations, perhaps most notably gambling.
Here we are plotting as a function of , where that function is
Note that the left hand side is the expected odds of being accepted to medical school. The right hand side is now a familiar exponential function of .
The MedGPA_binned
data frame contains the data for each GPA bin, while the MedGPA_plus
data frame records the original observations after being augment()
-ed by mod
.
# compute odds for bins
MedGPA_binned <- MedGPA_binned %>%
mutate(odds = acceptance_rate / (1 - acceptance_rate))
# plot binned odds
data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = odds)) +
geom_point() + geom_line()
# compute odds for observations
MedGPA_plus <- MedGPA_plus %>%
mutate(odds_hat = .fitted / (1 - .fitted))
# logistic model on odds scale
data_space +
geom_line(data = MedGPA_plus, aes(x = GPA, y = odds_hat), color = "red")
Fantastic job! Being able to move back and forth between different scales is essential to interpreting logistic regression models.
Log-odds scale
Previously, we considered two formulations of logistic regression models:
We’ll now add a third formulation:
-on the log-odds scale, the units are nearly impossible to interpret, but the function is linear, which makes it easy to understand
As you can see, none of these three is uniformly superior. Most people tend to interpret the fitted values on the probability scale and the function on the log-odds scale. The interpretation of the coefficients is most commonly done on the odds scale. Recall that we interpreted our slope coefficient in linear regression as the expected change in given a one unit change in . On the probability scale, the function is non-linear and so this approach won’t work. On the log-odds, the function is linear, but the units are not interpretable (what does the of the odds mean??). However, on the odds scale, a one unit change in leads to the odds being multiplied by a factor of . To see why, we form the odds ratio:
Thus, the exponentiated coefficent tells us how the expected odds change for a one unit increase in the explanatory variable. It is tempting to interpret this as a change in the expected probability, but this is wrong and can lead to nonsensical predictions (e.g. expected probabilities greater than 1).
# compute log odds for bins
MedGPA_binned <- MedGPA_binned %>%
mutate(log_odds = log(acceptance_rate / (1 - acceptance_rate)))
# plot binned log odds
data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = log_odds)) +
geom_point() + geom_line()
# compute log odds for observations
MedGPA_plus <- MedGPA_plus %>%
mutate(log_odds_hat = log(.fitted / (1 - .fitted)))
# logistic model on log odds scale
data_space +
geom_line(data = MedGPA_plus, aes(x = GPA, y = log_odds_hat), color = "red")
Amazing work! When you’re on the log-odds scale, your model is a simple linear function.
Interpretation of logistic regression
The fitted coefficient from the medical school logistic regression model is 5.45. The exponential of this is 233.73.
Donald’s GPA is 2.9, and thus the model predicts that the probability of him getting into medical school is 3.26%. The odds of Donald getting into medical school are 0.0337, or—phrased in gambling terms — 29.6:1. If Donald hacks the school’s registrar and changes his GPA to 3.9, then which of the following statements are TRUE:
Model predicted that 91 patients would die, and 12 would survive 71 actually did die, and 8 actually did survive
Accuracy was 79/103 (predicted that matched actual values/total values) == 77%
Making probabilistic predictions
Just as we did with linear regression, we can use our logistic regression model to make predictions about new observations. In this exercise, we will use the newdata
argument to the augment()
function from the broom
package to make predictions about students who were not in our original data set. These predictions are sometimes called out-of-sample.
Following our previous discussion about scales, with logistic regression it is important that we specify on which scale we want the predicted values. Although the default is terms
– which uses the log-odds scale – we want our predictions on the probability scale, which is the scale of the response
variable. The type.predict
argument to augment()
controls this behavior.
A logistic regression model object, mod
, has been defined for you.
# create new data frame
new_data <- data.frame(GPA = 3.51)
# make predictions
augment(mod, newdata = new_data, type.predict = "response")
## # A tibble: 1 x 2
## GPA .fitted
## <dbl> <dbl>
## 1 3.51 0.484
Perfect predicting! By framing your prediction as a probability you can show how likely it is that this student will get admitted to medical school.
Making binary predictions
Naturally, we want to know how well our model works. Did it predict acceptance for the students who were actually accepted to medical school? Did it predict rejections for the student who were not admitted? These types of predictions are called in-sample. One common way to evaluate models with a binary response is with a confusion matrix. [Yes, that is actually what it is called!]
However, note that while our response variable is binary, our fitted values are probabilities. Thus, we have to round them somehow into binary predictions. While the probabilities convey more information, we might ultimately have to make a decision, and so this rounding is common in practice. There are many different ways to round, but for simplicity we will predict admission if the fitted probability is greater than 0.5, and rejection otherwise.
First, we’ll use augment()
to make the predictions, and then mutate()
and round()
to convert these probabilities into binary decisions. Then we will form the confusion matrix using the table()
function. table()
will compute a 2-way table when given a data frame with two categorical variables, so we will first use select()
to grab only those variables.
You will find that this model made only 15 mistakes on these 55 observations, so it is nearly 73% accurate.
# data frame with binary predictions
tidy_mod <- augment(mod, type.predict = "response") %>%
mutate(Acceptance_hat = round(.fitted))
# confusion matrix
tidy_mod %>%
select(Acceptance, Acceptance_hat) %>%
table()
## Acceptance_hat
## Acceptance 0 1
## 0 16 9
## 1 6 24
Great work! This model doesn’t seem like it’s too bad!
Exploratory data analysis
Multiple regression can be an effective technique for understanding how a response variable changes as a result of changes to more than one explanatory variable. But it is not magic – understanding the relationships among the explanatory variables is also necessary, and will help us build a better model. This process is often called exploratory data analysis (EDA) and is covered in another DataCamp course.
One quick technique for jump-starting EDA is to examine all of the pairwise scatterplots in your data. This can be achieved using the pairs()
function. Look for variables in the nyc
data set that are strongly correlated, as those relationships will help us check for multicollinearity later on.
Which pairs of variables appear to be strongly correlated?
nyc <- read.csv("_data/nyc.csv", stringsAsFactors = TRUE)
## Scatter Plot Matrix with histogram diagonals and r/CI95 on the top panels (Run top to bottom from here)
## Histograms on the diagonal (Thanks to Ane Handles)
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="lavender", ...)
}
## Correlations & 95% CIs on the upper panels (Thanks to Ane Handles) + p-values
panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- cor(x, y,use="complete.obs")
txt <- format(c(r, 0.123456789), digits=digits)[1]
prefix <- "r = "
rc <- cor.test(x,y)
rci <- rc$conf.int
p <- cor.test(x, y)$p.value
txt4 <- format(c(p, 0.123456789), digits = digits)[1]
txt4 <- paste(",\np= ", txt4, sep = "")
if(p<0.01) txt4 <- paste(",\np= ", "<0.01", sep = "")
txt2 <- format(c(rci, 0.123456789), digits=digits)[1]
txt3 <- format(c(rci, 0.123456789), digits=digits)[2]
prefix2 <- "\nCI = "
txt <- paste(prefix, txt, prefix2, txt2, ", ", txt3, sep="", txt4)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = 1)
}
## Actual Scatter Plot Matrix
pairs(nyc[2:7],
lower.panel=panel.smooth, cex = .8, pch = 21, bg="steelblue",
diag.panel=panel.hist, cex.labels = 1.2, font.labels=2,
upper.panel=panel.cor)
pairs(nyc[2:7])
SLR models
Based on your knowledge of the restaurant industry, do you think that the quality of the food in a restaurant is an important determinant of the price of a meal at that restaurant? It would be hard to imagine that it wasn’t. We’ll start our modeling process by plotting and fitting a model for Price
as a function of Food
.
On your own, interpret these coefficients and examine the fit of the model. What does the coefficient of Food
mean in plain English? “Each additional rating point of food quality is associated with a…”
# Price by Food plot
ggplot(data = nyc, aes(x = Food, y = Price)) +
geom_point()
# Price by Food model
lm(Price ~ Food, data = nyc)
##
## Call:
## lm(formula = Price ~ Food, data = nyc)
##
## Coefficients:
## (Intercept) Food
## -17.832 2.939
Great work! What does the simple linear model say about how food quality affects price?
Parallel lines with location
In real estate, a common mantra is that the three most important factors in determining the price of a property are “location, location, and location.” If location drives up property values and rents, then we might imagine that location would increase a restaurant’s costs, which would result in them having higher prices. In many parts of New York, the east side (east of 5th Avenue) is more developed and perhaps more expensive. [This is increasingly less true, but was more true at the time these data were collected.]
Let’s expand our model into a parallel slopes model by including the East
variable in addition to Food
.
Use lm()
to fit a parallel slopes model for Price
as a function of Food
and East
. Interpret the coefficients and the fit of the model. Can you explain the meaning of the coefficient on East
in simple terms? Did the coefficient on Food
change from the previous model? If so, why? Did it change by a lot or just a little?
lm(Price ~ Food + East, data = nyc)
##
## Call:
## lm(formula = Price ~ Food + East, data = nyc)
##
## Coefficients:
## (Intercept) Food East
## -17.430 2.875 1.459
The statements that are TRUE:
Each additional rating point of food quality is associated with a $2.88 increase in the expected price of meal, after controlling for location.
The premium for an Italian restaurant in NYC associated with being on the east side of 5th Avenue is $1.46, after controlling for the quality of the food.
A plane in 3D
One reason that many people go to a restaurant — apart from the food — is that they don’t have to cook or clean up. Many people appreciate the experience of being waited upon, and we can all agree that the quality of the service at restaurants varies widely. Are people willing to pay more for better restaurant Service
? More interestingly, are they willing to pay more for better service, after controlling for the quality of the food?
Multiple regression gives us a way to reason about these questions. Fit the model with Food
and Service
and interpret the coefficients and fit. Did the coefficient on Food
change from the previous model? What do the coefficients on Food
and Service
tell you about how these restaurants set prices?
Next, let’s visually assess our model using plotly
. The x
and y
vectors, as well as the plane
matrix, have been created for you.
# fit model
(mod <- lm(Price ~ Food + Service, data = nyc))
##
## Call:
## lm(formula = Price ~ Food + Service, data = nyc)
##
## Coefficients:
## (Intercept) Food Service
## -21.159 1.495 1.704
x <- seq(from = 16, to = 25, by = 9/49)
y <- seq(from = 14, to = 24, by = 10/49)
plane <- outer(x, y, function(a, b){mod$coefficients[[1]] + mod$coefficients[[2]]*a + mod$coefficients[[3]]*b})
# draw 3D scatterplot
p <- plot_ly(data = nyc, z = ~Price, x = ~Food, y = ~Service, opacity = 0.6) %>%
add_markers()
# draw a plane
p %>%
add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)
Fantastic! Is it surprising how service affects the price of a meal?
Parallel planes with location
We have explored models that included the quality of both food and service, as well as location, but we haven’t put these variables all into the same model. Let’s now build a parallel planes model that incorporates all three variables.
Examine the coefficients closely. Do they make sense based on what you understand about these data so far? How did the coefficients change from the previous models that you fit?
# Price by Food and Service and East
(mod <- lm(Price ~ Food + Service + East, data = nyc))
##
## Call:
## lm(formula = Price ~ Food + Service + East, data = nyc)
##
## Coefficients:
## (Intercept) Food Service East
## -20.8155 1.4863 1.6647 0.9649
Great job! Does it seem like location has a big impact on price?
The following statements are TRUE about the magnitude of the East coefficient.
Impact of location
The impact of location brings us to a modeling question: should we keep this variable in our model? In a later course, you will learn how we can conduct formal hypothesis tests to help us answer that question. In this course, we will focus on the size of the effect. Is the impact of location big or small?
One way to think about this would be in terms of the practical significance. Is the value of the coefficient large enough to make a difference to your average person? The units are in dollars so in this case this question is not hard to grasp.
Another way is to examine the impact of location in the context of the variability of the other variables. We can do this by building our parallel planes in 3D and seeing how far apart they are. Are the planes close together or far apart? Does the East variable clearly separate the data into two distinct groups? Or are the points all mixed up together?
plane0 <- outer(x, y, function(a, b){mod$coefficients[[1]] + mod$coefficients[[2]]*a + mod$coefficients[[3]]*b})
plane1 <- outer(x, y, function(a, b){mod$coefficients[[1]] + mod$coefficients[[4]] + mod$coefficients[[2]]*a + mod$coefficients[[3]]*b})
# draw 3D scatterplot
p <- plot_ly(data = nyc, z = ~Price, x = ~Food, y = ~Service, opacity = 0.6) %>%
add_markers(color = ~factor(East))
# draw two planes
p %>%
add_surface(x = ~x, y = ~y, z = ~plane0, showscale = FALSE) %>%
add_surface(x = ~x, y = ~y, z = ~plane1, showscale = FALSE)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Awesome work! How does this visualization relate to the model coefficients you found in the last exercise?
Full model
One variable we haven’t considered is Decor
. Do people, on average, pay more for a meal in a restaurant with nicer decor? If so, does it still matter after controlling for the quality of food, service, and location?
By adding a third numeric explanatory variable to our model, we lose the ability to visualize the model in even three dimensions. Our model is now a hyperplane – or rather, parallel hyperplanes – and while we won’t go any further with the geometry, know that we can continue to add as many variables to our model as we want. As humans, our spatial visualization ability taps out after three numeric variables (maybe you could argue for four, but certainly no further), but neither the mathematical equation for the regression model, nor the formula specification for the model in R, is bothered by the higher dimensionality.
Use lm()
to fit a parallel planes model for Price
as a function of Food
, Service
, Decor
, and East
.
Notice the dramatic change in the value of the Service
coefficient.
lm(Price ~ Food + Service + East, data = nyc)
##
## Call:
## lm(formula = Price ~ Food + Service + East, data = nyc)
##
## Coefficients:
## (Intercept) Food Service East
## -20.8155 1.4863 1.6647 0.9649
(mod <- lm(Price ~ Food + Service + Decor + East, data = nyc))
##
## Call:
## lm(formula = Price ~ Food + Service + Decor + East, data = nyc)
##
## Coefficients:
## (Intercept) Food Service Decor East
## -24.023800 1.538120 -0.002727 1.910087 2.068050
The following interpretations are valid: