What if you have two groups?

Fitting a parallel slopes model

We use the lm() function to fit linear models to data. In this case, we want to understand how the price of MarioKart games sold at auction varies as a function of not only the number of wheels included in the package, but also whether the item is new or used. Obviously, it is expected that you might have to pay a premium to buy these new. But how much is that premium? Can we estimate its value after controlling for the number of wheels?

We will fit a parallel slopes model using lm(). In addition to the data argument, lm() needs to know which variables you want to include in your regression model, and how you want to include them. It accomplishes this using a formula argument. A simple linear regression formula looks like y ~ x, where y is the name of the response variable, and x is the name of the explanatory variable. Here, we will simply extend this formula to include multiple explanatory variables. A parallel slopes model has the form y ~ x + z, where z is a categorical explanatory variable, and x is a numerical explanatory variable.

The output from lm() is a model object, which when printed, will show the fitted coefficients.

library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
mario_kart <- mariokart %>% filter(total_pr < 100)

# Explore the data
glimpse(mario_kart)
## Rows: 141
## Columns: 12
## $ id          <dbl> 150377422259, 260483376854, 320432342985, 280405224677,...
## $ duration    <int> 3, 7, 3, 3, 1, 3, 1, 1, 3, 7, 1, 1, 1, 1, 7, 7, 3, 3, 1...
## $ n_bids      <int> 20, 13, 16, 18, 20, 19, 13, 15, 29, 8, 15, 15, 13, 16, ...
## $ cond        <fct> new, used, new, new, new, new, used, new, used, used, n...
## $ start_pr    <dbl> 0.99, 0.99, 0.99, 0.99, 0.01, 0.99, 0.01, 1.00, 0.99, 1...
## $ ship_pr     <dbl> 4.00, 3.99, 3.50, 0.00, 0.00, 4.00, 0.00, 2.99, 4.00, 4...
## $ total_pr    <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99,...
## $ ship_sp     <fct> standard, firstClass, firstClass, standard, media, stan...
## $ seller_rate <int> 1580, 365, 998, 7, 820, 270144, 7284, 4858, 27, 201, 48...
## $ stock_photo <fct> yes, yes, no, yes, yes, yes, yes, yes, yes, no, yes, ye...
## $ wheels      <int> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2...
## $ title       <fct> "~~ Wii MARIO KART &amp; WHEEL ~ NINTENDO Wii ~ BRAND N...
# fit parallel slopes
lm(total_pr ~ wheels + cond, data = mario_kart)
## 
## Call:
## lm(formula = total_pr ~ wheels + cond, data = mario_kart)
## 
## Coefficients:
## (Intercept)       wheels     condused  
##      42.370        7.233       -5.585

Awesome! On average, how much more does a new game cost compared to a used game?

Visualizing parallel slopes models

Using geom_line() and augment()

Parallel slopes models are so-named because we can visualize these models in the data space as not one line, but two parallel lines. To do this, we’ll draw two things:

Our plotting strategy is to compute the fitted values, plot these, and connect the points to form a line. The augment() function from the broom package provides an easy way to add the fitted values to our data frame, and the geom_line() function can then use that data frame to plot the points and connect them.

Note that this approach has the added benefit of automatically coloring the lines appropriately to match the data.

You already know how to use ggplot() and geom_point() to make the scatterplot. The only twist is that now you’ll pass your augment()-ed model as the data argument in your ggplot() call. When you add your geom_line(), instead of letting the y aesthetic inherit its values from the ggplot() call, you can set it to the .fitted column of the augment()-ed model. This has the advantage of automatically coloring the lines for you.

library(broom)
library(ggplot2)

mod <- lm(total_pr ~ wheels + cond, data = mario_kart)

# Augment the model
augmented_mod <- augment(mod)
glimpse(augmented_mod)
## Rows: 141
## Columns: 9
## $ total_pr   <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99, ...
## $ wheels     <int> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2,...
## $ cond       <fct> new, used, new, new, new, new, used, new, used, used, ne...
## $ .fitted    <dbl> 49.60260, 44.01777, 49.60260, 49.60260, 56.83544, 42.369...
## $ .resid     <dbl> 1.9473995, -6.9777674, -4.1026005, -5.6026005, 14.164559...
## $ .std.resid <dbl> 0.40270893, -1.43671086, -0.84838977, -1.15857953, 2.926...
## $ .hat       <dbl> 0.02103158, 0.01250410, 0.02103158, 0.02103158, 0.019156...
## $ .sigma     <dbl> 4.902339, 4.868399, 4.892414, 4.881308, 4.750591, 4.8998...
## $ .cooksd    <dbl> 1.161354e-03, 8.712334e-03, 5.154337e-03, 9.612441e-03, ...
# scatterplot, with color
data_space <- ggplot(augmented_mod, aes(x = wheels, y = total_pr, color = cond)) + 
  geom_point()
  
# single call to geom_line()
data_space + 
  geom_line(aes(y = .fitted))

Great work! Thinking geometrically is great way to get your head around how linear models work.

Interpreting parallel slopes coefficients

Three ways to describe a model

Mathematical

Syntax from math

The babies data set contains observations about the birthweight and other characteristics of children born in the San Francisco Bay area from 1960–1967.

We would like to build a model for birthweight as a function of the mother’s age and whether this child was her first (parity == 0). Use the mathematical specification below to code the model in R.

birthweight=β0+β1age+β2parity+ϵ

# build model
lm(bwt ~ age + parity, data = babies)
## 
## Call:
## lm(formula = bwt ~ age + parity, data = babies)
## 
## Coefficients:
## (Intercept)          age       parity  
##   118.27782      0.06315     -1.65248

You’re parallel slopes pro! How big is the effect of parity compared to age?

Syntax from plot

This time, we’d like to build a model for birthweight as a function of the length of gestation and the mother’s smoking status. Use the plot to inform your model specification.

# build model
lm(bwt ~ gestation + smoke, data = babies)
## 
## Call:
## lm(formula = bwt ~ gestation + smoke, data = babies)
## 
## Coefficients:
## (Intercept)    gestation        smoke  
##     -0.9317       0.4429      -8.0883

Great work! Visually examining your data is always a great place to start when you’re building models.

Model fit, residuals, and prediction

R-squared vs. adjusted R-squared

Two common measures of how well a model fits to data are R2 (the coefficient of determination) and the adjusted R2. The former measures the percentage of the variability in the response variable that is explained by the model. To compute this, we define

R2=1SSESST,

where SSE and SST are the sum of the squared residuals, and the total sum of the squares, respectively. One issue with this measure is that the SSE can only decrease as new variables are added to the model, while the SST depends only on the response variable and therefore is not affected by changes to the model. This means that you can increase R2 by adding any additional variables to your model — even random noise.

The adjusted R2 includes a term that penalizes a model for each additional explanatory variable (where p is the number of explanatory variables).

Radj2=1SSESSTn1np1,

We can see both measures in the output of the summary() function on our model object.

# R^2 and adjusted R^2
summary(mod)
## 
## Call:
## lm(formula = total_pr ~ wheels + cond, data = mario_kart)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0078  -3.0754  -0.8254   2.9822  14.1646 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.3698     1.0651  39.780  < 2e-16 ***
## wheels        7.2328     0.5419  13.347  < 2e-16 ***
## condused     -5.5848     0.9245  -6.041 1.35e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.887 on 138 degrees of freedom
## Multiple R-squared:  0.7165, Adjusted R-squared:  0.7124 
## F-statistic: 174.4 on 2 and 138 DF,  p-value: < 2.2e-16
# add random noise
mario_kart_noisy <- mario_kart %>%
  mutate(noise = rnorm(nrow(mario_kart)))
  
# compute new model
mod2 <- lm(total_pr ~ wheels + cond + noise, data = mario_kart_noisy)

# new R^2 and adjusted R^2
summary(mod2)
## 
## Call:
## lm(formula = total_pr ~ wheels + cond + noise, data = mario_kart_noisy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7992  -3.0877  -0.6089   2.9021  14.7793 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.6295     1.0394  41.015  < 2e-16 ***
## wheels        7.1277     0.5281  13.496  < 2e-16 ***
## condused     -5.6320     0.8991  -6.264 4.54e-09 ***
## noise         1.2212     0.4084   2.990  0.00331 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.753 on 137 degrees of freedom
## Multiple R-squared:  0.7339, Adjusted R-squared:  0.7281 
## F-statistic: 125.9 on 3 and 137 DF,  p-value: < 2.2e-16

Perfect! How did adding random noise change the value of R2? And the adjusted R2?

Prediction

Once we have fit a regression model, we can use it to make predictions for unseen observations or retrieve the fitted values. Here, we explore two methods for doing the latter.

A traditional way to return the fitted values (i.e. the y^’s) is to run the predict() function on the model object. This will return a vector of the fitted values. Note that predict() will take an optional newdata argument that will allow you to make predictions for observations that are not in the original data.

A newer alternative is the augment() function from the broom package, which returns a data.frame with the response varible (y), the relevant explanatory variables (the x’s), the fitted value (y^) and some information about the residuals (e). augment() will also take a newdata argument that allows you to make predictions.

# return a vector
predict(mod)
##        1        2        3        4        5        6        7        8 
## 49.60260 44.01777 49.60260 49.60260 56.83544 42.36976 36.78493 56.83544 
##        9       10       11       12       13       14       15       16 
## 44.01777 44.01777 56.83544 56.83544 56.83544 56.83544 44.01777 36.78493 
##       17       18       19       20       21       22       23       24 
## 49.60260 49.60260 56.83544 36.78493 56.83544 56.83544 56.83544 44.01777 
##       25       26       27       28       29       30       31       32 
## 56.83544 36.78493 36.78493 36.78493 49.60260 36.78493 36.78493 44.01777 
##       33       34       35       36       37       38       39       40 
## 51.25061 44.01777 44.01777 36.78493 44.01777 56.83544 56.83544 49.60260 
##       41       42       43       44       45       46       47       48 
## 44.01777 51.25061 56.83544 56.83544 44.01777 56.83544 36.78493 36.78493 
##       49       50       51       52       53       54       55       56 
## 44.01777 56.83544 36.78493 44.01777 42.36976 36.78493 36.78493 44.01777 
##       57       58       59       60       61       62       63       64 
## 44.01777 36.78493 36.78493 56.83544 36.78493 56.83544 36.78493 51.25061 
##       65       66       67       68       69       70       71       72 
## 56.83544 44.01777 58.48345 51.25061 49.60260 44.01777 49.60260 56.83544 
##       73       74       75       76       77       78       79       80 
## 56.83544 51.25061 44.01777 36.78493 36.78493 36.78493 44.01777 56.83544 
##       81       82       83       84       85       86       87       88 
## 44.01777 65.71629 44.01777 56.83544 36.78493 49.60260 49.60260 36.78493 
##       89       90       91       92       93       94       95       96 
## 44.01777 36.78493 51.25061 44.01777 36.78493 51.25061 42.36976 56.83544 
##       97       98       99      100      101      102      103      104 
## 51.25061 44.01777 51.25061 56.83544 56.83544 56.83544 36.78493 49.60260 
##      105      106      107      108      109      110      111      112 
## 51.25061 44.01777 56.83544 49.60260 36.78493 44.01777 51.25061 56.83544 
##      113      114      115      116      117      118      119      120 
## 64.06828 44.01777 49.60260 44.01777 49.60260 51.25061 42.36976 44.01777 
##      121      122      123      124      125      126      127      128 
## 56.83544 44.01777 49.60260 44.01777 51.25061 56.83544 56.83544 49.60260 
##      129      130      131      132      133      134      135      136 
## 56.83544 36.78493 44.01777 44.01777 36.78493 56.83544 36.78493 44.01777 
##      137      138      139      140      141 
## 36.78493 51.25061 49.60260 36.78493 56.83544
# return a data frame
augment(mod)
## # A tibble: 141 x 9
##    total_pr wheels cond  .fitted .resid .std.resid   .hat .sigma   .cooksd
##       <dbl>  <int> <fct>   <dbl>  <dbl>      <dbl>  <dbl>  <dbl>     <dbl>
##  1     51.6      1 new      49.6  1.95      0.403  0.0210   4.90 0.00116  
##  2     37.0      1 used     44.0 -6.98     -1.44   0.0125   4.87 0.00871  
##  3     45.5      1 new      49.6 -4.10     -0.848  0.0210   4.89 0.00515  
##  4     44        1 new      49.6 -5.60     -1.16   0.0210   4.88 0.00961  
##  5     71        2 new      56.8 14.2       2.93   0.0192   4.75 0.0557   
##  6     45        0 new      42.4  2.63      0.551  0.0475   4.90 0.00505  
##  7     37.0      0 used     36.8  0.235     0.0486 0.0209   4.91 0.0000168
##  8     54.0      2 new      56.8 -2.85     -0.588  0.0192   4.90 0.00225  
##  9     47        1 used     44.0  2.98      0.614  0.0125   4.90 0.00159  
## 10     50        1 used     44.0  5.98      1.23   0.0125   4.88 0.00640  
## # ... with 131 more rows

Perfect! Can you see the difference between the functions predict() and augment()?

Understanding interaction

the rate of gas mileage based on displacement will vary based on year of manufacture, but not by a significant amount.

Fitting a model with interaction

Including an interaction term in a model is easy — we just have to tell lm() that we want to include that new variable. An expression of the form

lm(y ~ x + z + x:z, data = mydata)

will do the trick. The use of the colon (:) here means that the interaction between x and z will be a third term in the model.

# include interaction
lm(total_pr ~ duration + cond + cond:duration, data = mario_kart)
## 
## Call:
## lm(formula = total_pr ~ duration + cond + cond:duration, data = mario_kart)
## 
## Coefficients:
##       (Intercept)           duration           condused  duration:condused  
##            58.268             -1.966            -17.122              2.325

Great work! How does adding an interaction term affect the other coefficients in the model?

Visualizing interaction models

Interaction allows the slope of the regression line in each group to vary. In this case, this means that the relationship between the final price and the length of the auction is moderated by the condition of each item.

Interaction models are easy to visualize in the data space with ggplot2 because they have the same coefficients as if the models were fit independently to each group defined by the level of the categorical variable. In this case, new and used MarioKarts each get their own regression line. To see this, we can set an aesthetic (e.g. color) to the categorical variable, and then add a geom_smooth() layer to overlay the regression line for each color.

# interaction plot
ggplot(mario_kart, aes(y = total_pr, x = duration, color = cond)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Great work! How does the interaction model differ from the parallel slopes model?

Simpson’s Paradox

Adding one additional variable for “SAT rate” reverses the relationship from negative to positive. Why?

In this case, for every additional thousand dollars of salary for teachers in a particular state, the expected SAT score for a student from that state is about 2 points higher, after controlling for the percentage of students taking the SAT.

Simpson’s paradox in action

A mild version of Simpson’s paradox can be observed in the MarioKart auction data. Consider the relationship between the final auction price and the length of the auction. It seems reasonable to assume that longer auctions would result in higher prices, since—other things being equal—a longer auction gives more bidders more time to see the auction and bid on the item.

However, a simple linear regression model reveals the opposite: longer auctions are associated with lower final prices. The problem is that all other things are not equal. In this case, the new MarioKarts—which people pay a premium for—were mostly sold in one-day auctions, while a plurality of the used MarioKarts were sold in the standard seven-day auctions.

Our simple linear regression model is misleading, in that it suggests a negative relationship between final auction price and duration. However, for the used MarioKarts, the relationship is positive.

slr <- ggplot(mario_kart, aes(y = total_pr, x = duration)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

# model with one slope
lm(total_pr ~ duration, data = mario_kart)
## 
## Call:
## lm(formula = total_pr ~ duration, data = mario_kart)
## 
## Coefficients:
## (Intercept)     duration  
##      52.374       -1.317
# plot with two slopes
slr + aes(color = cond)
## `geom_smooth()` using formula 'y ~ x'

Great work! Which of the two groups is showing signs of Simpson’s paradox?

Adding a numerical explanatory variable

Fitting a MLR model

In terms of the R code, fitting a multiple linear regression model is easy: simply add variables to the model formula you specify in the lm() command.

In a parallel slopes model, we had two explanatory variables: one was numeric and one was categorical. Here, we will allow both explanatory variables to be numeric.

# Fit the model using duration and startPr
(mod <- lm(total_pr ~ duration + start_pr, data = mario_kart))
## 
## Call:
## lm(formula = total_pr ~ duration + start_pr, data = mario_kart)
## 
## Coefficients:
## (Intercept)     duration     start_pr  
##      51.030       -1.508        0.233

Awesome work! Using R to fit multiple regression models is just as simple as fitting simpler models.

Tiling the plane

One method for visualizing a multiple linear regression model is to create a heatmap of the fitted values in the plane defined by the two explanatory variables. This heatmap will illustrate how the model output changes over different combinations of the explanatory variables.

This is a multistep process:

library(modelr)
## 
## Attaching package: 'modelr'
## The following object is masked from 'package:broom':
## 
##     bootstrap
data_space <- ggplot(mario_kart, aes(x = duration, y = start_pr)) +
  geom_point(aes(color = total_pr))
  
grid <- mario_kart %>%
  data_grid(
    duration = seq_range(duration, by = 1),
    start_pr = seq_range(start_pr, by = 1)
  )

# add predictions to grid
price_hats <- augment(mod, newdata = grid)

# tile the plane
data_space + 
  geom_tile(data = price_hats, aes(fill = .fitted), alpha = 0.5)

Great work! Visualizing higher dimensional models takes some creativity.

Models in 3D

An alternative way to visualize a multiple regression model with two numeric explanatory variables is as a plane in three dimensions. This is possible in R using the plotly package.

We have created three objects that you will need:

Much like ggplot(), the plot_ly() function will allow you to create a plot object with variables mapped to x, y, and z aesthetics. The add_markers() function is similar to geom_point() in that it allows you to add points to your 3D plot.

Note that plot_ly uses the pipe (%>%) operator to chain commands together.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
x <- seq(from = 1, to = 10, by = 9/69)
y <- seq(from = 0.01, to = 69.95, by = 69.94/69)
plane <- outer(x, y, function(a, b){mod$coefficients[[1]] + mod$coefficients[[2]]*a + mod$coefficients[[3]]*b})


# draw the 3D scatterplot
p <- plot_ly(data = mario_kart, z = ~total_pr, x = ~duration, y = ~start_pr, opacity = 0.6) %>%
  add_markers() 
  
# draw the plane
p %>%
  add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

What a plot! You’ve added a whole new dimension to regression!

Conditional interpretation of coefficients

“While holding age constant … the effect of x on y”

“While holding gestational length constant … the effect of x on y”

Coefficient magnitude

The coefficients from our model for the total auction price of MarioKarts as a function of auction duration and starting price are shown below.

Call:
lm(formula = totalPr ~ duration + startPr, data = mario_kart)

Coefficients:
(Intercept)     duration      startPr  
     51.030       -1.508        0.233  

A colleague claims that these results imply that the duration of the auction is a more important determinant of final price than starting price, because the coefficient is larger. This interpretation is false because:

The coefficients have different units (dollars per day and dollars per dollar, respectively) and so they are not directly comparable.

Practicing interpretation

Fit a multiple regression model for the total auction price of an item in the mario_kart data set as a function of the starting price and the duration of the auction. Compute the coefficients and choose the correct interpretation of the duration variable.

For each additional day the auction lasts, the expected final price declines by $1.51, after controlling for starting price.

Adding a third (categorical) variable

Visualizing parallel planes

By including the duration, starting price, and condition variables in our model, we now have two explanatory variables and one categorical variable. Our model now takes the geometric form of two parallel planes!

The first plane corresponds to the model output when the condition of the item is new, while the second plane corresponds to the model output when the condition of the item is used. The planes have the same slopes along both the duration and starting price axes—it is the z-intercept that is different.

Once again we have stored the x and y vectors for you. Since we now have two planes, there are matrix objects plane0 and plane1 stored for you as well.

(mod <- lm(total_pr ~ duration + start_pr + cond, data = mario_kart))
## 
## Call:
## lm(formula = total_pr ~ duration + start_pr + cond, data = mario_kart)
## 
## Coefficients:
## (Intercept)     duration     start_pr     condused  
##     53.3448      -0.6560       0.1982      -8.9493
plane0 <- outer(x, y, function(a, b){mod$coefficients[[1]] + mod$coefficients[[2]]*a + mod$coefficients[[3]]*b})
plane1 <- outer(x, y, function(a, b){mod$coefficients[[1]] + mod$coefficients[[4]] + mod$coefficients[[2]]*a + mod$coefficients[[3]]*b})

# draw the 3D scatterplot
p <- plot_ly(data = mario_kart, z = ~total_pr, x = ~duration, y = ~start_pr, opacity = 0.6) %>%
  add_markers(color = ~cond) 
  
# draw two planes
p %>%
  add_surface(x = ~x, y = ~y, z = ~plane0, showscale = FALSE) %>%
  add_surface(x = ~x, y = ~y, z = ~plane1, showscale = FALSE)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Awesome! Unfortunately, a parallel planes model is about as complicated a model as we’re able to visualize.

The expected premium for new (relative to used) MarioKarts is $8.95, after controlling for the duration and starting price of the auction.

Higher dimensions

Interpretation of coefficient in a big model

This time we have thrown even more variables into our model, including the number of bids in each auction (nBids) and the number of wheels. Unfortunately this makes a full visualization of our model impossible, but we can still interpret the coefficients.

Call:
lm(formula = totalPr ~ duration + startPr + cond + wheels + nBids, 
    data = mario_kart)

Coefficients:
(Intercept)     duration      startPr     condused       wheels  
    39.3741      -0.2752       0.1796      -4.7720       6.7216  
      nBids  
     0.1909  

Choose the correct interpretation of the coefficient on the number of wheels:

Each additional wheel is associated with an increase in the expected auction price of $6.72, after controlling for auction duration, starting price, number of bids, and the condition of the item.

What is logistic regression?

Fitting a line to a binary response

When our response variable is binary, a regression model has several limitations. Among the more obvious—and logically incongruous—is that the regression line extends infinitely in either direction. This means that even though our response variable y only takes on the values 0 and 1, our fitted values y^ can range anywhere from to +. This doesn’t make sense.

To see this in action, we’ll fit a linear regression model to data about 55 students who applied to medical school. We want to understand how their undergraduate GPA relates to the probability they will be accepted by a particular school (Acceptance).

library(Stat2Data)
data("MedGPA")

# scatterplot with jitter
data_space <- ggplot(data = MedGPA, aes(y = Acceptance, x = GPA)) + 
  geom_jitter(width = 0, height = 0.05, alpha = 0.5)

# linear regression line
data_space + 
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Great work! How well does the linear model fit the data?

Fitting a line to a binary response (2)

In the previous exercise, we identified a major limitation to fitting a linear regression model when we have a binary response variable. However, it is not always inappropriate to do so. Note that our regression line only makes illogical predictions (i.e. y^<0 or y^>1) for students with very high or very low GPAs. For GPAs closer to average, the predictions seem fine.

Moreover, the alternative logistic regression model — which we will fit next — is very similar to the linear regression model for observations near the average of the explanatory variable. It just so happens that the logistic curve is very straight near its middle. Thus, in these cases a linear regression model may still be acceptable, even for a binary response.

# filter
MedGPA_middle <- MedGPA %>%
  filter(GPA >= 3.375, GPA <= 3.770)

# scatterplot with jitter
data_space <- ggplot(data = MedGPA_middle, aes(y = Acceptance, x = GPA)) + 
  geom_jitter(width = 0, height = 0.05, alpha = 0.5)

# linear regression line
data_space + 
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Perfect plotting! In cases like this, it can be easier to use a simpler model.

Fitting a model

Logistic regression is a special case of a broader class of generalized linear models, often known as GLMs. Specifying a logistic regression model is very similar to specify a regression model, with two important differences:

We still use the formula and data arguments with glm().

Note that the mathematical model is now:

log(y1y)=β0+β1x+ϵ,

where ϵ is the error term.

# fit model
(mod <- glm(Acceptance ~ GPA, data = MedGPA, family = binomial))
## 
## Call:  glm(formula = Acceptance ~ GPA, family = binomial, data = MedGPA)
## 
## Coefficients:
## (Intercept)          GPA  
##     -19.207        5.454  
## 
## Degrees of Freedom: 54 Total (i.e. Null);  53 Residual
## Null Deviance:       75.79 
## Residual Deviance: 56.84     AIC: 60.84

Awesome! Fitting a GLM isn’t much harder than a regular linear model.

Visualizing logistic regression

Using geom_smooth()

Our logistic regression model can be visualized in the data space by overlaying the appropriate logistic curve. We can use the geom_smooth() function to do this. Recall that geom_smooth() takes a method argument that allows you to specify what type of smoother you want to see. In our case, we need to specify that we want to use the glm() function to do the smoothing.

However we also need to tell the glm() function which member of the GLM family we want to use. To do this, we will pass the family argument to glm() as a list using the method.args argument to geom_smooth(). This mechanism is common in R, and allows one function to pass a list of arguments to another function.

# scatterplot with jitter
data_space <- ggplot(data = MedGPA, aes(y = Acceptance, x = GPA)) + 
  geom_jitter(width = 0, height = 0.05, alpha = 0.5)

# add logistic curve
data_space +
  geom_smooth(method = "glm", se = FALSE, method.args = list(family = "binomial"))
## `geom_smooth()` using formula 'y ~ x'

What a plot! What differences can you spot between this curve and the linear model you plotted earlier?

Using bins

One of the difficulties in working with a binary response variable is understanding how it “changes.” The response itself (y) is either 0 or 1, while the fitted values (y^) — which are interpreted as probabilities — are between 0 and 1. But if every medical school applicant is either admitted or not, what does it mean to talk about the probability of being accepted?

What we’d like is a larger sample of students, so that for each GPA value (e.g. 3.54) we had many observations (say n), and we could then take the average of those n observations to arrive at the estimated probability of acceptance. Unfortunately, since the explanatory variable is continuous, this is hopeless — it would take an infinite amount of data to make these estimates robust.

Instead, what we can do is put the observations into bins based on their GPA value. Within each bin, we can compute the proportion of accepted students, and we can visualize our model as a smooth logistic curve through those binned values.

We have created a data.frame called MedGPA_binned that aggregates the original data into separate bins for each 0.25 of GPA. It also contains the fitted values from the logistic regression model.

Here we are plotting y as a function of x, where that function is

y=exp(β^0+β^1x)1+exp(β^0+β^1x)

Note that the left hand side is the expected probability y of being accepted to medical school.

(gpa_bins <- quantile(MedGPA$GPA, probs = seq(0, 1, 1/6)))
##        0% 16.66667% 33.33333%       50% 66.66667% 83.33333%      100% 
##      2.72      3.30      3.44      3.58      3.70      3.87      3.97
MedGPA$bins <- cut(MedGPA$GPA, breaks = gpa_bins, include.lowest = TRUE)
head(MedGPA)
##   Accept Acceptance Sex BCPM  GPA VR PS WS BS MCAT Apps       bins
## 1      D          0   F 3.59 3.62 11  9  9  9   38    5 (3.58,3.7]
## 2      A          1   M 3.75 3.84 12 13  8 12   45    3 (3.7,3.87]
## 3      A          1   F 3.24 3.23  9 10  5  9   33   19 [2.72,3.3]
## 4      A          1   F 3.74 3.69 12 11  7 10   40    5 (3.58,3.7]
## 5      A          1   F 3.53 3.38  9 11  4 11   35   11 (3.3,3.44]
## 6      A          1   M 3.59 3.72 10  9  7 10   36    5 (3.7,3.87]
(MedGPA_binned <- MedGPA %>%
group_by(bins) %>%
summarize(mean_GPA = mean(GPA), acceptance_rate = mean(Acceptance)))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 3
##   bins        mean_GPA acceptance_rate
##   <fct>          <dbl>           <dbl>
## 1 [2.72,3.3]      3.11           0.2  
## 2 (3.3,3.44]      3.39           0.2  
## 3 (3.44,3.58]     3.54           0.75 
## 4 (3.58,3.7]      3.65           0.333
## 5 (3.7,3.87]      3.79           0.889
## 6 (3.87,3.97]     3.91           1
# binned points and line
data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = acceptance_rate)) + 
  geom_point() + geom_line()

# augmented model
(MedGPA_plus <- mod %>%
  augment(type.predict = "response"))
## # A tibble: 55 x 8
##    Acceptance   GPA .fitted .resid .std.resid   .hat .sigma .cooksd
##         <int> <dbl>   <dbl>  <dbl>      <dbl>  <dbl>  <dbl>   <dbl>
##  1          0  3.62   0.631 -1.41      -1.43  0.0268   1.03 0.0242 
##  2          1  3.84   0.850  0.569      0.581 0.0386   1.04 0.00368
##  3          1  3.23   0.169  1.88       1.94  0.0527   1.01 0.144  
##  4          1  3.69   0.715  0.819      0.832 0.0304   1.04 0.00644
##  5          1  3.38   0.316  1.52       1.55  0.0400   1.02 0.0470 
##  6          1  3.72   0.747  0.764      0.776 0.0323   1.04 0.00584
##  7          1  3.89   0.882  0.501      0.512 0.0396   1.04 0.00287
##  8          0  3.34   0.271 -0.795     -0.813 0.0444   1.04 0.00904
##  9          1  3.71   0.737  0.782      0.795 0.0316   1.04 0.00603
## 10          1  3.89   0.882  0.501      0.512 0.0396   1.04 0.00287
## # ... with 45 more rows
# logistic model on probability scale
data_space +
  geom_line(data = MedGPA_plus, aes(x = GPA, y = .fitted), color = "red")

Great work! The logistic predictions seem to follow the binned values pretty well.

Three scales approach to interpretation

Odds scale

For most people, the idea that we could estimate the probability of being admitted to medical school based on undergraduate GPA is fairly intuitive. However, thinking about how the probability changes as a function of GPA is complicated by the non-linear logistic curve. By translating the response from the probability scale to the odds scale, we make the right hand side of our equation easier to understand.

If the probability of getting accepted is y, then the odds are y1y. Expressions of probabilities in terms of odds are common in many situations, perhaps most notably gambling.

Here we are plotting y1y as a function of x, where that function is

odds(y^)=y^1y^=exp(β^0+β^1x)

Note that the left hand side is the expected odds of being accepted to medical school. The right hand side is now a familiar exponential function of x.

The MedGPA_binned data frame contains the data for each GPA bin, while the MedGPA_plus data frame records the original observations after being augment()-ed by mod.

# compute odds for bins
MedGPA_binned <- MedGPA_binned %>%
  mutate(odds = acceptance_rate / (1 - acceptance_rate))

# plot binned odds
data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = odds)) + 
  geom_point() + geom_line()

# compute odds for observations
MedGPA_plus <- MedGPA_plus %>%
  mutate(odds_hat = .fitted / (1 - .fitted))

# logistic model on odds scale
data_space +
  geom_line(data = MedGPA_plus, aes(x = GPA, y = odds_hat), color = "red")

Fantastic job! Being able to move back and forth between different scales is essential to interpreting logistic regression models.

Log-odds scale

Previously, we considered two formulations of logistic regression models:

We’ll now add a third formulation:

-on the log-odds scale, the units are nearly impossible to interpret, but the function is linear, which makes it easy to understand

As you can see, none of these three is uniformly superior. Most people tend to interpret the fitted values on the probability scale and the function on the log-odds scale. The interpretation of the coefficients is most commonly done on the odds scale. Recall that we interpreted our slope coefficient β1 in linear regression as the expected change in y given a one unit change in x. On the probability scale, the function is non-linear and so this approach won’t work. On the log-odds, the function is linear, but the units are not interpretable (what does the log of the odds mean??). However, on the odds scale, a one unit change in x leads to the odds being multiplied by a factor of β1. To see why, we form the odds ratio:

OR=odds(y^|x+1)odds(y^|x)=exp β1

Thus, the exponentiated coefficent β1 tells us how the expected odds change for a one unit increase in the explanatory variable. It is tempting to interpret this as a change in the expected probability, but this is wrong and can lead to nonsensical predictions (e.g. expected probabilities greater than 1).

# compute log odds for bins
MedGPA_binned <- MedGPA_binned %>%
  mutate(log_odds = log(acceptance_rate / (1 - acceptance_rate)))

# plot binned log odds
data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = log_odds)) + 
  geom_point() + geom_line()

# compute log odds for observations
MedGPA_plus <- MedGPA_plus %>%
  mutate(log_odds_hat = log(.fitted / (1 - .fitted)))

# logistic model on log odds scale
data_space +
  geom_line(data = MedGPA_plus, aes(x = GPA, y = log_odds_hat), color = "red")

Amazing work! When you’re on the log-odds scale, your model is a simple linear function.

Interpretation of logistic regression

The fitted coefficient β^1 from the medical school logistic regression model is 5.45. The exponential of this is 233.73.

Donald’s GPA is 2.9, and thus the model predicts that the probability of him getting into medical school is 3.26%. The odds of Donald getting into medical school are 0.0337, or—phrased in gambling terms — 29.6:1. If Donald hacks the school’s registrar and changes his GPA to 3.9, then which of the following statements are TRUE:

Using a logistic model

Model predicted that 91 patients would die, and 12 would survive 71 actually did die, and 8 actually did survive

Accuracy was 79/103 (predicted that matched actual values/total values) == 77%

Making probabilistic predictions

Just as we did with linear regression, we can use our logistic regression model to make predictions about new observations. In this exercise, we will use the newdata argument to the augment() function from the broom package to make predictions about students who were not in our original data set. These predictions are sometimes called out-of-sample.

Following our previous discussion about scales, with logistic regression it is important that we specify on which scale we want the predicted values. Although the default is terms – which uses the log-odds scale – we want our predictions on the probability scale, which is the scale of the response variable. The type.predict argument to augment() controls this behavior.

A logistic regression model object, mod, has been defined for you.

# create new data frame
new_data <- data.frame(GPA = 3.51)

# make predictions
augment(mod, newdata = new_data, type.predict = "response")
## # A tibble: 1 x 2
##     GPA .fitted
##   <dbl>   <dbl>
## 1  3.51   0.484

Perfect predicting! By framing your prediction as a probability you can show how likely it is that this student will get admitted to medical school.

Making binary predictions

Naturally, we want to know how well our model works. Did it predict acceptance for the students who were actually accepted to medical school? Did it predict rejections for the student who were not admitted? These types of predictions are called in-sample. One common way to evaluate models with a binary response is with a confusion matrix. [Yes, that is actually what it is called!]

However, note that while our response variable is binary, our fitted values are probabilities. Thus, we have to round them somehow into binary predictions. While the probabilities convey more information, we might ultimately have to make a decision, and so this rounding is common in practice. There are many different ways to round, but for simplicity we will predict admission if the fitted probability is greater than 0.5, and rejection otherwise.

First, we’ll use augment() to make the predictions, and then mutate() and round() to convert these probabilities into binary decisions. Then we will form the confusion matrix using the table() function. table() will compute a 2-way table when given a data frame with two categorical variables, so we will first use select() to grab only those variables.

You will find that this model made only 15 mistakes on these 55 observations, so it is nearly 73% accurate.

# data frame with binary predictions
tidy_mod <- augment(mod, type.predict = "response") %>% 
  mutate(Acceptance_hat = round(.fitted)) 
  
# confusion matrix
tidy_mod %>% 
  select(Acceptance, Acceptance_hat) %>%
  table()
##           Acceptance_hat
## Acceptance  0  1
##          0 16  9
##          1  6 24

Great work! This model doesn’t seem like it’s too bad!

Italian restaurants in NYC

Exploratory data analysis

Multiple regression can be an effective technique for understanding how a response variable changes as a result of changes to more than one explanatory variable. But it is not magic – understanding the relationships among the explanatory variables is also necessary, and will help us build a better model. This process is often called exploratory data analysis (EDA) and is covered in another DataCamp course.

One quick technique for jump-starting EDA is to examine all of the pairwise scatterplots in your data. This can be achieved using the pairs() function. Look for variables in the nyc data set that are strongly correlated, as those relationships will help us check for multicollinearity later on.

Which pairs of variables appear to be strongly correlated?

nyc <- read.csv("_data/nyc.csv", stringsAsFactors = TRUE)

## Scatter Plot Matrix with histogram diagonals and r/CI95 on the top panels (Run top to bottom from here)
## Histograms on the diagonal (Thanks to Ane Handles)
panel.hist <- function(x, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, col="lavender", ...)
}
## Correlations & 95% CIs on the upper panels (Thanks to Ane Handles) + p-values 
panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- cor(x, y,use="complete.obs")
  txt <- format(c(r, 0.123456789), digits=digits)[1]
  prefix <- "r = "
    rc <- cor.test(x,y)
  rci <- rc$conf.int
  p <- cor.test(x, y)$p.value
  txt4 <- format(c(p, 0.123456789), digits = digits)[1]
  txt4 <- paste(",\np= ", txt4, sep = "")
  if(p<0.01) txt4 <- paste(",\np= ", "<0.01", sep = "")
  txt2 <- format(c(rci, 0.123456789), digits=digits)[1]
  txt3 <- format(c(rci, 0.123456789), digits=digits)[2]
    prefix2 <- "\nCI = "
  txt <- paste(prefix, txt, prefix2, txt2, ", ", txt3, sep="", txt4)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = 1)
}
## Actual Scatter Plot Matrix
pairs(nyc[2:7], 
      lower.panel=panel.smooth, cex = .8, pch = 21, bg="steelblue",
      diag.panel=panel.hist, cex.labels = 1.2, font.labels=2, 
      upper.panel=panel.cor)

pairs(nyc[2:7])

SLR models

Based on your knowledge of the restaurant industry, do you think that the quality of the food in a restaurant is an important determinant of the price of a meal at that restaurant? It would be hard to imagine that it wasn’t. We’ll start our modeling process by plotting and fitting a model for Price as a function of Food.

On your own, interpret these coefficients and examine the fit of the model. What does the coefficient of Food mean in plain English? “Each additional rating point of food quality is associated with a…”

# Price by Food plot
ggplot(data = nyc, aes(x = Food, y = Price)) +
  geom_point()

# Price by Food model
lm(Price ~ Food, data = nyc)
## 
## Call:
## lm(formula = Price ~ Food, data = nyc)
## 
## Coefficients:
## (Intercept)         Food  
##     -17.832        2.939

Great work! What does the simple linear model say about how food quality affects price?

Incorporating another variable

Parallel lines with location

In real estate, a common mantra is that the three most important factors in determining the price of a property are “location, location, and location.” If location drives up property values and rents, then we might imagine that location would increase a restaurant’s costs, which would result in them having higher prices. In many parts of New York, the east side (east of 5th Avenue) is more developed and perhaps more expensive. [This is increasingly less true, but was more true at the time these data were collected.]

Let’s expand our model into a parallel slopes model by including the East variable in addition to Food.

Use lm() to fit a parallel slopes model for Price as a function of Food and East. Interpret the coefficients and the fit of the model. Can you explain the meaning of the coefficient on East in simple terms? Did the coefficient on Food change from the previous model? If so, why? Did it change by a lot or just a little?

lm(Price ~ Food + East, data = nyc)
## 
## Call:
## lm(formula = Price ~ Food + East, data = nyc)
## 
## Coefficients:
## (Intercept)         Food         East  
##     -17.430        2.875        1.459

The statements that are TRUE:

A plane in 3D

One reason that many people go to a restaurant — apart from the food — is that they don’t have to cook or clean up. Many people appreciate the experience of being waited upon, and we can all agree that the quality of the service at restaurants varies widely. Are people willing to pay more for better restaurant Service? More interestingly, are they willing to pay more for better service, after controlling for the quality of the food?

Multiple regression gives us a way to reason about these questions. Fit the model with Food and Service and interpret the coefficients and fit. Did the coefficient on Food change from the previous model? What do the coefficients on Food and Service tell you about how these restaurants set prices?

Next, let’s visually assess our model using plotly. The x and y vectors, as well as the plane matrix, have been created for you.

# fit model
(mod <- lm(Price ~ Food + Service, data = nyc))
## 
## Call:
## lm(formula = Price ~ Food + Service, data = nyc)
## 
## Coefficients:
## (Intercept)         Food      Service  
##     -21.159        1.495        1.704
x <- seq(from = 16, to = 25, by = 9/49)
y <- seq(from = 14, to = 24, by = 10/49)
plane <- outer(x, y, function(a, b){mod$coefficients[[1]] + mod$coefficients[[2]]*a + mod$coefficients[[3]]*b})

# draw 3D scatterplot
p <- plot_ly(data = nyc, z = ~Price, x = ~Food, y = ~Service, opacity = 0.6) %>%
  add_markers() 

# draw a plane
p %>%
  add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE) 

Fantastic! Is it surprising how service affects the price of a meal?

Higher dimensions

Parallel planes with location

We have explored models that included the quality of both food and service, as well as location, but we haven’t put these variables all into the same model. Let’s now build a parallel planes model that incorporates all three variables.

Examine the coefficients closely. Do they make sense based on what you understand about these data so far? How did the coefficients change from the previous models that you fit?

# Price by Food and Service and East
(mod <- lm(Price ~ Food + Service + East, data = nyc))
## 
## Call:
## lm(formula = Price ~ Food + Service + East, data = nyc)
## 
## Coefficients:
## (Intercept)         Food      Service         East  
##    -20.8155       1.4863       1.6647       0.9649

Great job! Does it seem like location has a big impact on price?

The following statements are TRUE about the magnitude of the East coefficient.

Impact of location

The impact of location brings us to a modeling question: should we keep this variable in our model? In a later course, you will learn how we can conduct formal hypothesis tests to help us answer that question. In this course, we will focus on the size of the effect. Is the impact of location big or small?

One way to think about this would be in terms of the practical significance. Is the value of the coefficient large enough to make a difference to your average person? The units are in dollars so in this case this question is not hard to grasp.

Another way is to examine the impact of location in the context of the variability of the other variables. We can do this by building our parallel planes in 3D and seeing how far apart they are. Are the planes close together or far apart? Does the East variable clearly separate the data into two distinct groups? Or are the points all mixed up together?

plane0 <- outer(x, y, function(a, b){mod$coefficients[[1]] + mod$coefficients[[2]]*a + mod$coefficients[[3]]*b})
plane1 <- outer(x, y, function(a, b){mod$coefficients[[1]] + mod$coefficients[[4]] + mod$coefficients[[2]]*a + mod$coefficients[[3]]*b})


# draw 3D scatterplot
p <- plot_ly(data = nyc, z = ~Price, x = ~Food, y = ~Service, opacity = 0.6) %>%
  add_markers(color = ~factor(East)) 

# draw two planes
p %>%
  add_surface(x = ~x, y = ~y, z = ~plane0, showscale = FALSE) %>%
  add_surface(x = ~x, y = ~y, z = ~plane1, showscale = FALSE)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Awesome work! How does this visualization relate to the model coefficients you found in the last exercise?

Full model

One variable we haven’t considered is Decor. Do people, on average, pay more for a meal in a restaurant with nicer decor? If so, does it still matter after controlling for the quality of food, service, and location?

By adding a third numeric explanatory variable to our model, we lose the ability to visualize the model in even three dimensions. Our model is now a hyperplane – or rather, parallel hyperplanes – and while we won’t go any further with the geometry, know that we can continue to add as many variables to our model as we want. As humans, our spatial visualization ability taps out after three numeric variables (maybe you could argue for four, but certainly no further), but neither the mathematical equation for the regression model, nor the formula specification for the model in R, is bothered by the higher dimensionality.

Use lm() to fit a parallel planes model for Price as a function of Food, Service, Decor, and East.

Notice the dramatic change in the value of the Service coefficient.

lm(Price ~ Food + Service + East, data = nyc)
## 
## Call:
## lm(formula = Price ~ Food + Service + East, data = nyc)
## 
## Coefficients:
## (Intercept)         Food      Service         East  
##    -20.8155       1.4863       1.6647       0.9649
(mod <- lm(Price ~ Food + Service + Decor + East, data = nyc))
## 
## Call:
## lm(formula = Price ~ Food + Service + Decor + East, data = nyc)
## 
## Coefficients:
## (Intercept)         Food      Service        Decor         East  
##  -24.023800     1.538120    -0.002727     1.910087     2.068050

The following interpretations are valid:

Wrap-up