Regression in R

0. Course Description

1. Simple Linear Regression

2. Predictions and model objects

3. Assessing model fit

4. Simple logistic regression

5. Parallel Slopes

6. Interactions

Predicting with interactions

7. Multiple Linear Regression

8. Multiple Logistic Regression

8.1 Visualizing multiple explanatory variables
8.2 Logistic regression with 2 explanatory variables
8.3 Logistic regression prediction
8.4 Confusion matrix
8.5 Cumulative distribution function
8.6 Inverse cumulative distribution function
8.7 Logistic regression algorithm

Libraries

library(tidyverse)
library(anytime)
library(lubridate)
library(assertive)
library(fst)
library(broom)
library(ggfortify)
library(yardstick)
library(moderndive)
library(tidyr)
library(plot3D)
library(magrittr)
library(infer)
library(anytime)
library(ggridges)
library(zeallot)

Data

taiwan_real_estate <- read.fst("taiwan_real_estate.fst")
ad_conversion <- read.fst("ad_conversion.fst")
churn<- read.fst("churn.fst")
auctions <- read_csv("auction.csv")

0. Course Description

Linear regression and logistic regression are the two most widely used statistical models and act like master keys, unlocking the secrets hidden in datasets. In this course, you’ll gain the skills you need to fit simple linear and logistic regressions. Through hands-on exercises, you’ll explore the relationships between variables in real-world datasets, including motor insurance claims, Taiwan house prices, fish sizes, and more. By the end of this course, you’ll know how to make predictions from your data, quantify model performance, and diagnose problems with model fit.

Linear regression and logistic regression are the two most widely used statistical models and act like master keys, unlocking the secrets hidden in datasets. This course builds on the skills you gained in “Introduction to Regression in R”, covering linear and logistic regression with multiple explanatory variables.

1. Simple Linear Regression

You’ll learn the basics of this popular statistical model, what regression is, and how linear and logistic regressions differ. You’ll then learn how to fit simple linear regression models with numeric and categorical explanatory variables, and how to describe the relationship between the response and explanatory variables using model coefficients.

1.1 Visualizing two variables and Linear regression with lm()

Before you can run any statistical models, it’s usually a good idea to visualize your dataset. Here, we’ll look at the relationship between house price per area and the number of nearby convenience stores, using the Taiwan real estate dataset.

One challenge in this dataset is that the number of convenience stores contains integer data, causing points to overlap. To solve this, you will make the points transparent.

# Draw a scatter plot of n_convenience vs. price_twd_msq
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) + 
  geom_point()

# Make points 50% transparent
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
  geom_point(alpha = 0.5)

# Add a linear trend line without a confidence ribbon
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
  geom_point(alpha = 0.5) + geom_smooth(method = "lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

 # Run a linear regression of price_twd_msq vs. n_convenience
mdl_price_vs_conv <- lm(price_twd_msq ~ n_convenience, taiwan_real_estate)
mdl_price_vs_conv

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
## 
## Coefficients:
##   (Intercept)  n_convenience  
##        8.2242         0.7981

1.2 Visualizing numeric vs. categorical

If the explanatory variable is categorical, the scatter plot that you used before to visualize the data doesn’t make sense. Instead, a good option is to draw a histogram for each category.

The Taiwan real estate dataset has a categorical variable in the form of the age of each house. The ages have been split into 3 groups: 0 to 15 years, 15 to 30 years, and 30 to 45 years.

# Using taiwan_real_estate, plot price_twd_msq
ggplot(taiwan_real_estate, aes(price_twd_msq)) +
  # Make it a histogram with 10 bins
  geom_histogram(bins=10) +
  # Facet the plot so each house age group gets its own panel
  facet_wrap(vars(house_age_years))

1.3 Calculating means by category

A good way to explore categorical variables is to calculate summary statistics such as the mean for each category. Here, you’ll look at grouped means for the house prices in the Taiwan real estate dataset.

summary_stats <- taiwan_real_estate %>% 
  # Group by house age
  group_by(house_age_years) %>% 
  # Summarize to calculate the mean house price/area
  summarize(mean_by_group = mean(price_twd_msq))

# See the result
summary_stats

## # A tibble: 3 × 2
##   house_age_years mean_by_group
##   <ord>                   <dbl>
## 1 0 to 15                 12.6 
## 2 15 to 30                 9.88
## 3 30 to 45                11.4

1.4 lm() with a categorical explanatory variable

Linear regressions also work with categorical explanatory variables. In this case, the code to run the model is the same, but the coefficients returned by the model are different.

# Run a linear regression of price_twd_msq vs. house_age_years
mdl_price_vs_age <- lm(price_twd_msq ~ house_age_years, taiwan_real_estate)
# See the result
mdl_price_vs_age

## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years, data = taiwan_real_estate)
## 
## Coefficients:
##       (Intercept)  house_age_years.L  house_age_years.Q  
##           11.3025            -0.8798             1.7462

# Update the model formula to remove the intercept
mdl_price_vs_age_no_intercept <- lm(
  price_twd_msq ~ house_age_years + 0, 
  data = taiwan_real_estate)
# See the result
mdl_price_vs_age_no_intercept

## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years + 0, data = taiwan_real_estate)
## 
## Coefficients:
##  house_age_years0 to 15  house_age_years15 to 30  house_age_years30 to 45  
##                  12.637                    9.877                   11.393

2. Predictions and model objects

In this chapter, you’ll discover how to use linear regression models to make predictions on Taiwanese house prices and Facebook advert clicks. You’ll also grow your regression skills as you get hands-on with model objects, understand the concept of “regression to the mean”, and learn how to transform variables in a dataset.

2.1 Predicting house prices

Perhaps the most useful feature of statistical models like linear regression is that you can make predictions. That is, you specify values for each of the explanatory variables, feed them to the model, and you get a prediction for the corresponding response variable.

# Create a tibble with n_convenience column from zero to ten
explanatory_data <- tibble(n_convenience = 0:10)

# Use mdl_price_vs_conv to predict with explanatory_data
predict(mdl_price_vs_conv, explanatory_data)

##         1         2         3         4         5         6         7         8 
##  8.224237  9.022317  9.820397 10.618477 11.416556 12.214636 13.012716 13.810795 
##         9        10        11 
## 14.608875 15.406955 16.205035

# Edit this, so predictions are stored in prediction_data
prediction_data <- explanatory_data %>% 
  mutate(price_twd_msq = predict(mdl_price_vs_conv, explanatory_data))

# See the result
prediction_data

## # A tibble: 11 × 2
##    n_convenience price_twd_msq
##            <int>         <dbl>
##  1             0          8.22
##  2             1          9.02
##  3             2          9.82
##  4             3         10.6 
##  5             4         11.4 
##  6             5         12.2 
##  7             6         13.0 
##  8             7         13.8 
##  9             8         14.6 
## 10             9         15.4 
## 11            10         16.2

The prediction data you calculated contains a column of explanatory variable values and a column of response variable values. That means you can plot it on the same scatter plot of response versus explanatory data values.

# Add to the plot
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE) +
  # Add a point layer of prediction data, colored yellow
  geom_point(data = prediction_data, color = "yellow")

## `geom_smooth()` using formula 'y ~ x'

2.2 Extracting model elements

The variable returned by lm() that contains the model object has many elements. In order to perform further analysis on the model results, you need to extract the useful bits of it. The model coefficients, the fitted values, and the residuals are perhaps the most important bits of the linear model object.

# Get the model coefficients of mdl_price_vs_conv
coefficients(mdl_price_vs_conv)

##   (Intercept) n_convenience 
##     8.2242375     0.7980797

# Get the fitted values of mdl_price_vs_conv
head(fitted(mdl_price_vs_conv))

##        1        2        3        4        5        6 
## 16.20503 15.40695 12.21464 12.21464 12.21464 10.61848

# Get the residuals of mdl_price_vs_conv
head(residuals(mdl_price_vs_conv))

##          1          2          3          4          5          6 
## -4.7375611 -2.6384224  2.0970130  4.3663019  0.8262112 -0.9059199

# Print a summary of mdl_price_vs_conv
summary(mdl_price_vs_conv)

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7132  -2.2213  -0.5409   1.8105  26.5299 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.22424    0.28500   28.86   <2e-16 ***
## n_convenience  0.79808    0.05653   14.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.384 on 412 degrees of freedom
## Multiple R-squared:  0.326,  Adjusted R-squared:  0.3244 
## F-statistic: 199.3 on 1 and 412 DF,  p-value: < 2.2e-16

2.3 Manually predicting house prices

You can manually calculate the predictions from the model coefficients. When making predictions in real life, it is better to use predict(), but doing this manually is helpful to reassure yourself that predictions aren’t magic—they are simply arithmetic.

# Get the coefficients of mdl_price_vs_conv
coeffs <- coefficients(mdl_price_vs_conv)

# Get the intercept
intercept <- coeffs[1]

# Get the slope
slope <- coeffs[2]
explanatory_data %>% 
  mutate(price_twd_msq = intercept + slope * n_convenience)

## # A tibble: 11 × 2
##    n_convenience price_twd_msq
##            <int>         <dbl>
##  1             0          8.22
##  2             1          9.02
##  3             2          9.82
##  4             3         10.6 
##  5             4         11.4 
##  6             5         12.2 
##  7             6         13.0 
##  8             7         13.8 
##  9             8         14.6 
## 10             9         15.4 
## 11            10         16.2

# Compare to the results from predict()
predict(mdl_price_vs_conv, explanatory_data)

##         1         2         3         4         5         6         7         8 
##  8.224237  9.022317  9.820397 10.618477 11.416556 12.214636 13.012716 13.810795 
##         9        10        11 
## 14.608875 15.406955 16.205035

2.4 Using broom

Many programming tasks are easier if you keep all your data inside data frames. This is particularly true if you are a tidyverse fan, where dplyr and ggplot2 require you to use data frames. The broom package contains functions that decompose models into three data frames: one for the coefficient-level elements (the coefficients themselves, as well as p-values for each coefficient), the observation-level elements (like fitted values and residuals), and the model-level elements (mostly performance metrics).

# Get the coefficient-level elements of the model
tidy(mdl_price_vs_conv)

## # A tibble: 2 × 5
##   term          estimate std.error statistic   p.value
##   <chr>            <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)      8.22     0.285       28.9 5.81e-101
## 2 n_convenience    0.798    0.0565      14.1 3.41e- 37

# Get the observation-level elements of the model
augment(mdl_price_vs_conv)

## # A tibble: 414 × 8
##    price_twd_msq n_convenience .fitted .resid    .hat .sigma  .cooksd .std.resid
##            <dbl>         <dbl>   <dbl>  <dbl>   <dbl>  <dbl>    <dbl>      <dbl>
##  1         11.5             10   16.2  -4.74  0.0121    3.38  1.22e-2     -1.41 
##  2         12.8              9   15.4  -2.64  0.00913   3.39  2.83e-3     -0.783
##  3         14.3              5   12.2   2.10  0.00264   3.39  5.10e-4      0.621
##  4         16.6              5   12.2   4.37  0.00264   3.38  2.21e-3      1.29 
##  5         13.0              5   12.2   0.826 0.00264   3.39  7.92e-5      0.244
##  6          9.71             3   10.6  -0.906 0.00275   3.39  9.91e-5     -0.268
##  7         12.2              7   13.8  -1.62  0.00477   3.39  5.50e-4     -0.479
##  8         14.1              6   13.0   1.12  0.00343   3.39  1.88e-4      0.331
##  9          5.69             1    9.02 -3.33  0.00509   3.38  2.49e-3     -0.988
## 10          6.69             3   10.6  -3.93  0.00275   3.38  1.87e-3     -1.16 
## # … with 404 more rows

# Get the model-level elements of the model
glance(mdl_price_vs_conv)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.326         0.324  3.38      199. 3.41e-37     1 -1091. 2188. 2200.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

2.5 Transforming the explanatory variable

If there is no straight line relationship between the response variable and the explanatory variable, it is sometimes possible to create one by transforming one or both of the variables. Here, you’ll look at transforming the explanatory variable.

You’ll take another look at the Taiwan real estate dataset, this time using the distance to the nearest MRT (metro) station as the explanatory variable. You’ll use code to make every commuter’s dream come true: shortening the distance to the metro station by taking the square root. Take that, geography!

# Run a linear regression of price_twd_msq vs. 
# square root of dist_to_mrt_m using taiwan_real_estate
mdl_price_vs_dist <- lm(price_twd_msq ~ sqrt(dist_to_mrt_m), taiwan_real_estate)
# See the result
mdl_price_vs_dist

## 
## Call:
## lm(formula = price_twd_msq ~ sqrt(dist_to_mrt_m), data = taiwan_real_estate)
## 
## Coefficients:
##         (Intercept)  sqrt(dist_to_mrt_m)  
##             16.7098              -0.1828

# Use this explanatory data
explanatory_data <- tibble(
  dist_to_mrt_m = seq(0, 80, 10) ^ 2)
# Use mdl_price_vs_dist to predict explanatory_data
prediction_data <-explanatory_data %>%  
  mutate(price_twd_msq=predict(mdl_price_vs_dist, explanatory_data))
# See the result
prediction_data

## # A tibble: 9 × 2
##   dist_to_mrt_m price_twd_msq
##           <dbl>         <dbl>
## 1             0         16.7 
## 2           100         14.9 
## 3           400         13.1 
## 4           900         11.2 
## 5          1600          9.40
## 6          2500          7.57
## 7          3600          5.74
## 8          4900          3.91
## 9          6400          2.08

ggplot(taiwan_real_estate, aes(sqrt(dist_to_mrt_m), price_twd_msq)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # Add points from prediction_data, colored green, size 5
  geom_point(data = prediction_data, color = "green", size = 5)

## `geom_smooth()` using formula 'y ~ x'

2.6 Transforming the response variable too

The response variable can be transformed too, but this means you need an extra step at the end to undo that transformation. That is, you “back transform” the predictions.

# Run the code to see the plot
# Edit to raise x, y aesthetics to power 0.25
ggplot(ad_conversion, aes(n_impressions^0.25, n_clicks^0.25)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

# Run a linear regression of n_clicks to the power 0.25 vs. 
# n_impressions to the power 0.25 using ad_conversion
mdl_click_vs_impression <- lm(I(n_clicks^0.25) ~ I(n_impressions^0.25), ad_conversion)

# From previous step
mdl_click_vs_impression <- lm(
  I(n_clicks ^ 0.25) ~ I(n_impressions ^ 0.25),
  data = ad_conversion)

# Use this explanatory data
explanatory_data <- tibble(
  n_impressions = seq(0, 3e6, 5e5))

prediction_data <- explanatory_data %>% 
  mutate(
    # Use mdl_click_vs_impression to predict n_clicks ^ 0.25
    n_clicks_025 = predict(mdl_click_vs_impression, explanatory_data),
    # Back transform to get n_clicks
    n_clicks =n_clicks_025^4)

ggplot(ad_conversion, aes(n_impressions ^ 0.25, n_clicks ^ 0.25)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # Add points from prediction_data, colored green
  geom_point(data = prediction_data, color = "green")

## `geom_smooth()` using formula 'y ~ x'

3. Assessing model fit

In this chapter, you’ll learn how to ask questions of your model to assess fit. You’ll learn how to quantify how well a linear regression model fits, diagnose model problems using visualizations, and understand the leverage and influence of each observation used to create the model.

3.1 Coefficient of determination

The coefficient of determination is a measure of how well the linear regression line fits the observed values. For simple linear regression, it is equal to the square of the correlation between the explanatory and response variables.

mdl_click_vs_impression_orig <- lm(n_clicks ~ n_impressions, ad_conversion)
mdl_click_vs_impression_trans <- lm(I(n_clicks^0.25) ~ I(n_impressions^0.25), ad_conversion)

# Print a summary of mdl_click_vs_impression_orig
summary(mdl_click_vs_impression_orig)

## 
## Call:
## lm(formula = n_clicks ~ n_impressions, data = ad_conversion)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -186.099   -5.392   -1.422    2.070  119.876 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.683e+00  7.888e-01   2.133   0.0331 *  
## n_impressions 1.718e-04  1.960e-06  87.654   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.91 on 934 degrees of freedom
## Multiple R-squared:  0.8916, Adjusted R-squared:  0.8915 
## F-statistic:  7683 on 1 and 934 DF,  p-value: < 2.2e-16

# Print a summary of mdl_click_vs_impression_trans
summary(mdl_click_vs_impression_trans)

## 
## Call:
## lm(formula = I(n_clicks^0.25) ~ I(n_impressions^0.25), data = ad_conversion)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57061 -0.13229  0.00582  0.14494  0.46888 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.0717479  0.0172019   4.171 3.32e-05 ***
## I(n_impressions^0.25) 0.1115330  0.0008844 126.108  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1969 on 934 degrees of freedom
## Multiple R-squared:  0.9445, Adjusted R-squared:  0.9445 
## F-statistic: 1.59e+04 on 1 and 934 DF,  p-value: < 2.2e-16

# Get coeff of determination for mdl_click_vs_impression_orig
mdl_click_vs_impression_orig %>% 
  # Get the model-level details
  glance() %>% 
  # Pull out r.squared
  pull(r.squared)

## [1] 0.8916135

# Do the same for the transformed model
mdl_click_vs_impression_trans %>% glance() %>% pull(r.squared)

## [1] 0.9445273

3.2 Residual standard error

Residual standard error (RSE) is a measure of the typical size of the residuals. Equivalently, it’s a measure of how badly wrong you can expect predictions to be. Smaller numbers are better, with zero being a perfect fit to the data.

# Get RSE for mdl_click_vs_impression_orig
mdl_click_vs_impression_orig %>% 
  # Get the model-level details
  glance() %>% 
  # Pull out sigma
  pull(sigma)

## [1] 19.90584

# Do the same for the transformed model
mdl_click_vs_impression_trans %>% glance() %>% pull(sigma)

## [1] 0.1969064

3.3 Visualizing model fit

It’s time for you to draw these diagnostic plots yourself. Let’s go back to the Taiwan real estate dataset and the model of house prices versus number of convenience stores.

# Plot the three diagnostics for mdl_price_vs_conv
autoplot(mdl_price_vs_conv, which = 1:3, nrow = 3, ncol = 1)

3.4 Outliers, leverage, and influence

Leverage Leverage measures how unusual or extreme the explanatory variables are for each observation. Very roughly, a high leverage means that the explanatory variable has values that are different to other points in the dataset. In the case of simple linear regression, where there is only one explanatory value, this typically means values with a very high or very low explanatory value.

Influence Influence measures how much a model would change if each observation was left out of the model calculations, one at a time. That is, it measures how different the prediction line would look if you ran a linear regression on all data points except that point, compared to running a linear regression on the whole dataset. The standard metric for influence is Cook’s distance, which calculates influence based on the size of the residual and the leverage of the point.

mdl_price_vs_dist %>% 
  # Augment the model
  augment() %>% 
  # Arrange rows by descending leverage
  arrange(desc(.hat)) %>% 
  # Get the head of the dataset
  head()

## # A tibble: 6 × 7
##   price_twd_msq `sqrt(dist_to_mrt_m)` .fitted   .hat .sigma   .cooksd .std.resid
##           <dbl>                 <dbl>   <dbl>  <dbl>  <dbl>     <dbl>      <dbl>
## 1          3.39                  80.5    1.98 0.0267   2.82 0.00351       0.506 
## 2          3.69                  80.0    2.09 0.0261   2.82 0.00447       0.577 
## 3          4.54                  79.4    2.19 0.0256   2.82 0.00937       0.844 
## 4          5.69                  74.2    3.13 0.0211   2.82 0.00906       0.916 
## 5          5.26                  74.2    3.13 0.0211   2.82 0.00630       0.764 
## 6          4.05                  67.9    4.30 0.0163   2.82 0.0000644    -0.0882

mdl_price_vs_dist %>% 
  # Augment the model
  augment() %>% 
  # Arrange rows by descending Cook's distance
  arrange(desc(.cooksd)) %>% 
  # Get the head of the dataset
  head()

## # A tibble: 6 × 7
##   price_twd_msq `sqrt(dist_to_mrt_m)` .fitted    .hat .sigma .cooksd .std.resid
##           <dbl>                 <dbl>   <dbl>   <dbl>  <dbl>   <dbl>      <dbl>
## 1         35.6                   15.9   13.8  0.00385   2.61  0.116        7.73
## 2         13.6                   61.5    5.47 0.0121    2.79  0.0524       2.92
## 3         14.1                   56.3    6.41 0.00933   2.80  0.0354       2.74
## 4         23.7                   13.7   14.2  0.00440   2.78  0.0251       3.37
## 5          2.30                  19.8   13.1  0.00310   2.77  0.0228      -3.83
## 6         23.6                   17.8   13.4  0.00344   2.78  0.0225       3.61

# Plot the three outlier diagnostics for mdl_price_vs_conv
autoplot(mdl_price_vs_dist, which = 4:6, nrow = 3, ncol = 1)

4. Simple logistic regression

Logistic regression with glm() Linear regression and logistic regression are special cases of a broader type of models called generalized linear models (“GLMs”). A linear regression makes the assumption that the residuals follow a Gaussian (normal) distribution. By contrast, a logistic regression assumes that residuals follow a binomial distribution.

4.1 Exploring the explanatory variables

When the response variable is logical, all the points lie on the y equals zero and y equals one lines, making it difficult to see what is happening. In the video, until you saw the trend line, it wasn’t clear how the explanatory variable was distributed on each line. This can be solved with a histogram of the explanatory variable, faceted on the response.

# Using churn, plot time_since_last_purchase
ggplot(churn, aes(time_since_last_purchase)) +
  # as a histogram with binwidth 0.25
  geom_histogram(binwidth = 0.25) +
  # faceted in a grid with has_churned on each row
  facet_grid(rows = vars(has_churned))

# Redraw the plot with time_since_first_purchase
ggplot(churn, aes(time_since_first_purchase)) +
  # as a histogram with binwidth 0.25
  geom_histogram(binwidth = 0.25) +
  # faceted in a grid with has_churned on each row
  facet_grid(rows = vars(has_churned))

4.2 Visualizing linear and logistic models

As with linear regressions, ggplot2 will draw model predictions for a logistic regression without you having to worry about the modeling code yourself. To see how the predictions differ for linear and logistic regressions, try drawing both trend lines side by side. Spoiler: you should see a linear (straight line) trend from the linear model, and a logistic (S-shaped) trend from the logistic model.

# Using churn plot has_churned vs. time_since_first_purchase
plt_churn_vs_relationship <-ggplot(churn,aes(time_since_first_purchase, has_churned)) +
  # Make it a scatter plot
  geom_point() +
  # Add an lm trend line, no std error ribbon, colored red
  geom_smooth(method = "lm", se = FALSE, color = "red")
plt_churn_vs_relationship

## `geom_smooth()` using formula 'y ~ x'

ggplot(churn, aes(time_since_first_purchase, has_churned)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  # Add a glm trend line, no std error ribbon, binomial family
  geom_smooth(method = "glm", se = FALSE, method.args= list(family = binomial))

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

4.3 Logistic regression with glm()

Linear regression and logistic regression are special cases of a broader type of models called generalized linear models (“GLMs”). A linear regression makes the assumption that the residuals follow a Gaussian (normal) distribution. By contrast, a logistic regression assumes that residuals follow a binomial distribution.

# Fit a logistic regression of churn vs. 
# length of relationship using the churn dataset
mdl_churn_vs_relationship <- glm(has_churned ~ time_since_first_purchase, data = churn, family = binomial)

# See the result
mdl_churn_vs_relationship

## 
## Call:  glm(formula = has_churned ~ time_since_first_purchase, family = binomial, 
##     data = churn)
## 
## Coefficients:
##               (Intercept)  time_since_first_purchase  
##                  -0.01518                   -0.35479  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  398 Residual
## Null Deviance:       554.5 
## Residual Deviance: 543.7     AIC: 547.7

4.4 Predictions and odds ratios

There are four main ways of expressing the prediction from a logistic regression model – we’ll look at each of them over the next four exercises. Firstly, since the response variable is either “yes” or “no”, you can make a prediction of the probability of a “yes”. Here, you’ll calculate and visualize these probabilities.

# Make a data frame of predicted probabilities
explanatory_data <- tibble(time_since_first_purchase = seq(-1, 6, 0.25))
prediction_data <- explanatory_data %>% 
  mutate(   
    has_churned = predict(
      mdl_churn_vs_relationship, 
      explanatory_data, 
      type = "response"
    )
  )
# See the result
prediction_data

## # A tibble: 29 × 2
##    time_since_first_purchase has_churned
##                        <dbl>       <dbl>
##  1                     -1          0.584
##  2                     -0.75       0.562
##  3                     -0.5        0.540
##  4                     -0.25       0.518
##  5                      0          0.496
##  6                      0.25       0.474
##  7                      0.5        0.452
##  8                      0.75       0.430
##  9                      1          0.409
## 10                      1.25       0.387
## # … with 19 more rows

# Update the plot
plt_churn_vs_relationship +
  # Add points from prediction_data, colored yellow, size 2
  geom_point(data = prediction_data, color = "yellow", size = 2)

## `geom_smooth()` using formula 'y ~ x'

4.5 Most likely outcome

When explaining your results to a non-technical audience, you may wish to side-step talking about probabilities and simply explain the most likely outcome. That is, rather than saying there is a 60% chance of a customer churning, you say that the most likely outcome is that the customer will churn. The tradeoff here is easier interpretation at the cost of nuance.

# Update the data frame
prediction_data <- explanatory_data %>% 
  mutate(   
    has_churned = predict(mdl_churn_vs_relationship, explanatory_data, type = "response"),
    # Add the most likely churn outcome
    most_likely_outcome = round(has_churned)
  )
# See the result
prediction_data

## # A tibble: 29 × 3
##    time_since_first_purchase has_churned most_likely_outcome
##                        <dbl>       <dbl>               <dbl>
##  1                     -1          0.584                   1
##  2                     -0.75       0.562                   1
##  3                     -0.5        0.540                   1
##  4                     -0.25       0.518                   1
##  5                      0          0.496                   0
##  6                      0.25       0.474                   0
##  7                      0.5        0.452                   0
##  8                      0.75       0.430                   0
##  9                      1          0.409                   0
## 10                      1.25       0.387                   0
## # … with 19 more rows

# Update the plot
plt_churn_vs_relationship +
  # Add most likely outcome points from prediction_data, 
  # colored yellow, size 2
  geom_point(
    aes(y = most_likely_outcome), 
    data = prediction_data, 
    color = "yellow",
    size = 2
  )

## `geom_smooth()` using formula 'y ~ x'

4.6 Odds ratio

Odds ratios compare the probability of something happening with the probability of it not happening. This is sometimes easier to reason about than probabilities, particularly when you want to make decisions about choices. For example, if a customer has a 20% chance of churning, it maybe more intuitive to say “the chance of them not churning is four times higher than the chance of them churning”.

# Update the data frame
prediction_data <- explanatory_data %>% 
  mutate(   
    has_churned = predict(
      mdl_churn_vs_relationship, explanatory_data, 
      type = "response"
    ),
    # Add the odds ratio
    odds_ratio = has_churned / (1 - has_churned)
  )
# See the result
prediction_data

## # A tibble: 29 × 3
##    time_since_first_purchase has_churned odds_ratio
##                        <dbl>       <dbl>      <dbl>
##  1                     -1          0.584      1.40 
##  2                     -0.75       0.562      1.29 
##  3                     -0.5        0.540      1.18 
##  4                     -0.25       0.518      1.08 
##  5                      0          0.496      0.985
##  6                      0.25       0.474      0.901
##  7                      0.5        0.452      0.825
##  8                      0.75       0.430      0.755
##  9                      1          0.409      0.691
## 10                      1.25       0.387      0.632
## # … with 19 more rows

# Using prediction_data, plot odds_ratio vs. time_since_first_purchase
ggplot(prediction_data, aes(time_since_first_purchase, odds_ratio)) +
  # Make it a line plot
  geom_line() +
  # Add a dotted horizontal line at y = 1
  geom_hline(yintercept = 1, linetype = "dotted")

4.7 Log odds ratio

One downside to probabilities and odds ratios for logistic regression predictions is that the prediction lines for each are curved. This makes it harder to reason about what happens to the prediction when you make a change to the explanatory variable. The logarithm of the odds ratio (the “log odds ratio”) does have a linear relationship between predicted response and explanatory variable. That means that as the explanatory variable changes, you don’t see dramatic changes in the response metric - only linear changes.

# Update the data frame
prediction_data <- explanatory_data %>% 
  mutate(   
    has_churned = predict(mdl_churn_vs_relationship, explanatory_data, type = "response"),
    odds_ratio = has_churned / (1 - has_churned),
    # Add the log odds ratio from odds_ratio
    log_odds_ratio = log(odds_ratio),
    # Add the log odds ratio using predict()
    log_odds_ratio2 = predict(mdl_churn_vs_relationship, explanatory_data)
  )
# See the result
prediction_data

## # A tibble: 29 × 5
##    time_since_first_purch… has_churned odds_ratio log_odds_ratio log_odds_ratio2
##                      <dbl>       <dbl>      <dbl>          <dbl>           <dbl>
##  1                   -1          0.584      1.40          0.340           0.340 
##  2                   -0.75       0.562      1.29          0.251           0.251 
##  3                   -0.5        0.540      1.18          0.162           0.162 
##  4                   -0.25       0.518      1.08          0.0735          0.0735
##  5                    0          0.496      0.985        -0.0152         -0.0152
##  6                    0.25       0.474      0.901        -0.104          -0.104 
##  7                    0.5        0.452      0.825        -0.193          -0.193 
##  8                    0.75       0.430      0.755        -0.281          -0.281 
##  9                    1          0.409      0.691        -0.370          -0.370 
## 10                    1.25       0.387      0.632        -0.459          -0.459 
## # … with 19 more rows

# Update the plot
ggplot(prediction_data, aes(time_since_first_purchase, odds_ratio)) +
  geom_line() +
  geom_hline(yintercept = 1, linetype = "dotted") +
  # Use a logarithmic y-scale
scale_y_log10()

4,8 Quantifying Logistic Regression

A confusion matrix (occasionally called a confusion table) is the basis of all performance metrics for models with a categorical response (such as a logistic regression). It contains the counts of each actual response-predicted response pair. In this case, where there are two possible responses (churn or not churn), there are four overall outcomes. 1. The customer churned and the model predicted that. 2. The customer churned but the model didn’t predict that. 3. The customer didn’t churn but the model predicted they did. 4.The customer didn’t churn and the model predicted that.

# Get the actual responses from the dataset
actual_response <- churn$has_churned

# Get the "most likely" predicted responses from the model
predicted_response <- round(fitted(mdl_churn_vs_relationship))

# Create a table of counts
outcomes <- table(predicted_response, actual_response)

# See the result
outcomes

##                   actual_response
## predicted_response   0   1
##                  0 112  76
##                  1  88 124

# Convert outcomes to a yardstick confusion matrix
confusion <- conf_mat(outcomes)

# Plot the confusion matrix
autoplot(confusion)

# Get performance metrics for the confusion matrix
summary(confusion, event_level = "second")

## # A tibble: 13 × 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.59 
##  2 kap                  binary         0.18 
##  3 sens                 binary         0.62 
##  4 spec                 binary         0.56 
##  5 ppv                  binary         0.585
##  6 npv                  binary         0.596
##  7 mcc                  binary         0.180
##  8 j_index              binary         0.180
##  9 bal_accuracy         binary         0.59 
## 10 detection_prevalence binary         0.53 
## 11 precision            binary         0.585
## 12 recall               binary         0.62 
## 13 f_meas               binary         0.602

5. Parallel Slopes

Extend your linear regression skills to “parallel slopes” regression, with one numeric and one categorical explanatory variable. This is the first step towards conquering multiple linear regression.

5.1 Fitting a parallel slopes linear regression

In Introduction to Regression in R, you learned to fit linear regression models with a single explanatory variable. In many cases, using only one explanatory variable limits the accuracy of predictions. That means that to truly master linear regression, you need to be able to include multiple explanatory variables.

The case when there is one numeric explanatory variable and one categorical explanatory variable is sometimes called a “parallel slopes” linear regression due to the shape of the predictions—more on that in the next exercise.

# Fit a linear regr'n of price_twd_msq vs. n_convenience
mdl_price_vs_conv <- lm(price_twd_msq ~ n_convenience, data =taiwan_real_estate)
# See the result
mdl_price_vs_conv

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
## 
## Coefficients:
##   (Intercept)  n_convenience  
##        8.2242         0.7981

# Fit a linear regr'n of price_twd_msq vs. house_age_years, no intercept
mdl_price_vs_age <- lm(price_twd_msq ~ house_age_years + 0, data = taiwan_real_estate)
# See the result
mdl_price_vs_age

## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years + 0, data = taiwan_real_estate)
## 
## Coefficients:
##  house_age_years0 to 15  house_age_years15 to 30  house_age_years30 to 45  
##                  12.637                    9.877                   11.393

# Fit a linear regr'n of price_twd_msq vs. n_convenience 
# plus house_age_years, no intercept
mdl_price_vs_both <- lm(price_twd_msq ~ n_convenience + house_age_years + 0, data = taiwan_real_estate)
# See the result
mdl_price_vs_both

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience + house_age_years + 
##     0, data = taiwan_real_estate)
## 
## Coefficients:
##           n_convenience   house_age_years0 to 15  house_age_years15 to 30  
##                  0.7915                   9.4133                   7.0852  
## house_age_years30 to 45  
##                  7.5110

5.2 Visualizing each explanatory variable

Being able to see the predictions made by a model makes it easier to understand. In the case where there is only one explanatory variable, ggplot lets you do this without any manual calculation or messing about.

To visualize the relationship between a numeric explanatory variable and the numeric response, you can draw a scatter plot with a linear trend line.

To visualize the relationship between a categorical explanatory variable and the numeric response, you can draw a box plot.

# Using taiwan_real_estate, plot price_twd_msq vs. n_convenience
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
  # Add a point layer
  geom_point() +
  # Add a smooth trend line using linear regr'n, no ribbon
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

# Using taiwan_real_estate, plot price_twd_msq vs. house_age_years
ggplot(taiwan_real_estate, aes(house_age_years, price_twd_msq)) +
  # Add a box plot layer
  geom_boxplot() + coord_flip()

5.3 Visualizing parallel slopes

The two plots in the previous exercise gave very different predictions: one gave a predicted response that increased linearly with a numeric variable; the other gave a fixed response for each category. The only sensible way to reconcile these two conflicting predictions is to incorporate both explanatory variables in the model at once.

# Using taiwan_real_estate, plot price_twd_msq vs. n_convenience
# colored by house_age_years
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq, color = house_age_years)) +
  # Add a point layer
  geom_point() +
  # Add parallel slopes, no ribbon
  geom_parallel_slopes(se = FALSE)

5.4 Predicting parallel slopes

While ggplot can automatically show you model predictions, in order to get those values to program with, you’ll need to do the calculations yourself.

Just as with the case of a single explanatory variable, the workflow has two steps: create a data frame of explanatory variables, then add a column of predictions. To make sure you’ve got the right answer, you can add your predictions to the ggplot with the geom_parallel_slopes() lines.

# Make a grid of explanatory data
explanatory_data <- expand_grid(
  # Set n_convenience to zero to ten
  n_convenience = 0:10,
  # Set house_age_years to the unique values of that variable
  house_age_years = unique(taiwan_real_estate$house_age_years))

# See the result
explanatory_data

## # A tibble: 33 × 2
##    n_convenience house_age_years
##            <int> <ord>          
##  1             0 30 to 45       
##  2             0 15 to 30       
##  3             0 0 to 15        
##  4             1 30 to 45       
##  5             1 15 to 30       
##  6             1 0 to 15        
##  7             2 30 to 45       
##  8             2 15 to 30       
##  9             2 0 to 15        
## 10             3 30 to 45       
## # … with 23 more rows

# Add predictions to the data frame
prediction_data <- explanatory_data %>% 
  mutate(
    price_twd_msq = predict(
      mdl_price_vs_both, explanatory_data
    )
  )

# See the result
prediction_data

## # A tibble: 33 × 3
##    n_convenience house_age_years price_twd_msq
##            <int> <ord>                   <dbl>
##  1             0 30 to 45                 7.51
##  2             0 15 to 30                 7.09
##  3             0 0 to 15                  9.41
##  4             1 30 to 45                 8.30
##  5             1 15 to 30                 7.88
##  6             1 0 to 15                 10.2 
##  7             2 30 to 45                 9.09
##  8             2 15 to 30                 8.67
##  9             2 0 to 15                 11.0 
## 10             3 30 to 45                 9.89
## # … with 23 more rows

taiwan_real_estate %>% 
  ggplot(aes(n_convenience, price_twd_msq, color = house_age_years)) +
  geom_point() +
  geom_parallel_slopes(se = FALSE) +
  # Add points using prediction_data, with size 5 and shape 15
  geom_point(data= prediction_data, size = 5, shape = 15)

5.5 Manually calculating predictions

As with simple linear regression, you can manually calculate the predictions from the model coefficients. The only change for the parallel slopes case is that the intercept is different for each category of the categorical explanatory variable. That means you need to consider the case when each each category occurs separately.

# Get the coefficients from mdl_price_vs_both
coeffs <-coefficients(mdl_price_vs_both)
# Extract the slope coefficient
slope <- coeffs[1]
# Extract the intercept coefficient for 0 to 15
intercept_0_15 <- coeffs[2]
# Extract the intercept coefficient for 15 to 30
intercept_15_30 <- coeffs[3]
# Extract the intercept coefficient for 30 to 45
intercept_30_45 <- coeffs[4]


prediction_data <- explanatory_data %>% 
  mutate(
    # Consider the 3 cases to choose the intercept
    intercept = case_when(
      house_age_years == "0 to 15" ~ intercept_0_15,
      house_age_years == "15 to 30" ~ intercept_15_30,
      house_age_years == "30 to 45" ~ intercept_30_45 
    ),
    # Manually calculate the predictions
    price_twd_msq = intercept + slope * n_convenience
  )
# See the results
prediction_data

## # A tibble: 33 × 4
##    n_convenience house_age_years intercept price_twd_msq
##            <int> <ord>               <dbl>         <dbl>
##  1             0 30 to 45             7.51          7.51
##  2             0 15 to 30             7.09          7.09
##  3             0 0 to 15              9.41          9.41
##  4             1 30 to 45             7.51          8.30
##  5             1 15 to 30             7.09          7.88
##  6             1 0 to 15              9.41         10.2 
##  7             2 30 to 45             7.51          9.09
##  8             2 15 to 30             7.09          8.67
##  9             2 0 to 15              9.41         11.0 
## 10             3 30 to 45             7.51          9.89
## # … with 23 more rows

5.6 Assessing model performance

Recall that the coefficient of determination is a measure of how well the linear regression line fits the observed values. An important motivation for including several explanatory variables in a linear regression is that you can improve the fit compared to considering only a single explanatory variable.The other common metric for assessing model fit is the residual standard error (RSE), which measures the typical size of the residuals.

mdl_price_vs_conv %>% 
  # Get the model-level coefficients
 glance() %>% 
  # Select the coeffs of determination
  select(r.squared, adj.r.squared)

## # A tibble: 1 × 2
##   r.squared adj.r.squared
##       <dbl>         <dbl>
## 1     0.326         0.324

# Get the coeffs of determination for mdl_price_vs_age
mdl_price_vs_age %>% 
  glance() %>% 
  select(r.squared, adj.r.squared)

## # A tibble: 1 × 2
##   r.squared adj.r.squared
##       <dbl>         <dbl>
## 1     0.896         0.895

# Get the coeffs of determination for mdl_price_vs_both
mdl_price_vs_both %>% 
 glance() %>% 
  select(r.squared, adj.r.squared)

## # A tibble: 1 × 2
##   r.squared adj.r.squared
##       <dbl>         <dbl>
## 1     0.931         0.931

mdl_price_vs_conv %>% 
  # Get the model-level coefficients
 glance() %>% 
  # Pull out the RSE
  pull(sigma)

## [1] 3.383888

# Get the RSE for mdl_price_vs_age
mdl_price_vs_age %>% 
  glance() %>% pull(sigma)

## [1] 3.950184

# Get the RSE for mdl_price_vs_both
mdl_price_vs_both %>% 
  glance() %>% pull(sigma)

## [1] 3.21346

6. Interactions

Explore the effect of interactions between explanatory variables. Considering interactions allows for more realistic models that can have better predictive power. You’ll also deal with Simpson’s Paradox: a non-intuitive result that arises when you have multiple explanatory variables.

6.1 One model per category

The model you ran on the whole dataset fits some parts of the data better than others. It’s worth taking a look at what happens when you run a linear model on different parts of the dataset separately, to see if each model agrees or disagrees with the others.

# Filter for rows where house age is 0 to 15 years
taiwan_0_to_15 <- taiwan_real_estate %>% 
filter(house_age_years == "0 to 15")

# Filter for rows where house age is 15 to 30 years
taiwan_15_to_30 <- taiwan_real_estate %>% 
filter(house_age_years == "15 to 30")

# Filter for rows where house age is 30 to 45 years
taiwan_30_to_45 <- taiwan_real_estate %>% 
filter(house_age_years == "30 to 45")

# Model price vs. no. convenience stores using 0 to 15 data
mdl_0_to_15 <- lm(price_twd_msq ~ n_convenience, taiwan_0_to_15)

# Model price vs. no. convenience stores using 15 to 30 data
mdl_15_to_30 <- lm(price_twd_msq ~ n_convenience, taiwan_15_to_30)

# Model price vs. no. convenience stores using 30 to 45 data
mdl_30_to_45 <- lm(price_twd_msq ~ n_convenience, taiwan_30_to_45)

# See the results
mdl_0_to_15

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_0_to_15)
## 
## Coefficients:
##   (Intercept)  n_convenience  
##        9.2417         0.8336

mdl_15_to_30

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_15_to_30)
## 
## Coefficients:
##   (Intercept)  n_convenience  
##        6.8719         0.8519

mdl_30_to_45

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_30_to_45)
## 
## Coefficients:
##   (Intercept)  n_convenience  
##        8.1131         0.6687

6.2 Predicting multiple models

In order to see what each of the models for individual categories are doing, it’s helpful to make predictions from them. The flow is exactly the same as the flow for making predictions on the whole model, though remember that you only have a single explanatory variable in these models (so expand_grid() isn’t needed.)

# Create a tibble of explanatory data, setting
# no. of conv stores to 0 to 10
explanatory_data <- tibble(n_convenience = 0:10)

# Add column of predictions using "0 to 15" model and explanatory data 
prediction_data_0_to_15 <- explanatory_data %>% 
 mutate(price_twd_msq = predict(mdl_0_to_15, explanatory_data))

# Same again, with "15 to 30"
prediction_data_15_to_30 <- explanatory_data %>% 
 mutate(price_twd_msq = predict(mdl_15_to_30, explanatory_data))

# Same again, with "30 to 45"
prediction_data_30_to_45 <- explanatory_data %>% 
 mutate(price_twd_msq = predict(mdl_30_to_45, explanatory_data))

6.3 Visualizing multiple models

When you use geom_smooth() in a ggplot with an aesthetic that splits the dataset into groups and draws a line for each group (like the color aesthetic), you get multiple trend lines. This is the same as running a model on each group separately, so we get a chance to test our predictions against ggplot’s.

# Using taiwan_real_estate, plot price vs. no. of conv. stores
# colored by house age
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq, color = house_age_years)) +
  # Make it a scatter plot
  geom_point() +
  # Add smooth linear regression trend lines, no ribbon
  geom_smooth(method = "lm", se =FALSE) +
 # Add points using prediction_data_0_to_15, colored red, size 3, shape 15
  geom_point(data = prediction_data_0_to_15, color = "red", size = 3, shape = 15) +
  # Add points using prediction_data_15_to_30, colored green, size 3, shape 15
  geom_point(data = prediction_data_15_to_30, color = "green", size = 3, shape = 15) +
  # Add points using prediction_data_30_to_45, colored blue, size 3, shape 15
  geom_point(data = prediction_data_30_to_45, color = "blue", size = 3, shape = 15)

## `geom_smooth()` using formula 'y ~ x'

6.4 Assessing model performance

To test which approach is best—the whole dataset model or the models for each house age category—you need to calculate some metrics. Here’s, you’ll compare the coefficient of determination and the residual standard error for each model.

mdl_all_ages <- lm(price_twd_msq ~ n_convenience, taiwan_real_estate)

# Get the coeff. of determination for mdl_all_ages
mdl_all_ages %>% 
  glance() %>% 
  pull(r.squared)

## [1] 0.3260466

# Get the coeff. of determination for mdl_0_to_15
mdl_0_to_15 %>% 
  glance() %>% 
  pull(r.squared)

## [1] 0.3120536

# Get the coeff. of determination for mdl_15_to_30
mdl_15_to_30 %>% 
  glance() %>% 
  pull(r.squared)

## [1] 0.4424605

# Get the coeff. of determination for mdl_30_to_45
mdl_30_to_45 %>% 
  glance() %>% 
  pull(r.squared)

## [1] 0.3125713

# Get the RSE for mdl_all
mdl_all_ages %>% 
  glance() %>% 
  pull(sigma)

## [1] 3.383888

# Get the RSE for mdl_0_to_15
mdl_0_to_15 %>% 
  glance() %>% 
  pull(sigma)

## [1] 3.564127

# Get the RSE for mdl_15_to_30
mdl_15_to_30 %>% 
  glance() %>% 
  pull(sigma)

## [1] 2.585273

# Get the RSE for mdl_30_to_45
mdl_30_to_45 %>% 
  glance() %>% 
  pull(sigma)

## [1] 3.239037

6.5 Specifying an interaction

So far you used a single parallel slopes model, which gave an OK fit for the whole dataset, then three separate models for each house age category, which gave a better fit for each individual category, but was clunky because you had three separate models to work with and explain. Ideally, you’d have a single model that had all the predictive power of the individual models.

# Model price vs both with an interaction using "times" syntax
lm(price_twd_msq ~ n_convenience*house_age_years, taiwan_real_estate)

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience * house_age_years, 
##     data = taiwan_real_estate)
## 
## Coefficients:
##                     (Intercept)                    n_convenience  
##                         8.07558                          0.78473  
##               house_age_years.L                house_age_years.Q  
##                        -0.79803                          1.47418  
## n_convenience:house_age_years.L  n_convenience:house_age_years.Q  
##                        -0.11659                         -0.08228

# Model price vs both with an interaction using "colon" syntax
lm(formula = price_twd_msq ~ n_convenience + house_age_years +n_convenience:house_age_years, data = taiwan_real_estate)

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience + house_age_years + 
##     n_convenience:house_age_years, data = taiwan_real_estate)
## 
## Coefficients:
##                     (Intercept)                    n_convenience  
##                         8.07558                          0.78473  
##               house_age_years.L                house_age_years.Q  
##                        -0.79803                          1.47418  
## n_convenience:house_age_years.L  n_convenience:house_age_years.Q  
##                        -0.11659                         -0.08228

# Model price vs. house age plus an interaction, no intercept
mdl_readable_inter <- lm(formula = price_twd_msq ~ house_age_years + n_convenience:house_age_years + 0, data = taiwan_real_estate)

# See the result
mdl_readable_inter

## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years + n_convenience:house_age_years + 
##     0, data = taiwan_real_estate)
## 
## Coefficients:
##                house_age_years0 to 15                house_age_years15 to 30  
##                                9.2417                                 6.8719  
##               house_age_years30 to 45   house_age_years0 to 15:n_convenience  
##                                8.1131                                 0.8336  
## house_age_years15 to 30:n_convenience  house_age_years30 to 45:n_convenience  
##                                0.8519                                 0.6687

# Get coefficients for mdl_0_to_15
coefficients(mdl_readable_inter[1])

##                house_age_years0 to 15               house_age_years15 to 30 
##                             9.2417022                             6.8719186 
##               house_age_years30 to 45  house_age_years0 to 15:n_convenience 
##                             8.1131235                             0.8335867 
## house_age_years15 to 30:n_convenience house_age_years30 to 45:n_convenience 
##                             0.8519172                             0.6686981

# Get coefficients for mdl_15_to_30
coefficients(mdl_readable_inter[2])

## NULL

# Get coefficients for mdl_30_to_45
coefficients(mdl_readable_inter[3])

## NULL

Predicting with interactions

As with every other regression model you’ve created, the fun part is making predictions. Fortunately, the code flow for this case is the same as the one without interactions – R can handle calculating the interactions without any extra prompting from you. The only thing you need to remember is the trick for getting combinations of explanatory variables.

# Make a grid of explanatory data
explanatory_data <- expand_grid(
  # Set n_convenience to zero to ten
  n_convenience = 0:10,
  # Set house_age_years to the unique values of that variable
  house_age_years = unique(taiwan_real_estate$house_age_years))

# See the result
explanatory_data

## # A tibble: 33 × 2
##    n_convenience house_age_years
##            <int> <ord>          
##  1             0 30 to 45       
##  2             0 15 to 30       
##  3             0 0 to 15        
##  4             1 30 to 45       
##  5             1 15 to 30       
##  6             1 0 to 15        
##  7             2 30 to 45       
##  8             2 15 to 30       
##  9             2 0 to 15        
## 10             3 30 to 45       
## # … with 23 more rows

mdl_price_vs_both_inter <- lm(formula = price_twd_msq ~ house_age_years + n_convenience:house_age_years + 
    0, data = taiwan_real_estate)

# Add predictions to the data frame
prediction_data <- explanatory_data %>% mutate(predict(mdl_price_vs_both_inter, explanatory_data))

# See the result
prediction_data

## # A tibble: 33 × 3
##    n_convenience house_age_years `predict(mdl_price_vs_both_inter, explanatory_…
##            <int> <ord>                                                     <dbl>
##  1             0 30 to 45                                                   8.11
##  2             0 15 to 30                                                   6.87
##  3             0 0 to 15                                                    9.24
##  4             1 30 to 45                                                   8.78
##  5             1 15 to 30                                                   7.72
##  6             1 0 to 15                                                   10.1 
##  7             2 30 to 45                                                   9.45
##  8             2 15 to 30                                                   8.58
##  9             2 0 to 15                                                   10.9 
## 10             3 30 to 45                                                  10.1 
## # … with 23 more rows

6.6 Manually calculating predictions with interactions

In order to understand how predict() works, it’s time to calculate the predictions manually again. For this model, there are three separate lines to calculate for, and in each one, the prediction is an intercept plus a slope times the numeric explanatory value. The tricky part is getting the right intercept and the right slope for each case.

# Get the coefficients from mdl_price_vs_both_inter
coeffs <- coefficients(mdl_price_vs_both_inter)

# Get the intercept for 0 to 15 year age group
intercept_0_15 <- coeffs[1]

# Get the intercept for 15 to 30 year age group
intercept_15_30 <- coeffs[2]

# Get the intercept for 30 to 45 year age group
intercept_30_45 <- coeffs[3]

# Get the slope for 0 to 15 year age group
slope_0_15 <- coeffs[4]

# Get the slope for 15 to 30 year age group
slope_15_30 <- coeffs[5]

# Get the slope for 30 to 45 year age group
slope_30_45 <- coeffs[6]


prediction_data <- explanatory_data %>% 
  mutate(
    # Consider the 3 cases to choose the price
    price_twd_msq = case_when(
      house_age_years == "0 to 15" ~ intercept_0_15 + slope_0_15 * n_convenience,
      house_age_years == "15 to 30" ~ intercept_15_30 + slope_15_30 * n_convenience,
      house_age_years == "30 to 45" ~ intercept_30_45 + slope_30_45 * n_convenience ))

# See the result
prediction_data

## # A tibble: 33 × 3
##    n_convenience house_age_years price_twd_msq
##            <int> <ord>                   <dbl>
##  1             0 30 to 45                 8.11
##  2             0 15 to 30                 6.87
##  3             0 0 to 15                  9.24
##  4             1 30 to 45                 8.78
##  5             1 15 to 30                 7.72
##  6             1 0 to 15                 10.1 
##  7             2 30 to 45                 9.45
##  8             2 15 to 30                 8.58
##  9             2 0 to 15                 10.9 
## 10             3 30 to 45                10.1 
## # … with 23 more rows

6.7 Modeling eBay auctions

Sometimes modeling a whole dataset suggests trends that disagree with models on separate parts of that dataset. This is known as Simpson’s paradox. In the most extreme case, you may see a positive slope on the whole dataset, and negative slopes on every subset of that dataset (or the other way around).

# Take a glimpse at the dataset
glimpse(auctions)

## Rows: 10,681
## Columns: 9
## $ auctionid    <dbl> 1638893549, 1638893549, 1638893549, 1638893549, 163889354…
## $ bid          <dbl> 175.00, 100.00, 120.00, 150.00, 177.50, 1.00, 1.25, 1.50,…
## $ bidtime      <dbl> 2.230949, 2.600116, 2.600810, 2.601076, 2.909826, 0.35585…
## $ bidder       <chr> "schadenfreud", "chuik", "kiwisstuff", "kiwisstuff", "eli…
## $ bidderrate   <dbl> 0, 0, 2, 2, 4, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 15, 15, 0, 0…
## $ openbid      <dbl> 99, 99, 99, 99, 99, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ price        <dbl> 177.5, 177.5, 177.5, 177.5, 177.5, 355.0, 355.0, 355.0, 3…
## $ item         <chr> "Cartier wristwatch", "Cartier wristwatch", "Cartier wris…
## $ auction_type <chr> "3 day auction", "3 day auction", "3 day auction", "3 day…

# Model price vs. opening bid using auctions
mdl_price_vs_openbid <- lm(price ~ openbid, data= auctions)

# See the result
mdl_price_vs_openbid

## 
## Call:
## lm(formula = price ~ openbid, data = auctions)
## 
## Coefficients:
## (Intercept)      openbid  
##     268.434        1.275

# Using auctions, plot price vs. opening bid as a 
# scatter plot with linear regression trend lines
ggplot(auctions, aes(openbid, price)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

# Fit linear regression of price vs. opening bid and auction 
# type, with an interaction.
mdl_price_vs_both <- lm(price ~ openbid *  auction_type, auctions)

# See the result
mdl_price_vs_both

## 
## Call:
## lm(formula = price ~ openbid * auction_type, data = auctions)
## 
## Coefficients:
##                       (Intercept)                            openbid  
##                           166.920                              1.422  
##         auction_type5 day auction          auction_type7 day auction  
##                           120.023                            124.450  
## openbid:auction_type5 day auction  openbid:auction_type7 day auction  
##                            -0.231                             -0.130

ggplot(auctions, aes(openbid, price, color = auction_type)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

7. Multiple Linear Regression

See how modeling, and linear regression in particular, makes it easy to work with more than two explanatory variables. Once you’ve mastered fitting linear regression models, you’ll get to implement your own linear regression algorithm.

7.1 3D visualizations

Since computer screens and paper are both two-dimensional objects, most plots are best suited to visualizing two variables at once. For the case of three continuous variables, you can draw a 3D scatter plot, but perspective problems usually make it difficult to interpret. There are some “flat” alternatives that provide easier interpretation, though they require a little thinking about to make.

# With taiwan_real_estate, draw a 3D scatter plot of
# no. of conv. stores, sqrt dist to MRT, and price
taiwan_real_estate %$% 
  scatter3D(n_convenience, sqrt(dist_to_mrt_m), price_twd_msq)

# Using taiwan_real_estate, plot sqrt dist to MRT vs. 
# no. of conv stores, colored by price
ggplot(taiwan_real_estate, aes(n_convenience, sqrt(dist_to_mrt_m), color = price_twd_msq)) + 
  # Make it a scatter plot
  geom_point() +
  # Use the continuous viridis plasma color scale
  scale_color_viridis_c(option = "plasma")

7.2 Modeling 2 numeric explanatory variables

You already saw how to make a model and predictions with a numeric and a categorical explanatory variable. The code for modeling and predicting with two numeric explanatory variables in the same, other than a slight difference in how to specify the explanatory variables to make predictions against.

# Fit a linear regression of price vs. no. of conv. stores
# and sqrt dist. to nearest MRT, no interaction
mdl_price_vs_conv_dist <- lm(price_twd_msq ~ n_convenience + sqrt(dist_to_mrt_m), data = taiwan_real_estate)

# See the result
mdl_price_vs_conv_dist

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience + sqrt(dist_to_mrt_m), 
##     data = taiwan_real_estate)
## 
## Coefficients:
##         (Intercept)        n_convenience  sqrt(dist_to_mrt_m)  
##             15.1038               0.2142              -0.1573

# Create expanded grid of explanatory variables with no. of conv. stores and  dist. to nearest MRT
explanatory_data <- expand_grid(
  n_convenience = 0:10,
  dist_to_mrt_m = seq(0, 80, 10) ^ 2)

# Add predictions using mdl_price_vs_conv_dist and explanatory_data
prediction_data <- explanatory_data %>% 
  mutate(
    price_twd_msq = predict(mdl_price_vs_conv_dist, explanatory_data))

# See the result
prediction_data

## # A tibble: 99 × 3
##    n_convenience dist_to_mrt_m price_twd_msq
##            <int>         <dbl>         <dbl>
##  1             0             0         15.1 
##  2             0           100         13.5 
##  3             0           400         12.0 
##  4             0           900         10.4 
##  5             0          1600          8.81
##  6             0          2500          7.24
##  7             0          3600          5.67
##  8             0          4900          4.09
##  9             0          6400          2.52
## 10             1             0         15.3 
## # … with 89 more rows

# Add predictions to plot
ggplot(
  taiwan_real_estate, 
  aes(n_convenience, sqrt(dist_to_mrt_m), color = price_twd_msq)) + 
  geom_point() +
  scale_color_viridis_c(option = "plasma")+
  # Add prediction points colored yellow, size 3
  geom_point(data = prediction_data, color = "yellow", size = 3)

7.3 Including an interaction

Just as in the case with one numeric and one categorical explanatory variable, it is possible that numeric explanatory variables can interact. With this model structure, you’ll get a third slope coefficient: one for each explanatory variable and one for the interaction.

# Fit a linear regression of price vs. no. of conv. stores
# and sqrt dist. to nearest MRT, with interaction
mdl_price_vs_conv_dist <- lm(price_twd_msq ~ n_convenience * sqrt(dist_to_mrt_m), data = taiwan_real_estate)

# See the result
mdl_price_vs_conv_dist

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience * sqrt(dist_to_mrt_m), 
##     data = taiwan_real_estate)
## 
## Coefficients:
##                       (Intercept)                      n_convenience  
##                          14.73733                            0.42425  
##               sqrt(dist_to_mrt_m)  n_convenience:sqrt(dist_to_mrt_m)  
##                          -0.14121                           -0.01124

# Create expanded grid of explanatory variables with no. of conv. stores and  dist. to nearest MRT
explanatory_data <- expand_grid(
  n_convenience = 0:10,
  dist_to_mrt_m = seq(0, 80, 10) ^ 2)

# Add predictions using mdl_price_vs_conv_dist and explanatory_data
prediction_data <- explanatory_data %>% 
  mutate(
    price_twd_msq = predict(mdl_price_vs_conv_dist, explanatory_data))

# See the result
prediction_data

## # A tibble: 99 × 3
##    n_convenience dist_to_mrt_m price_twd_msq
##            <int>         <dbl>         <dbl>
##  1             0             0         14.7 
##  2             0           100         13.3 
##  3             0           400         11.9 
##  4             0           900         10.5 
##  5             0          1600          9.09
##  6             0          2500          7.68
##  7             0          3600          6.26
##  8             0          4900          4.85
##  9             0          6400          3.44
## 10             1             0         15.2 
## # … with 89 more rows

# Add predictions to plot
ggplot(
  taiwan_real_estate, 
  aes(n_convenience, sqrt(dist_to_mrt_m), color = price_twd_msq)) + 
  geom_point() +
  scale_color_viridis_c(option = "plasma") +
  # Add prediction points colored yellow, size 3
    geom_point(data = prediction_data, color = "yellow", size = 3)

7.4 Visualizing many variables

As you begin to consider more variables, plotting them all at the same time becomes increasingly difficult. In addition to using x and y scales for two numeric variables, you can use color for a third numeric variable, and you can use faceting for categorical variables. And that’s about your limit before the plots become to difficult to interpret. There are some specialist plot types like correlation heatmaps and parallel coordinates plots that will handle more variables, but they give you much less information about each variable, and they aren’t great for visualizing model predictions.

# Using taiwan_real_estate, no. of conv. stores vs. sqrt of dist. to MRT, colored by plot house price
ggplot(taiwan_real_estate, aes(sqrt(dist_to_mrt_m), n_convenience,  color=price_twd_msq)) +
  # Make it a scatter plot
  geom_point() +
  # Use the continuous viridis plasma color scale
  scale_color_viridis_c(option="plasma") +
  # Facet, wrapped by house age
  facet_wrap(vars(house_age_years))

7.5 Different levels of interaction

Once you have three explanatory variables, the number of options for specifying interactions increases. You can specify no interactions. You can specify 2-way interactions, which gives you model coefficients for each pair of variables. The third option is to specify all the interactions, which means the three 2-way interactions and and interaction between all three explanatory variables.

# Model price vs. sqrt dist. to MRT station, no. of conv.
# stores & house age, no global intercept, 3-way interactions
mdl_price_vs_all_3_way_inter <- lm(
  price_twd_msq ~ sqrt(dist_to_mrt_m) * n_convenience * house_age_years + 0, 
  data = taiwan_real_estate)

# See the result
mdl_price_vs_all_3_way_inter

## 
## Call:
## lm(formula = price_twd_msq ~ sqrt(dist_to_mrt_m) * n_convenience * 
##     house_age_years + 0, data = taiwan_real_estate)
## 
## Coefficients:
##                                 sqrt(dist_to_mrt_m)  
##                                           -0.130311  
##                                       n_convenience  
##                                            0.423346  
##                              house_age_years0 to 15  
##                                           16.046849  
##                             house_age_years15 to 30  
##                                           13.760066  
##                             house_age_years30 to 45  
##                                           12.088773  
##                   sqrt(dist_to_mrt_m):n_convenience  
##                                           -0.008200  
##               sqrt(dist_to_mrt_m):house_age_years.L  
##                                            0.043333  
##               sqrt(dist_to_mrt_m):house_age_years.Q  
##                                           -0.004881  
##                     n_convenience:house_age_years.L  
##                                            0.047178  
##                     n_convenience:house_age_years.Q  
##                                           -0.036750  
## sqrt(dist_to_mrt_m):n_convenience:house_age_years.L  
##                                            0.003112  
## sqrt(dist_to_mrt_m):n_convenience:house_age_years.Q  
##                                            0.004917

# Model price vs. sqrt dist. to MRT station, no. of conv.
# stores & house age, no global intercept, 2-way interactions
mdl_price_vs_all_2_way_inter <- lm(
  price_twd_msq ~ (sqrt(dist_to_mrt_m) + n_convenience + house_age_years) ^ 2 + 0, 
  data = taiwan_real_estate)
# See the result
mdl_price_vs_all_2_way_inter

## 
## Call:
## lm(formula = price_twd_msq ~ (sqrt(dist_to_mrt_m) + n_convenience + 
##     house_age_years)^2 + 0, data = taiwan_real_estate)
## 
## Coefficients:
##                   sqrt(dist_to_mrt_m)                          n_convenience  
##                             -0.128758                               0.430422  
##                house_age_years0 to 15                house_age_years15 to 30  
##                             16.026633                              13.880791  
##               house_age_years30 to 45      sqrt(dist_to_mrt_m):n_convenience  
##                             11.926904                              -0.008956  
## sqrt(dist_to_mrt_m):house_age_years.L  sqrt(dist_to_mrt_m):house_age_years.Q  
##                              0.048223                               0.002040  
##       n_convenience:house_age_years.L        n_convenience:house_age_years.Q  
##                              0.101411                               0.064178

7. 6 Predicting again

You’ve followed the prediction workflow several times now with different combinations of explanatory variables. Time to try it once more on the model with three explanatory variables. Here, you’ll use the model with 3-way interactions, though the code is the same when using any of the three models from the previous exercise.

# Make a grid of explanatory data
explanatory_data <- expand_grid(
  # Set dist_to_mrt_m a seq from 0 to 80 by 10s, squared
  dist_to_mrt_m = seq(0, 80, 10)^2,
  # Set n_convenience to 0 to 10
  n_convenience = 0:10,
  # Set house_age_years to the unique values of that variable
 house_age_years= unique(taiwan_real_estate$house_age_years))

# See the result
explanatory_data

## # A tibble: 297 × 3
##    dist_to_mrt_m n_convenience house_age_years
##            <dbl>         <int> <ord>          
##  1             0             0 30 to 45       
##  2             0             0 15 to 30       
##  3             0             0 0 to 15        
##  4             0             1 30 to 45       
##  5             0             1 15 to 30       
##  6             0             1 0 to 15        
##  7             0             2 30 to 45       
##  8             0             2 15 to 30       
##  9             0             2 0 to 15        
## 10             0             3 30 to 45       
## # … with 287 more rows

8. Multiple Logistic Regression

Extend your logistic regression skills to multiple explanatory variables. Understand the logistic distribution, which underpins this form of regression. Finally, implement your own logistic regression algorithm.

8.1 Visualizing multiple explanatory variables

Logistic regression also supports multiple explanatory variables. Plotting has similar issues as the linear regression case: it quickly becomes difficult to include more numeric variables in the plot. Here we’ll look at the case of two numeric explanatory variables, and the solution is basically the same as before: use color to denote the response.

# Using churn, plot recency vs. length of relationship,
# colored by churn status
ggplot(churn, aes(time_since_first_purchase, time_since_last_purchase, color = has_churned)) +
  # Make it a scatter plot, with transparency 0.5
  geom_point(alpha = 0.5) +
  # Use a 2-color gradient split at 0.5
 scale_color_gradient2( midpoint = 0.5) +
  # Use the black and white theme
  theme_bw()

8.2 Logistic regression with 2 explanatory variables

To include multiple explanatory variables in logistic regression models, the syntax is the same as for linear regressions. The only change is the same as in the simple case: you run a generalized linear model with a binomial error family.

# Fit a logistic regression of churn status vs. length of
# relationship, recency, and an interaction
mdl_churn_vs_both_inter <- glm(has_churned ~ time_since_first_purchase*time_since_last_purchase, family = binomial, data = churn)

# See the result
mdl_churn_vs_both_inter

## 
## Call:  glm(formula = has_churned ~ time_since_first_purchase * time_since_last_purchase, 
##     family = binomial, data = churn)
## 
## Coefficients:
##                                        (Intercept)  
##                                            -0.1505  
##                          time_since_first_purchase  
##                                            -0.6376  
##                           time_since_last_purchase  
##                                             0.4233  
## time_since_first_purchase:time_since_last_purchase  
##                                             0.1123  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  396 Residual
## Null Deviance:       554.5 
## Residual Deviance: 519.8     AIC: 527.8

8.3 Logistic regression prediction

As with linear regression, the joy of logistic regression is that you can make predictions. Let’s step through the prediction flow one more time!

# Make a grid of explanatory data
explanatory_data <- expand.grid(
  # Set len. relationship to seq from -2 to 4 in steps of 0.1
  time_since_first_purchase = seq(-2, 4, 0.1),
  # Set recency to seq from -1 to 6 in steps of 0.1
  time_since_last_purchase = seq(-1, 6, 0.1))

# See the result
head(explanatory_data, 20)

##    time_since_first_purchase time_since_last_purchase
## 1                       -2.0                       -1
## 2                       -1.9                       -1
## 3                       -1.8                       -1
## 4                       -1.7                       -1
## 5                       -1.6                       -1
## 6                       -1.5                       -1
## 7                       -1.4                       -1
## 8                       -1.3                       -1
## 9                       -1.2                       -1
## 10                      -1.1                       -1
## 11                      -1.0                       -1
## 12                      -0.9                       -1
## 13                      -0.8                       -1
## 14                      -0.7                       -1
## 15                      -0.6                       -1
## 16                      -0.5                       -1
## 17                      -0.4                       -1
## 18                      -0.3                       -1
## 19                      -0.2                       -1
## 20                      -0.1                       -1

# Add a column of predictions using mdl_churn_vs_both_inter
# and explanatory_data with type response
prediction_data <- explanatory_data %>% 
  mutate(
    has_churned = predict(
      mdl_churn_vs_both_inter, explanatory_data, type = "response"
    )
  )

# See the result
head(prediction_data, 20)

##    time_since_first_purchase time_since_last_purchase has_churned
## 1                       -2.0                       -1   0.7162561
## 2                       -1.9                       -1   0.7007719
## 3                       -1.8                       -1   0.6848146
## 4                       -1.7                       -1   0.6684090
## 5                       -1.6                       -1   0.6515839
## 6                       -1.5                       -1   0.6343721
## 7                       -1.4                       -1   0.6168101
## 8                       -1.3                       -1   0.5989378
## 9                       -1.2                       -1   0.5807985
## 10                      -1.1                       -1   0.5624381
## 11                      -1.0                       -1   0.5439051
## 12                      -0.9                       -1   0.5252497
## 13                      -0.8                       -1   0.5065235
## 14                      -0.7                       -1   0.4877790
## 15                      -0.6                       -1   0.4690688
## 16                      -0.5                       -1   0.4504452
## 17                      -0.4                       -1   0.4319594
## 18                      -0.3                       -1   0.4136613
## 19                      -0.2                       -1   0.3955984
## 20                      -0.1                       -1   0.3778160

# Extend the plot
ggplot(
  churn, 
  aes(time_since_first_purchase, time_since_last_purchase, color = has_churned)) +
  geom_point(alpha = 0.5) +
  scale_color_gradient2(midpoint = 0.5) +
  theme_bw() +
  # Add points from prediction_data with size 3 and shape 15
  geom_point(data = prediction_data, size = 3, shape = 15)

8.4 Confusion matrix

When the response variable has just two outcomes, like the case of churn, the measures of success for the model are “how many cases where the customer churned did the model correctly predict?” and “how many cases where the customer didn’t churn did the model correctly predict?”. These can be found by generating a confusion matrix and calculating summary metrics on it. A mosaic plot is the natural way to visualize the results.

# Get the actual responses from churn
actual_response <- churn$has_churned

# Get the predicted responses from the model
predicted_response <- round(fitted(mdl_churn_vs_both_inter))
# Get a table of these values
outcomes <- table(predicted_response, actual_response)

# Convert the table to a conf_mat object
confusion <- conf_mat(outcomes)

# See the result
confusion

##                   actual_response
## predicted_response   0   1
##                  0 102  53
##                  1  98 147

# "Automatically" plot the confusion matrix
autoplot(confusion)

# Get summary metrics
summary(confusion, event_level = "second")

## # A tibble: 13 × 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.622
##  2 kap                  binary         0.245
##  3 sens                 binary         0.735
##  4 spec                 binary         0.51 
##  5 ppv                  binary         0.6  
##  6 npv                  binary         0.658
##  7 mcc                  binary         0.251
##  8 j_index              binary         0.245
##  9 bal_accuracy         binary         0.622
## 10 detection_prevalence binary         0.612
## 11 precision            binary         0.6  
## 12 recall               binary         0.735
## 13 f_meas               binary         0.661

8.5 Cumulative distribution function

Understanding the logistic distribution is key to understanding logistic regression. Like the normal (Gaussian) distribution, it is a probability distribution of a single continuous variable. Here you’ll visualize the cumulative distribution function (CDF) for the logistic distribution. That is, if you have a logistically distributed variable, x, and a possible value, xval, that x could take, then the CDF gives the probability that x is less than xval. The logistic distribution’s CDF is calculated with the logistic function (hence the name). The plot of this has an S-shape, known as a sigmoid curve. An important property of this function is that it takes an input that can be any number from minus infinity to infinity, and returns a value between zero and one.

logistic_distn_cdf <- tibble(
  # Make a seq from -10 to 10 in steps of 0.1
  x = seq(-10, 10, 0.1),
  # Transform x with built-in logistic CDF
  logistic_x = plogis(x),
  # Transform x with manual logistic
  logistic_x_man = (1+exp(-x))^-1
) 

# Check that each logistic function gives the same results
all.equal(
  logistic_distn_cdf, 
  logistic_distn_cdf
)

## [1] TRUE

# Using logistic_distn_cdf, plot logistic_x vs. x
ggplot(logistic_distn_cdf, aes(x, logistic_x)) +
  # Make it a line plot
  geom_line()

8.6 Inverse cumulative distribution function

The logistic function (logistic distribution CDF) has another important property: each x input value is transformed to a unique value. That means that the transformation can be reversed. The logit function is the name for the inverse logistic function, which is also the logistic distribution inverse cumulative distribution function. (All three terms mean exactly the same thing.

logistic_distn_inv_cdf <- tibble(
  # Make a seq from 0.001 to 0.999 in steps of 0.001
  p = seq(0.001, 0.999, 0.001),
  # Transform with built-in logistic inverse CDF
  logit_p = qlogis(p),
  # Transform with manual logit
  logit_p_man = log(p / (1 - p))
) 

# Check that each logistic function gives the same results
all.equal(
  logistic_distn_inv_cdf$logit_p,
  logistic_distn_inv_cdf$logit_p_man
)

## [1] TRUE

# Using logistic_distn_inv_cdf, plot logit_p vs. p
ggplot(logistic_distn_inv_cdf, aes(p, logit_p)) +
  # Make it a line plot
  geom_line()

8.7 Logistic regression algorithm

Let’s dig into the internals and implement a logistic regression algorithm. Since R’s glm() function is very complex, you’ll stick to implementing simple logistic regression for a single dataset.

Rather than using sum of squares as the metric, we want to use likelihood. However, log-likelihood is more computationally stable, so we’ll use that instead. Actually, there is one more change: since we want to maximize log-likelihood, but optim() defaults to finding minimum values, it is easier to calculate the negative log-likelihood.

x_actual <- churn$time_since_last_purchase
y_actual <- churn$has_churned
# Set the intercept to 1
intercept <- 1

# Set the slope to 0.5
slope <- 0.5

# Calculate the predicted y values
y_pred <- plogis(intercept + slope * x_actual)

# Calculate the log-likelihood for each term
log_likelihoods <- log(y_pred) * y_actual + log(1 - y_pred) * (1 - y_actual)

# Calculate minus the sum of the log-likelihoods for each term
-sum(log_likelihoods)

## [1] 326.2599

calc_neg_log_likelihood <- function(coeffs) {
  # Get the intercept coeff
  intercept <- coeffs[1]

  # Get the slope coeff
  slope <- coeffs[2]

  # Calculate the predicted y values
  y_pred <- plogis(intercept + slope * x_actual)

  # Calculate the log-likelihood for each term
  log_likelihoods <- log(y_pred) * y_actual + log(1 - y_pred) * (1 - y_actual)

  # Calculate minus the sum of the log-likelihoods for each term
  -sum(log_likelihoods)
}

# Optimize the metric
optim(
  # Initially guess 0 intercept and 1 slope
  par = c(intercept = 0, slope = 1),
  # Use calc_neg_log_likelihood as the optimization fn 
  fn = calc_neg_log_likelihood
)

## $par
##   intercept       slope 
## -0.03478255  0.26890041 
## 
## $value
## [1] 273.2002
## 
## $counts
## function gradient 
##       51       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

# Compare the coefficients to those calculated by glm()
glm(has_churned ~ time_since_last_purchase, data = churn, family = binomial)

## 
## Call:  glm(formula = has_churned ~ time_since_last_purchase, family = binomial, 
##     data = churn)
## 
## Coefficients:
##              (Intercept)  time_since_last_purchase  
##                 -0.03502                   0.26921  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  398 Residual
## Null Deviance:       554.5 
## Residual Deviance: 546.4     AIC: 550.4

The End.

Thanks DataCamp

- My Favorite Team -