Libraries
library(tidyverse)
library(anytime)
library(lubridate)
library(assertive)
library(fst)
library(broom)
library(ggfortify)
library(yardstick)
library(moderndive)
library(tidyr)
library(plot3D)
library(magrittr)
library(infer)
library(anytime)
library(ggridges)
library(zeallot)
Data
taiwan_real_estate <- read.fst("taiwan_real_estate.fst")
ad_conversion <- read.fst("ad_conversion.fst")
churn<- read.fst("churn.fst")
auctions <- read_csv("auction.csv")
Linear regression and logistic regression are the two most widely used statistical models and act like master keys, unlocking the secrets hidden in datasets. In this course, you’ll gain the skills you need to fit simple linear and logistic regressions. Through hands-on exercises, you’ll explore the relationships between variables in real-world datasets, including motor insurance claims, Taiwan house prices, fish sizes, and more. By the end of this course, you’ll know how to make predictions from your data, quantify model performance, and diagnose problems with model fit.
Linear regression and logistic regression are the two most widely used statistical models and act like master keys, unlocking the secrets hidden in datasets. This course builds on the skills you gained in “Introduction to Regression in R”, covering linear and logistic regression with multiple explanatory variables.
You’ll learn the basics of this popular statistical model, what regression is, and how linear and logistic regressions differ. You’ll then learn how to fit simple linear regression models with numeric and categorical explanatory variables, and how to describe the relationship between the response and explanatory variables using model coefficients.
Before you can run any statistical models, it’s usually a good idea to visualize your dataset. Here, we’ll look at the relationship between house price per area and the number of nearby convenience stores, using the Taiwan real estate dataset.
One challenge in this dataset is that the number of convenience stores contains integer data, causing points to overlap. To solve this, you will make the points transparent.
# Draw a scatter plot of n_convenience vs. price_twd_msq
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
geom_point()
# Make points 50% transparent
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
geom_point(alpha = 0.5)
# Add a linear trend line without a confidence ribbon
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
geom_point(alpha = 0.5) + geom_smooth(method = "lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
# Run a linear regression of price_twd_msq vs. n_convenience
mdl_price_vs_conv <- lm(price_twd_msq ~ n_convenience, taiwan_real_estate)
mdl_price_vs_conv
##
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
##
## Coefficients:
## (Intercept) n_convenience
## 8.2242 0.7981
If the explanatory variable is categorical, the scatter plot that you used before to visualize the data doesn’t make sense. Instead, a good option is to draw a histogram for each category.
The Taiwan real estate dataset has a categorical variable in the form of the age of each house. The ages have been split into 3 groups: 0 to 15 years, 15 to 30 years, and 30 to 45 years.
# Using taiwan_real_estate, plot price_twd_msq
ggplot(taiwan_real_estate, aes(price_twd_msq)) +
# Make it a histogram with 10 bins
geom_histogram(bins=10) +
# Facet the plot so each house age group gets its own panel
facet_wrap(vars(house_age_years))
A good way to explore categorical variables is to calculate summary statistics such as the mean for each category. Here, you’ll look at grouped means for the house prices in the Taiwan real estate dataset.
summary_stats <- taiwan_real_estate %>%
# Group by house age
group_by(house_age_years) %>%
# Summarize to calculate the mean house price/area
summarize(mean_by_group = mean(price_twd_msq))
# See the result
summary_stats
## # A tibble: 3 × 2
## house_age_years mean_by_group
## <ord> <dbl>
## 1 0 to 15 12.6
## 2 15 to 30 9.88
## 3 30 to 45 11.4
Linear regressions also work with categorical explanatory variables. In this case, the code to run the model is the same, but the coefficients returned by the model are different.
# Run a linear regression of price_twd_msq vs. house_age_years
mdl_price_vs_age <- lm(price_twd_msq ~ house_age_years, taiwan_real_estate)
# See the result
mdl_price_vs_age
##
## Call:
## lm(formula = price_twd_msq ~ house_age_years, data = taiwan_real_estate)
##
## Coefficients:
## (Intercept) house_age_years.L house_age_years.Q
## 11.3025 -0.8798 1.7462
# Update the model formula to remove the intercept
mdl_price_vs_age_no_intercept <- lm(
price_twd_msq ~ house_age_years + 0,
data = taiwan_real_estate)
# See the result
mdl_price_vs_age_no_intercept
##
## Call:
## lm(formula = price_twd_msq ~ house_age_years + 0, data = taiwan_real_estate)
##
## Coefficients:
## house_age_years0 to 15 house_age_years15 to 30 house_age_years30 to 45
## 12.637 9.877 11.393
In this chapter, you’ll discover how to use linear regression models to make predictions on Taiwanese house prices and Facebook advert clicks. You’ll also grow your regression skills as you get hands-on with model objects, understand the concept of “regression to the mean”, and learn how to transform variables in a dataset.
Perhaps the most useful feature of statistical models like linear regression is that you can make predictions. That is, you specify values for each of the explanatory variables, feed them to the model, and you get a prediction for the corresponding response variable.
# Create a tibble with n_convenience column from zero to ten
explanatory_data <- tibble(n_convenience = 0:10)
# Use mdl_price_vs_conv to predict with explanatory_data
predict(mdl_price_vs_conv, explanatory_data)
## 1 2 3 4 5 6 7 8
## 8.224237 9.022317 9.820397 10.618477 11.416556 12.214636 13.012716 13.810795
## 9 10 11
## 14.608875 15.406955 16.205035
# Edit this, so predictions are stored in prediction_data
prediction_data <- explanatory_data %>%
mutate(price_twd_msq = predict(mdl_price_vs_conv, explanatory_data))
# See the result
prediction_data
## # A tibble: 11 × 2
## n_convenience price_twd_msq
## <int> <dbl>
## 1 0 8.22
## 2 1 9.02
## 3 2 9.82
## 4 3 10.6
## 5 4 11.4
## 6 5 12.2
## 7 6 13.0
## 8 7 13.8
## 9 8 14.6
## 10 9 15.4
## 11 10 16.2
The prediction data you calculated contains a column of explanatory variable values and a column of response variable values. That means you can plot it on the same scatter plot of response versus explanatory data values.
# Add to the plot
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
geom_point() + geom_smooth(method = "lm", se = FALSE) +
# Add a point layer of prediction data, colored yellow
geom_point(data = prediction_data, color = "yellow")
## `geom_smooth()` using formula 'y ~ x'
The variable returned by lm() that contains the model object has many elements. In order to perform further analysis on the model results, you need to extract the useful bits of it. The model coefficients, the fitted values, and the residuals are perhaps the most important bits of the linear model object.
# Get the model coefficients of mdl_price_vs_conv
coefficients(mdl_price_vs_conv)
## (Intercept) n_convenience
## 8.2242375 0.7980797
# Get the fitted values of mdl_price_vs_conv
head(fitted(mdl_price_vs_conv))
## 1 2 3 4 5 6
## 16.20503 15.40695 12.21464 12.21464 12.21464 10.61848
# Get the residuals of mdl_price_vs_conv
head(residuals(mdl_price_vs_conv))
## 1 2 3 4 5 6
## -4.7375611 -2.6384224 2.0970130 4.3663019 0.8262112 -0.9059199
# Print a summary of mdl_price_vs_conv
summary(mdl_price_vs_conv)
##
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7132 -2.2213 -0.5409 1.8105 26.5299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.22424 0.28500 28.86 <2e-16 ***
## n_convenience 0.79808 0.05653 14.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.384 on 412 degrees of freedom
## Multiple R-squared: 0.326, Adjusted R-squared: 0.3244
## F-statistic: 199.3 on 1 and 412 DF, p-value: < 2.2e-16
You can manually calculate the predictions from the model coefficients. When making predictions in real life, it is better to use predict(), but doing this manually is helpful to reassure yourself that predictions aren’t magic—they are simply arithmetic.
# Get the coefficients of mdl_price_vs_conv
coeffs <- coefficients(mdl_price_vs_conv)
# Get the intercept
intercept <- coeffs[1]
# Get the slope
slope <- coeffs[2]
explanatory_data %>%
mutate(price_twd_msq = intercept + slope * n_convenience)
## # A tibble: 11 × 2
## n_convenience price_twd_msq
## <int> <dbl>
## 1 0 8.22
## 2 1 9.02
## 3 2 9.82
## 4 3 10.6
## 5 4 11.4
## 6 5 12.2
## 7 6 13.0
## 8 7 13.8
## 9 8 14.6
## 10 9 15.4
## 11 10 16.2
# Compare to the results from predict()
predict(mdl_price_vs_conv, explanatory_data)
## 1 2 3 4 5 6 7 8
## 8.224237 9.022317 9.820397 10.618477 11.416556 12.214636 13.012716 13.810795
## 9 10 11
## 14.608875 15.406955 16.205035
Many programming tasks are easier if you keep all your data inside data frames. This is particularly true if you are a tidyverse fan, where dplyr and ggplot2 require you to use data frames. The broom package contains functions that decompose models into three data frames: one for the coefficient-level elements (the coefficients themselves, as well as p-values for each coefficient), the observation-level elements (like fitted values and residuals), and the model-level elements (mostly performance metrics).
# Get the coefficient-level elements of the model
tidy(mdl_price_vs_conv)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 8.22 0.285 28.9 5.81e-101
## 2 n_convenience 0.798 0.0565 14.1 3.41e- 37
# Get the observation-level elements of the model
augment(mdl_price_vs_conv)
## # A tibble: 414 × 8
## price_twd_msq n_convenience .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 11.5 10 16.2 -4.74 0.0121 3.38 1.22e-2 -1.41
## 2 12.8 9 15.4 -2.64 0.00913 3.39 2.83e-3 -0.783
## 3 14.3 5 12.2 2.10 0.00264 3.39 5.10e-4 0.621
## 4 16.6 5 12.2 4.37 0.00264 3.38 2.21e-3 1.29
## 5 13.0 5 12.2 0.826 0.00264 3.39 7.92e-5 0.244
## 6 9.71 3 10.6 -0.906 0.00275 3.39 9.91e-5 -0.268
## 7 12.2 7 13.8 -1.62 0.00477 3.39 5.50e-4 -0.479
## 8 14.1 6 13.0 1.12 0.00343 3.39 1.88e-4 0.331
## 9 5.69 1 9.02 -3.33 0.00509 3.38 2.49e-3 -0.988
## 10 6.69 3 10.6 -3.93 0.00275 3.38 1.87e-3 -1.16
## # … with 404 more rows
# Get the model-level elements of the model
glance(mdl_price_vs_conv)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.326 0.324 3.38 199. 3.41e-37 1 -1091. 2188. 2200.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
If there is no straight line relationship between the response variable and the explanatory variable, it is sometimes possible to create one by transforming one or both of the variables. Here, you’ll look at transforming the explanatory variable.
You’ll take another look at the Taiwan real estate dataset, this time using the distance to the nearest MRT (metro) station as the explanatory variable. You’ll use code to make every commuter’s dream come true: shortening the distance to the metro station by taking the square root. Take that, geography!
# Run a linear regression of price_twd_msq vs.
# square root of dist_to_mrt_m using taiwan_real_estate
mdl_price_vs_dist <- lm(price_twd_msq ~ sqrt(dist_to_mrt_m), taiwan_real_estate)
# See the result
mdl_price_vs_dist
##
## Call:
## lm(formula = price_twd_msq ~ sqrt(dist_to_mrt_m), data = taiwan_real_estate)
##
## Coefficients:
## (Intercept) sqrt(dist_to_mrt_m)
## 16.7098 -0.1828
# Use this explanatory data
explanatory_data <- tibble(
dist_to_mrt_m = seq(0, 80, 10) ^ 2)
# Use mdl_price_vs_dist to predict explanatory_data
prediction_data <-explanatory_data %>%
mutate(price_twd_msq=predict(mdl_price_vs_dist, explanatory_data))
# See the result
prediction_data
## # A tibble: 9 × 2
## dist_to_mrt_m price_twd_msq
## <dbl> <dbl>
## 1 0 16.7
## 2 100 14.9
## 3 400 13.1
## 4 900 11.2
## 5 1600 9.40
## 6 2500 7.57
## 7 3600 5.74
## 8 4900 3.91
## 9 6400 2.08
ggplot(taiwan_real_estate, aes(sqrt(dist_to_mrt_m), price_twd_msq)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
# Add points from prediction_data, colored green, size 5
geom_point(data = prediction_data, color = "green", size = 5)
## `geom_smooth()` using formula 'y ~ x'
The response variable can be transformed too, but this means you need an extra step at the end to undo that transformation. That is, you “back transform” the predictions.
# Run the code to see the plot
# Edit to raise x, y aesthetics to power 0.25
ggplot(ad_conversion, aes(n_impressions^0.25, n_clicks^0.25)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# Run a linear regression of n_clicks to the power 0.25 vs.
# n_impressions to the power 0.25 using ad_conversion
mdl_click_vs_impression <- lm(I(n_clicks^0.25) ~ I(n_impressions^0.25), ad_conversion)
# From previous step
mdl_click_vs_impression <- lm(
I(n_clicks ^ 0.25) ~ I(n_impressions ^ 0.25),
data = ad_conversion)
# Use this explanatory data
explanatory_data <- tibble(
n_impressions = seq(0, 3e6, 5e5))
prediction_data <- explanatory_data %>%
mutate(
# Use mdl_click_vs_impression to predict n_clicks ^ 0.25
n_clicks_025 = predict(mdl_click_vs_impression, explanatory_data),
# Back transform to get n_clicks
n_clicks =n_clicks_025^4)
ggplot(ad_conversion, aes(n_impressions ^ 0.25, n_clicks ^ 0.25)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
# Add points from prediction_data, colored green
geom_point(data = prediction_data, color = "green")
## `geom_smooth()` using formula 'y ~ x'
In this chapter, you’ll learn how to ask questions of your model to assess fit. You’ll learn how to quantify how well a linear regression model fits, diagnose model problems using visualizations, and understand the leverage and influence of each observation used to create the model.
The coefficient of determination is a measure of how well the linear regression line fits the observed values. For simple linear regression, it is equal to the square of the correlation between the explanatory and response variables.
mdl_click_vs_impression_orig <- lm(n_clicks ~ n_impressions, ad_conversion)
mdl_click_vs_impression_trans <- lm(I(n_clicks^0.25) ~ I(n_impressions^0.25), ad_conversion)
# Print a summary of mdl_click_vs_impression_orig
summary(mdl_click_vs_impression_orig)
##
## Call:
## lm(formula = n_clicks ~ n_impressions, data = ad_conversion)
##
## Residuals:
## Min 1Q Median 3Q Max
## -186.099 -5.392 -1.422 2.070 119.876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.683e+00 7.888e-01 2.133 0.0331 *
## n_impressions 1.718e-04 1.960e-06 87.654 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.91 on 934 degrees of freedom
## Multiple R-squared: 0.8916, Adjusted R-squared: 0.8915
## F-statistic: 7683 on 1 and 934 DF, p-value: < 2.2e-16
# Print a summary of mdl_click_vs_impression_trans
summary(mdl_click_vs_impression_trans)
##
## Call:
## lm(formula = I(n_clicks^0.25) ~ I(n_impressions^0.25), data = ad_conversion)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57061 -0.13229 0.00582 0.14494 0.46888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0717479 0.0172019 4.171 3.32e-05 ***
## I(n_impressions^0.25) 0.1115330 0.0008844 126.108 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1969 on 934 degrees of freedom
## Multiple R-squared: 0.9445, Adjusted R-squared: 0.9445
## F-statistic: 1.59e+04 on 1 and 934 DF, p-value: < 2.2e-16
# Get coeff of determination for mdl_click_vs_impression_orig
mdl_click_vs_impression_orig %>%
# Get the model-level details
glance() %>%
# Pull out r.squared
pull(r.squared)
## [1] 0.8916135
# Do the same for the transformed model
mdl_click_vs_impression_trans %>% glance() %>% pull(r.squared)
## [1] 0.9445273
Residual standard error (RSE) is a measure of the typical size of the residuals. Equivalently, it’s a measure of how badly wrong you can expect predictions to be. Smaller numbers are better, with zero being a perfect fit to the data.
# Get RSE for mdl_click_vs_impression_orig
mdl_click_vs_impression_orig %>%
# Get the model-level details
glance() %>%
# Pull out sigma
pull(sigma)
## [1] 19.90584
# Do the same for the transformed model
mdl_click_vs_impression_trans %>% glance() %>% pull(sigma)
## [1] 0.1969064
It’s time for you to draw these diagnostic plots yourself. Let’s go back to the Taiwan real estate dataset and the model of house prices versus number of convenience stores.
# Plot the three diagnostics for mdl_price_vs_conv
autoplot(mdl_price_vs_conv, which = 1:3, nrow = 3, ncol = 1)
Leverage Leverage measures how unusual or extreme the explanatory variables are for each observation. Very roughly, a high leverage means that the explanatory variable has values that are different to other points in the dataset. In the case of simple linear regression, where there is only one explanatory value, this typically means values with a very high or very low explanatory value.
Influence Influence measures how much a model would change if each observation was left out of the model calculations, one at a time. That is, it measures how different the prediction line would look if you ran a linear regression on all data points except that point, compared to running a linear regression on the whole dataset. The standard metric for influence is Cook’s distance, which calculates influence based on the size of the residual and the leverage of the point.
mdl_price_vs_dist %>%
# Augment the model
augment() %>%
# Arrange rows by descending leverage
arrange(desc(.hat)) %>%
# Get the head of the dataset
head()
## # A tibble: 6 × 7
## price_twd_msq `sqrt(dist_to_mrt_m)` .fitted .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.39 80.5 1.98 0.0267 2.82 0.00351 0.506
## 2 3.69 80.0 2.09 0.0261 2.82 0.00447 0.577
## 3 4.54 79.4 2.19 0.0256 2.82 0.00937 0.844
## 4 5.69 74.2 3.13 0.0211 2.82 0.00906 0.916
## 5 5.26 74.2 3.13 0.0211 2.82 0.00630 0.764
## 6 4.05 67.9 4.30 0.0163 2.82 0.0000644 -0.0882
mdl_price_vs_dist %>%
# Augment the model
augment() %>%
# Arrange rows by descending Cook's distance
arrange(desc(.cooksd)) %>%
# Get the head of the dataset
head()
## # A tibble: 6 × 7
## price_twd_msq `sqrt(dist_to_mrt_m)` .fitted .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 35.6 15.9 13.8 0.00385 2.61 0.116 7.73
## 2 13.6 61.5 5.47 0.0121 2.79 0.0524 2.92
## 3 14.1 56.3 6.41 0.00933 2.80 0.0354 2.74
## 4 23.7 13.7 14.2 0.00440 2.78 0.0251 3.37
## 5 2.30 19.8 13.1 0.00310 2.77 0.0228 -3.83
## 6 23.6 17.8 13.4 0.00344 2.78 0.0225 3.61
# Plot the three outlier diagnostics for mdl_price_vs_conv
autoplot(mdl_price_vs_dist, which = 4:6, nrow = 3, ncol = 1)
Logistic regression with glm() Linear regression and logistic regression are special cases of a broader type of models called generalized linear models (“GLMs”). A linear regression makes the assumption that the residuals follow a Gaussian (normal) distribution. By contrast, a logistic regression assumes that residuals follow a binomial distribution.
When the response variable is logical, all the points lie on the y equals zero and y equals one lines, making it difficult to see what is happening. In the video, until you saw the trend line, it wasn’t clear how the explanatory variable was distributed on each line. This can be solved with a histogram of the explanatory variable, faceted on the response.
# Using churn, plot time_since_last_purchase
ggplot(churn, aes(time_since_last_purchase)) +
# as a histogram with binwidth 0.25
geom_histogram(binwidth = 0.25) +
# faceted in a grid with has_churned on each row
facet_grid(rows = vars(has_churned))
# Redraw the plot with time_since_first_purchase
ggplot(churn, aes(time_since_first_purchase)) +
# as a histogram with binwidth 0.25
geom_histogram(binwidth = 0.25) +
# faceted in a grid with has_churned on each row
facet_grid(rows = vars(has_churned))
As with linear regressions, ggplot2 will draw model predictions for a logistic regression without you having to worry about the modeling code yourself. To see how the predictions differ for linear and logistic regressions, try drawing both trend lines side by side. Spoiler: you should see a linear (straight line) trend from the linear model, and a logistic (S-shaped) trend from the logistic model.
# Using churn plot has_churned vs. time_since_first_purchase
plt_churn_vs_relationship <-ggplot(churn,aes(time_since_first_purchase, has_churned)) +
# Make it a scatter plot
geom_point() +
# Add an lm trend line, no std error ribbon, colored red
geom_smooth(method = "lm", se = FALSE, color = "red")
plt_churn_vs_relationship
## `geom_smooth()` using formula 'y ~ x'
ggplot(churn, aes(time_since_first_purchase, has_churned)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
# Add a glm trend line, no std error ribbon, binomial family
geom_smooth(method = "glm", se = FALSE, method.args= list(family = binomial))
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Linear regression and logistic regression are special cases of a broader type of models called generalized linear models (“GLMs”). A linear regression makes the assumption that the residuals follow a Gaussian (normal) distribution. By contrast, a logistic regression assumes that residuals follow a binomial distribution.
# Fit a logistic regression of churn vs.
# length of relationship using the churn dataset
mdl_churn_vs_relationship <- glm(has_churned ~ time_since_first_purchase, data = churn, family = binomial)
# See the result
mdl_churn_vs_relationship
##
## Call: glm(formula = has_churned ~ time_since_first_purchase, family = binomial,
## data = churn)
##
## Coefficients:
## (Intercept) time_since_first_purchase
## -0.01518 -0.35479
##
## Degrees of Freedom: 399 Total (i.e. Null); 398 Residual
## Null Deviance: 554.5
## Residual Deviance: 543.7 AIC: 547.7
There are four main ways of expressing the prediction from a logistic regression model – we’ll look at each of them over the next four exercises. Firstly, since the response variable is either “yes” or “no”, you can make a prediction of the probability of a “yes”. Here, you’ll calculate and visualize these probabilities.
# Make a data frame of predicted probabilities
explanatory_data <- tibble(time_since_first_purchase = seq(-1, 6, 0.25))
prediction_data <- explanatory_data %>%
mutate(
has_churned = predict(
mdl_churn_vs_relationship,
explanatory_data,
type = "response"
)
)
# See the result
prediction_data
## # A tibble: 29 × 2
## time_since_first_purchase has_churned
## <dbl> <dbl>
## 1 -1 0.584
## 2 -0.75 0.562
## 3 -0.5 0.540
## 4 -0.25 0.518
## 5 0 0.496
## 6 0.25 0.474
## 7 0.5 0.452
## 8 0.75 0.430
## 9 1 0.409
## 10 1.25 0.387
## # … with 19 more rows
# Update the plot
plt_churn_vs_relationship +
# Add points from prediction_data, colored yellow, size 2
geom_point(data = prediction_data, color = "yellow", size = 2)
## `geom_smooth()` using formula 'y ~ x'
When explaining your results to a non-technical audience, you may wish to side-step talking about probabilities and simply explain the most likely outcome. That is, rather than saying there is a 60% chance of a customer churning, you say that the most likely outcome is that the customer will churn. The tradeoff here is easier interpretation at the cost of nuance.
# Update the data frame
prediction_data <- explanatory_data %>%
mutate(
has_churned = predict(mdl_churn_vs_relationship, explanatory_data, type = "response"),
# Add the most likely churn outcome
most_likely_outcome = round(has_churned)
)
# See the result
prediction_data
## # A tibble: 29 × 3
## time_since_first_purchase has_churned most_likely_outcome
## <dbl> <dbl> <dbl>
## 1 -1 0.584 1
## 2 -0.75 0.562 1
## 3 -0.5 0.540 1
## 4 -0.25 0.518 1
## 5 0 0.496 0
## 6 0.25 0.474 0
## 7 0.5 0.452 0
## 8 0.75 0.430 0
## 9 1 0.409 0
## 10 1.25 0.387 0
## # … with 19 more rows
# Update the plot
plt_churn_vs_relationship +
# Add most likely outcome points from prediction_data,
# colored yellow, size 2
geom_point(
aes(y = most_likely_outcome),
data = prediction_data,
color = "yellow",
size = 2
)
## `geom_smooth()` using formula 'y ~ x'
Odds ratios compare the probability of something happening with the probability of it not happening. This is sometimes easier to reason about than probabilities, particularly when you want to make decisions about choices. For example, if a customer has a 20% chance of churning, it maybe more intuitive to say “the chance of them not churning is four times higher than the chance of them churning”.
# Update the data frame
prediction_data <- explanatory_data %>%
mutate(
has_churned = predict(
mdl_churn_vs_relationship, explanatory_data,
type = "response"
),
# Add the odds ratio
odds_ratio = has_churned / (1 - has_churned)
)
# See the result
prediction_data
## # A tibble: 29 × 3
## time_since_first_purchase has_churned odds_ratio
## <dbl> <dbl> <dbl>
## 1 -1 0.584 1.40
## 2 -0.75 0.562 1.29
## 3 -0.5 0.540 1.18
## 4 -0.25 0.518 1.08
## 5 0 0.496 0.985
## 6 0.25 0.474 0.901
## 7 0.5 0.452 0.825
## 8 0.75 0.430 0.755
## 9 1 0.409 0.691
## 10 1.25 0.387 0.632
## # … with 19 more rows
# Using prediction_data, plot odds_ratio vs. time_since_first_purchase
ggplot(prediction_data, aes(time_since_first_purchase, odds_ratio)) +
# Make it a line plot
geom_line() +
# Add a dotted horizontal line at y = 1
geom_hline(yintercept = 1, linetype = "dotted")
One downside to probabilities and odds ratios for logistic regression predictions is that the prediction lines for each are curved. This makes it harder to reason about what happens to the prediction when you make a change to the explanatory variable. The logarithm of the odds ratio (the “log odds ratio”) does have a linear relationship between predicted response and explanatory variable. That means that as the explanatory variable changes, you don’t see dramatic changes in the response metric - only linear changes.
# Update the data frame
prediction_data <- explanatory_data %>%
mutate(
has_churned = predict(mdl_churn_vs_relationship, explanatory_data, type = "response"),
odds_ratio = has_churned / (1 - has_churned),
# Add the log odds ratio from odds_ratio
log_odds_ratio = log(odds_ratio),
# Add the log odds ratio using predict()
log_odds_ratio2 = predict(mdl_churn_vs_relationship, explanatory_data)
)
# See the result
prediction_data
## # A tibble: 29 × 5
## time_since_first_purch… has_churned odds_ratio log_odds_ratio log_odds_ratio2
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1 0.584 1.40 0.340 0.340
## 2 -0.75 0.562 1.29 0.251 0.251
## 3 -0.5 0.540 1.18 0.162 0.162
## 4 -0.25 0.518 1.08 0.0735 0.0735
## 5 0 0.496 0.985 -0.0152 -0.0152
## 6 0.25 0.474 0.901 -0.104 -0.104
## 7 0.5 0.452 0.825 -0.193 -0.193
## 8 0.75 0.430 0.755 -0.281 -0.281
## 9 1 0.409 0.691 -0.370 -0.370
## 10 1.25 0.387 0.632 -0.459 -0.459
## # … with 19 more rows
# Update the plot
ggplot(prediction_data, aes(time_since_first_purchase, odds_ratio)) +
geom_line() +
geom_hline(yintercept = 1, linetype = "dotted") +
# Use a logarithmic y-scale
scale_y_log10()
A confusion matrix (occasionally called a confusion table) is the basis of all performance metrics for models with a categorical response (such as a logistic regression). It contains the counts of each actual response-predicted response pair. In this case, where there are two possible responses (churn or not churn), there are four overall outcomes. 1. The customer churned and the model predicted that. 2. The customer churned but the model didn’t predict that. 3. The customer didn’t churn but the model predicted they did. 4.The customer didn’t churn and the model predicted that.
# Get the actual responses from the dataset
actual_response <- churn$has_churned
# Get the "most likely" predicted responses from the model
predicted_response <- round(fitted(mdl_churn_vs_relationship))
# Create a table of counts
outcomes <- table(predicted_response, actual_response)
# See the result
outcomes
## actual_response
## predicted_response 0 1
## 0 112 76
## 1 88 124
# Convert outcomes to a yardstick confusion matrix
confusion <- conf_mat(outcomes)
# Plot the confusion matrix
autoplot(confusion)
# Get performance metrics for the confusion matrix
summary(confusion, event_level = "second")
## # A tibble: 13 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.59
## 2 kap binary 0.18
## 3 sens binary 0.62
## 4 spec binary 0.56
## 5 ppv binary 0.585
## 6 npv binary 0.596
## 7 mcc binary 0.180
## 8 j_index binary 0.180
## 9 bal_accuracy binary 0.59
## 10 detection_prevalence binary 0.53
## 11 precision binary 0.585
## 12 recall binary 0.62
## 13 f_meas binary 0.602
Extend your linear regression skills to “parallel slopes” regression, with one numeric and one categorical explanatory variable. This is the first step towards conquering multiple linear regression.
In Introduction to Regression in R, you learned to fit linear regression models with a single explanatory variable. In many cases, using only one explanatory variable limits the accuracy of predictions. That means that to truly master linear regression, you need to be able to include multiple explanatory variables.
The case when there is one numeric explanatory variable and one categorical explanatory variable is sometimes called a “parallel slopes” linear regression due to the shape of the predictions—more on that in the next exercise.
# Fit a linear regr'n of price_twd_msq vs. n_convenience
mdl_price_vs_conv <- lm(price_twd_msq ~ n_convenience, data =taiwan_real_estate)
# See the result
mdl_price_vs_conv
##
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
##
## Coefficients:
## (Intercept) n_convenience
## 8.2242 0.7981
# Fit a linear regr'n of price_twd_msq vs. house_age_years, no intercept
mdl_price_vs_age <- lm(price_twd_msq ~ house_age_years + 0, data = taiwan_real_estate)
# See the result
mdl_price_vs_age
##
## Call:
## lm(formula = price_twd_msq ~ house_age_years + 0, data = taiwan_real_estate)
##
## Coefficients:
## house_age_years0 to 15 house_age_years15 to 30 house_age_years30 to 45
## 12.637 9.877 11.393
# Fit a linear regr'n of price_twd_msq vs. n_convenience
# plus house_age_years, no intercept
mdl_price_vs_both <- lm(price_twd_msq ~ n_convenience + house_age_years + 0, data = taiwan_real_estate)
# See the result
mdl_price_vs_both
##
## Call:
## lm(formula = price_twd_msq ~ n_convenience + house_age_years +
## 0, data = taiwan_real_estate)
##
## Coefficients:
## n_convenience house_age_years0 to 15 house_age_years15 to 30
## 0.7915 9.4133 7.0852
## house_age_years30 to 45
## 7.5110
Being able to see the predictions made by a model makes it easier to understand. In the case where there is only one explanatory variable, ggplot lets you do this without any manual calculation or messing about.
To visualize the relationship between a numeric explanatory variable and the numeric response, you can draw a scatter plot with a linear trend line.
To visualize the relationship between a categorical explanatory variable and the numeric response, you can draw a box plot.
# Using taiwan_real_estate, plot price_twd_msq vs. n_convenience
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq)) +
# Add a point layer
geom_point() +
# Add a smooth trend line using linear regr'n, no ribbon
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# Using taiwan_real_estate, plot price_twd_msq vs. house_age_years
ggplot(taiwan_real_estate, aes(house_age_years, price_twd_msq)) +
# Add a box plot layer
geom_boxplot() + coord_flip()
The two plots in the previous exercise gave very different predictions: one gave a predicted response that increased linearly with a numeric variable; the other gave a fixed response for each category. The only sensible way to reconcile these two conflicting predictions is to incorporate both explanatory variables in the model at once.
# Using taiwan_real_estate, plot price_twd_msq vs. n_convenience
# colored by house_age_years
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq, color = house_age_years)) +
# Add a point layer
geom_point() +
# Add parallel slopes, no ribbon
geom_parallel_slopes(se = FALSE)
While ggplot can automatically show you model predictions, in order to get those values to program with, you’ll need to do the calculations yourself.
Just as with the case of a single explanatory variable, the workflow has two steps: create a data frame of explanatory variables, then add a column of predictions. To make sure you’ve got the right answer, you can add your predictions to the ggplot with the geom_parallel_slopes() lines.
# Make a grid of explanatory data
explanatory_data <- expand_grid(
# Set n_convenience to zero to ten
n_convenience = 0:10,
# Set house_age_years to the unique values of that variable
house_age_years = unique(taiwan_real_estate$house_age_years))
# See the result
explanatory_data
## # A tibble: 33 × 2
## n_convenience house_age_years
## <int> <ord>
## 1 0 30 to 45
## 2 0 15 to 30
## 3 0 0 to 15
## 4 1 30 to 45
## 5 1 15 to 30
## 6 1 0 to 15
## 7 2 30 to 45
## 8 2 15 to 30
## 9 2 0 to 15
## 10 3 30 to 45
## # … with 23 more rows
# Add predictions to the data frame
prediction_data <- explanatory_data %>%
mutate(
price_twd_msq = predict(
mdl_price_vs_both, explanatory_data
)
)
# See the result
prediction_data
## # A tibble: 33 × 3
## n_convenience house_age_years price_twd_msq
## <int> <ord> <dbl>
## 1 0 30 to 45 7.51
## 2 0 15 to 30 7.09
## 3 0 0 to 15 9.41
## 4 1 30 to 45 8.30
## 5 1 15 to 30 7.88
## 6 1 0 to 15 10.2
## 7 2 30 to 45 9.09
## 8 2 15 to 30 8.67
## 9 2 0 to 15 11.0
## 10 3 30 to 45 9.89
## # … with 23 more rows
taiwan_real_estate %>%
ggplot(aes(n_convenience, price_twd_msq, color = house_age_years)) +
geom_point() +
geom_parallel_slopes(se = FALSE) +
# Add points using prediction_data, with size 5 and shape 15
geom_point(data= prediction_data, size = 5, shape = 15)
As with simple linear regression, you can manually calculate the predictions from the model coefficients. The only change for the parallel slopes case is that the intercept is different for each category of the categorical explanatory variable. That means you need to consider the case when each each category occurs separately.
# Get the coefficients from mdl_price_vs_both
coeffs <-coefficients(mdl_price_vs_both)
# Extract the slope coefficient
slope <- coeffs[1]
# Extract the intercept coefficient for 0 to 15
intercept_0_15 <- coeffs[2]
# Extract the intercept coefficient for 15 to 30
intercept_15_30 <- coeffs[3]
# Extract the intercept coefficient for 30 to 45
intercept_30_45 <- coeffs[4]
prediction_data <- explanatory_data %>%
mutate(
# Consider the 3 cases to choose the intercept
intercept = case_when(
house_age_years == "0 to 15" ~ intercept_0_15,
house_age_years == "15 to 30" ~ intercept_15_30,
house_age_years == "30 to 45" ~ intercept_30_45
),
# Manually calculate the predictions
price_twd_msq = intercept + slope * n_convenience
)
# See the results
prediction_data
## # A tibble: 33 × 4
## n_convenience house_age_years intercept price_twd_msq
## <int> <ord> <dbl> <dbl>
## 1 0 30 to 45 7.51 7.51
## 2 0 15 to 30 7.09 7.09
## 3 0 0 to 15 9.41 9.41
## 4 1 30 to 45 7.51 8.30
## 5 1 15 to 30 7.09 7.88
## 6 1 0 to 15 9.41 10.2
## 7 2 30 to 45 7.51 9.09
## 8 2 15 to 30 7.09 8.67
## 9 2 0 to 15 9.41 11.0
## 10 3 30 to 45 7.51 9.89
## # … with 23 more rows
Recall that the coefficient of determination is a measure of how well the linear regression line fits the observed values. An important motivation for including several explanatory variables in a linear regression is that you can improve the fit compared to considering only a single explanatory variable.The other common metric for assessing model fit is the residual standard error (RSE), which measures the typical size of the residuals.
mdl_price_vs_conv %>%
# Get the model-level coefficients
glance() %>%
# Select the coeffs of determination
select(r.squared, adj.r.squared)
## # A tibble: 1 × 2
## r.squared adj.r.squared
## <dbl> <dbl>
## 1 0.326 0.324
# Get the coeffs of determination for mdl_price_vs_age
mdl_price_vs_age %>%
glance() %>%
select(r.squared, adj.r.squared)
## # A tibble: 1 × 2
## r.squared adj.r.squared
## <dbl> <dbl>
## 1 0.896 0.895
# Get the coeffs of determination for mdl_price_vs_both
mdl_price_vs_both %>%
glance() %>%
select(r.squared, adj.r.squared)
## # A tibble: 1 × 2
## r.squared adj.r.squared
## <dbl> <dbl>
## 1 0.931 0.931
mdl_price_vs_conv %>%
# Get the model-level coefficients
glance() %>%
# Pull out the RSE
pull(sigma)
## [1] 3.383888
# Get the RSE for mdl_price_vs_age
mdl_price_vs_age %>%
glance() %>% pull(sigma)
## [1] 3.950184
# Get the RSE for mdl_price_vs_both
mdl_price_vs_both %>%
glance() %>% pull(sigma)
## [1] 3.21346
Explore the effect of interactions between explanatory variables. Considering interactions allows for more realistic models that can have better predictive power. You’ll also deal with Simpson’s Paradox: a non-intuitive result that arises when you have multiple explanatory variables.
The model you ran on the whole dataset fits some parts of the data better than others. It’s worth taking a look at what happens when you run a linear model on different parts of the dataset separately, to see if each model agrees or disagrees with the others.
# Filter for rows where house age is 0 to 15 years
taiwan_0_to_15 <- taiwan_real_estate %>%
filter(house_age_years == "0 to 15")
# Filter for rows where house age is 15 to 30 years
taiwan_15_to_30 <- taiwan_real_estate %>%
filter(house_age_years == "15 to 30")
# Filter for rows where house age is 30 to 45 years
taiwan_30_to_45 <- taiwan_real_estate %>%
filter(house_age_years == "30 to 45")
# Model price vs. no. convenience stores using 0 to 15 data
mdl_0_to_15 <- lm(price_twd_msq ~ n_convenience, taiwan_0_to_15)
# Model price vs. no. convenience stores using 15 to 30 data
mdl_15_to_30 <- lm(price_twd_msq ~ n_convenience, taiwan_15_to_30)
# Model price vs. no. convenience stores using 30 to 45 data
mdl_30_to_45 <- lm(price_twd_msq ~ n_convenience, taiwan_30_to_45)
# See the results
mdl_0_to_15
##
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_0_to_15)
##
## Coefficients:
## (Intercept) n_convenience
## 9.2417 0.8336
mdl_15_to_30
##
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_15_to_30)
##
## Coefficients:
## (Intercept) n_convenience
## 6.8719 0.8519
mdl_30_to_45
##
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_30_to_45)
##
## Coefficients:
## (Intercept) n_convenience
## 8.1131 0.6687
In order to see what each of the models for individual categories are doing, it’s helpful to make predictions from them. The flow is exactly the same as the flow for making predictions on the whole model, though remember that you only have a single explanatory variable in these models (so expand_grid() isn’t needed.)
# Create a tibble of explanatory data, setting
# no. of conv stores to 0 to 10
explanatory_data <- tibble(n_convenience = 0:10)
# Add column of predictions using "0 to 15" model and explanatory data
prediction_data_0_to_15 <- explanatory_data %>%
mutate(price_twd_msq = predict(mdl_0_to_15, explanatory_data))
# Same again, with "15 to 30"
prediction_data_15_to_30 <- explanatory_data %>%
mutate(price_twd_msq = predict(mdl_15_to_30, explanatory_data))
# Same again, with "30 to 45"
prediction_data_30_to_45 <- explanatory_data %>%
mutate(price_twd_msq = predict(mdl_30_to_45, explanatory_data))
When you use geom_smooth() in a ggplot with an aesthetic that splits the dataset into groups and draws a line for each group (like the color aesthetic), you get multiple trend lines. This is the same as running a model on each group separately, so we get a chance to test our predictions against ggplot’s.
# Using taiwan_real_estate, plot price vs. no. of conv. stores
# colored by house age
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq, color = house_age_years)) +
# Make it a scatter plot
geom_point() +
# Add smooth linear regression trend lines, no ribbon
geom_smooth(method = "lm", se =FALSE) +
# Add points using prediction_data_0_to_15, colored red, size 3, shape 15
geom_point(data = prediction_data_0_to_15, color = "red", size = 3, shape = 15) +
# Add points using prediction_data_15_to_30, colored green, size 3, shape 15
geom_point(data = prediction_data_15_to_30, color = "green", size = 3, shape = 15) +
# Add points using prediction_data_30_to_45, colored blue, size 3, shape 15
geom_point(data = prediction_data_30_to_45, color = "blue", size = 3, shape = 15)
## `geom_smooth()` using formula 'y ~ x'
To test which approach is best—the whole dataset model or the models for each house age category—you need to calculate some metrics. Here’s, you’ll compare the coefficient of determination and the residual standard error for each model.
mdl_all_ages <- lm(price_twd_msq ~ n_convenience, taiwan_real_estate)
# Get the coeff. of determination for mdl_all_ages
mdl_all_ages %>%
glance() %>%
pull(r.squared)
## [1] 0.3260466
# Get the coeff. of determination for mdl_0_to_15
mdl_0_to_15 %>%
glance() %>%
pull(r.squared)
## [1] 0.3120536
# Get the coeff. of determination for mdl_15_to_30
mdl_15_to_30 %>%
glance() %>%
pull(r.squared)
## [1] 0.4424605
# Get the coeff. of determination for mdl_30_to_45
mdl_30_to_45 %>%
glance() %>%
pull(r.squared)
## [1] 0.3125713
# Get the RSE for mdl_all
mdl_all_ages %>%
glance() %>%
pull(sigma)
## [1] 3.383888
# Get the RSE for mdl_0_to_15
mdl_0_to_15 %>%
glance() %>%
pull(sigma)
## [1] 3.564127
# Get the RSE for mdl_15_to_30
mdl_15_to_30 %>%
glance() %>%
pull(sigma)
## [1] 2.585273
# Get the RSE for mdl_30_to_45
mdl_30_to_45 %>%
glance() %>%
pull(sigma)
## [1] 3.239037
So far you used a single parallel slopes model, which gave an OK fit for the whole dataset, then three separate models for each house age category, which gave a better fit for each individual category, but was clunky because you had three separate models to work with and explain. Ideally, you’d have a single model that had all the predictive power of the individual models.
# Model price vs both with an interaction using "times" syntax
lm(price_twd_msq ~ n_convenience*house_age_years, taiwan_real_estate)
##
## Call:
## lm(formula = price_twd_msq ~ n_convenience * house_age_years,
## data = taiwan_real_estate)
##
## Coefficients:
## (Intercept) n_convenience
## 8.07558 0.78473
## house_age_years.L house_age_years.Q
## -0.79803 1.47418
## n_convenience:house_age_years.L n_convenience:house_age_years.Q
## -0.11659 -0.08228
# Model price vs both with an interaction using "colon" syntax
lm(formula = price_twd_msq ~ n_convenience + house_age_years +n_convenience:house_age_years, data = taiwan_real_estate)
##
## Call:
## lm(formula = price_twd_msq ~ n_convenience + house_age_years +
## n_convenience:house_age_years, data = taiwan_real_estate)
##
## Coefficients:
## (Intercept) n_convenience
## 8.07558 0.78473
## house_age_years.L house_age_years.Q
## -0.79803 1.47418
## n_convenience:house_age_years.L n_convenience:house_age_years.Q
## -0.11659 -0.08228
# Model price vs. house age plus an interaction, no intercept
mdl_readable_inter <- lm(formula = price_twd_msq ~ house_age_years + n_convenience:house_age_years + 0, data = taiwan_real_estate)
# See the result
mdl_readable_inter
##
## Call:
## lm(formula = price_twd_msq ~ house_age_years + n_convenience:house_age_years +
## 0, data = taiwan_real_estate)
##
## Coefficients:
## house_age_years0 to 15 house_age_years15 to 30
## 9.2417 6.8719
## house_age_years30 to 45 house_age_years0 to 15:n_convenience
## 8.1131 0.8336
## house_age_years15 to 30:n_convenience house_age_years30 to 45:n_convenience
## 0.8519 0.6687
# Get coefficients for mdl_0_to_15
coefficients(mdl_readable_inter[1])
## house_age_years0 to 15 house_age_years15 to 30
## 9.2417022 6.8719186
## house_age_years30 to 45 house_age_years0 to 15:n_convenience
## 8.1131235 0.8335867
## house_age_years15 to 30:n_convenience house_age_years30 to 45:n_convenience
## 0.8519172 0.6686981
# Get coefficients for mdl_15_to_30
coefficients(mdl_readable_inter[2])
## NULL
# Get coefficients for mdl_30_to_45
coefficients(mdl_readable_inter[3])
## NULL
As with every other regression model you’ve created, the fun part is making predictions. Fortunately, the code flow for this case is the same as the one without interactions – R can handle calculating the interactions without any extra prompting from you. The only thing you need to remember is the trick for getting combinations of explanatory variables.
# Make a grid of explanatory data
explanatory_data <- expand_grid(
# Set n_convenience to zero to ten
n_convenience = 0:10,
# Set house_age_years to the unique values of that variable
house_age_years = unique(taiwan_real_estate$house_age_years))
# See the result
explanatory_data
## # A tibble: 33 × 2
## n_convenience house_age_years
## <int> <ord>
## 1 0 30 to 45
## 2 0 15 to 30
## 3 0 0 to 15
## 4 1 30 to 45
## 5 1 15 to 30
## 6 1 0 to 15
## 7 2 30 to 45
## 8 2 15 to 30
## 9 2 0 to 15
## 10 3 30 to 45
## # … with 23 more rows
mdl_price_vs_both_inter <- lm(formula = price_twd_msq ~ house_age_years + n_convenience:house_age_years +
0, data = taiwan_real_estate)
# Add predictions to the data frame
prediction_data <- explanatory_data %>% mutate(predict(mdl_price_vs_both_inter, explanatory_data))
# See the result
prediction_data
## # A tibble: 33 × 3
## n_convenience house_age_years `predict(mdl_price_vs_both_inter, explanatory_…
## <int> <ord> <dbl>
## 1 0 30 to 45 8.11
## 2 0 15 to 30 6.87
## 3 0 0 to 15 9.24
## 4 1 30 to 45 8.78
## 5 1 15 to 30 7.72
## 6 1 0 to 15 10.1
## 7 2 30 to 45 9.45
## 8 2 15 to 30 8.58
## 9 2 0 to 15 10.9
## 10 3 30 to 45 10.1
## # … with 23 more rows
In order to understand how predict() works, it’s time to calculate the predictions manually again. For this model, there are three separate lines to calculate for, and in each one, the prediction is an intercept plus a slope times the numeric explanatory value. The tricky part is getting the right intercept and the right slope for each case.
# Get the coefficients from mdl_price_vs_both_inter
coeffs <- coefficients(mdl_price_vs_both_inter)
# Get the intercept for 0 to 15 year age group
intercept_0_15 <- coeffs[1]
# Get the intercept for 15 to 30 year age group
intercept_15_30 <- coeffs[2]
# Get the intercept for 30 to 45 year age group
intercept_30_45 <- coeffs[3]
# Get the slope for 0 to 15 year age group
slope_0_15 <- coeffs[4]
# Get the slope for 15 to 30 year age group
slope_15_30 <- coeffs[5]
# Get the slope for 30 to 45 year age group
slope_30_45 <- coeffs[6]
prediction_data <- explanatory_data %>%
mutate(
# Consider the 3 cases to choose the price
price_twd_msq = case_when(
house_age_years == "0 to 15" ~ intercept_0_15 + slope_0_15 * n_convenience,
house_age_years == "15 to 30" ~ intercept_15_30 + slope_15_30 * n_convenience,
house_age_years == "30 to 45" ~ intercept_30_45 + slope_30_45 * n_convenience ))
# See the result
prediction_data
## # A tibble: 33 × 3
## n_convenience house_age_years price_twd_msq
## <int> <ord> <dbl>
## 1 0 30 to 45 8.11
## 2 0 15 to 30 6.87
## 3 0 0 to 15 9.24
## 4 1 30 to 45 8.78
## 5 1 15 to 30 7.72
## 6 1 0 to 15 10.1
## 7 2 30 to 45 9.45
## 8 2 15 to 30 8.58
## 9 2 0 to 15 10.9
## 10 3 30 to 45 10.1
## # … with 23 more rows
Sometimes modeling a whole dataset suggests trends that disagree with models on separate parts of that dataset. This is known as Simpson’s paradox. In the most extreme case, you may see a positive slope on the whole dataset, and negative slopes on every subset of that dataset (or the other way around).
# Take a glimpse at the dataset
glimpse(auctions)
## Rows: 10,681
## Columns: 9
## $ auctionid <dbl> 1638893549, 1638893549, 1638893549, 1638893549, 163889354…
## $ bid <dbl> 175.00, 100.00, 120.00, 150.00, 177.50, 1.00, 1.25, 1.50,…
## $ bidtime <dbl> 2.230949, 2.600116, 2.600810, 2.601076, 2.909826, 0.35585…
## $ bidder <chr> "schadenfreud", "chuik", "kiwisstuff", "kiwisstuff", "eli…
## $ bidderrate <dbl> 0, 0, 2, 2, 4, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 15, 15, 0, 0…
## $ openbid <dbl> 99, 99, 99, 99, 99, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ price <dbl> 177.5, 177.5, 177.5, 177.5, 177.5, 355.0, 355.0, 355.0, 3…
## $ item <chr> "Cartier wristwatch", "Cartier wristwatch", "Cartier wris…
## $ auction_type <chr> "3 day auction", "3 day auction", "3 day auction", "3 day…
# Model price vs. opening bid using auctions
mdl_price_vs_openbid <- lm(price ~ openbid, data= auctions)
# See the result
mdl_price_vs_openbid
##
## Call:
## lm(formula = price ~ openbid, data = auctions)
##
## Coefficients:
## (Intercept) openbid
## 268.434 1.275
# Using auctions, plot price vs. opening bid as a
# scatter plot with linear regression trend lines
ggplot(auctions, aes(openbid, price)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
# Fit linear regression of price vs. opening bid and auction
# type, with an interaction.
mdl_price_vs_both <- lm(price ~ openbid * auction_type, auctions)
# See the result
mdl_price_vs_both
##
## Call:
## lm(formula = price ~ openbid * auction_type, data = auctions)
##
## Coefficients:
## (Intercept) openbid
## 166.920 1.422
## auction_type5 day auction auction_type7 day auction
## 120.023 124.450
## openbid:auction_type5 day auction openbid:auction_type7 day auction
## -0.231 -0.130
ggplot(auctions, aes(openbid, price, color = auction_type)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
See how modeling, and linear regression in particular, makes it easy to work with more than two explanatory variables. Once you’ve mastered fitting linear regression models, you’ll get to implement your own linear regression algorithm.
Since computer screens and paper are both two-dimensional objects, most plots are best suited to visualizing two variables at once. For the case of three continuous variables, you can draw a 3D scatter plot, but perspective problems usually make it difficult to interpret. There are some “flat” alternatives that provide easier interpretation, though they require a little thinking about to make.
# With taiwan_real_estate, draw a 3D scatter plot of
# no. of conv. stores, sqrt dist to MRT, and price
taiwan_real_estate %$%
scatter3D(n_convenience, sqrt(dist_to_mrt_m), price_twd_msq)
# Using taiwan_real_estate, plot sqrt dist to MRT vs.
# no. of conv stores, colored by price
ggplot(taiwan_real_estate, aes(n_convenience, sqrt(dist_to_mrt_m), color = price_twd_msq)) +
# Make it a scatter plot
geom_point() +
# Use the continuous viridis plasma color scale
scale_color_viridis_c(option = "plasma")
You already saw how to make a model and predictions with a numeric and a categorical explanatory variable. The code for modeling and predicting with two numeric explanatory variables in the same, other than a slight difference in how to specify the explanatory variables to make predictions against.
# Fit a linear regression of price vs. no. of conv. stores
# and sqrt dist. to nearest MRT, no interaction
mdl_price_vs_conv_dist <- lm(price_twd_msq ~ n_convenience + sqrt(dist_to_mrt_m), data = taiwan_real_estate)
# See the result
mdl_price_vs_conv_dist
##
## Call:
## lm(formula = price_twd_msq ~ n_convenience + sqrt(dist_to_mrt_m),
## data = taiwan_real_estate)
##
## Coefficients:
## (Intercept) n_convenience sqrt(dist_to_mrt_m)
## 15.1038 0.2142 -0.1573
# Create expanded grid of explanatory variables with no. of conv. stores and dist. to nearest MRT
explanatory_data <- expand_grid(
n_convenience = 0:10,
dist_to_mrt_m = seq(0, 80, 10) ^ 2)
# Add predictions using mdl_price_vs_conv_dist and explanatory_data
prediction_data <- explanatory_data %>%
mutate(
price_twd_msq = predict(mdl_price_vs_conv_dist, explanatory_data))
# See the result
prediction_data
## # A tibble: 99 × 3
## n_convenience dist_to_mrt_m price_twd_msq
## <int> <dbl> <dbl>
## 1 0 0 15.1
## 2 0 100 13.5
## 3 0 400 12.0
## 4 0 900 10.4
## 5 0 1600 8.81
## 6 0 2500 7.24
## 7 0 3600 5.67
## 8 0 4900 4.09
## 9 0 6400 2.52
## 10 1 0 15.3
## # … with 89 more rows
# Add predictions to plot
ggplot(
taiwan_real_estate,
aes(n_convenience, sqrt(dist_to_mrt_m), color = price_twd_msq)) +
geom_point() +
scale_color_viridis_c(option = "plasma")+
# Add prediction points colored yellow, size 3
geom_point(data = prediction_data, color = "yellow", size = 3)
Just as in the case with one numeric and one categorical explanatory variable, it is possible that numeric explanatory variables can interact. With this model structure, you’ll get a third slope coefficient: one for each explanatory variable and one for the interaction.
# Fit a linear regression of price vs. no. of conv. stores
# and sqrt dist. to nearest MRT, with interaction
mdl_price_vs_conv_dist <- lm(price_twd_msq ~ n_convenience * sqrt(dist_to_mrt_m), data = taiwan_real_estate)
# See the result
mdl_price_vs_conv_dist
##
## Call:
## lm(formula = price_twd_msq ~ n_convenience * sqrt(dist_to_mrt_m),
## data = taiwan_real_estate)
##
## Coefficients:
## (Intercept) n_convenience
## 14.73733 0.42425
## sqrt(dist_to_mrt_m) n_convenience:sqrt(dist_to_mrt_m)
## -0.14121 -0.01124
# Create expanded grid of explanatory variables with no. of conv. stores and dist. to nearest MRT
explanatory_data <- expand_grid(
n_convenience = 0:10,
dist_to_mrt_m = seq(0, 80, 10) ^ 2)
# Add predictions using mdl_price_vs_conv_dist and explanatory_data
prediction_data <- explanatory_data %>%
mutate(
price_twd_msq = predict(mdl_price_vs_conv_dist, explanatory_data))
# See the result
prediction_data
## # A tibble: 99 × 3
## n_convenience dist_to_mrt_m price_twd_msq
## <int> <dbl> <dbl>
## 1 0 0 14.7
## 2 0 100 13.3
## 3 0 400 11.9
## 4 0 900 10.5
## 5 0 1600 9.09
## 6 0 2500 7.68
## 7 0 3600 6.26
## 8 0 4900 4.85
## 9 0 6400 3.44
## 10 1 0 15.2
## # … with 89 more rows
# Add predictions to plot
ggplot(
taiwan_real_estate,
aes(n_convenience, sqrt(dist_to_mrt_m), color = price_twd_msq)) +
geom_point() +
scale_color_viridis_c(option = "plasma") +
# Add prediction points colored yellow, size 3
geom_point(data = prediction_data, color = "yellow", size = 3)
As you begin to consider more variables, plotting them all at the same time becomes increasingly difficult. In addition to using x and y scales for two numeric variables, you can use color for a third numeric variable, and you can use faceting for categorical variables. And that’s about your limit before the plots become to difficult to interpret. There are some specialist plot types like correlation heatmaps and parallel coordinates plots that will handle more variables, but they give you much less information about each variable, and they aren’t great for visualizing model predictions.
# Using taiwan_real_estate, no. of conv. stores vs. sqrt of dist. to MRT, colored by plot house price
ggplot(taiwan_real_estate, aes(sqrt(dist_to_mrt_m), n_convenience, color=price_twd_msq)) +
# Make it a scatter plot
geom_point() +
# Use the continuous viridis plasma color scale
scale_color_viridis_c(option="plasma") +
# Facet, wrapped by house age
facet_wrap(vars(house_age_years))
Once you have three explanatory variables, the number of options for specifying interactions increases. You can specify no interactions. You can specify 2-way interactions, which gives you model coefficients for each pair of variables. The third option is to specify all the interactions, which means the three 2-way interactions and and interaction between all three explanatory variables.
# Model price vs. sqrt dist. to MRT station, no. of conv.
# stores & house age, no global intercept, 3-way interactions
mdl_price_vs_all_3_way_inter <- lm(
price_twd_msq ~ sqrt(dist_to_mrt_m) * n_convenience * house_age_years + 0,
data = taiwan_real_estate)
# See the result
mdl_price_vs_all_3_way_inter
##
## Call:
## lm(formula = price_twd_msq ~ sqrt(dist_to_mrt_m) * n_convenience *
## house_age_years + 0, data = taiwan_real_estate)
##
## Coefficients:
## sqrt(dist_to_mrt_m)
## -0.130311
## n_convenience
## 0.423346
## house_age_years0 to 15
## 16.046849
## house_age_years15 to 30
## 13.760066
## house_age_years30 to 45
## 12.088773
## sqrt(dist_to_mrt_m):n_convenience
## -0.008200
## sqrt(dist_to_mrt_m):house_age_years.L
## 0.043333
## sqrt(dist_to_mrt_m):house_age_years.Q
## -0.004881
## n_convenience:house_age_years.L
## 0.047178
## n_convenience:house_age_years.Q
## -0.036750
## sqrt(dist_to_mrt_m):n_convenience:house_age_years.L
## 0.003112
## sqrt(dist_to_mrt_m):n_convenience:house_age_years.Q
## 0.004917
# Model price vs. sqrt dist. to MRT station, no. of conv.
# stores & house age, no global intercept, 2-way interactions
mdl_price_vs_all_2_way_inter <- lm(
price_twd_msq ~ (sqrt(dist_to_mrt_m) + n_convenience + house_age_years) ^ 2 + 0,
data = taiwan_real_estate)
# See the result
mdl_price_vs_all_2_way_inter
##
## Call:
## lm(formula = price_twd_msq ~ (sqrt(dist_to_mrt_m) + n_convenience +
## house_age_years)^2 + 0, data = taiwan_real_estate)
##
## Coefficients:
## sqrt(dist_to_mrt_m) n_convenience
## -0.128758 0.430422
## house_age_years0 to 15 house_age_years15 to 30
## 16.026633 13.880791
## house_age_years30 to 45 sqrt(dist_to_mrt_m):n_convenience
## 11.926904 -0.008956
## sqrt(dist_to_mrt_m):house_age_years.L sqrt(dist_to_mrt_m):house_age_years.Q
## 0.048223 0.002040
## n_convenience:house_age_years.L n_convenience:house_age_years.Q
## 0.101411 0.064178
You’ve followed the prediction workflow several times now with different combinations of explanatory variables. Time to try it once more on the model with three explanatory variables. Here, you’ll use the model with 3-way interactions, though the code is the same when using any of the three models from the previous exercise.
# Make a grid of explanatory data
explanatory_data <- expand_grid(
# Set dist_to_mrt_m a seq from 0 to 80 by 10s, squared
dist_to_mrt_m = seq(0, 80, 10)^2,
# Set n_convenience to 0 to 10
n_convenience = 0:10,
# Set house_age_years to the unique values of that variable
house_age_years= unique(taiwan_real_estate$house_age_years))
# See the result
explanatory_data
## # A tibble: 297 × 3
## dist_to_mrt_m n_convenience house_age_years
## <dbl> <int> <ord>
## 1 0 0 30 to 45
## 2 0 0 15 to 30
## 3 0 0 0 to 15
## 4 0 1 30 to 45
## 5 0 1 15 to 30
## 6 0 1 0 to 15
## 7 0 2 30 to 45
## 8 0 2 15 to 30
## 9 0 2 0 to 15
## 10 0 3 30 to 45
## # … with 287 more rows
Extend your logistic regression skills to multiple explanatory variables. Understand the logistic distribution, which underpins this form of regression. Finally, implement your own logistic regression algorithm.
Logistic regression also supports multiple explanatory variables. Plotting has similar issues as the linear regression case: it quickly becomes difficult to include more numeric variables in the plot. Here we’ll look at the case of two numeric explanatory variables, and the solution is basically the same as before: use color to denote the response.
# Using churn, plot recency vs. length of relationship,
# colored by churn status
ggplot(churn, aes(time_since_first_purchase, time_since_last_purchase, color = has_churned)) +
# Make it a scatter plot, with transparency 0.5
geom_point(alpha = 0.5) +
# Use a 2-color gradient split at 0.5
scale_color_gradient2( midpoint = 0.5) +
# Use the black and white theme
theme_bw()
To include multiple explanatory variables in logistic regression models, the syntax is the same as for linear regressions. The only change is the same as in the simple case: you run a generalized linear model with a binomial error family.
# Fit a logistic regression of churn status vs. length of
# relationship, recency, and an interaction
mdl_churn_vs_both_inter <- glm(has_churned ~ time_since_first_purchase*time_since_last_purchase, family = binomial, data = churn)
# See the result
mdl_churn_vs_both_inter
##
## Call: glm(formula = has_churned ~ time_since_first_purchase * time_since_last_purchase,
## family = binomial, data = churn)
##
## Coefficients:
## (Intercept)
## -0.1505
## time_since_first_purchase
## -0.6376
## time_since_last_purchase
## 0.4233
## time_since_first_purchase:time_since_last_purchase
## 0.1123
##
## Degrees of Freedom: 399 Total (i.e. Null); 396 Residual
## Null Deviance: 554.5
## Residual Deviance: 519.8 AIC: 527.8
As with linear regression, the joy of logistic regression is that you can make predictions. Let’s step through the prediction flow one more time!
# Make a grid of explanatory data
explanatory_data <- expand.grid(
# Set len. relationship to seq from -2 to 4 in steps of 0.1
time_since_first_purchase = seq(-2, 4, 0.1),
# Set recency to seq from -1 to 6 in steps of 0.1
time_since_last_purchase = seq(-1, 6, 0.1))
# See the result
head(explanatory_data, 20)
## time_since_first_purchase time_since_last_purchase
## 1 -2.0 -1
## 2 -1.9 -1
## 3 -1.8 -1
## 4 -1.7 -1
## 5 -1.6 -1
## 6 -1.5 -1
## 7 -1.4 -1
## 8 -1.3 -1
## 9 -1.2 -1
## 10 -1.1 -1
## 11 -1.0 -1
## 12 -0.9 -1
## 13 -0.8 -1
## 14 -0.7 -1
## 15 -0.6 -1
## 16 -0.5 -1
## 17 -0.4 -1
## 18 -0.3 -1
## 19 -0.2 -1
## 20 -0.1 -1
# Add a column of predictions using mdl_churn_vs_both_inter
# and explanatory_data with type response
prediction_data <- explanatory_data %>%
mutate(
has_churned = predict(
mdl_churn_vs_both_inter, explanatory_data, type = "response"
)
)
# See the result
head(prediction_data, 20)
## time_since_first_purchase time_since_last_purchase has_churned
## 1 -2.0 -1 0.7162561
## 2 -1.9 -1 0.7007719
## 3 -1.8 -1 0.6848146
## 4 -1.7 -1 0.6684090
## 5 -1.6 -1 0.6515839
## 6 -1.5 -1 0.6343721
## 7 -1.4 -1 0.6168101
## 8 -1.3 -1 0.5989378
## 9 -1.2 -1 0.5807985
## 10 -1.1 -1 0.5624381
## 11 -1.0 -1 0.5439051
## 12 -0.9 -1 0.5252497
## 13 -0.8 -1 0.5065235
## 14 -0.7 -1 0.4877790
## 15 -0.6 -1 0.4690688
## 16 -0.5 -1 0.4504452
## 17 -0.4 -1 0.4319594
## 18 -0.3 -1 0.4136613
## 19 -0.2 -1 0.3955984
## 20 -0.1 -1 0.3778160
# Extend the plot
ggplot(
churn,
aes(time_since_first_purchase, time_since_last_purchase, color = has_churned)) +
geom_point(alpha = 0.5) +
scale_color_gradient2(midpoint = 0.5) +
theme_bw() +
# Add points from prediction_data with size 3 and shape 15
geom_point(data = prediction_data, size = 3, shape = 15)
When the response variable has just two outcomes, like the case of churn, the measures of success for the model are “how many cases where the customer churned did the model correctly predict?” and “how many cases where the customer didn’t churn did the model correctly predict?”. These can be found by generating a confusion matrix and calculating summary metrics on it. A mosaic plot is the natural way to visualize the results.
# Get the actual responses from churn
actual_response <- churn$has_churned
# Get the predicted responses from the model
predicted_response <- round(fitted(mdl_churn_vs_both_inter))
# Get a table of these values
outcomes <- table(predicted_response, actual_response)
# Convert the table to a conf_mat object
confusion <- conf_mat(outcomes)
# See the result
confusion
## actual_response
## predicted_response 0 1
## 0 102 53
## 1 98 147
# "Automatically" plot the confusion matrix
autoplot(confusion)
# Get summary metrics
summary(confusion, event_level = "second")
## # A tibble: 13 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.622
## 2 kap binary 0.245
## 3 sens binary 0.735
## 4 spec binary 0.51
## 5 ppv binary 0.6
## 6 npv binary 0.658
## 7 mcc binary 0.251
## 8 j_index binary 0.245
## 9 bal_accuracy binary 0.622
## 10 detection_prevalence binary 0.612
## 11 precision binary 0.6
## 12 recall binary 0.735
## 13 f_meas binary 0.661
Understanding the logistic distribution is key to understanding logistic regression. Like the normal (Gaussian) distribution, it is a probability distribution of a single continuous variable. Here you’ll visualize the cumulative distribution function (CDF) for the logistic distribution. That is, if you have a logistically distributed variable, x, and a possible value, xval, that x could take, then the CDF gives the probability that x is less than xval. The logistic distribution’s CDF is calculated with the logistic function (hence the name). The plot of this has an S-shape, known as a sigmoid curve. An important property of this function is that it takes an input that can be any number from minus infinity to infinity, and returns a value between zero and one.
logistic_distn_cdf <- tibble(
# Make a seq from -10 to 10 in steps of 0.1
x = seq(-10, 10, 0.1),
# Transform x with built-in logistic CDF
logistic_x = plogis(x),
# Transform x with manual logistic
logistic_x_man = (1+exp(-x))^-1
)
# Check that each logistic function gives the same results
all.equal(
logistic_distn_cdf,
logistic_distn_cdf
)
## [1] TRUE
# Using logistic_distn_cdf, plot logistic_x vs. x
ggplot(logistic_distn_cdf, aes(x, logistic_x)) +
# Make it a line plot
geom_line()
The logistic function (logistic distribution CDF) has another important property: each x input value is transformed to a unique value. That means that the transformation can be reversed. The logit function is the name for the inverse logistic function, which is also the logistic distribution inverse cumulative distribution function. (All three terms mean exactly the same thing.
logistic_distn_inv_cdf <- tibble(
# Make a seq from 0.001 to 0.999 in steps of 0.001
p = seq(0.001, 0.999, 0.001),
# Transform with built-in logistic inverse CDF
logit_p = qlogis(p),
# Transform with manual logit
logit_p_man = log(p / (1 - p))
)
# Check that each logistic function gives the same results
all.equal(
logistic_distn_inv_cdf$logit_p,
logistic_distn_inv_cdf$logit_p_man
)
## [1] TRUE
# Using logistic_distn_inv_cdf, plot logit_p vs. p
ggplot(logistic_distn_inv_cdf, aes(p, logit_p)) +
# Make it a line plot
geom_line()
Let’s dig into the internals and implement a logistic regression algorithm. Since R’s glm() function is very complex, you’ll stick to implementing simple logistic regression for a single dataset.
Rather than using sum of squares as the metric, we want to use likelihood. However, log-likelihood is more computationally stable, so we’ll use that instead. Actually, there is one more change: since we want to maximize log-likelihood, but optim() defaults to finding minimum values, it is easier to calculate the negative log-likelihood.
x_actual <- churn$time_since_last_purchase
y_actual <- churn$has_churned
# Set the intercept to 1
intercept <- 1
# Set the slope to 0.5
slope <- 0.5
# Calculate the predicted y values
y_pred <- plogis(intercept + slope * x_actual)
# Calculate the log-likelihood for each term
log_likelihoods <- log(y_pred) * y_actual + log(1 - y_pred) * (1 - y_actual)
# Calculate minus the sum of the log-likelihoods for each term
-sum(log_likelihoods)
## [1] 326.2599
calc_neg_log_likelihood <- function(coeffs) {
# Get the intercept coeff
intercept <- coeffs[1]
# Get the slope coeff
slope <- coeffs[2]
# Calculate the predicted y values
y_pred <- plogis(intercept + slope * x_actual)
# Calculate the log-likelihood for each term
log_likelihoods <- log(y_pred) * y_actual + log(1 - y_pred) * (1 - y_actual)
# Calculate minus the sum of the log-likelihoods for each term
-sum(log_likelihoods)
}
# Optimize the metric
optim(
# Initially guess 0 intercept and 1 slope
par = c(intercept = 0, slope = 1),
# Use calc_neg_log_likelihood as the optimization fn
fn = calc_neg_log_likelihood
)
## $par
## intercept slope
## -0.03478255 0.26890041
##
## $value
## [1] 273.2002
##
## $counts
## function gradient
## 51 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
# Compare the coefficients to those calculated by glm()
glm(has_churned ~ time_since_last_purchase, data = churn, family = binomial)
##
## Call: glm(formula = has_churned ~ time_since_last_purchase, family = binomial,
## data = churn)
##
## Coefficients:
## (Intercept) time_since_last_purchase
## -0.03502 0.26921
##
## Degrees of Freedom: 399 Total (i.e. Null); 398 Residual
## Null Deviance: 554.5
## Residual Deviance: 546.4 AIC: 550.4
The End.
Thanks DataCamp
- My Favorite Team -