Fitting a logistic regression Pittsburgh is a city located in Allegheny County, PA, USA. Commuters living in this area have different options for commuting to work, including taking the bus. Using data from 2015, you will see if the number of commuting days per week increases the chance somebody will ride the bus. The choice of riding the bus is a binary outcome, hence you will use a logistic regression to model this data. * Using the data.frame bus, fit a logistic regression in R using glm() with a "binomial" family. The variable Bus should be predicted by CommuteDays. * Save the model as bus_out. > glimpse(bus) Rows: 15,892 Columns: 3 $ CommuteDays 5, 5, 5, 5, 3, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5… $ MilesOneWay 19.54675, 19.54675, 19.54675, 19.54675, 19.54675, 21.66784… $ Bus Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, No, No, No, No, … # Build a glm using the bus data frame that models Bus predicted by CommuteDays bus_out <- glm(Bus ~ CommuteDays, data = bus, family = "binomial") ---------------------------------------------------------------------------------------------------------------------------------------------- Examining & interpreting logistic regression outputs In the previous exercise, you fit a logistic regression, bus_out. During this exercise, you will examine bus_out and interpret the results of the regression using the tools you learned about in Chapter 1: print() includes the coefficient estimates (i.e., slopes and intercepts) for different predictor variables and information about the model fit such as deviance. summary() includes the print() outputs as well as standard errors, z-scores, and P-values for the coefficient estimates. tidy() includes the summary() coefficient table as a tidy data frame. Recall that regression coefficients can help us understand both the direction of relationships and statistical significance of coefficients. For logistic regression, a positive coefficient indicates that the probability of an event occurring increases as a predictor increases. * Print the model's output using print(). * Look at the summary of the model's output using summary(). * Look at the Tidy output of the model using tidy(). # Print the bus_out with the print() function print(bus_out) Call: glm(formula = Bus ~ CommuteDays, family = "binomial", data = bus) Coefficients: (Intercept) CommuteDays -1.4549 0.1299 Degrees of Freedom: 15891 Total (i.e. Null); 15890 Residual Null Deviance: 19570 Residual Deviance: 19540 AIC: 19540 # Look at the summary() of bus_out summary(bus_out) Call: glm(formula = Bus ~ CommuteDays, family = "binomial", data = bus) Deviance Residuals: Min 1Q Median 3Q Max -0.9560 -0.8595 -0.8595 1.5330 1.7668 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.45493 0.11471 -12.683 < 2e-16 *** CommuteDays 0.12985 0.02312 5.616 1.96e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 19568 on 15891 degrees of freedom Residual deviance: 19536 on 15890 degrees of freedom AIC: 19540 Number of Fisher Scoring iterations: 4 # Look at the tidy() output of bus_out tidy(bus_out) # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) -1.45 0.115 -12.7 7.32e-37 2 CommuteDays 0.130 0.0231 5.62 1.96e- 8 What is the relationship between the number of commute days and the probability of taking the bus? There is a positive significant relationship (commuting more days increases the chance someone takes the bus). As a person commutes more days, they are more likely to take the bus. This is because the trend is positive and statistically significant from zero. You’re now seeing how to use GLMs to look for trends in data. ---------------------------------------------------------------------------------------------------------------------------------------------- Simulating binary data A Bernoulli distribution is a special case of a binomial. Next, you will see how to simulate both in R and then examine the outputs to see how they are similar. Both distributions can be simulated with the random binomial function: rbinom(). rbinom() requires 3 arguments: n, which is the number of draws or random numbers (i.e., an output vector of length n). size, which is the number of samples per draw (i.e., the maximum value for each random number). prob, which is the probability for the simulation. To sample with a Bernoulli, you simply use size = 1. If we take a single random draw (n = 1) from a binomial distribution with a large number of samples per draw (e.g. size = 100), we should get similar results as a taking a many samples (e.g. n = 100) with 1 sample per draw (size = 1). * Simulate from a binomial distribution with 1 sample (n), a size (size) of 100, and a probability (p) of 0.5. * Simulate from a Bernoulli distribution with 100 samples (n), a size (size) of 1, and a probability (p) of 0.5. * Print the results from the first simulation. * Sum the results from the second simulation. # Simulate 1 draw with a sample size of 100 binomial_sim <- rbinom(n = 1, size = 100, p = 0.5) # Simulate 100 draw with a sample size of 1 bernoulli_sim <- rbinom(n = 100, size = 1, p = 0.5) # Print the results from the binomial print(binomial_sim) [1] 49 # Sum the results from the Bernoulli sum(bernoulli_sim) [1] 47 This simulation has helped you to see how the binomial and Bernoulli distribution are related. Notice how the two results are similar, but not exactly the same. This is due to the inherent randomness of the simulated numbers. ---------------------------------------------------------------------------------------------------------------------------------------------- Long-form logistic regression input As you learned in the video, a binomial regression can take inputs in three different formats. The first requires data to be in "long" format and directly models each observation (e.g. a response of numeric 0/1 or a factor of yes/no): x y 1 a fail 2 a fail 3 b success 4 b fail ... In this format, the formula response predicted by predictor (or response ~ predictor in R's formula syntax) is used. During this exercise, you'll fit a regression using this format. * Fit a logistic regression using the data.frame data_long with y being predicted by x. * Save the output as lr_1. > data_long x y 1 a fail 2 a fail 3 a fail 4 a fail 5 a success 6 a fail 7 a fail 8 a fail 9 a fail 10 a fail 11 a fail 12 a fail 13 a fail 14 a success 15 b success 16 b fail 17 b success 18 b success 19 b success 20 b success 21 b success 22 b success 23 b success 24 b success 25 b success 26 b fail 27 b success 28 b fail # Fit a a long format logistic regression lr_1 <- glm(y ~ x, data = data_long, family = "binomial") print(lr_1) Call: glm(formula = y ~ x, family = "binomial", data = data_long) Coefficients: (Intercept) xb -1.792 3.091 Degrees of Freedom: 27 Total (i.e. Null); 26 Residual Null Deviance: 38.67 Residual Deviance: 26.03 AIC: 30.03 You've just fit the long dataset, which had 28 entries. You may have noticed how your degrees of freedom were 27. This is because degrees of freedom are usually the number of data points minus the number of parameters estimated. ---------------------------------------------------------------------------------------------------------------------------------------------- Wide-form input logistic regression The second and third approaches for fitting logistic regressions require "wide" format data: x fail success Total successProportion 1 a 12 2 14 0.1428571 2 b 3 11 14 0.7857143 For the second approach, model a 2 column matrix of successes and failures (e.g. number of no's and yes's per group). In this format, use the formula cbind(success, fail) ~ predictor. For the third approach, model the probability of success (e.g. group 1 had 75% yes and group 2 had 65% no) and the weight or number of observations per group (e.g. there were 40 individuals in group 1 and 100 individuals in group 2). In this example, successProportion = success / Total. In this format, the formula proportion of successes ~ response variable is used with weights = number in treatment. * Fit a logistic regression using the data.frame data_wide. First, combine success and fail with cbind() and then have this predicted by x. Save the output as lr_2. * Print the output of lr_2. * Fit a logistic regression using the data.frame data_wide with successProportion being predicted by x while using weights = Total. Save the output as lr_3. * Print the output of lr_3. # Fit a wide form logistic regression lr_2 <- glm(cbind(success, fail) ~ x, data = data_wide, family = "binomial") # Print the output of lr_2 print(lr_2) Call: glm(formula = cbind(success, fail) ~ x, family = "binomial", data = data_wide) Coefficients: (Intercept) xb -1.792 3.091 Degrees of Freedom: 1 Total (i.e. Null); 0 Residual Null Deviance: 12.64 Residual Deviance: 4.441e-15 AIC: 9.215 # Using data_wide, fit a glm with successProportion # predicted by x and weights = Total lr_3 <- glm(successProportion ~ x, data = data_wide, family = "binomial", weights = Total) # Print the output of lr_3 print(lr_3) Call: glm(formula = successProportion ~ x, family = "binomial", data = data_wide, weights = Total) Coefficients: (Intercept) xb -1.792 3.091 Degrees of Freedom: 1 Total (i.e. Null); 0 Residual Null Deviance: 12.64 Residual Deviance: 4.441e-15 AIC: 9.215 You've now seen three methods for fitting a logistic regression. While all three methods produce the same regression coefficients, notice how the degrees of freedom in model 1 differed from the degrees of freedom in models 2 and 3. The degrees of freedom differ between models, but the other outputs are the same. The choice of model input formatting is a personal choice and is usually driven by the structure of the data and questions being asked with the data. ---------------------------------------------------------------------------------------------------------------------------------------------- Fitting probits and logits During this exercise, you will fit a probit and logit model to the Pittsburgh bus data. There are multiple purposes for this and the following exercises: Learning how to change link functions. Learning to notice differences and similarities between using the logit and probit models. Learning how the link functions differ in code. First, let's code some models! Then, you'll examine the outputs. * Using the data.frame bus, fit a logistic regression to predict Bus using CommuteDays. Save this as bus_logit. * Make sure you specify the link option to be "logit" inside of the binomial() family. * Fit a probit by changing the link option to "probit". Save this as as bus_probit. * Print the summaries of both models. > glimpse(bus) Rows: 15,892 Columns: 3 $ CommuteDays 5, 5, 5, 5, 3, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5… $ MilesOneWay 19.54675, 19.54675, 19.54675, 19.54675, 19.54675, 21.66784… $ Bus Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, No, No, No, No, … # Fit a GLM with a logit link and save it as bus_logit bus_logit <- glm(Bus ~ CommuteDays, data = bus, family = binomial(link = "logit")) # Fit a GLM with probit link and save it as bus_probit bus_probit <- glm(Bus ~ CommuteDays, data = bus, family = binomial(link = "probit")) summary(bus_logit) Call: glm(formula = Bus ~ CommuteDays, family = binomial(link = "logit"), data = bus) Deviance Residuals: Min 1Q Median 3Q Max -0.9560 -0.8595 -0.8595 1.5330 1.7668 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.45493 0.11471 -12.683 < 2e-16 *** CommuteDays 0.12985 0.02312 5.616 1.96e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 19568 on 15891 degrees of freedom Residual deviance: 19536 on 15890 degrees of freedom AIC: 19540 Number of Fisher Scoring iterations: 4 summary(bus_probit) glm(formula = Bus ~ CommuteDays, family = binomial(link = "probit"), data = bus) Deviance Residuals: Min 1Q Median 3Q Max -0.9545 -0.8596 -0.8596 1.5328 1.7706 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.88951 0.06833 -13.017 < 2e-16 *** CommuteDays 0.07810 0.01380 5.658 1.53e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 19568 on 15891 degrees of freedom Residual deviance: 19536 on 15890 degrees of freedom AIC: 19540 Number of Fisher Scoring iterations: 4 * Do the results differ? The coefficient estimates for CommuteDays differ, but a statistically significant relationship exists with both models. Notice how both models produce the same statistical result, but have different coefficient estimates because both models have different link functions. Usually, probit and logit models produce similar qualitative results. The choice between the two is largely a modeler's choice based upon their preference. ---------------------------------------------------------------------------------------------------------------------------------------------- Simulating a logit Simulations can help us to understand distributions, evaluate our models, and compare study designs. During this exercise, you will simulate a logit distribution. This will allow you to generate data with known parameters. This can be helpful when testing models or comparing study designs (e.g. how many samples do we need to collect?). * Using the plogis() function, convert 0 on the logit scale to a probability. Save the output as p. * Using the rbinom function, generate 10 samples (n = 10) with a size of 1 (size = 1) using the probability p. # Convert from the logit scale to a probability # p = 0.5 p <- plogis(0) # Simulate a logit rbinom(n = 10, size = 1, p = p) [1] 0 1 1 1 0 1 0 1 1 1 You've now seen how to simulate logit data, next you will see how simulating a probit is similar. Simulating data can be helpful when testing models or designing studies. ---------------------------------------------------------------------------------------------------------------------------------------------- Simulating a probit During the previous exercise, you simulated a logit. During this exercise, you will simulate a probit. First, you will convert from the probit scale to a probability. Second, you will use this probability to simulate from a binomial distribution. * Using function pnorm() to convert 0 on the probit scale to a probability. Save the output as p. * Using the rbinom function, generate 10 samples (n = 10) with a size of 1 (size = 1) using the probability p. # Convert from the probit scale to a probability # p = 0.5 p <- pnorm(0) # Simulate a probit rbinom(n = 10, size = 1, p = p) [1] 0 0 1 1 0 1 0 1 1 1