Fitting a multiple logistic regression In the video, you learned about fitting a regression with two different predictor variables. In this exercise, you'll get to build the model we talked about. The distance a commuter rides might be important in determining if they are more likely to ride the bus. Similarly, the number of days a commuter goes to work might be important in determining if they are more likely to ride the bus. Using this model, we will examine how this variables may be used to predict the chance a commuter rides the bus. * Build a logistic regression using the bus data frame. Save the output as bus_both. * Predict if someone rides the Bus based upon both CommuteDays and MilesOneWay. Make sure you use the variables in this order in the formula.. * Use summary() on bus_both. > glimpse(bus) Rows: 15,892 Columns: 3 $ CommuteDays 5, 5, 5, 5, 3, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5… $ MilesOneWay 19.54675, 19.54675, 19.54675, 19.54675, 19.54675, 21.66784… $ Bus Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, No, No, No, No, … # Build a logistic regression with Bus predicted by CommuteDays and MilesOneWay bus_both <- glm(Bus ~ CommuteDays + MilesOneWay, data = bus, family = "binomial") # Look at the summary of the output summary(bus_both) Call: glm(formula = Bus ~ CommuteDays + MilesOneWay, family = "binomial", data = bus) Deviance Residuals: Min 1Q Median 3Q Max -1.0732 -0.9035 -0.7816 1.3968 2.5066 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.707515 0.119719 -5.910 3.42e-09 *** CommuteDays 0.066084 0.023181 2.851 0.00436 ** MilesOneWay -0.059571 0.003218 -18.512 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 19568 on 15891 degrees of freedom Residual deviance: 19137 on 15889 degrees of freedom AIC: 19143 Number of Fisher Scoring iterations: 4 ---------------------------------------------------------------------------------------------------------------------------------------------- Building two models In chapter 2, you built a model to examine if the number of commute days impacted chance of riding the bus. In chapter 3, you visualized if the distance a commuter traveled impacted the chance of riding the bus. During this exercise, you will build both of these logistic regressions using the glm() function with the binomial family. The bus data frame has been loaded for you. * Build a logistic regression where Bus is predicted by CommuteDays and save it as bus_days. * Build a logistic regression where Bus is predicted by MilesOneWay and save it as bus_miles. # Build a logistic regression with Bus predicted by CommuteDays bus_days <- glm(Bus ~ CommuteDays, data = bus, family = "binomial") # Build a logistic regression with Bus predicted by MilesOneWay bus_miles <- glm(Bus ~ MilesOneWay, data = bus, family = "binomial") summary(bus_days) Call: glm(formula = Bus ~ CommuteDays, family = "binomial", data = bus) Deviance Residuals: Min 1Q Median 3Q Max -0.9560 -0.8595 -0.8595 1.5330 1.7668 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.45493 0.11471 -12.683 < 2e-16 *** CommuteDays 0.12985 0.02312 5.616 1.96e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 19568 on 15891 degrees of freedom Residual deviance: 19536 on 15890 degrees of freedom AIC: 19540 Number of Fisher Scoring iterations: 4 summary(bus_miles) Call: glm(formula = Bus ~ MilesOneWay, family = "binomial", data = bus) Deviance Residuals: Min 1Q Median 3Q Max -1.0204 -0.9057 -0.7868 1.3974 2.5268 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.375824 0.027801 -13.52 <2e-16 *** MilesOneWay -0.060760 0.003197 -19.01 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 19568 on 15891 degrees of freedom Residual deviance: 19145 on 15890 degrees of freedom AIC: 19149 Number of Fisher Scoring iterations: 4 The single and multiple regression models estimated different coefficient values. This is why multiple regression model results often are described with 'corrected for'. For example, here you might describe the probability of riding the bus based upon commute days, corrected for the commute distance. ---------------------------------------------------------------------------------------------------------------------------------------------- Comparing variable order The order of predictor variables can be important, especially if predictors are correlated. This is because changing the order of correlated predictor variables can change the estimates for the regression coefficients. The name for this problem is Multicollinearity. During this exercise, you will build two different multiple regressions with the bus data in order to compare the importance of model inputs order. First, examine the correlation between CommuteDays and MilesOneWay. Second, build two logistic regressions using the bus data frame where Bus is predicted by CommuteDays and MilesOneWay in separate orders. After you build the two models, look at each model's summary() to see the outputs. * Examine the correlation between bus$CommuteDays (first argument) and bus$MilesOneWay (second argument) with cor(). * Using the bus data, build 2 multiple GLMs with the "binomial" family. * Save model 1 as bus_one, with Bus predicted by CommuteDays + MilesOneWay. * Save model 2 as bus_two, with Bus predicted by MilesOneWay + CommuteDays. # Run a correlation cor(bus$CommuteDays, bus$MilesOneWay) [1] -0.1378974 # Build a glm with CommuteDays first and MilesOneWay second bus_one <- glm(Bus ~ CommuteDays + MilesOneWay, data = bus, family = "binomial") # Build a glm with MilesOneWay first and CommuteDays second bus_two <- glm(Bus ~ MilesOneWay + CommuteDays, data = bus, family = "binomial") # Print model summaries summary(bus_one) summary(bus_two) Call: glm(formula = Bus ~ CommuteDays + MilesOneWay, family = "binomial", data = bus) Deviance Residuals: Min 1Q Median 3Q Max -1.0732 -0.9035 -0.7816 1.3968 2.5066 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.707515 0.119719 -5.910 3.42e-09 *** CommuteDays 0.066084 0.023181 2.851 0.00436 ** MilesOneWay -0.059571 0.003218 -18.512 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 19568 on 15891 degrees of freedom Residual deviance: 19137 on 15889 degrees of freedom AIC: 19143 Number of Fisher Scoring iterations: 4 Call: glm(formula = Bus ~ MilesOneWay + CommuteDays, family = "binomial", data = bus) Deviance Residuals: Min 1Q Median 3Q Max -1.0732 -0.9035 -0.7816 1.3968 2.5066 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.707515 0.119719 -5.910 3.42e-09 *** MilesOneWay -0.059571 0.003218 -18.512 < 2e-16 *** CommuteDays 0.066084 0.023181 2.851 0.00436 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 19568 on 15891 degrees of freedom Residual deviance: 19137 on 15889 degrees of freedom AIC: 19143 Number of Fisher Scoring iterations: 4 Changing the order did not change the numerical coefficient estimates or their significance. Notice how in this case, order was not important for the regression predictor variables. Usually, this is the case. Multiple slopes You learned about model.matrix() and its relationship to glm() during the video. Now, you can "peek under the hood" of R and see how formulas work. The input to model.matrix() is similar to formula inputs of lm() and glm() with one key difference: model.matrix() does not have a left-hand side (e.g. y ~), only the right-hand side (e.g., ~ x). For example, model.matrix( ~ x1) or model.matrix( ~ x1 + x2) both would be valid inputs. The output from model.matrix() is called a prediction matrix because it is the prediction (or right-hand side) of a formula in R. During this exercise, you will use model.matrix() with 2 vectors: size and count. First, build a simple formula that only includes size. Second, build a formula that includes both size and count. Practically speaking, the easiest way to check this assumption to change the order of the variables in the regression. ---------------------------------------------------------------------------------------------------------------------------------------------- Multiple slopes You learned about model.matrix() and its relationship to glm() during the video. Now, you can "peek under the hood" of R and see how formulas work. The input to model.matrix() is similar to formula inputs of lm() and glm() with one key difference: model.matrix() does not have a left-hand side (e.g. y ~), only the right-hand side (e.g., ~ x). For example, model.matrix( ~ x1) or model.matrix( ~ x1 + x2) both would be valid inputs. The output from model.matrix() is called a prediction matrix because it is the prediction (or right-hand side) of a formula in R. During this exercise, you will use model.matrix() with 2 vectors: size and count. First, build a simple formula that only includes size. Second, build a formula that includes both size and count. * Use model.matrix() to create a prediction matrix with only size. * Use model.matrix() to create a prediction matrix with both size (first) and count (second). > size [1] 1.1 2.2 3.3 > count [1] 10 4 2 # Use model.matrix() with size model.matrix( ~ size) (Intercept) size 1 1 1.1 2 1 2.2 3 1 3.3 attr(,"assign") [1] 0 1 # Use model.matrix() with size and count model.matrix( ~ size + count) (Intercept) size count 1 1 1.1 10 2 1 2.2 4 3 1 3.3 2 attr(,"assign") [1] 0 1 2 Notice how model.matrix() includes an intercept for the design matrix. This means all observations would share the same intercept estimate. Next, you'll see how to specify intercepts. ---------------------------------------------------------------------------------------------------------------------------------------------- Intercepts In the previous exercise, you saw how to code multiple slopes. During this exercise, you will get to explore intercepts with model.matrix() using a simple vectors of colors. First, build a matrix including a reference intercept. This input only has the intercept variable's name (e.g. ~x1). Second, build a matrix including an intercept for each color. This input has the intercept variable's name and -1 (e.g. ~x1 - 1). * Build a matrix using color including a reference intercept, which is the default option for R. * Build a matrix using color including an intercept for each color, which requires a - 1 in the input. > color [1] "red" "blue" "green" > class(color) [1] "character" # Create a matrix that includes a reference intercept model.matrix( ~ color) (Intercept) colorgreen colorred 1 1 0 1 2 1 0 0 3 1 1 0 attr(,"assign") [1] 0 1 1 attr(,"contrasts") attr(,"contrasts")$color [1] "contr.treatment" # Create a matrix that includes an intercept for each group model.matrix( ~ color - 1) colorblue colorgreen colorred 1 0 0 1 2 1 0 0 3 0 1 0 attr(,"assign") [1] 1 1 1 attr(,"contrasts") attr(,"contrasts")$color [1] "contr.treatment" Notice how model.matrix() has different outputs for each input. The first input gives a baseline intercept and contrast. The second input gives an intercept for each group. The first allows you to compare groups to a reference group and the second allows you to compare all groups to zero. ---------------------------------------------------------------------------------------------------------------------------------------------- Multiple intercepts With multiple intercepts, predictor variable order become important if you want to estimate intercepts for one group. This is because R only estimates intercepts for the first predictor variable. All other predictor intercepts will contrast to a reference group within the factor. In the previous exercise, you constructed a contrast matrix with one intercept with and without an intercept for each group. During this exercise, build a contrast matrix with two group variables and use the -1 option. Then, swap the order of the input variables. This allows you to see how order is important for the contrast levels. * Build a contrast matrix using model.matrix() with color first and shape second. Be sure to include -1. * Build a contrast matrix using model.matrix() with shape first and color second. Be sure to include - 1. > color [1] "red" "blue" "green" > shape [1] "square" "square" "circle" # Create a matrix that includes color and then shape model.matrix( ~ color + shape - 1) colorblue colorgreen colorred shapesquare 1 0 0 1 1 2 1 0 0 1 3 0 1 0 0 attr(,"assign") [1] 1 1 1 2 attr(,"contrasts") attr(,"contrasts")$color [1] "contr.treatment" attr(,"contrasts")$shape [1] "contr.treatment" # Create a matrix that includes shape and then color model.matrix( ~ shape + color - 1) shapecircle shapesquare colorgreen colorred 1 0 1 0 1 2 0 1 0 0 3 1 0 1 0 attr(,"assign") [1] 1 1 2 2 attr(,"contrasts") attr(,"contrasts")$shape [1] "contr.treatment" attr(,"contrasts")$color [1] "contr.treatment" ---------------------------------------------------------------------------------------------------------------------------------------------- > model.matrix( ~ shape + color) (Intercept) shapesquare colorgreen colorred 1 1 1 0 1 2 1 1 0 0 3 1 0 1 0 attr(,"assign") [1] 0 1 2 2 attr(,"contrasts") attr(,"contrasts")$shape [1] "contr.treatment" attr(,"contrasts")$color [1] "contr.treatment" > model.matrix( ~ color + shape) (Intercept) colorgreen colorred shapesquare 1 1 0 1 1 2 1 0 0 1 3 1 1 0 0 attr(,"assign") [1] 0 1 1 2 attr(,"contrasts") attr(,"contrasts")$color [1] "contr.treatment" attr(,"contrasts")$shape [1] "contr.treatment" ---------------------------------------------------------------------------------------------------------------------------------------------- Simpson's paradox Simpson's paradox occurs when adding or removing a coefficient changes the results of analysis and is important for regressions. The 1973 Graduate School admission data from UC-Berkeley illustrates this point. At first glance, it appears females are less likely to be admitted to graduate programs. However, including Department as a coefficient causes gender's significance to disappear. As it turns out, prospective female students applied to more competitive programs than males. Data note: When looking at the data, you are provided with four columns: Dept, Gender, Admitted, and Rejected. You can build a "binomial" glm() by binding the Admitted and Rejected columns. * Build a logistic regression using glm() where cbind(Admitted, Rejected) is predicted by Gender using the UCB_data data frame. Save this as glm_1. * Build a second logistic regression using glm() where cbind(Admitted, Rejected) is predicted by Gender and Dept using the UCB_data data.frame. Save this as glm_2. * Look at the summary of both models. > glimpse(UCB_data) Rows: 12 Columns: 4 $ Dept "A", "A", "B", "B", "C", "C", "D", "D", "E", "E", "F", "F" $ Gender "Male", "Female", "Male", "Female", "Male", "Female", "Male",… $ Admitted 512, 89, 353, 17, 120, 202, 138, 131, 53, 94, 22, 24 $ Rejected 313, 19, 207, 8, 205, 391, 279, 244, 138, 299, 351, 317 # Build a binomial glm where Admitted and Rejected are predicted by Gender glm_1 <- glm(cbind(Admitted, Rejected) ~ Gender, data = UCB_data, family = "binomial") # Build a binomial glm where Admitted and Rejected are predicted by Gender and Dept glm_2 <- glm(cbind(Admitted, Rejected) ~ Gender + Dept, data = UCB_data, family = "binomial") # Look at the summary of both models summary(glm_1) summary(glm_2) Call: glm(formula = cbind(Admitted, Rejected) ~ Gender, family = "binomial", data = UCB_data) Deviance Residuals: Min 1Q Median 3Q Max -16.7915 -4.7613 -0.4365 5.1025 11.2022 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.83049 0.05077 -16.357 <2e-16 *** GenderMale 0.61035 0.06389 9.553 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 877.06 on 11 degrees of freedom Residual deviance: 783.61 on 10 degrees of freedom AIC: 856.55 Number of Fisher Scoring iterations: 4 Call: glm(formula = cbind(Admitted, Rejected) ~ Gender + Dept, family = "binomial", data = UCB_data) Deviance Residuals: 1 2 3 4 5 6 7 8 -1.2487 3.7189 -0.0560 0.2706 1.2533 -0.9243 0.0826 -0.0858 9 10 11 12 1.2205 -0.8509 -0.2076 0.2052 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.68192 0.09911 6.880 5.97e-12 *** GenderMale -0.09987 0.08085 -1.235 0.217 DeptB -0.04340 0.10984 -0.395 0.693 DeptC -1.26260 0.10663 -11.841 < 2e-16 *** DeptD -1.29461 0.10582 -12.234 < 2e-16 *** DeptE -1.73931 0.12611 -13.792 < 2e-16 *** DeptF -3.30648 0.16998 -19.452 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 877.056 on 11 degrees of freedom Residual deviance: 20.204 on 5 degrees of freedom AIC: 103.14 Number of Fisher Scoring iterations: 4 Adding Dept causes gender to not be a significant coefficient. Notice how adding the department made Gender no longer a significant predictor variable. In this case, the academic departments are important. If you were to dig into the data further, you would find different departments had different rejection rates and also that applications by gender were not balanced across departments. ---------------------------------------------------------------------------------------------------------------------------------------------- Non-linear logistic regression In chapter 3, you explored the distance commuters traveled and the linear effect this had on the probability of somebody riding the bus. However, what if this relationship is non-linear and non-monotonic? For example, what if people who commute the shortest distances and longest are less likely to ride the bus? You can add non-linear terms to formulas in R using the I(..) function as part of your formula. For example y~I(x^2) allows you to estimate a coefficient for x*x. During this exercise, you will examine the bus data more. * Add the formula y ~ I(x^2) to the formula option in the second call to geom_smooth(). > glimpse(bus) Rows: 15,892 Columns: 4 $ CommuteDays 5, 5, 5, 5, 3, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5… $ MilesOneWay 19.54675, 19.54675, 19.54675, 19.54675, 19.54675, 21.66784… $ Bus Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, No, No, No, No, … $ Bus2 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1… # Plot linear effect of travel distance on probability of taking the bus gg_jitter <- ggplot(data = bus, aes(x = MilesOneWay, y = Bus2)) + geom_jitter(width = 0, height = 0.05) + geom_smooth(method = 'glm', method.args = list(family = 'binomial')) # Add a non-linear equation to a geom_smooth() gg_jitter + geom_smooth(method = 'glm', method.args = list(family = 'binomial'), formula = y ~ I(x^2), color = 'red') Based upon the extra plot, do you think the data is linear or non-linear? The two lines are similar to each other. However, with the non-linear (red) line, the probability of commuting stays fairly constant for commutes <10 miles. Above this distance, the probability drops off. This suggests a non-linear trend in the data and the need for more investigation or insight for your client.