lm vs. Poisson coefficients A linear model and Poisson regression have been fit for you. In this exercise, you'll extract the coefficients for both models and compare them. In Step 3, you will need to exponentiate the results from the Poisson regression. This is needed because the Poisson coefficients are estimated on link-scale, not the raw data-scale. Recall from the video that linear models intercepts are additive whereas Poisson regression intercepts are multiplicative. * You have been given the model lm_out, extract the coefficients using coef() and save as lm_coef. * You have been given the model poisson_out, extract the coefficients using coef() and save as poisson_out. * Exponentiate the coefficients from poisson_out using exp() and save this as poisson_coef_exp. > lm_out Call: lm(formula = Number ~ Month, data = dat) Coefficients: (Intercept) Month2 Month3 Month4 Month5 Month6 0.129477 -0.038031 -0.078401 -0.057254 -0.032702 -0.043365 Month7 Month8 Month9 Month10 Month11 Month12 -0.005821 -0.051520 -0.023921 -0.054208 -0.023921 -0.022919 > poisson_out Call: glm(formula = Number ~ Month, family = "poisson", data = dat) Coefficients: (Intercept) Month2 Month3 Month4 Month5 Month6 -2.0443 -0.3478 -0.9302 -0.5838 -0.2911 -0.4079 Month7 Month8 Month9 Month10 Month11 Month12 -0.0460 -0.5073 -0.2043 -0.5424 -0.2043 -0.1948 Degrees of Freedom: 4367 Total (i.e. Null); 4356 Residual Null Deviance: 2325 Residual Deviance: 2303 AIC: 2976 # Extract the coeffients from lm_out lm_coef <- coef(lm_out) lm_coef (Intercept) Month2 Month3 Month4 Month5 Month6 0.12947658 -0.03803116 -0.07840132 -0.05725436 -0.03270239 -0.04336547 Month7 Month8 Month9 Month10 Month11 Month12 -0.00582067 -0.05151959 -0.02392103 -0.05420777 -0.02392103 -0.02291921 # Extract the coefficients from poisson_out poisson_coef <- coef(poisson_out) poisson_coef (Intercept) Month2 Month3 Month4 Month5 Month6 -2.04425523 -0.34775767 -0.93019964 -0.58375226 -0.29111968 -0.40786159 Month7 Month8 Month9 Month10 Month11 Month12 -0.04599723 -0.50734279 -0.20426264 -0.54243411 -0.20426264 -0.19481645 # Take the exponential using exp() poisson_coef_exp <- exp(poisson_coef) poisson_coef_exp (Intercept) Month2 Month3 Month4 Month5 Month6 0.1294766 0.7062700 0.3944749 0.5578014 0.7474262 0.6650709 Month7 Month8 Month9 Month10 Month11 Month12 0.9550446 0.6020933 0.8152482 0.5813315 0.8152482 0.8229857 How do the exponential Poisson regression coefficients (poisson_coef_exp) compare to the linear regression coefficients (lm_coef)? Both have almost the same (intercept)s, but different coefficients (Month2, Month3, etc.). The intercepts are essentially the same, but the other coefficients are different. This is because the Poisson coefficients are multiplicative. Notice that 0.129 * 0.706 = 0.091 from the Poisson coefficents is the same as 0.129-0.038 = 0.091 from the linear model. ---------------------------------------------------------------------------------------------------------------------------------------------- Poisson regression plotting Using geom_smooth to plot data from a Poisson regression is similar to plotting a binomial. However, you need to change the family argument in geom_smooth(). During this exercise, you will plot the number of cancer cells per cm and use a geom_smooth(). This is the same simulated dataset you saw in the video. * Use geom_jitter() to set both the jitter width and height to 0.05. * Use geom_smooth() to create a line-based on a Poisson regression. > dat dose cells 1 1 1 2 1 0 3 1 0 4 1 0 5 1 0 6 1 2 7 1 0 8 1 1 9 1 2 10 1 0 11 2 3 12 2 0 13 2 2 14 2 2 15 2 1 16 2 0 17 2 1 18 2 2 19 2 2 20 2 2 21 3 2 22 3 3 23 3 5 24 3 3 25 3 0 26 3 3 27 3 6 28 3 2 29 3 4 30 3 4 31 4 2 32 4 2 33 4 8 34 4 4 35 4 4 36 4 4 37 4 7 38 4 2 39 4 6 40 4 5 41 5 2 42 5 5 43 5 8 44 5 4 45 5 7 46 5 4 47 5 4 48 5 7 49 5 9 50 5 3 51 6 6 52 6 7 53 6 9 54 6 5 55 6 3 56 6 5 57 6 5 58 6 3 59 6 4 60 6 11 61 7 2 62 7 7 63 7 9 64 7 3 65 7 4 66 7 2 67 7 6 68 7 5 69 7 5 70 7 6 71 8 4 72 8 5 73 8 8 74 8 10 75 8 11 76 8 9 77 8 8 78 8 8 79 8 11 80 8 7 81 9 10 82 9 12 83 9 9 84 9 12 85 9 10 86 9 12 87 9 9 88 9 17 89 9 6 90 9 9 91 10 15 92 10 11 93 10 11 94 10 10 95 10 4 96 10 9 97 10 13 98 10 8 99 10 8 100 10 13 # Use geom_smooth to plot a continuous predictor variable ggplot(data = dat, aes(x = dose, y = cells)) + geom_jitter(width = 0.05, height = 0.05) + geom_smooth(method = 'glm', method.args = list(family = 'poisson')) ---------------------------------------------------------------------------------------------------------------------------------------------- Extracting and interpreting odds-ratios You looked at the bus data in this chapter and in chapter 2. During this exercise, you are going to extract out the odds-ratio ratio from a model that predicts the chance of riding the bus based upon number of commuting days. The resulting odds-ratio (indicated by the CommuteDays term) will give you the change in odds for each extra day a person commutes. For example, an odds-ratio of 1.5 would mean each extra-day a person commutes per week, the odds increase 50% (1.5 - 1.0 = 0.5 = 50%) they will ride the bus. A logistic regression, bus_out has been fit for you. * Using base R, extract out the regression coefficients using coef() and save this a coef_out. * Convert these coefficients to odds-ratios by calculating the exponential of the results with exp(). > bus_out Call: glm(formula = Bus ~ CommuteDays, family = "binomial", data = bus) Coefficients: (Intercept) CommuteDays -1.4549 0.1299 Degrees of Freedom: 15891 Total (i.e. Null); 15890 Residual Null Deviance: 19570 Residual Deviance: 19540 AIC: 19540 # Extract out the coefficients coef_out <- coef(bus_out) # Convert the coefficients to odds-ratios exp(coef_out) (Intercept) CommuteDays 0.2334164 1.1386623 What is the meaning of the odds-ratio that you extracted? The odds-ratio was >1, so commuting more days increased the chance of riding the bus. ---------------------------------------------------------------------------------------------------------------------------------------------- Odds-ratios & confidence intervals in the Tidyverse The broom package in the Tidyverse give us a powerful tool, tidy(), for extracting coefficients and making them readable. More specifically, it allows us to extract the coefficients, as well as the confidence intervals, and exponentiate the outputs. Confidence intervals are another approach for statistical inference. If the confidence intervals for odds-ratios do not include 1, the corresponding coefficient is statistically different than 1. During this exercise, you will use tidy() to extract the 95% confidence intervals from the bus model in the previous exercises. * Use the tidy() function on bus_out to extract the coefficients. * Set exponentiate to TRUE to convert the results to odds ratios. * Set conf.int to TRUE to include confidence intervals. # Exponentiate the results and extract the confidence intervals of bus_out with tidy() tidy(bus_out, exponentiate = TRUE, conf.int = TRUE) # A tibble: 2 × 7 term estimate std.error statistic p.value conf.low conf.high 1 (Intercept) 0.233 0.115 -12.7 7.32e-37 0.186 0.292 2 CommuteDays 1.14 0.0231 5.62 1.96e- 8 1.09 1.19 Does adding an extra commuting day significantly change the odds-ratio for riding the bus? The confidence interval for CommuteDays does not include 1.0 so the results are statistically significant. ---------------------------------------------------------------------------------------------------------------------------------------------- Default trend lines ggplot2 gives us powerful tools to visualize our data and "see" what is going on. However, sometimes we need to wrangle our data to give us what we need. During this exercise, you will look and see if the length of commutes changes the chance somebody rides the bus. You will do this by changing the points to be non-overlapping (or jittered) using geom_jitter() and adding smoothed trend line with geom_smooth(). * Using the data.frame bus, plot MilesOneWay on the x-axis and Bus2 on the y-axis. * Jitter the width of the data by 0 and the height by 0.05. * Label the axes on the plot. Use "Probability of riding the bus" on the y-axis and "One-way commute trip (in miles)" on the x-axis. * Save this plot as gg_jitter then add geom_smooth(). > glimpse(bus) Rows: 15,892 Columns: 4 $ CommuteDays 5, 5, 5, 5, 3, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5… $ MilesOneWay 19.54675, 19.54675, 19.54675, 19.54675, 19.54675, 21.66784… $ Bus Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, No, No, No, No, … $ Bus2 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1… # Create a jittered plot of MilesOneWay vs Bus2 using the bus dataset gg_jitter <- ggplot(data = bus, aes(x = MilesOneWay, y = Bus2)) + geom_jitter(width = 0, height = 0.05) + ylab("Probability of riding the bus") + xlab("One-way commute trip (in miles)") # Add a geom_smooth() to your plot gg_jitter + geom_smooth() ---------------------------------------------------------------------------------------------------------------------------------------------- Methods for trend lines During the previous exercise, you used ggplot2's default geom_smooth(). During this exercise, you will use a glm() instead. This will allow you to "see" a logistic regression with ggplot2. Specifically, you plot the probability someone takes the bus given their commute distance. You will need to tell geom_smooth() to use the glm() method. Recall from Chapter 2 that the default family for a glm() is a Gaussian family, which produces the same results as a lm(). Thus, you will also need to specify the method argument from glm(). The code for creating gg_jitter, which you built in the last exercise, has been provided for you. * Use the "glm" method with geom_smooth(). * With the method.args, set the family to 'binomial'. # Create a jittered plot of MilesOneWay vs Bus2 using the bus dataset gg_jitter <- ggplot(data = bus, aes(x = MilesOneWay, y = Bus2)) + geom_jitter(width = 0, height = 0.05) + ylab("Probability of riding the bus") + xlab("One-way commute trip (in miles)") # Add a geom_smooth() that uses a GLM method to your plot gg_jitter + geom_smooth(method = "glm" , method.args = list(family = "binomial")) ---------------------------------------------------------------------------------------------------------------------------------------------- Comparing probits and logits You learned about probit and logit link function in the previous chapter. Now, you will learn how to plot them together with ggplot2. The purpose of this exercise is to show you that the two are often very similar and also to show you how to change link functions with ggplot2. Your plot, gg_jitter, has been loaded for you and you will use this with geom_smooth(). Also, you will change the colors of your curves and remove the shaded areas to make the plot easier to read. * Plot a probit curve by setting the first family to use a probit link using binomial(link = 'probit'), the color to 'red', and se to FALSE. * Plot a logit curve by setting the first family to use a logit link using binomial(link = 'logit'), the color to 'blue', and se to FALSE. # Add geom_smooth() lines for the probit and logit link functions gg_jitter + geom_smooth(method = 'glm', method.args = list(family = binomial(link = 'probit')), color = 'red', se = FALSE) + geom_smooth(method = 'glm', method.args = list(family = binomial(link = 'logit')), color = 'blue', se = FALSE)