Refresher on fitting linear models During this exercise, you'll see how a linear model is a special case of a generalized linear model (GLM). You will use the ChickWeight dataset that we introduced during the video. I have already extracted the last observed weights for you. With the dataset, you will see if Diet affects weight at the end of the study. Recall from the video that a linear model looks like lm(formula = y ~ x, data = dat) and that a GLM looks like glm(formula = y ~ x, data = dat, family = 'family'). * Fit a linear model with variable weight predicted by variable Diet using the chick_weight_end dataframe. * Fit a GLM using the same formula and data, but also include the gaussian family. > glimpse(ChickWeight) Rows: 578 Columns: 4 $ weight 42, 51, 59, 64, 76, 93, 106, 125, 149, 171, 199, 205, 40, 49, 5… $ Time 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 21, 0, 2, 4, 6, 8, 10, 1… $ Chick 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, … $ Diet 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … > glimpse(chick_weight_end) Rows: 45 Columns: 4 $ weight 205, 215, 202, 157, 223, 157, 305, 98, 124, 175, 205, 96, 266, … $ Time 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,… $ Chick 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 17, 19, 20, 21, 22,… $ Diet 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, … # Fit a lm() lm(formula = weight ~ Diet, data = ChickWeight) # Fit a glm() glm(formula = weight ~ Diet , data = ChickWeight, family = 'gaussian') ---------------------------------------------------------------------------------------------------------------------------------------------- Fitting a Poisson regression in R In this exercise, you will fit a Poisson regression using glm(). GLMs in R use the same formula notation as linear models: response ~ predictor. However, the model also requires the error or distribution family to be specified with the family = ... argument. To fit a Poisson regression, set family equal to "poisson" (make sure to include the quotes " around poisson). You will use a simulated dataset with a predictor variable time and response count. A real dataset could be number of points scored, clicks on a webpage, or birds you see out your window. * Using data.frame dat, fit a Poisson regression where count is predicted by time with the poisson family. Save the model as poisson_out. * Print poisson_out to the screen. > dat time count 1 1 0 2 2 0 3 3 0 4 4 0 5 5 1 6 6 0 7 7 0 8 8 1 9 9 0 10 10 0 11 11 2 12 12 0 13 13 1 14 14 0 15 15 0 16 16 1 17 17 0 18 18 0 19 19 0 20 20 2 21 21 2 22 22 1 23 23 1 24 24 4 25 25 1 26 26 1 27 27 1 28 28 1 29 29 0 30 30 0 # fit y predicted by x with data.frame dat using the poisson family poisson_out <- glm(count ~ time, data = dat, family = 'poisson') # print the output print(poisson_out) ---------------------------------------------------------------------------------------------------------------------------------------------- Comparing linear and Poisson regression During this exercise, you will fit a linear regression and see how the models produce different results than the Poisson regression you previously fit. In the previous exercise, you fit a GLM with a Poisson family, poisson_out. This object has been provided for you. Now, using glm(), fit a glm() with a normal or Gaussian family. * Fit a GLM with a gaussian family where count is predicted by time using data.frame dat. * Print the summary for lm_out model first and the poisson_out model second and notice the different test-statistics and p-values. # Fit a glm with count predicted by time using data.frame dat and gaussian family lm_out <- glm(count ~ time, data = dat, family = 'gaussian') summary(lm_out) Call: glm(formula = count ~ time, family = "gaussian", data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.2022 -0.5190 -0.1497 0.2595 3.0194 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.09425 0.32891 0.287 0.7766 time 0.03693 0.01853 1.993 0.0561 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 0.7714815) Null deviance: 24.667 on 29 degrees of freedom Residual deviance: 21.601 on 28 degrees of freedom AIC: 81.283 Number of Fisher Scoring iterations: 2 summary(poisson_out) Call: glm(formula = count ~ time, family = "poisson", data = dat) Deviance Residuals: Min 1Q Median 3Q Max -1.6547 -0.9666 -0.7226 0.3830 2.3022 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.43036 0.59004 -2.424 0.0153 * time 0.05815 0.02779 2.093 0.0364 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 35.627 on 29 degrees of freedom Residual deviance: 30.918 on 28 degrees of freedom AIC: 66.024 Number of Fisher Scoring iterations: 5 Notice how the Poisson regression is more powerful than a linear model. This means a Poisson regression can sometimes find trends that a simple linear regression will miss. This is especially true when dealing with count data with many zeros. In this specific case, the trend estimate with the Poisson regression had a p-value <0.05, indicating it was significantly different from zero. ---------------------------------------------------------------------------------------------------------------------------------------------- Intercepts-comparisons versus means This exercise will show you how R's formulas allow two types of intercepts to be estimated. You will use data for the number of goals per game for two players, "Scoring Sam" and "Unlucky Lou". Because these are counts, use a glm() with a Poisson family. First, fit a model that estimates an intercept and the difference between the two players. This is the default formula option in R and looks like formula = response ~ intercept. Second, fit a model that estimates the effect (or intercept) of both players. This formula option in R requires a - 1, for example formula = response ~ intercept - 1. * Using the data, scores, fit a Poisson glm using goal predicted by player. Look at the summary of the output. * Using the data, scores, fit a Poisson glm using goal predicted by player - 1 to estimate average for each player. Look at the summary of the output. > scores player goal 1 Sam 1 2 Sam 2 3 Sam 0 4 Sam 4 5 Sam 3 6 Lou 0 7 Lou 0 8 Lou 1 9 Lou 0 10 Lou 0 # Fit a glm() that estimates the difference between players summary(glm(goal ~ player, data = scores, family = 'poisson')) Call: glm(formula = goal ~ player, family = "poisson", data = scores) Deviance Residuals: Min 1Q Median 3Q Max -2.0000 -0.6325 -0.6325 0.4934 1.2724 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.6094 0.9999 -1.610 0.1075 playerSam 2.3026 1.0487 2.196 0.0281 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 18.3578 on 9 degrees of freedom Residual deviance: 9.8105 on 8 degrees of freedom AIC: 26.682 Number of Fisher Scoring iterations: 5 # Fit a glm() that estimates an intercept for each player summary(glm(goal ~ player - 1, data = scores, family = 'poisson')) Call: glm(formula = goal ~ player - 1, family = "poisson", data = scores) Deviance Residuals: Min 1Q Median 3Q Max -2.0000 -0.6325 -0.6325 0.4934 1.2724 Coefficients: Estimate Std. Error z value Pr(>|z|) playerLou -1.6094 0.9999 -1.610 0.1075 playerSam 0.6931 0.3162 2.192 0.0284 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 18.4546 on 10 degrees of freedom Residual deviance: 9.8105 on 8 degrees of freedom AIC: 26.682 Number of Fisher Scoring iterations: 5 The formula, goal ~ player, estimated two parameters. The first parameter was the number of goals scored by Lou. The second parameter was the difference in Sam's goals compared to Lou's goals. The formula, goal ~ player - 1, also estimated two parameters. The first was the number of goals scored by Lou. The second was the number of goals scored by Sam. Both estimates were on the log-scale because Poisson regressions use a log-link function. As you can see, R allows for two different types of intercepts to be estimated. The first and default setting estimated the number of goals scored by Sam and the difference in goals scored by Lou. The second setting, using - 1, estimated the number of goals scored by Sam AND the number of goals scored by Lou. More generally, the default formula in R uses the first factor level and contrasts all other levels to it. Reording a factor allows you to change the reference level. ---------------------------------------------------------------------------------------------------------------------------------------------- Applying summary(), print(), and tidy() to glm print() and summary() are two of the backbone R functions for examining models. These functions tells us about the basics model outputs. Using data from Louisville, KY, USA about the number of civilian fire injury victims per day, you will examine a linear regression's output and a Poisson regression's output. In addition to the base R function, we will use the tidy function from the broom package. This creates a tidy output for each of the models. Note that broom needs to be installed and does not come with base R. * Build a linear regression and Poisson regression using the data frame dat where Number is predicted by Month with lm() and glm(). * Examine the print() output for both models. * Examine the summary() output for both models. * Examine the tidy() output for both models. > glimpse(dat) Rows: 4,368 Columns: 3 $ Date "2005-01-10", "2005-01-11", "2005-01-12", "2005-01-13", "2005-0… $ Number 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 2, … $ Month 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … # Build your linear and Poisson regression models lm_out <- lm(Number ~ Month, data = dat) poisson_out <- glm(Number ~ Month, data = dat, family = "poisson") # Examine the outputs using print() print(lm_out) print(poisson_out) # Examine the outputs using summary() summary(lm_out) summary(poisson_out) # Examine the outputs using tidy() tidy(lm_out) tidy(poisson_out) ---------------------------------------------------------------------------------------------------------------------------------------------- Extracting coefficients from glm() Sometimes, we specifically care about model coefficients and their confidence intervals. The coef() function extracts coefficients and confint() extracts the confidence intervals. These functions work the same for both linear and generalized linear models. The GLM you fit in the previous exercise (poisson_out) has been loaded for you. * Using coef(), extract the regression coefficients from poisson_out. * Using confint(), extract the confidence intervals for the regression coefficients from poisson_out. # Extract the regression coefficients coef(poisson_out) (Intercept) Month2 Month3 Month4 Month5 Month6 -2.04425523 -0.34775767 -0.93019964 -0.58375226 -0.29111968 -0.40786159 Month7 Month8 Month9 Month10 Month11 Month12 -0.04599723 -0.50734279 -0.20426264 -0.54243411 -0.20426264 -0.19481645 # Extract the confidence intervals confint(poisson_out) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) -2.3444432 -1.77136313 Month2 -0.8103027 0.10063404 Month3 -1.4866061 -0.41424128 Month4 -1.0762364 -0.11342457 Month5 -0.7311289 0.14051326 Month6 -0.8704066 0.04053012 Month7 -0.4542037 0.36161360 Month8 -0.9807831 -0.05092540 Month9 -0.6367321 0.22171492 Month10 -1.0218277 -0.08165226 Month11 -0.6367321 0.22171492 Month12 -0.6237730 0.22851779 Notice how the coefficient estimates fall within the confidence interval. If the coef() output for a coefficient does not fall within the confidence interval, something is wrong and needs to be investigated. ---------------------------------------------------------------------------------------------------------------------------------------------- Predicting with glm() Data scientists often use models to predict future situations. GLMs are one such tool and, when used for these situations, they are sometimes called supervised learning. For this exercise, you will predict the expected number of daily civilian fire injury victims for the North American summer months of June (6), July (7), and August (8) using the Poisson regression you previously fit and the new_dat dataset. Recall that the Poisson slope and intercept estimates are on the natural log scale and can be exponentiated to be more easily understood. You can do this by specifying type = "response" with the predict function. * Print new_dat to see your new prediction situation. * Use the fit Poisson regression, poisson_out as the object and new_dat as the new data in predict(). Be sure to exponentiate your output by setting type = "response". Save the results as pred_out. * Print pred_out. > new_dat Month 1 6 2 7 3 8 # print the new input months print(new_dat) # use the model to predict with new data pred_out <- predict(object = poisson_out, newdata = new_dat, type = "response") # print the predictions print(pred_out) 1 2 3 0.08611111 0.12365591 0.07795699 You've just predicted the average number of daily civilian fire injury vicitms for the three summer months.