lm vs. Poisson coefficients

A linear model and Poisson regression have been fit for you. In this exercise, you'll extract the coefficients for both models and compare
them. In Step 3, you will need to exponentiate the results from the Poisson regression. This is needed because the Poisson coefficients are
estimated on link-scale, not the raw data-scale.

Recall from the video that linear models intercepts are additive whereas Poisson regression intercepts are multiplicative.

* You have been given the model lm_out, extract the coefficients using coef() and save as lm_coef.
* You have been given the model poisson_out, extract the coefficients using coef() and save as poisson_out.
* Exponentiate the coefficients from poisson_out using exp() and save this as poisson_coef_exp.

>
lm_out

Call:
lm(formula = Number ~ Month, data = dat)

Coefficients:
(Intercept)       Month2       Month3       Month4       Month5       Month6  
   0.129477    -0.038031    -0.078401    -0.057254    -0.032702    -0.043365  
     Month7       Month8       Month9      Month10      Month11      Month12  
  -0.005821    -0.051520    -0.023921    -0.054208    -0.023921    -0.022919

  
>
poisson_out

Call:  glm(formula = Number ~ Month, family = "poisson", data = dat)

Coefficients:
(Intercept)       Month2       Month3       Month4       Month5       Month6  
    -2.0443      -0.3478      -0.9302      -0.5838      -0.2911      -0.4079  
     Month7       Month8       Month9      Month10      Month11      Month12  
    -0.0460      -0.5073      -0.2043      -0.5424      -0.2043      -0.1948  

Degrees of Freedom: 4367 Total (i.e. Null);  4356 Residual
Null Deviance:	    2325 
Residual Deviance: 2303 	AIC: 2976


# Extract the coeffients from lm_out
lm_coef <- coef(lm_out)
lm_coef

(Intercept)      Month2      Month3      Month4      Month5      Month6 
 0.12947658 -0.03803116 -0.07840132 -0.05725436 -0.03270239 -0.04336547 
     Month7      Month8      Month9     Month10     Month11     Month12 
-0.00582067 -0.05151959 -0.02392103 -0.05420777 -0.02392103 -0.02291921


# Extract the coefficients from poisson_out
poisson_coef <- coef(poisson_out)
poisson_coef

(Intercept)      Month2      Month3      Month4      Month5      Month6 
-2.04425523 -0.34775767 -0.93019964 -0.58375226 -0.29111968 -0.40786159 
     Month7      Month8      Month9     Month10     Month11     Month12 
-0.04599723 -0.50734279 -0.20426264 -0.54243411 -0.20426264 -0.19481645


# Take the exponential using exp()
poisson_coef_exp <- exp(poisson_coef)
poisson_coef_exp

(Intercept)      Month2      Month3      Month4      Month5      Month6 
  0.1294766   0.7062700   0.3944749   0.5578014   0.7474262   0.6650709 
     Month7      Month8      Month9     Month10     Month11     Month12 
  0.9550446   0.6020933   0.8152482   0.5813315   0.8152482   0.8229857


How do the exponential Poisson regression coefficients (poisson_coef_exp) compare to the linear regression coefficients (lm_coef)?
Both have almost the same (intercept)s, but different coefficients (Month2, Month3, etc.).

The intercepts are essentially the same, but the other coefficients are different. This is because the Poisson coefficients are multiplicative.
Notice that 0.129 * 0.706 = 0.091 from the Poisson coefficents is the same as 0.129-0.038 = 0.091 from the linear model. 

----------------------------------------------------------------------------------------------------------------------------------------------

Poisson regression plotting

Using geom_smooth to plot data from a Poisson regression is similar to plotting a binomial. However, you need to change the family 
argument in geom_smooth(). During this exercise, you will plot the number of cancer cells per cm and use a geom_smooth(). 
This is the same simulated dataset you saw in the video.

* Use geom_jitter() to set both the jitter width and height to 0.05.
* Use geom_smooth() to create a line-based on a Poisson regression.

>
dat
    dose cells
1      1     1
2      1     0
3      1     0
4      1     0
5      1     0
6      1     2
7      1     0
8      1     1
9      1     2
10     1     0
11     2     3
12     2     0
13     2     2
14     2     2
15     2     1
16     2     0
17     2     1
18     2     2
19     2     2
20     2     2
21     3     2
22     3     3
23     3     5
24     3     3
25     3     0
26     3     3
27     3     6
28     3     2
29     3     4
30     3     4
31     4     2
32     4     2
33     4     8
34     4     4
35     4     4
36     4     4
37     4     7
38     4     2
39     4     6
40     4     5
41     5     2
42     5     5
43     5     8
44     5     4
45     5     7
46     5     4
47     5     4
48     5     7
49     5     9
50     5     3
51     6     6
52     6     7
53     6     9
54     6     5
55     6     3
56     6     5
57     6     5
58     6     3
59     6     4
60     6    11
61     7     2
62     7     7
63     7     9
64     7     3
65     7     4
66     7     2
67     7     6
68     7     5
69     7     5
70     7     6
71     8     4
72     8     5
73     8     8
74     8    10
75     8    11
76     8     9
77     8     8
78     8     8
79     8    11
80     8     7
81     9    10
82     9    12
83     9     9
84     9    12
85     9    10
86     9    12
87     9     9
88     9    17
89     9     6
90     9     9
91    10    15
92    10    11
93    10    11
94    10    10
95    10     4
96    10     9
97    10    13
98    10     8
99    10     8
100   10    13

# Use geom_smooth to plot a continuous predictor variable
ggplot(data = dat, aes(x = dose, y = cells)) + 
	geom_jitter(width = 0.05, height = 0.05) + 
	geom_smooth(method = 'glm', method.args = list(family = 'poisson'))

----------------------------------------------------------------------------------------------------------------------------------------------

Extracting and interpreting odds-ratios

You looked at the bus data in this chapter and in chapter 2. During this exercise, you are going to extract out the odds-ratio ratio from a
model that predicts the chance of riding the bus based upon number of commuting days. The resulting odds-ratio (indicated by the CommuteDays
term) will give you the change in odds for each extra day a person commutes. For example, an odds-ratio of 1.5 would mean each extra-day a
person commutes per week, the odds increase 50% (1.5 - 1.0 = 0.5 = 50%) they will ride the bus.

A logistic regression, bus_out has been fit for you.

* Using base R, extract out the regression coefficients using coef() and save this a coef_out.
* Convert these coefficients to odds-ratios by calculating the exponential of the results with exp().


>
bus_out

Call:  glm(formula = Bus ~ CommuteDays, family = "binomial", data = bus)

Coefficients:
(Intercept)  CommuteDays  
    -1.4549       0.1299  

Degrees of Freedom: 15891 Total (i.e. Null);  15890 Residual
Null Deviance:	    19570 
Residual Deviance: 19540 	AIC: 19540


# Extract out the coefficients 
coef_out <- coef(bus_out)

# Convert the coefficients to odds-ratios 
exp(coef_out)

(Intercept) CommuteDays 
  0.2334164   1.1386623 

What is the meaning of the odds-ratio that you extracted?
The odds-ratio was >1, so commuting more days increased the chance of riding the bus.

----------------------------------------------------------------------------------------------------------------------------------------------

Odds-ratios & confidence intervals in the Tidyverse

The broom package in the Tidyverse give us a powerful tool, tidy(), for extracting coefficients and making them readable. More specifically,
it allows us to extract the coefficients, as well as the confidence intervals, and exponentiate the outputs. Confidence intervals are another
approach for statistical inference. If the confidence intervals for odds-ratios do not include 1, the corresponding coefficient is
statistically different than 1. During this exercise, you will use tidy() to extract the 95% confidence intervals from the bus model in the
previous exercises.

* Use the tidy() function on bus_out to extract the coefficients.
     * Set exponentiate to TRUE to convert the results to odds ratios.
     * Set conf.int to TRUE to include confidence intervals.


# Exponentiate the results and extract the confidence intervals of bus_out with tidy()
tidy(bus_out, exponentiate = TRUE, conf.int = TRUE)

# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    0.233    0.115     -12.7  7.32e-37    0.186     0.292
2 CommuteDays    1.14     0.0231      5.62 1.96e- 8    1.09      1.19 

Does adding an extra commuting day significantly change the odds-ratio for riding the bus?
The confidence interval for CommuteDays does not include 1.0 so the results are statistically significant.

----------------------------------------------------------------------------------------------------------------------------------------------

Default trend lines

ggplot2 gives us powerful tools to visualize our data and "see" what is going on. However, sometimes we need to wrangle our data to give us
what we need. During this exercise, you will look and see if the length of commutes changes the chance somebody rides the bus. You will do 
this by changing the points to be non-overlapping (or jittered) using geom_jitter() and adding smoothed trend line with geom_smooth().

* Using the data.frame bus, plot MilesOneWay on the x-axis and Bus2 on the y-axis.
* Jitter the width of the data by 0 and the height by 0.05.
* Label the axes on the plot. Use "Probability of riding the bus" on the y-axis and "One-way commute trip (in miles)" on the x-axis.
* Save this plot as gg_jitter then add geom_smooth().


>
glimpse(bus)
Rows: 15,892
Columns: 4
$ CommuteDays <int> 5, 5, 5, 5, 3, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 5, 5…
$ MilesOneWay <dbl> 19.54675, 19.54675, 19.54675, 19.54675, 19.54675, 21.66784…
$ Bus         <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, No, No, No, No, …
$ Bus2        <dbl> 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1…

# Create a jittered plot of MilesOneWay vs Bus2 using the bus dataset
gg_jitter <- ggplot(data = bus, aes(x = MilesOneWay, y = Bus2)) + 
	geom_jitter(width = 0, height = 0.05) +
	ylab("Probability of riding the bus") +
	xlab("One-way commute trip (in miles)")

# Add a geom_smooth() to your plot
gg_jitter + geom_smooth()

----------------------------------------------------------------------------------------------------------------------------------------------

Methods for trend lines

During the previous exercise, you used ggplot2's default geom_smooth(). During this exercise, you will use a glm() instead. This will allow 
you to "see" a logistic regression with ggplot2. Specifically, you plot the probability someone takes the bus given their commute distance.

You will need to tell geom_smooth() to use the glm() method. Recall from Chapter 2 that the default family for a glm() is a Gaussian family,
which produces the same results as a lm(). Thus, you will also need to specify the method argument from glm().
The code for creating gg_jitter, which you built in the last exercise, has been provided for you.

* Use the "glm" method with geom_smooth().
* With the method.args, set the family to 'binomial'.

# Create a jittered plot of MilesOneWay vs Bus2 using the bus dataset
gg_jitter <- ggplot(data = bus, aes(x = MilesOneWay, y = Bus2)) + 
	geom_jitter(width = 0, height = 0.05) +
	ylab("Probability of riding the bus") +
	xlab("One-way commute trip (in miles)")

# Add a geom_smooth() that uses a GLM method to your plot
gg_jitter + geom_smooth(method =  "glm" , method.args = list(family = "binomial"))

----------------------------------------------------------------------------------------------------------------------------------------------

Comparing probits and logits

You learned about probit and logit link function in the previous chapter. Now, you will learn how to plot them together with ggplot2. The
purpose of this exercise is to show you that the two are often very similar and also to show you how to change link functions with ggplot2.
Your plot, gg_jitter, has been loaded for you and you will use this with geom_smooth(). Also, you will change the colors of your curves and
remove the shaded areas to make the plot easier to read.

* Plot a probit curve by setting the first family to use a probit link using binomial(link = 'probit'), the color to 'red', and se to FALSE.
* Plot a logit curve by setting the first family to use a logit link using binomial(link = 'logit'), the color to 'blue', and se to FALSE.

# Add geom_smooth() lines for the probit and logit link functions
gg_jitter + 
	geom_smooth(method = 'glm', 
                method.args = list(family = binomial(link = 'probit')), 
                color = 'red', se = FALSE) +
	geom_smooth(method = 'glm', 
                method.args = list(family = binomial(link = 'logit')), 
                color = 'blue', se = FALSE)