My personal course notes from the Hierarchical and Mixed Effects Models in R course on DataCamp taught by Richard Erickson. His other course, Generalized Linear Models in R is a pre-requisite to this one. Dr. Erickson is a quantitative ecologist.
This course begins by reviewing slopes and intercepts in linear regressions before moving on to random-effects. You’ll learn what a random effect is and how to use one to model your data. Next, the course covers linear mixed-effect regressions. These powerful models will allow you to explore data with a more complicated structure than a standard linear regression. The course then teaches generalized linear mixed-effect regressions. Generalized linear mixed-effects models allow you to model more kinds of data, including binary responses and count data. Lastly, the course goes over repeated-measures analysis as a special case of mixed-effect modeling. This kind of data appears when subjects are followed over time and measurements are collected at intervals. Throughout the course you’ll work with real data to answer interesting questions using mixed-effects models.
library(tidyverse)
library(tidymodels) # broom
library(lme4)
library(lmerTest)
library(broom.mixed)
library(readxl)
library(knitr)
student_data <- read_csv("data/classroom.csv")
county_births_data <- read_csv("data/countyBirthsDataUse.csv")
md_crime <- read_csv("data/MDcrime.csv")
hate <- read_csv("data/hateNY.csv")
il_data <- read_csv("data/ILdata.csv")
user_groups <- read_xlsx("data/user_groups.xlsx")
sleep <- read_xlsx("data/sleep.xlsx")
The first chapter provides an example of when to use a mixed-effect and also describes the parts of a regression. The chapter also examines a student test-score dataset with a nested structure to demonstrate mixed-effects.
Often, many different types of observed data lend themselves to hierarchical models. Below are six examples of data that were analyzed using mixed-effect regressions. In this case, the term mixed-effect regression is used because this variation of the term hierarchical model was used in all of these studies.
Instructions
Solution
You have seen how different types of predictor variables can be observed at different levels. Hierarchical models can capture this structure.
Before using hierarchical models, explore the data with a plot and linear model. Plotting data gives you visual intuition and allows you to see potential trends or problems. Building a simple model provides a comparison to a hierarchical model. Building a simple linear model provides a starting place for hierarchical models and troubleshooting the linear model is quicker and simpler than a hierarchical model.
During this exercise, you will plot how a student’s gain in math scores may be predicted by the teacher’s math knowledge.
Instructions
student_data
) with
ggplot2
’s ggplot()
.mathknow
) on the
x-axis and the student’s gain in math (mathgain
) as the
y-axis.geom_smooth()
to add a lm
trend
line.ggplot(student_data, aes(mathknow, mathgain))+
geom_point()+
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 109 rows containing non-finite values (stat_smooth).
## Warning: Removed 109 rows containing missing values (geom_point).
mathgain
predicted by mathknow
using
the student_dataNote that I have used the tidyverse modeling principles (tidymodels) instead of the base R version
math_mod <- lm(mathgain ~ mathknow, student_data)
tidy(math_mod)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 57.8 1.06 54.7 1.19e-313
## 2 mathknow 0.680 1.04 0.657 5.11e- 1
Question
During the video, you saw different reasons for using hierarchical models with data such as this student data. Which of the following examples is not a reason for using hierarchical models with data such as these student test scores?
Possible Answers
Correct. A hierarchical model cannot help with modeling non-normal
count data. However, we will learn about generalized linear mixed-effect
models which can model count data in Chapter 3. Also, notice how the
results from the linear model correspond to the ggplot2
line. In this case, there is a slight increase in student math scores
when the teacher has a higher test score. You can tell this because the
coefficient term mathknow
was 0.68. Yet, there is
uncertainty around this estimate because the standard error was 1.04.
Thus, the results are unclear if a teacher’s math knowledge affects a
student’s math gains. Within this data set, students are not independent
within classrooms or school. Next, we will examine some approaches for
accounting for this nesting.
In the last exercise, the simple linear model you used did not account for the structure of the data. Students learn within classrooms and classrooms exist within schools, which means students within the same classroom are not independent. One solution is to collapse the data by taking a mean for each level. However, the method used to collapse the data can be important, especially for small or unequal-sized groups.
In this exercise, you will aggregate the gains in math scores
(mathgain
) three different ways. After summarizing the
data, you will examine a linear model of the data at each level.
Instructions
# Summarize the student data at the classroom level
class_data <-
student_data %>%
group_by(classid, schoolid) %>%
summarize(mathgain_class = mean(mathgain),
mathknow_class = mean(mathknow),
n_class = n(), .groups = "keep")
student_data
where
mathgain
is predicted by mathknow.
## Model the math gain with the student-level data
lm(mathgain ~ mathknow, student_data)
##
## Call:
## lm(formula = mathgain ~ mathknow, data = student_data)
##
## Coefficients:
## (Intercept) mathknow
## 57.8224 0.6801
Fit a linear model using class_data where mathgain_class is predicted by mathknow_class.
# Model the math gain with the classroom-level data
lm(mathgain_class ~ mathknow_class, class_data)
##
## Call:
## lm(formula = mathgain_class ~ mathknow_class, data = class_data)
##
## Coefficients:
## (Intercept) mathknow_class
## 58.43 2.71
Note: .groups ="keep"
tells
summarize()
to keep the grouping variable. It is included
to suppress a tidyverse
notification message.
school_data
where
mathgain_school
is predicted by
mathknow_school
.# Summarize the data at the school level
school_data <-
student_data %>%
group_by(schoolid) %>%
summarize(mathgain_school = mean(mathgain),
mathknow_school = mean(mathknow),
n_school = n(), .groups = 'keep')
# Model the data at the school-level
lm(mathgain_school ~ mathknow_school, school_data)
##
## Call:
## lm(formula = mathgain_school ~ mathknow_school, data = school_data)
##
## Coefficients:
## (Intercept) mathknow_school
## 58.00624 -0.01119
s_by_c_data
where
mathgain_s_by_c
is predicted by
mathknow_s_by_c
.# Summarize school by class (s_by_c)
s_by_c_data <-
class_data %>%
group_by(schoolid) %>%
summarize(mathgain_s_by_c = mean(mathgain_class),
mathknow_s_by_c = mean(mathknow_class),
n_s_by_c = n(), .groups = 'keep')
# Model the data at the school-level after summarizing
# students at the class level
lm(mathgain_s_by_c ~ mathknow_s_by_c, s_by_c_data)
##
## Call:
## lm(formula = mathgain_s_by_c ~ mathknow_s_by_c, data = s_by_c_data)
##
## Coefficients:
## (Intercept) mathknow_s_by_c
## 58.0246 0.3966
Question
You have fit four different linear models. Which statement is true?
Possible Answers
mathknow
(mathknow
, mathknow_class
,
mathknow_school
, and mathknow_s_by_c
are
samemathknow
(mathknow
, mathknow_class
,
mathknow_school
, and mathknow_s_by_c
are
different.Correct. Notice how the coefficient estimates are different from each other. Which estimate is correct? Which one should you use? Can we model the multiple levels of data? Yes. We can model multiple levels of data. However, first, we will review regression coefficients such as slopes and intercepts.
Intercepts are an important part of regression models, including hierarchical models, and allow the modeling of discrete groups. Without other coefficients, a single intercept is the global mean of the data. This model is also called a null model by some. Similarly, multiple intercepts allow you to estimate the mean for each group as long as other coefficients are not estimated.
During this exercise, you will learn about intercepts and see their
relationship to means. You will look at a subset of the school data that
only includes student data from the school with the id code of
3
. This data has been loaded for you as
school_3_data.
school_3_data <- student_data %>%
filter(schoolid == 3)
mathgain
. Recall that a global intercept is
~ 1
in R.mean()
to calculate the mean for all student math
gains using summarize()
.# Use a linear model to estimate the global intercept
lm(mathgain ~ 1 , school_3_data)
##
## Call:
## lm(formula = mathgain ~ 1, data = school_3_data)
##
## Coefficients:
## (Intercept)
## 88.86
# Use summarize to calculate the mean
school_3_data %>%
summarize(mean(mathgain))
## # A tibble: 1 × 1
## `mean(mathgain)`
## <dbl>
## 1 88.9
mathgain ~ classid
using lm()
# Use a linear model to estimate mathgain in each classroom
lm(mathgain ~ classid, school_3_data)
##
## Call:
## lm(formula = mathgain ~ classid, data = school_3_data)
##
## Coefficients:
## (Intercept) classid
## 92.64880 -0.03074
Notice R treats classid
as a continuous
variable.
mutate() classid
to be a factor.
Rerun the model you fit in the previous step.
# Change classid to be a factor
school_3_data <-
school_3_data %>%
mutate(classid = factor(classid))
# Use a linear model to estimate mathgain in each classroom
lm(mathgain ~ classid, school_3_data)
##
## Call:
## lm(formula = mathgain ~ classid, data = school_3_data)
##
## Coefficients:
## (Intercept) classid137 classid145 classid228
## 85.0 20.5 11.0 -14.0
group_by(classid)
~ classid - 1
in the lm()
formula.# Change classid to be a factor
school_3_data <-
school_3_data %>%
mutate(classid = factor(classid))
# Calculate the mean of mathgain for each class
school_3_data %>%
group_by(classid) %>%
summarize(n(), mathgain_class = mean(mathgain))
## # A tibble: 4 × 3
## classid `n()` mathgain_class
## <fct> <int> <dbl>
## 1 11 4 85
## 2 137 2 106.
## 3 145 5 96
## 4 228 3 71
# Estimate an intercept for each class
lm(mathgain ~ classid-1, school_3_data)
##
## Call:
## lm(formula = mathgain ~ classid - 1, data = school_3_data)
##
## Coefficients:
## classid11 classid137 classid145 classid228
## 85.0 105.5 96.0 71.0
You observed the relationships between means and intercepts. Notice
how the global intercept was the same as the mean across all the data.
When examining intercepts, notice how R’s default option gave you the
difference between each classroom and the reference classroom. With
formulas in R, the reference group is always the first group in a factor
unless you change this setting. When using the - 1
option,
the intercept for each classroom was the classroom’s mean. With multiple
regression, the first group can have an intercept estimated for each
group or contrasts estimated. The additional discrete predictors can
only have contrasts estimated.
Previously, you used multiple intercepts to model the expected values for discrete groups. Now, you will model continuous predictor variables with slopes.
In particular, you will model gains in math test scores by building three different models.
The data, school_3_data
, has been loaded for you.
Notes:
y ~ x
is a shortcut for y ~ 1 + x
(where
1 +
is the intercept in the model), and thus you can omit
1 +
in your formulas.a +
in the formula (e.g.,
y ~ x_one + x_two
).y ~ x_one * x_two
so the testing software will correctly score your answer.Instructions
With the data school_3_data
, use a linear model to
examine how a student’s math gain (mathgain
) is predicted
by the student’s math score in kindergarten (mathkind
).
# Use a linear model to estimate how math kindergarten
# scores predict math gains later
lm(mathgain ~ mathkind, school_3_data)
##
## Call:
## lm(formula = mathgain ~ mathkind, data = school_3_data)
##
## Coefficients:
## (Intercept) mathkind
## 335.5802 -0.5259
With the data school_3_data
, model a student’s math gain
(mathgain
) being predicted by class (classid
)
and score (mathkind
). Estimate a coefficient for each
classroom using a - 1
at the end of your formula.
# Build a multiple regression
lm(mathgain ~ classid + mathkind - 1 , school_3_data )
##
## Call:
## lm(formula = mathgain ~ classid + mathkind - 1, data = school_3_data)
##
## Coefficients:
## classid11 classid137 classid145 classid228 mathkind
## 404.1494 431.5223 435.9090 404.8938 -0.7049
school_3_data
, model a student’s math
gain (mathgain
) predicted by class (classid
)
and score (mathkind
).- 1
at the end of your formula. Include an interaction term for
classid
and mathkind
.# Build a multiple regression with interaction
lm(mathgain ~ classid * mathkind - 1 , school_3_data )
##
## Call:
## lm(formula = mathgain ~ classid * mathkind - 1, data = school_3_data)
##
## Coefficients:
## classid11 classid137 classid145
## 187.7939 596.0303 561.1480
## classid228 mathkind classid137:mathkind
## 1376.9393 -0.2270 -0.8336
## classid145:mathkind classid228:mathkind
## -0.7376 -2.5300
You used three different models to estimate three different
coefficients from mathkind.
As an alternative, hierarchical
models allow you to share parameter information across groups. This is
also the topic of the next section.
Linear models in R estimate parameters that are considered fixed or non-random and are called fixed-effects. In contrast, random-effect parameters assume data share a common error distribution, and can produce different estimates when there are small amounts of data or outliers. Models with both fixed and random-effects are mixed-effect models or linear mixed-effect regression.
The lme4
package fits mixed-effect models (models with
both fixed- and random-effects) with lmer()
, which uses a
formula similar to lm()
.But, random-effect intercepts use
special syntax:
lmer(y ~ x + (1 | random-effect), data = my_data)
lmer()
function requires the model to include a
random-effect, otherwise the model gives you an error. Here, you will
fit a lm()
and a lmer()
, and then graphically
compare the fitted models using a subset of the data. We provide this
code because of the advanced data wrangling, which is required because
random-effects are usually not plotted (ggplot2
also does
not include nice plot options for mixed-effect models). In this plot,
notice how the dashed lines from random-effect slopes
compare to the solid lines from the fixed-effect
slopes.
Note: broom.mixed
is required because
the broom
package does not support lme4
.
Instructions
mathgain
predicted by
classid + mathkind
using the student_data
.
Save the output as lm_out
.mathgain
predicted by mathkind
as a fixed-effect and
classid
predicted as a random-effect using the
student_data
.mathkind
coefficient details.# Build a liner model including class as fixed-effect model
lm_out <- lm(mathgain ~ classid + mathkind , student_data)
# Build a mixed-effect model including class id as a random-effect
lmer_out <- lmer(mathgain ~ mathkind + (1 | classid), data = student_data)
# Extract out the slope estimate for mathkind
tidy(lm_out) %>%
filter(term == "mathkind")
## # A tibble: 1 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 mathkind -0.407 0.0212 -19.2 4.98e-72
tidy(lmer_out) %>%
filter(term == "mathkind")
## # A tibble: 1 × 8
## effect group term estimate std.error statistic df p.value
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fixed <NA> mathkind -0.424 0.0215 -19.7 1178. 6.27e-75
Run the code as written. It includes these steps:
# Re-run the models to load their outputs
lm_out <- lm(mathgain ~ classid + mathkind, data = student_data)
lmer_out <- lmer(mathgain ~ mathkind + (1 | classid), data = student_data)
# Add the predictions to the original data
student_data_subset <-
student_data %>%
mutate(lm_predict = predict(lm_out),
lmer_predict = predict(lmer_out)) %>%
filter(schoolid == "1")
# Plot the predicted values
ggplot(student_data_subset,
aes(x = mathkind, y = mathgain, color = as_factor(classid))) +
geom_point() +
geom_line(aes(x = mathkind, y = lm_predict)) +
geom_line(aes(x = mathkind, y = lmer_predict), linetype = 'dashed') +
xlab("Kindergarten math score") +
ylab("Math gain later in school") +
theme_bw() +
scale_color_manual("Class ID", values = c("red", "blue"))
Question
In Step 1, you estimated a fixed-effect and random-effect coefficient
for predicting mathgain
. Both models included
mathkind
as a fixed effect, with classid
changing from a fixed-effect to a random-effect. The models estimate
standard errors for mathkind
, which represents the
uncertainty around the parameter estimate. How did the standard error
change for mathkind
between the two models?
Possible Answers
mathkind
was 0.0258 from the
lm()
and decreased to 0.0215 from the
lmer()
.tidy()
outputs do not include the details necessary to
answer this question.Correct. Notice how the random-effect for classid explains more variability than the fixed-effect.
In the previous exercise, you saw how to code random-effect
intercepts. You will now see how to code random-effect slopes. With
lme4
syntax, lmer()
uses
(countinuous_predictor | random_effect_group
) for a
random-effect slope. When lme4
estimates a random-effect
slope, it also estimates a random-effect intercept. scale()
rescaled the predictor variable mathkind
to make the model
more numerically stable. Without this change, lmer()
cannot
fit the model.
In the previous exercise, you estimated a random-effect intercept for
each classroom and one slope
for all data. Here, you will
estimate a random-effect intercept for each class and a random-effect
slope for each classroom. Like a random-effect intercept, a
random-effect slope comes from a shared distribution of all
random-effect slopes.
Instructions
mathkind
to be
mathkind_scaled
.lmer()
function from the lme4
package to fit a random-effects intercept model. This is similar to the
model you fit before, but has mathgain
predicted by
mathkind_scaled.
classid
is the random-effect.
Use the student_data
.lmer()
function from the lme4
package to fit a random-effects slope model. mathgain
is
predicted by mathkind_scaled
as a random-effect slope and
classid
as a random-effect group.# Rescale mathkind to make the model more stable
student_data <-
student_data %>%
mutate(mathkind_scaled = scale(mathkind))
# Build lmer models
lmer_intercept <- lmer(mathgain ~ mathkind_scaled + (1 | classid), data = student_data)
lmer_slope <- lmer(mathgain ~ (mathkind_scaled | classid), data = student_data)
# Add the predictions to the original data
student_data_subset <-
student_data %>%
mutate(lmer_intercept = predict(lmer_intercept),
lmer_slope = predict(lmer_slope)) %>%
filter(schoolid == "1")
# Plot the predicted values
ggplot(student_data_subset,
aes(x = mathkind_scaled, y = mathgain, color =as_factor(classid))) +
geom_point() +
geom_line(aes(x = mathkind_scaled, y = lmer_intercept)) +
geom_line(aes(x = mathkind_scaled, y = lmer_slope), linetype = 'dashed') +
theme_bw() +
scale_color_manual("Class ID", values = c("red", "blue"))
The model with fixed-effect slopes has parallel lines (solid lines) because the slope estimates are the same. The model with random-effect slopes (dashed lines) does not have parallel lines because the slope estimates are different. The model with random-effect slopes (dashed lines) has lines that are shallower than the other model. This occurred because slopes are being estimated for each classroom, but include a shared distribution. This shared distribution pools information from all classrooms (including those not shown on the plot).
You will now built a model to examine what factors predict a student’s gain in math knowledge. As described in the video, you will build a model to see:
As part of this model, you will also account for other possible predictors including:
Instructions
student_data
, build a model and save the
output as lmer_classroom
mathgain
predicted by mathknow
,
mathprep
, sex
, mathkind
and
ses
as fixed-effects -Include classid
as a
random-effect.# Build the model
lmer_classroom <- lmer(mathgain ~ mathknow + mathprep + sex + mathkind + ses + (1| classid), data = student_data)
# Print the model's output
print(lmer_classroom)
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: mathgain ~ mathknow + mathprep + sex + mathkind + ses + (1 |
## classid)
## Data: student_data
## REML criterion at convergence: 10337.78
## Random effects:
## Groups Name Std.Dev.
## classid (Intercept) 12.61
## Residual 26.90
## Number of obs: 1081, groups: classid, 285
## Fixed Effects:
## (Intercept) mathknow mathprep sex mathkind ses
## 267.4076 1.6243 1.2964 -1.3139 -0.4548 6.0905
You have now built a model to examine student’s gains in math knowledge. In the exercise, you will examine and interpret the model’s coefficients.
lmer
model, lmer_classroom
, you built
in the previous exercise is loaded.tidy()
function to extract the model’s
coefficients including confidence intervals by setting
conf.int
to be TRUE.
lmer_coef
and print this
object.# Extract coefficents
lmer_coef <- tidy(lmer_classroom, conf.int = TRUE)
# Print coefficents
lmer_coef
## # A tibble: 8 × 10
## effect group term estimate std.error statistic df p.value conf.low
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fixed <NA> (Int… 267. 11.1 24.0 1036. 1.05e-101 246.
## 2 fixed <NA> math… 1.62 1.14 1.43 232. 1.55e- 1 -0.619
## 3 fixed <NA> math… 1.30 1.18 1.10 223. 2.73e- 1 -1.03
## 4 fixed <NA> sex -1.31 1.73 -0.758 1015. 4.49e- 1 -4.72
## 5 fixed <NA> math… -0.455 0.0223 -20.4 1074. 3.60e- 78 -0.499
## 6 fixed <NA> ses 6.09 1.25 4.85 1075. 1.39e- 6 3.63
## 7 ran_pars classid sd__… 12.6 NA NA NA NA NA
## 8 ran_pars Residual sd__… 26.9 NA NA NA NA NA
## # … with 1 more variable: conf.high <dbl>
lmer_coef
to the code.# Plot results
lmer_coef %>%
filter(effect == "fixed" & term != "(Intercept)") %>%
ggplot(., aes(x = term, y = estimate,
ymin = conf.low, ymax = conf.high)) +
geom_hline(yintercept = 0, color = 'red') +
geom_point() +
geom_linerange() +
coord_flip() +
theme_bw() +
ylab("Coefficient estimate and 95% CI") +
xlab("Regression coefficient")
Question
Examine the coefficients and their 95% confidence intervals.
Possible Answers
ses
and mathkind
differed
from zero.Correct. Both a school’s socio-economic status and a student’s kindergarten math knowldege had coefficients that differed from zero. Without knowing more about the subject area, we cannot say why. Additionally, perhaps there are more sources of variability not considered in the model. You will learn more about interpreting the results from mixed-effect models in the next chapter.
This chapter providers an introduction to linear mixed-effects models. It covers different types of random-effects, describes how to understand the results for linear mixed-effects models, and goes over different methods for statistical inference with mixed-effects models using crime data from Maryland.
In the video, you learned about the county-level birth rate data. Counties exist within states and perhaps states contribute to variability. During these exercises, you’ll build a series of mixed-effects models using this data.
In this exercise, you’ll build a hierarchical model with a global
intercept (fixed-effect) and random-effect for state. You will then look
at the summary()
of the model and the plot()
of the residuals. Like other types of regression analysis, examining
residuals can help you see if anything is wrong with the model.
With lmer()
, there are two methods for doing this:
y ~ 1 + (1 | random_effect)
or the shortcut,
y ~ (1 | random_effect)
. Use the shortcut in this exercise
so that your answer passes the DataCamp test.
When building mixed-effect models, starting with simple models such
as the global intercept model can check to see if problems exist with
either the data or code. A global intercept
assumes a
single intercept can describe all of the variability in the data. One
way to view a global intercept is that you cannot do any better modeling
that data than to only model the mean without including any other
predictor variables.
Instructions
lmer()
model to the
county_births_data
data. Include State
as a
random-effect that predicts BirthRate
.# Build a lmer with State as a random effect
birth_rate_state_model <- lmer(BirthRate ~ (1 | State),
data = county_births_data)
# Look at the model's summary and the residuals
summary(birth_rate_state_model)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: BirthRate ~ (1 | State)
## Data: county_births_data
##
## REML criterion at convergence: 2411
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.7957 -0.6056 -0.1063 0.5211 5.5948
##
## Random effects:
## Groups Name Variance Std.Dev.
## State (Intercept) 1.899 1.378
## Residual 3.256 1.804
## Number of obs: 578, groups: State, 50
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 12.3362 0.2216 43.3830 55.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(birth_rate_state_model)
## # A tibble: 3 × 8
## effect group term estimate std.error statistic df p.value
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fixed <NA> (Intercept) 12.3 0.222 55.7 43.4 5.49e-42
## 2 ran_pars State sd__(Intercept) 1.38 NA NA NA NA
## 3 ran_pars Residual sd__Observation 1.80 NA NA NA NA
plot(birth_rate_state_model)
# Plot the predicted values from the model on top of the plot shown during the video
county_births_data$birthPredictState <-
predict(birth_rate_state_model, county_births_data)
ggplot() +
theme_minimal() +
geom_point(data = county_births_data,
aes(x = TotalPopulation, y = BirthRate)) +
geom_point(data = county_births_data,
aes(x = TotalPopulation, y = birthPredictState),
color = 'blue', alpha = 0.5)
## Warning: Removed 2 rows containing missing values (geom_point).
## Removed 2 rows containing missing values (geom_point).
You’ve now built a simple lmer()
model that included a
global mean. You might do this as a first step modeling your own data or
as a debugging step. For example, if R could not fit a
lmer()
model, this simple model could help you see if the
trouble was with your random-effects. In the next exercise, you will see
how to include a fixed-effect. Later in this chapter, you will learn
more about interpreting lmer()
outputs.
During the previous exercise, you built a model with only a global intercept. Usually, hierarchical models include predictor variables of interest.
The county-level birth data includes the average age of the mother,
AverageAgeofMother
. Perhaps this explains a county’s birth
rate.
In this case, the formula in R “knows”
AverageAgeofMother
is numeric and will treat the
corresponding coefficient as a slope.
Build a hierarchical model with county_births_data
,
which has been loaded for you, and include a fixed-effect. Does the
average age of the mother at birth predict the birth rate?
Instructions
AverageAgeofMother
as a fixed-effect
slope parameter including State
as a random-effect. The
response variable is BirthRate
.summary()
.# Include the AverageAgeofMother as a fixed effect within the lmer and state as a random effect
age_mother_model <- lmer(BirthRate ~ AverageAgeofMother + (1 | State), county_births_data)
summary(age_mother_model)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: BirthRate ~ AverageAgeofMother + (1 | State)
## Data: county_births_data
##
## REML criterion at convergence: 2347.6
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.9602 -0.6086 -0.1042 0.5144 5.2686
##
## Random effects:
## Groups Name Variance Std.Dev.
## State (Intercept) 1.562 1.250
## Residual 2.920 1.709
## Number of obs: 578, groups: State, 50
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 27.57033 1.81801 575.96197 15.165 < 2e-16 ***
## AverageAgeofMother -0.53549 0.06349 574.18338 -8.434 2.71e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## AvrgAgfMthr -0.994
tidy(age_mother_model)
## # A tibble: 4 × 8
## effect group term estimate std.error statistic df p.value
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fixed <NA> (Intercept) 27.6 1.82 15.2 576. 5.92e-44
## 2 fixed <NA> AverageAgeofMo… -0.535 0.0635 -8.43 574. 2.71e-16
## 3 ran_pars State sd__(Intercept) 1.25 NA NA NA NA
## 4 ran_pars Residual sd__Observation 1.71 NA NA NA NA
Question
Looking at the regression coefficient
AverageAgeofMother
, which of the following statements is
true about its estimate?
Possible Answers
Correct. The estimate was negative, suggesting counties with older mothers have lower birth rates on average. You will learn about statistical inferences with hierarchical models later. Additionally, other DataCamp courses cover statistical inference and introduce regression in R in greater detail.
In the previous exercise, you estimated random-effect intercepts for
each state. This allowed you to account for each state having its own
intercept. During this exercise, you will estimate a random-effect slope
for each state. For example, perhaps the log10(total
population of each county), LogTotalPop
, changes the birth
rate of a county AND varies by state.
Recall from the video, a random-effect slope
may be
estimated for each group using (slope | group)
syntax with
lmer()
.
During this exercise, fit a mixed-effects model estimating the effect of the mother’s average age while accounting for state and total population as random-effects.
How do the outputs from this model compare to the previous model you built?
Instructions
AverageAgeofMother
as a fixed-effect and State
a random-effect.LogTotalPop
as a
random-effects slope.county_births_data <- county_births_data %>%
mutate(LogTotalPop = log10(TotalPopulation))
# Include the AverageAgeofMother as fixed-effect and State as a random-effect
model_a <- lmer(BirthRate ~ AverageAgeofMother + (1 | State), county_births_data)
tidy(model_a)
## # A tibble: 4 × 8
## effect group term estimate std.error statistic df p.value
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fixed <NA> (Intercept) 27.6 1.82 15.2 576. 5.92e-44
## 2 fixed <NA> AverageAgeofMo… -0.535 0.0635 -8.43 574. 2.71e-16
## 3 ran_pars State sd__(Intercept) 1.25 NA NA NA NA
## 4 ran_pars Residual sd__Observation 1.71 NA NA NA NA
# Include the AverageAgeofMother as fixed-effect and LogTotalPop and State as random-effects
model_b <- lmer(BirthRate ~ AverageAgeofMother + (LogTotalPop | State), county_births_data)
tidy(model_b)
## # A tibble: 6 × 8
## effect group term estimate std.error statistic df p.value
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fixed <NA> (Intercept) 34.6 1.81 19.1 570. 1.97e-63
## 2 fixed <NA> AverageAgeofMo… -0.756 0.0620 -12.2 565. 1.75e-30
## 3 ran_pars State sd__(Intercept) 13.5 NA NA NA NA
## 4 ran_pars State cor__(Intercep… -0.997 NA NA NA NA
## 5 ran_pars State sd__LogTotalPop 2.26 NA NA NA NA
## 6 ran_pars Residual sd__Observation 1.53 NA NA NA NA
Question
How did the estimated effect for the average age of a mother change between the two models?
Possible Answers
AverageAgeofMother
coefficient in model_b
becomes closer to zero than in model_a
, indicating a weaker
relationship.AverageAgeofMother
coefficient in model_b
becomes farther from zero than in model_a
, indicating a
stronger relationship.AverageAgeofMother
coefficient was the same in
model_b
and model_a.
Correct. The estimate for AverageAgeofMother
became more
negative (farther from zero) in model B. Including
LogTotalPop
changed the estimate for
AverageAgeofMother
and is an important coefficient. You may
have also noticed the AverageAgeofMother
coefficient in
model b had a smaller standard error (std.error
). This also
caused a larger t-value (statistic
). Both of these show
including LogTotalPop
in the model explains a source of
variability in the dataset.
In the previous exercises, you fit mixed-effect models with different fixed- and random-effects.
Sometimes, a model can have the same predictor as both a fixed and
random-effect. For example, perhaps you are interested in estimating the
average effect the age of a mother at birth
(AverageAgeofMother
). Including the predictor as
fixed-effect allows you to estimate the effect of a mother’s age across
all locations. Including the predictor as a random-effect allows you to
simultaneously account (or correct) for different slope
estimates among states.
Instructions
lmer()
using
AverageAgeofMother
as a fixed-effect and
AverageAgeofMother
as a random-effect nested within
State
to predict BirthRate
with the
county_births_data.
Make sure the fixed-effect goes before
the random-effects in the formula.summary()
on the output.# Construct a lmer()
out <- lmer(BirthRate ~ AverageAgeofMother + (AverageAgeofMother|State), county_births_data)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0133555 (tol = 0.002, component 1)
# Look at the summary
summary(out)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: BirthRate ~ AverageAgeofMother + (AverageAgeofMother | State)
## Data: county_births_data
##
## REML criterion at convergence: 2337.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8402 -0.5965 -0.1132 0.5233 5.1817
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## State (Intercept) 78.33144 8.8505
## AverageAgeofMother 0.08433 0.2904 -0.99
## Residual 2.80345 1.6744
## Number of obs: 578, groups: State, 50
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 27.21961 2.41010 39.91438 11.294 5.31e-14 ***
## AverageAgeofMother -0.52344 0.08293 39.42721 -6.312 1.83e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## AvrgAgfMthr -0.997
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0133555 (tol = 0.002, component 1)
During this exercise, you built a model where the same predictor was
both a fixed- and random-effect. This is another tool for your modeling
toolbox. During the next video, you will learn about how to interpret
lmer()
outputs.
One of the first things to examine after fitting a model using
lmer()
is the model’s output using either the
print()
or summary()
functions. Although
similar, each produces slightly different outputs. The purpose of this
exercise is to have you compare the two outputs and then answer a
question about how the two differ. A lmer()
model has been
fit for you and saved as out.
out
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: BirthRate ~ AverageAgeofMother + (AverageAgeofMother | State)
## Data: county_births_data
## REML criterion at convergence: 2337.506
## Random effects:
## Groups Name Std.Dev. Corr
## State (Intercept) 8.8505
## AverageAgeofMother 0.2904 -0.99
## Residual 1.6744
## Number of obs: 578, groups: State, 50
## Fixed Effects:
## (Intercept) AverageAgeofMother
## 27.2196 -0.5234
## optimizer (nloptwrap) convergence code: 0 (OK) ; 0 optimizer warnings; 1 lme4 warnings
summary(out)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: BirthRate ~ AverageAgeofMother + (AverageAgeofMother | State)
## Data: county_births_data
##
## REML criterion at convergence: 2337.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8402 -0.5965 -0.1132 0.5233 5.1817
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## State (Intercept) 78.33144 8.8505
## AverageAgeofMother 0.08433 0.2904 -0.99
## Residual 2.80345 1.6744
## Number of obs: 578, groups: State, 50
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 27.21961 2.41010 39.91438 11.294 5.31e-14 ***
## AverageAgeofMother -0.52344 0.08293 39.42721 -6.312 1.83e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## AvrgAgfMthr -0.997
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0133555 (tol = 0.002, component 1)
Which of the following appears in the summary()
output
but not the print()
output?
Possible Answers
Correct. The standard errors for the fixed-effects are not included
in the print()
output.
We often want to know what values a model estimates as coefficients.
Although the summary()
function provides the model outputs,
we might also want to directly access model outputs.
The fixed-effects estimates can be called directly using the
fixef()
function. The random-effects estimates can be
called directly using the ranef()
function. We can also
extract confidence intervals for the fixed-effects using the function
confint()
.
The broom.mixed
package also contains tidy
methods for extracting model results from lmer()
models,
namely the tidy()
function. However, these results are more
complex and less tidy than many tidy outputs due to the
complexity of mixed-effect models.
A lmer
model has been fit for you and saved as
out
.
Instructions
fixef()
with the saved model out.
ranef()
with the saved model out.
confint()
function with the saved model
out.
tidy()
with out
and
conf.int = TRUE
to repeat your previous three code calls
with one tidy command.# Extract the fixed-effect coefficients
fixef(out)
## (Intercept) AverageAgeofMother
## 27.219605 -0.523442
# Extract the random-effect coefficients
ranef(out)
## $State
## (Intercept) AverageAgeofMother
## AK 4.15003868 -0.109184783
## AL -4.03378321 0.127790698
## AR 1.21082566 -0.026165323
## AZ -5.11009217 0.148282653
## CA 11.43254356 -0.373285277
## CO 1.35768165 -0.043219507
## CT -3.82970951 0.066100814
## DC 1.21115127 -0.012314396
## DE -2.11510834 0.060412134
## FL -11.31590854 0.334159606
## GA 6.93457372 -0.230274526
## HI -0.51788167 0.020356335
## IA 3.41174428 -0.085895750
## ID 7.51123552 -0.222596318
## IL -3.07183968 0.102977648
## IN -3.02904144 0.097145469
## KS 5.15187643 -0.150220284
## KY 2.30557751 -0.058607045
## LA 5.32029471 -0.154658352
## MA -5.72651427 0.144383430
## MD -3.31971034 0.116256669
## ME -5.76419680 0.145079276
## MI -5.51723036 0.156052226
## MN -1.41379992 0.077580405
## MO 0.46288340 -0.013492801
## MS -0.77739331 0.021041676
## MT 0.07683320 -0.006560404
## NC 4.20869679 -0.159191892
## ND 3.20725583 -0.078130864
## NE 4.97980324 -0.122101047
## NH -3.02333587 0.055641597
## NJ 6.33991764 -0.220034709
## NM 1.24087579 -0.066463156
## NV 0.35709749 -0.009233999
## NY -5.34557647 0.162856972
## OH -7.43833421 0.227092266
## OK 4.73462323 -0.140260716
## OR -5.64958450 0.170021845
## PA -8.17679196 0.230593543
## RI -0.82967532 -0.014536489
## SC -5.60511798 0.179511398
## SD 4.96670949 -0.128324308
## TN -1.21831791 0.038711838
## TX 10.90259845 -0.331687671
## UT 14.12700916 -0.369723836
## VA -8.76482204 0.350357075
## VT -0.02787975 -0.014042974
## WA 0.30152985 -0.002489047
## WI -0.35751943 -0.004946489
## WV -3.92421158 0.115236390
##
## with conditional variances for "State"
# Estimate the confidence intervals
confint(out)
## Computing profile confidence intervals ...
## Warning in nextpar(mat, cc, i, delta, lowcut, upcut): unexpected decrease in
## profile: using minstep
## Warning in FUN(X[[i]], ...): non-monotonic profile for .sig02
## Warning in confint.thpr(pp, level = level, zeta = zeta): bad spline fit
## for .sig02: falling back to linear interpolation
## 2.5 % 97.5 %
## .sig01 4.4099143 13.8613915
## .sig02 -0.9996895 -0.9607260
## .sig03 0.1333730 0.4630564
## .sigma 1.5761224 1.7821072
## (Intercept) 22.4264494 32.1350493
## AverageAgeofMother -0.6923139 -0.3576358
# Use the tidy function
tidy(out, conf.int = TRUE)
## # A tibble: 6 × 10
## effect group term estimate std.error statistic df p.value conf.low
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fixed <NA> (Inte… 27.2 2.41 11.3 39.9 5.31e-14 22.3
## 2 fixed <NA> Avera… -0.523 0.0829 -6.31 39.4 1.83e- 7 -0.691
## 3 ran_pars State sd__(… 8.85 NA NA NA NA NA
## 4 ran_pars State cor__… -0.992 NA NA NA NA NA
## 5 ran_pars State sd__A… 0.290 NA NA NA NA NA
## 6 ran_pars Residual sd__O… 1.67 NA NA NA NA NA
## # … with 1 more variable: conf.high <dbl>
Notice how the tidy()
function produces similar results.
You will see how to use the other functions for plotting in the next
exercise.
Data scientists must communicate their work and DataCamp offers courses on the topic. Explaining your work helps your audience understand the results. To do this, match your presentation to your audience’s knowledge level and expectations.
For non-technical audiences, describe the important findings from your output. For example, you might say, counties with older mothers tend to have lower birth rates. For technical audiences, include details such as coefficient estimates, confidence intervals, and test statistics. Books such as The Chicago Guide to Writing about Multivariate Analysis provide suggestions for describing regression outputs.
During this exercise, you will extract and plot fixed-effects.
Besides plotting the coefficients (with geom_point()
) and
their 95% confidence intervals (with geom_linerange()
), you
will add a red-line to the plot to help visualize where zero is located
(using geom_hline()
). If the 95% confidence intervals do
not include zero, the coefficient’s estimate differs from zero.
coord_flip()
is required because ggplot
does not allow for xmin
or xmax
, only
ymin
and ymax
. And,
theme_minimal()
changes the theme from the default.
Technical note: Extracting regression coefficients
from lmer
is tricky (see the discussion
between the lmer
and broom
authors).
Instructions
out
using
tidy()
from the broom.mixed
package. Include
the confidence interval.ggplot2
. Use term
for the x-axis, estimate
for the y-axis,
conf.low
for the ymin
, and
conf.high
for the ymax
.# Extract out the parameter estimates and confidence intervals
coef_estimates <-
tidy(out, conf.int = TRUE) %>%
filter(effect == "fixed")
# Print the new dataframe
print(coef_estimates)
## # A tibble: 2 × 10
## effect group term estimate std.error statistic df p.value conf.low
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fixed <NA> (Intercept) 27.2 2.41 11.3 39.9 5.31e-14 22.3
## 2 fixed <NA> AverageAgeo… -0.523 0.0829 -6.31 39.4 1.83e- 7 -0.691
## # … with 1 more variable: conf.high <dbl>
# Plot the results using ggplot2
ggplot(coef_estimates, aes(x = term, y = estimate,
ymin = conf.low, ymax = conf.high)) +
geom_hline( yintercept = 0, color = 'red' ) +
geom_linerange() + geom_point() + coord_flip() + theme_minimal()
In this exercise, you saw how to plot fixed-effect coefficients from
a lmer()
model. You used more code to wrangle and plot the
output than to fit the model. In fact, data science projects usually
include more code for the wrangling of model inputs and outputs than
code for running models. As a data scientist, you add value to projects
by helping people see and understand results.
Before fitting a model, plotting the data can be helpful to
see if trends or data points jump out, outliers exist, or other
attributes of the data require future consideration. Using
ggplot2
, you can plot lines for county and examine how
crimes change through time. For this exercise, examine Maryland crime
data (md_crime
). This includes the Year
, a
count of violent Crime
s in the county, and the
County
’s name.
To explore this data, first plot the data points for each county
through time. This lets you see how each county changes through time.
Rather than using an aesthetic such as color
,
group
is used here because there are too many counties to
easily distinguish colors. After plotting the raw data, add trend lines
for each county.
Both the connect points (geom_line
) and trend lines
(geom_smooth
) provide insight into what, if any, kinds of
random effects are required. If all of the points appear to have similar
ranges and means, a random-effect intercept may not be important.
Likewise, if trends look consistent across counties (i.e., the trend
lines look similar or parallel across groups), a random-effect slope may
not be required.
Instructions
Crime
(y
variable) has changed
across Year
(x
variable) in each
County
(group
variable) using the
md_crime
data.geom_smooth(method = 'lm', se = FALSE). se = FALSE
creates
a less cluttered plot.# Plot the change in crime through time by County
plot1 <-
ggplot(data = md_crime,
aes(x = Year, y = Crime, group = County)) +
geom_line() +
theme_minimal() +
ylab("Major crimes reported per county")
print(plot1)
# Add the trend line for each county
plot1 +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Question
Does the number of major crimes per county appear to be different across counties? If so, a random-effect intercept will be needed.
Possible Answers
Correct. Notice how the number of major crimes varies across counties. This suggests a random-effects intercept will improve the model’s fit. Likewise, the slopes may be different as well. This is something you will explore more in the next exercises.
The last plot showed changes in crime rate varied by county. This
shows you that you should include Year
as both a random-
and fixed-effect in your model. Including Year
this way
will estimate a global slope across all counties as well as slope for
each county. The fixed-effect slope estimates the change in major crimes
across all Maryland counties. The random-effect slope estimates model
for that counties have different changes in crime.
But, fitting this model produces a warning message! To address this
warning, change Year
from starting at 2006 to starting at
0. We provide you with this new variable, Year2
(e.g.,
2006
in Year
is 0
in
Year2
). Sometimes when fitting regression, you need to
scale or center the intercept to start at 0. This improves numerical
stability of the model.
Instructions
lmer()
to predict Crime
with
Year
as both a fixed-effect and random-effect slope and
County
as the random-effect intercept.lmer()
to predict Crime
with Year2
as both a fixed-effect and random-effect slope
and County
as the random-effect intercept.# Fit the model with Year as both a fixed and random-effect
lmer(Crime ~ Year + (1 + Year | County) , data = md_crime)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## unable to evaluate scaled gradient
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge: degenerate Hessian with 1 negative eigenvalues
## Warning: Model failed to converge with 1 negative eigenvalue: -7.2e-02
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: Crime ~ Year + (1 + Year | County)
## Data: md_crime
## REML criterion at convergence: 2892.018
## Random effects:
## Groups Name Std.Dev. Corr
## County (Intercept) 386.29
## Year 1.34 -0.84
## Residual 328.32
## Number of obs: 192, groups: County, 24
## Fixed Effects:
## (Intercept) Year
## 136642.97 -67.33
## optimizer (nloptwrap) convergence code: 0 (OK) ; 0 optimizer warnings; 2 lme4 warnings
# Fit the model with Year2 rather than Year
lmer(Crime ~ Year2 + (1 + Year2 | County) , data = md_crime)
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: Crime ~ Year2 + (1 + Year2 | County)
## Data: md_crime
## REML criterion at convergence: 2535.723
## Random effects:
## Groups Name Std.Dev. Corr
## County (Intercept) 2754.03
## Year2 130.15 -0.91
## Residual 91.79
## Number of obs: 192, groups: County, 24
## Fixed Effects:
## (Intercept) Year2
## 1577.28 -67.33
Notice how you had to rescale year so that the model could be fit by R. This is a common regression modeling approach to improve model fits. Some people will rescale the slope so the middle value is zero instead. Either approach is acceptable. The benefit of starting at zero is that it can sometimes create a regression coefficient that is easier to explain and understand. In the next exercise, you will examine the second model.
Null hypothesis testing uses p-values to test if a variable differs significantly from zero. Recently, the abuse and overuse of null hypothesis testing and p-values has caused the American Statistical Association to issue a statement about the use of p-values.
Because of criticisms such as these and other numerical challenges,
Doug Bates (the creator of the lme4
package) does
not include p-values as part of his package. Yet, you may still want
or need to estimate p-values. To fill this need, several packages exist,
including the lmerTest
package.
lmerTest
uses the same lmer()
syntax as the
lme4
package, but includes different outputs. During this
exercise, you will fit a lmer()
model using
lmerTest
and lme4
.
Instructions
lmerTest
package, which will also load the
lme4
package.lmer()
model using only the lme4
package. In R, this is done with lme4::lmer()
.lmer()
but use the lmerTest
package.# Load lmerTest
# Fit a lmer use lme4
out_lme4 <-
lme4::lmer(Crime ~ Year2 + (1 + Year2 | County),
data = md_crime)
# Fit a lmer use lmerTest
out_lmerTest <-
lmerTest::lmer(Crime ~ Year2 + (1 + Year2 | County),
data = md_crime)
# Look at the summaries
summary(out_lme4)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Crime ~ Year2 + (1 + Year2 | County)
## Data: md_crime
##
## REML criterion at convergence: 2535.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.8080 -0.2235 -0.0390 0.2837 3.0767
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## County (Intercept) 7584686 2754.03
## Year2 16939 130.15 -0.91
## Residual 8425 91.79
## Number of obs: 192, groups: County, 24
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 1577.28 562.29 2.805
## Year2 -67.33 26.72 -2.520
##
## Correlation of Fixed Effects:
## (Intr)
## Year2 -0.906
summary(out_lmerTest)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Crime ~ Year2 + (1 + Year2 | County)
## Data: md_crime
##
## REML criterion at convergence: 2535.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.8080 -0.2235 -0.0390 0.2837 3.0767
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## County (Intercept) 7584686 2754.03
## Year2 16939 130.15 -0.91
## Residual 8425 91.79
## Number of obs: 192, groups: County, 24
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1577.28 562.29 23.01 2.805 0.0100 *
## Year2 -67.33 26.72 23.01 -2.520 0.0191 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## Year2 -0.906
Question
lmerTest
includes output lme4
lacks. Which
of the following was not in the lme4::lmer()
output, but
was in the lmeTest::lmer()
?
Possible Answers
Estimate
).Std. Error
).Pr(>|t|)
).Yes. lmerTest
includes p-values as part of the outputs.
Using this extra package helps people to estimate p-values when they
need or want to with lmer
models.
P-values and null hypothesis testing historically have been important in science and statistics. Based upon the previous exercise and video, which of the following is not true:
Possible Answers
lme4
does not include P-values by default.As you saw, P-values can be powerful, but have limitations. Please consider reading the American Statistical Association’s statement of the use of p-values.
Comparing models can be difficult. Many methods exist although these are beyond the scope of this course such as model selection (e.g., AIC).
Analysis of Variance (ANOVA) exists as a basic option to compare
lmer
models. The ANOVA tests to see if one model explains
more variability than a second model. The ANOVA does this by examining
the amount of variability explained by the models.
For example, you can see if Year
predicts
Crime
in Maryland. To do this, build a null model with only
County
as a random-effect and a year model that includes
Year.
You can then compare the two models using the
anova()
function.
If Year
explains a significant amount of variability,
then the P-value will be less than your pre-specified threshold
(usually 0.05).
Instructions
County
as a
random-effect.Year2
as a random-effect
slope.anova()
.# Build the Null model with only County as a random-effect
null_model <- lmer(Crime ~ (1 | County) , data = md_crime)
# Build the Year2 model with Year2 as a fixed and random slope and County as the random-effect
year_model <- lmer(Crime ~ Year2 + (1 + Year2 | County) , data = md_crime)
# Compare null_model and year_model using an anova
anova(null_model, year_model)
## refitting model(s) with ML (instead of REML)
## Data: md_crime
## Models:
## null_model: Crime ~ (1 | County)
## year_model: Crime ~ Year2 + (1 + Year2 | County)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## null_model 3 2954.4 2964.2 -1474.2 2948.4
## year_model 6 2568.9 2588.4 -1278.4 2556.9 391.52 3 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Question
Look at the anova()
output, specifically
Pr(>Chisq)
, which is the output from the null hypothesis
where both models explain the same amount of variability. Does the
second model (the model with Year
as a slope,
year_model
) explain a significant amount of variability
compared to the first model without the slope
(null_model
)?
Possible Answers
null_model
explains more variability than
year_model
.year_model
explains more variability than
null_model
.Yes. year_model
explains significantly more variablity.
In the next chapter, you will learn how to generalize
lmers
.
This chapter extends linear mixed-effects models to include non-normal error terms using generalized linear mixed-effects models. By altering the model to include a non-normal error term, you are able to model more kinds of data with non-linear responses. After reviewing generalized linear models, the chapter examines binomial data and count data in the context of mixed-effects models.
In toxicology studies, organisms are often dosed and binary outcomes often occur such as dead/alive or inhibited/mobile. This is called a dose-response study. For example, the response to different doses might be mortality (1) or survival (0) at the end of a study.
During this exercise, we will fit a logistic regression using all three methods described in the video. You have been given two datasets.
df_long
, in a “long” format with each row corresponding
to an observation (i.e., a 0 or 1).df_short
, in an aggregated format with each row
corresponding to a treatment (e.g., 6 successes, 4 failures, number of
replicates = 10, proportion = 0.6).When using the “wide” or “short” data frame, the “success, failure”
methods for inputing logistic regression results require success and
failure be a matrix. The easiest way to do this is with the
cbind()
function.
Tip: When working with data in the wild,
always check to see what 0
and 1
correspond
to. Different people use different notation and assumptions can cause
problems for you if you assume wrong!
# Making up the said data frames
dose = rep(c(0,2,4,6,8,10), each =20)
mortality = c(
rep(0:1, times = c(20, 0)),
rep(0:1, times = c(14, 6)),
rep(0:1, times = c(9, 11)),
rep(0:1, times = c(4, 16)),
rep(0:1, times = c(5, 15)),
rep(0:1, times = c(3, 17))
)
df_long <- tibble(
dose, mortality
)
df_short <- data_frame(
dose = c(0,2,4,6,8,10),
mortality = c(0,6,11,16,15,17),
survival = c(20,14,9,4,5,3),
nReps = rep(20,6),
mortalityP = c(0,0.3,0.55,0.8,0.75,0.85))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Instructions
df_long
, fit a glm()
with
the "binomial"
distribution family (or, synonymously,
binomial error term) where mortality
is predicted by
dose.
summary()
of the model.# Fit a glm using data in a long format
fit_long <- glm(mortality ~ dose, data = df_long,
family = "binomial")
summary(fit_long)
##
## Call:
## glm(formula = mortality ~ dose, family = "binomial", data = df_long)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2356 -0.7756 0.4141 0.9043 1.6419
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.91229 0.42900 -4.458 8.29e-06 ***
## dose 0.43256 0.07872 5.495 3.92e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.52 on 119 degrees of freedom
## Residual deviance: 121.29 on 118 degrees of freedom
## AIC: 125.29
##
## Number of Fisher Scoring iterations: 4
df_short
, fit a glm() with the
“binomial” distribution family where the matrix
cbind(mortality, survival)
is predicted by dose.summary()
of the model.# Fit a glm using data in a short format with two columns
fit_short <- glm(cbind(mortality, survival) ~ dose ,
data = df_short, family = "binomial")
summary(fit_short)
##
## Call:
## glm(formula = cbind(mortality, survival) ~ dose, family = "binomial",
## data = df_short)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## -2.3477 0.4040 0.8544 1.3424 -0.8367 -1.0004
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.91229 0.42900 -4.458 8.29e-06 ***
## dose 0.43256 0.07872 5.495 3.92e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 54.143 on 5 degrees of freedom
## Residual deviance: 9.908 on 4 degrees of freedom
## AIC: 29.746
##
## Number of Fisher Scoring iterations: 5
df_short
, fit a glm()
with
the "binomial"
distribution family where
mortalityP
is predicted by dose
with
weights
of nReps.
summary()
of the model.# Fit a glm using data in a short format with weights
fit_short_p <- glm(mortalityP ~ dose , data = df_short,
weights = nReps , family = "binomial")
summary(fit_short_p)
##
## Call:
## glm(formula = mortalityP ~ dose, family = "binomial", data = df_short,
## weights = nReps)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## -2.3477 0.4040 0.8544 1.3424 -0.8367 -1.0004
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.91229 0.42900 -4.458 8.29e-06 ***
## dose 0.43256 0.07872 5.495 3.92e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 54.143 on 5 degrees of freedom
## Residual deviance: 9.908 on 4 degrees of freedom
## AIC: 29.746
##
## Number of Fisher Scoring iterations: 5
You fit a GLM three different methods. All three methods produced outputs with the same coefficient estimates, but differed with respect to the degrees of freedom estimated. As a tip, use the input method that best matches your data and requires the least amount of data wrangling. The difference in degrees of freedom does not change the models. If you use model selection, make sure your data is in the same format (long vs. Short) for all models you compare.
A Poisson regression is another type of GLM. This requires integers or count data (i.e., 0, 1, 2, 3,…). For some situations, a Poisson regression can be more powerful (e.g., detecting statistically significantly trends) than a linear model or “Gaussian” regression.
During this exercise, we’re going to build a linear regression using
the lm()
function and a Poisson regression using
glm()
.
The objects x
and y
are loaded into R for
you.
x <- 1:20
y <- c(rep(0:1, length.out = 9),2,1,2,0,1,1,0,1,5,1,1)
Instructions
lm()
where y
is predicted by
x
and then print the summary.glm()
where y
is predicted by
x
with a "poisson"
distribution function and
then print the summary to the terminal.glm()
produces statistically significant estimates for
x
.# Fit the linear model
summary(lm(y~x))
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3677 -0.6145 -0.2602 0.4297 3.4805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.15263 0.50312 0.303 0.7651
## x 0.07594 0.04200 1.808 0.0873 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.083 on 18 degrees of freedom
## Multiple R-squared: 0.1537, Adjusted R-squared: 0.1067
## F-statistic: 3.269 on 1 and 18 DF, p-value: 0.08733
# Fit the generalized linear model
summary(glm(y~x, family = "poisson"))
##
## Call:
## glm(formula = y ~ x, family = "poisson")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6389 -0.9726 -0.3115 0.5307 2.1559
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.04267 0.60513 -1.723 0.0849 .
## x 0.08360 0.04256 1.964 0.0495 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 23.589 on 19 degrees of freedom
## Residual deviance: 19.462 on 18 degrees of freedom
## AIC: 52.17
##
## Number of Fisher Scoring iterations: 5
You’ve now seen how to fit a Poisson regression, a type of GLM.
Remember this, because you’ll be fitting a glmer()
soon
using a similar approach!
Often, we want to “look” at our data and trends in our data.
ggplot2
allows us to add trend lines to our plots. The
default lines are created using a technique called local
regression. However, we can specify that different models are used
to create the lines, including GLMs.
During this exercise, we’ll see how to plot a GLM with
ggplot2
.
Instructions
stat_smooth()
to the first plot to fit the
default line type.# Plot the data using jittered points and the default stat_smooth
ggplot(data = df_long, aes(x = dose, y = mortality)) +
geom_jitter(height = 0.05, width = 0.1) +
stat_smooth(fill = 'pink', color = 'red')
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
stat_smooth()
to the second plot. This time,
specify the method to be "glm"
and family to be
"binomial"
to fit a logistic regression.# Plot the data using jittered points and the the glm stat_smooth
ggplot(data = df_long, aes(x = dose, y = mortality)) +
geom_jitter(height = 0.05, width = 0.1) +
stat_smooth(method = "glm",
method.args = list(family = "binomial"))
## `geom_smooth()` using formula 'y ~ x'
Notice how the second plot is bounded by 0 and 1 on the y-axis. Also,
notice how the curve is now monotonic (or only increasing and never
decreasing). Being monotic is an important assumption of binomial
regression.
A toxicologist wonders if exposure to a chemical increases mortality as its dose increases. She has exposed test organisms to different doses of a chemical and recorded the number that died in each test tank. Each dose was repeated on three different days.
First, build a GLM. Building a simple model makes debugging easier.
Also, you will later compare the results to GLMER. This model includes
dose
and replicate
as a fixed-effects.
Second, build a GLMER. This time, dose
is a fixed-effect
and replicate
is a random-effects intercept.
Last, examine the coefficient estimates with coef()
from
each model. Notice how the intercept estimates are displayed differently
from each model.
# The toxicoligst's dataframe
dose = rep(c(0,2,4,6,8,10), each =20, times = 3)
replicate = rep(c("a","b","c"), each = 120)
mortality = c(
rep(0:1, times = c(20, 0)),
rep(0:1, times = c(16,4)),
rep(0:1, times = c(10, 10)),
rep(0:1, times = c(7,13)),
rep(0:1, times = c(6, 14)),
rep(0:1, times = c(3, 17)),
rep(0:1, times = c(20, 0)),
rep(0:1, times = c(15,5)),
rep(0:1, times = c(8,12)),
rep(0:1, times = c(11,9)),
rep(0:1, times = c(11,9)),
rep(0:1, times = c(5, 15)),
rep(0:1, times = c(20, 0)),
rep(0:1, times = c(19,1)),
rep(0:1, times = c(15, 5)),
rep(0:1, times = c(9,11)),
rep(0:1, times = c(11, 9)),
rep(0:1, times = c(6, 14))
)
df <- data_frame(
dose, replicate, mortality
)
df
## # A tibble: 360 × 3
## dose replicate mortality
## <dbl> <chr> <int>
## 1 0 a 0
## 2 0 a 0
## 3 0 a 0
## 4 0 a 0
## 5 0 a 0
## 6 0 a 0
## 7 0 a 0
## 8 0 a 0
## 9 0 a 0
## 10 0 a 0
## # … with 350 more rows
Instructions
glm_out
, a glm()
model that
predicts mortality
as a function of dose
and
replicate
. Estimate an intercept for the replicate group
using replicate - 1
.coef()
.# Fit glm_out and look at its coefficient estimates
glm_out <- glm(mortality ~ dose + replicate -1, family = "binomial", data = df )
coef(glm_out)
## dose replicatea replicateb replicatec
## 0.3657509 -1.9199043 -2.2872780 -2.7645061
glmer_out
, a glmer()
model that
predicts mortality
as a function of dose
while
accounting for replicate
as a random effect.coef()
.# Load lmerTest
#library(lmerTest) # This package is already loaded
# Fit glmer_out and look at its coefficient estimates
glmer_out <- glmer(mortality ~ dose + (1 |replicate), family = "binomial", data = df )
coef(glmer_out)
## $replicate
## (Intercept) dose
## a -2.057613 0.3622273
## b -2.280911 0.3622273
## c -2.563972 0.3622273
##
## attr(,"class")
## [1] "coef.mer"
Question
How do the coefficient estimates differ?
Possible Answers
glmer()
outputs.glm()
outputs.Correct. Notice how the intercept coefficient estimates for the
glmer()
outputs are closer together. However, in this case
both model estimates are similar and either model would be a tenable
choice.
As described in the video, our client is interested in knowing if a
friend’s recommendation increases the number of people who buy, rather
than pass, on his online product. He has given us a summary of his data
as a data.frame
called all_data
. This data
includes the number of Purchases
and Pass
es
for 4 test cities (city
) as well as the customer
ranking
. This data structure lends itself to using
cbind()
on the two columns of interest to create a matrix
(You could use other methods of making a matrix in R, but this is one of
the easiest methods).
You are interested to see if the recommendation from a
friend
increases people buying the product. To answer this
question, you will build a glmer()
model and then examine
the model’s output.
If the parameter estimate for friend
is significantly
greater than zero, then a friend’s recommendation increases the chance
somebody makes a purchase. If the parameter estimate for
friend
is significantly less than zero, then a friend’s
recommendation decreases the chance somebody makes a purchase. If the
parameter estimate for friend is not significantly different than zero,
then a friend’s recommendation has no effect on somebody making a
purchase.
# I copied the marketing dataset from datacamp's R console, pasted (automatically as text into 1 column in Excel, then converted it to columns (Text to Column functionality)
library(readxl)
all_data <- read_xlsx("data/all_data.xlsx")
glmer()
with all_data
data.frame
. Use cbind(Purchases, Pass)
being
predicted by friend
and ranking
(friend goes
first). Use city
as your random-effect and family =
"binomial"
.cbind()
is required because glmer()
requires a matrix input.# Load lmerTest
# Fit the model and look at its summary
model_out <- glmer(cbind(Purchases, Pass) ~ friend + ranking + (1|city), family = "binomial", data = all_data)
summary(model_out)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: cbind(Purchases, Pass) ~ friend + ranking + (1 | city)
## Data: all_data
##
## AIC BIC logLik deviance df.resid
## 977.1 989.6 -484.5 969.1 164
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9986 -0.7692 0.1070 0.7954 2.8951
##
## Random effects:
## Groups Name Variance Std.Dev.
## city (Intercept) 0.01833 0.1354
## Number of obs: 168, groups: city, 4
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.274771 0.095609 -13.333 <2e-16 ***
## friendyes 0.511344 0.058862 8.687 <2e-16 ***
## ranking 0.079873 0.004969 16.075 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) frndys
## friendyes -0.346
## ranking -0.553 0.057
Question
How did a friend’s recommendation (the friendyes coefficient) change the chance of a purchase?
Possible Answers
Yes. The coefficient estimate for slope is positive and significantly different than zero. This indicates a friend’s recommendation increases the chance of purchasing an item. However, describing the coefficient to your friend would be hard. In the next exercise, you will see how odds-ratios can be used to describe logistic regression outputs. Other DataCamp courses such as A/B Testing in R cover this topic in greater detail.
In the previous exercise, we saw how to compare the effects of a friend’s recommendation on sales. However, regression outputs can be hard to describe and sometimes odds-ratios can be easier to use. Using the outputs from the previous exercise, we’re going to calculate odds-ratios.
Refresher on odds-ratios:
Note on course code: Since this course launched, the
broom
package has dropped support for
lme4::lmer()
models. If you try to repeat this on your own,
you will need the broom.mixed
package, which is on cran.
Instructions
summary()
of model_out
.model_out
with
fixef()
and then convert to an odds-ratio by taking
exponential. Repeat with confint()
to get the confidence
intervals.friends
on a purchase using tidy()
. Make
sure to set the conf.int
and exponentiate
parameters.# Run the code to see how to calculate odds ratios
summary(model_out)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: cbind(Purchases, Pass) ~ friend + ranking + (1 | city)
## Data: all_data
##
## AIC BIC logLik deviance df.resid
## 977.1 989.6 -484.5 969.1 164
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9986 -0.7692 0.1070 0.7954 2.8951
##
## Random effects:
## Groups Name Variance Std.Dev.
## city (Intercept) 0.01833 0.1354
## Number of obs: 168, groups: city, 4
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.274771 0.095609 -13.333 <2e-16 ***
## friendyes 0.511344 0.058862 8.687 <2e-16 ***
## ranking 0.079873 0.004969 16.075 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) frndys
## friendyes -0.346
## ranking -0.553 0.057
exp(fixef(model_out))
## (Intercept) friendyes ranking
## 0.2794948 1.6675308 1.0831499
exp(confint(model_out))
## Computing profile confidence intervals ...
## 2.5 % 97.5 %
## .sig01 1.0616505 1.429753
## (Intercept) 0.2257593 0.345121
## friendyes 1.4860713 1.871783
## ranking 1.0726937 1.093795
# Create the tidied output
tidy(model_out, conf.int = TRUE, exponentiate = TRUE)
## # A tibble: 4 × 9
## effect group term estimate std.error statistic p.value conf.low conf.high
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fixed <NA> (Int… 0.279 0.0267 -13.3 1.48e-40 0.232 0.337
## 2 fixed <NA> frie… 1.67 0.0982 8.69 3.72e-18 1.49 1.87
## 3 fixed <NA> rank… 1.08 0.00538 16.1 3.84e-58 1.07 1.09
## 4 ran_pars city sd__… 0.135 NA NA NA NA NA
Sometimes, odds ratios may be easier to explain than logistic regression coefficients. As you gain more experience as a data scientist, you will get better at explaining your results. These ‘soft skills’ can help set you apart from other data scientists. ‘May the odds be ever in your favor!’
A common theme throughout this course has been the nestedness of data. During this case study, you will examine the number of users who click through on a link to a website before and after a redesign. These users are nested within trial groups. You will then examine the outputs and make a recommendation to your client.
For this redesign, you will examine the web behavior of 10 people
from 4 different focus groups, for a total of 40 people. You want to
know if the number of clicks
to different pages changed
from the old to the new webpage
while correcting for
groups
.
Instructions
glmer()
where clicks
is predicted by
webpage
using group
as a random-effect and the
data.frame
user_groups
. Be sure to use the
‘"poisson"
’ family. Save the model as
glmer_out
.summary()
of the model.# Some data wrangling
user_groups <- mutate_if(user_groups, is.character, as.factor)
user_groups <- within(user_groups, webpage <- relevel(webpage, ref = "New"))
# Fit a Poisson glmer
glmer_out <- glmer(clicks ~ webpage + (1|group), family = "poisson", data = user_groups)
# Examine output with summary()
summary(glmer_out)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: clicks ~ webpage + (1 | group)
## Data: user_groups
##
## AIC BIC logLik deviance df.resid
## 260.6 267.7 -127.3 254.6 77
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.7009 -0.7755 -0.1949 0.6753 2.4145
##
## Random effects:
## Groups Name Variance Std.Dev.
## group (Intercept) 0.1693 0.4115
## Number of obs: 80, groups: group, 4
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6415 0.2366 2.711 0.00671 **
## webpageOld -0.5566 0.1820 -3.058 0.00222 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## webpageOld -0.280
Question
Look at the regression coefficient estimate for the new
webpage (compared to the old
webpage). How did the update
change the number of click-throughs?
Possible Answers
Question An important part of data science can be helping clients understand model outputs. Which of the following recommendations would you make based upon the model’s outputs?
Possible Answers
Correct. The updated webpage appears to increase the number of times people click-through. In this case, Poisson coefficients can be hard to explain. However, the DataCamp Course on GLMs in R describes other methods for describing Poisson regressions. The DataCamp Course on A/B Testing in R also provides more background on tests such as these. Hopefully, these examples have helped you to see how one statistical tool may be applied to many different problems if you know the language of the problem.
The number of infections change through time and vary across age groups. Some possible reasons cultural, social, and policy-related factors. For small populations, the number of infections often includes zeros and may be non-normal. For data such as these, use a Poisson model.
For this exercise, you will examine how chlamydia infections vary in small, Illinois counties. You will ask:
This data comes from the State of Illinois who provides summaries of
infections such as chlamydia by age groups and counties. First, fit a
Poisson glmer
to the data. Then, examine the results. In
the next exercise, plot the data.
Warning: If you mistype the formula, you may cause R
to crash. This is a pitfall of using lmer()
and
glmer()
.
Instructions
glmer()
with a “poisson” family, predicting
count
as a function of fixed effects age
(1st
fixed-effect) and year
(2nd fixed-effect), and
include year
as a
random-effect grouped by county
. Use the
data il_data
.summary()
of the model.# Load lmerTest
# Age goes before year
model_out <- glmer(count ~ age + year + (year|county), family = "poisson", data = il_data)
summary(model_out)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: count ~ age + year + (year | county)
## Data: il_data
##
## AIC BIC logLik deviance df.resid
## 3215.6 3259.6 -1599.8 3199.6 1800
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4511 -0.0151 -0.0056 -0.0022 4.0053
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## county (Intercept) 129.94605 11.3994
## year 0.06481 0.2546 -1.00
## Number of obs: 1808, groups: county, 47
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.76264 2.16559 -4.970 6.7e-07 ***
## age20_24 -0.04152 0.03690 -1.125 0.261
## age25_29 -1.16262 0.05290 -21.976 < 2e-16 ***
## age30_34 -2.28278 0.08487 -26.898 < 2e-16 ***
## year 0.32709 0.25644 1.276 0.202
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) a20_24 a25_29 a30_34
## age20_24 -0.008
## age25_29 -0.006 0.342
## age30_34 -0.004 0.213 0.148
## year -0.768 0.000 0.000 0.000
Question
Did the number of cases vary between age groups?
Possible Answers
Question
Did the slope estimate for year suggest the number of reported cases changing across time?
Possible Answers
year
was significantly greater
than zero, suggesting an increase in reported cases through time.year
was significantly less
than zero, suggesting a decrease in reported cases through time.year
was not significantly
different from zero. This suggests no trend in the number of reported
cases.Correct. No trend appears to be occuring through time. Next, you will plot this data.
In the previous exercise, you fit a GLMER to the Illinois chlamydia data. During this exercise, we will go over some methods for displaying the results. You could use these methods to get summaries of the model for a client or document you are writing describing your results. However, I encourage you to learn how to manipulate and explore model outputs on your own to create your own methods for displaying results. Developing your own unique methods can help you stand out as a data scientist!
Here is what you will do:
glm
to each age class.
Although not exactly the same as the glmer()
outputs, this
approximation helps to display the results in a visually easy to
understand method.Instructions
model_out
using fixef()
.model_out
using ranef()
.ggplot2
methods.il_data_2 <- read_xlsx("data/il_data_2.xlsx")
il_data_2 <- mutate_if(il_data_2, is.character, as.factor)
# Extract out fixed effects
fixef(model_out)
## (Intercept) age20_24 age25_29 age30_34 year
## -10.76264149 -0.04151673 -1.16261773 -2.28278215 0.32708938
# Extract out random effects
ranef(model_out)
## $county
## (Intercept) year
## ALEXANDER -0.2847730 0.006331923
## BROWN -0.2847730 0.006331923
## CALHOUN -0.2847730 0.006331923
## CARROLL 12.2419046 -0.260432572
## CASS -0.2847730 0.006331923
## CLARK 12.2138202 -0.268561955
## CLAY -0.2847730 0.006331923
## CRAWFORD 12.5037945 -0.265761403
## CUMBERLAND -0.2847730 0.006331923
## DE WITT 12.7456625 -0.277684054
## DOUGLAS 13.0752147 -0.306338945
## EDGAR 12.3643337 -0.283614781
## EDWARDS -0.2847730 0.006331923
## FAYETTE 12.8095075 -0.273069319
## FORD -0.2847730 0.006331923
## GALLATIN -0.2847730 0.006331923
## GREENE -0.2847730 0.006331923
## HAMILTON -0.2847730 0.006331923
## HANCOCK 12.8581820 -0.305659278
## HARDIN -0.2847730 0.006331923
## HENDERSON -0.2847730 0.006331923
## IROQUOIS 13.1617300 -0.311381985
## JASPER -0.2847730 0.006331923
## JERSEY 12.9203294 -0.272292923
## JO DAVIESS 12.7409939 -0.289756683
## JOHNSON -0.2847730 0.006331923
## LAWRENCE 12.3714098 -0.268579903
## MARSHALL -0.2847730 0.006331923
## MASON -0.2847730 0.006331923
## MENARD -0.2180907 0.004850099
## MERCER 12.7534737 -0.271687371
## MOULTRIE -0.2180907 0.004850099
## PIATT 12.5653681 -0.296696601
## PIKE 12.5311151 -0.259219952
## POPE -0.2180907 0.004850099
## PULASKI -0.2180907 0.004850099
## PUTNAM -0.2180907 0.004850099
## RICHLAND 12.0351399 -0.273937455
## SCHUYLER -0.2180907 0.004850099
## SCOTT -0.2180907 0.004850099
## SHELBY 12.5183838 -0.283481057
## STARK -0.2180907 0.004850099
## UNION 13.1465830 -0.308682407
## WABASH -0.2180907 0.004850099
## WASHINGTON -0.2180907 0.004850099
## WAYNE 12.1149423 -0.253243180
## WHITE -0.2180907 0.004850099
##
## with conditional variances for "county"
# Run code to see one method for plotting the data
ggplot(data = il_data_2,
aes(x = year, y = count, group = county)) +
geom_line() +
facet_grid(age ~ . ) +
stat_smooth(method = "glm",
method.args = list(family = "poisson"),
se = FALSE,
alpha = 0.5) +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
Good job on both fitting a model and plotting the results. Look at the figure and the model outputs. How might you tell a story to describe the outputs to your friend, a scientist, and a news reporter? Plots can help you tell stories with your data. The next chapter will cover repeated measures applications of LMERS and GLMERS.
This chapter shows how repeated-measures analysis is a special case of mixed-effect modeling. The chapter begins by reviewing paired t-tests and repeated measures ANOVA. Next, the chapter uses a linear mixed-effect model to examine sleep study data. Lastly, the chapter uses a generalized linear mixed-effect model to examine hate crime data from New York state through time.
In the video, you learned how paired t-tests can be more powerful than regular t-tests. During this exercise, you will see an example demonstrating this. The first step will be to simulate data. Similar data might come from a people’s weights before or after a drug treatment or amount of money spent by a customer before or after a seeing a commercial.
Simulated data allows you to know the properties of the data and check to see if your model behaves as expected. R comes with many distributions including the normal. Your simulated data will have unequal variance (that is, the standard deviations will be different). The second step will be to analyze the data with both a paired and regular t-test. Last, you will be asked about the results from the paired t-tests.
As part of the first step, you will “set the seed” for R’s random number generator. This ensures you get the same numbers each time you run the code and DataCamp’s software correctly scores your code.
Instructions
345659
.n_ind <- 10
to simulate 10 individuals.before
group with a mean of 0
and standard deviation of 0.5
.after
group by adding a random vector with
a mean 4.5
and standard deviation of 5
to the
before vector.# Set the seed to be 345659
set.seed(345659)
# Model 10 individuals
n_ind <- 10
# simulate before with mean of 0 and sd of 0.5
before <- rnorm(n = n_ind, mean = 0, sd = 0.5)
# simulate after with mean effect of 4.5 and standard devation of 5
after <- before + rnorm(n = n_ind, mean = 4.5, sd = 5)
before (x)
and
after (y)
by specifying paired = FALSE
.before (x)
and
after (y)
by specifying paired = TRUE
.# Run a standard, non-paired t-test
t.test(x = before, y = after, paired = FALSE)
##
## Welch Two Sample t-test
##
## data: before and after
## t = -4.6991, df = 9.3736, p-value = 0.001004
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.966165 -2.809916
## sample estimates:
## mean of x mean of y
## -0.2201914 5.1678489
# Run a standard, paired t-test
t.test(x = before, y =after, paired = TRUE)
##
## Paired t-test
##
## data: before and after
## t = -5.138, df = 9, p-value = 0.0006129
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -7.760267 -3.015813
## sample estimates:
## mean difference
## -5.38804
Question
In this case, both t-tests indicate significantly different means between groups. However, which test produces a larger (absolute value farthest from zero), t-score?
Possible Answers
Correct. Notice how the paired t-test has a larger t-score. The paired t-test is more powerful because it accounts for individual variability.
In the previous exercise, you saw how a paired t-test is more
powerful than a regular t-test. During this exercise, you will see how
statistical methods generalize. First, you will see how a paired t-test
is a special case of a repeated measures ANOVA. In the process, you will
see how a repeated measures ANOVA is a special case of a mixed-effects
model by using lmer()
in R.
The first part of this exercise will consist of transforming the simulated data from two vectors into a data.frame(). The second part will have you examine the model results to see how they are different. The third part will have you examine the outputs of the models and compare the results.
Instructions
data.frame() dat
:
y
is the combined vector before
and
after
, which have all been loaded for you.trial
is the repeated vector "before"
and
"after"
, each
one repeated
n_ind
.ind
is the individual ID vector, repeated
2 times
because there were two observations.# Create the data.frame, using the variables from the previous exercise.
# y is the joined vectors before and after.
# trial is the repeated names before and after, each one repeated n_ind
# ind is the letter of the individual, repeated 2 times (because there were two observations)
dat <- data.frame(y = c(before, after),
trial = rep(c("before", "after"), each = n_ind),
ind = rep(letters[1:n_ind], times = 2))
lmerTest
package.before
and after
with t.test()
.lmer()
and save it as lmer_out
. Have
y
predicted by trial
with ind
as
a random-effect intercept using dat
.summary()
of lmer_out
.#library(lmerTest)
# Run the code from the previous step
dat <- data.frame(y = c(before, after),
trial = rep(c("before", "after"), each = n_ind),
ind = rep(letters[1:n_ind], times = 2))
# Run a standard, paired t-test
t.test(before, after, paired = TRUE)
##
## Paired t-test
##
## data: before and after
## t = -5.138, df = 9, p-value = 0.0006129
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -7.760267 -3.015813
## sample estimates:
## mean difference
## -5.38804
# Run a lmer and save it as lmer_out
lmer_out <- lmer(y ~ trial + (1|ind), data = dat)
# Look at the summary() of lmer_out
summary(lmer_out)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: y ~ trial + (1 | ind)
## Data: dat
##
## REML criterion at convergence: 89.3
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.89008 -0.30775 -0.02928 0.22414 2.41195
##
## Random effects:
## Groups Name Variance Std.Dev.
## ind (Intercept) 1.075 1.037
## Residual 5.498 2.345
## Number of obs: 20, groups: ind, 10
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 5.1678 0.8108 17.5311 6.374 5.99e-06 ***
## trialbefore -5.3880 1.0487 9.0000 -5.138 0.000613 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## trialbefore -0.647
Question
Which of the following statements about the outputs are false?
Possible Answers
mean of the differences
from the paired t-test and
lmer()
coefficient estimate for trialbefore
both equal -5.388
.t value
from the paired t-test and the
trialbefore
coefficient both equal -5.138
.intercept
from the t-test is the same as the
(intercept
) from the lmer()
model’s
fixed-effects.Correct. The t-test does not provide an intercept estimate. The main purpose of this exercise was to show you a paired t-test is a special case of repeated measures ANOVA and a repeated measures ANOVA is simply a special case of a linear mixed-effects model. In the next exercise, you will see real data and learn more about interpreting the results.
One of the first things to do with new datasets is to plot it prior
to analysis. During this exercise, you will plot the data using
ggplot2
and create a publication quality figure. This will
help you to see the data.
This data examines if two different drugs change the amount of sleep
individuals get. The response variable of interest is the amount of
extra
sleep a patient gets. The predictor variable is the
drug group
. The ID
of each patient allows us
to do a repeated measures analysis. This is a random-effect intercept
and corresponds to the baseline effect of giving a person a sleeping
drug. We do not care how much an individuals sleeps in this case, only
the change in sleep of the groups.
The data.frame sleep
contains these variables.
Instruction
ggplot2
.extra
is the y
variable and
group
is the x
variable with the
sleep
data.frame.# Plot the raw data
ggplot(data = sleep, aes(x = group, y = extra)) +
geom_point()
geom_point()
with geom_line()
because you are plotting results from the same individuals.group = ID
in the aes()
function.# Replace geom_point with geom_line
ggplot(data = sleep, aes(x = group, y = extra, group = ID)) +
geom_line()
xlab()
to be "Drug"
.ylab()
to be "Extra sleep"
.theme_minimal()
.# Add an x and y label to the plot and change the theme to be "minimal"
ggplot(sleep, aes(x = group, y = extra, group = ID)) +
geom_line() +
xlab(label = "Drug") +
ylab(label = "Extra sleep") +
theme_minimal()
Now that you have plotted the data, you will get to build a model in the next exercise and compare two different forms of statistical inference. Also, in this figure, notice how almost all individuals had a greater increase in sleep with the second drug compared to the first. This suggests the second drug is more effective.
In this exercise, you will build a simple linear model
(lm()
) and then build a linear mixed-effects model
(lmer()
). The purpose of the first step is to make sure the
data works well with a simple model because lm()
outputs
are easier to debug than lmer()
outputs. During the next
exercise, you will compare two different methods of statistical
inference on the model.
Instructions
lm()
. The goal of this step
is to simply make sure the model builds without errors or warnings.extra
predicted by fixed-effects
group
(1st) and ID
(2nd) from the
sleep
data.# Build a simple linear model
lm( extra ~ group + ID, sleep)
##
## Call:
## lm(formula = extra ~ group + ID, data = sleep)
##
## Coefficients:
## (Intercept) group ID
## -2.6533 1.5800 0.3315
- Build a
lmer()model with
extrapredicted by the fixed-effect
groupand random-effect intercept
IDusing the
sleepdata. - Save the output as
lmer_out`.
# Build a simple linear model
lm(extra ~ group + ID, data = sleep)
##
## Call:
## lm(formula = extra ~ group + ID, data = sleep)
##
## Coefficients:
## (Intercept) group ID
## -2.6533 1.5800 0.3315
# Build a mixed-effects regression
lmer_out <- lmer(extra ~ group + (1|ID), sleep)
Now that you have built the models, the next step is using the models to make statistical inference. Usually, you would only use one approach, but in this case, you will compare two different methods of statistical inference.
In the previous exercise, you built a regression model. Two methods for statistical inference include examining the amount of variance explained by coefficients in the model (an ANOVA-like analysis) and using linear predictor variables to model the data (a regression analysis framework). The choice of approaches largely depends upon personal preference and statistical training. Both of these approaches may be done using frequentists or Bayesian methods. Although this course only uses frequentist methods, the same ideas apply to Bayesian models.
The lmer_out
model you build in the previous exercise
has been loaded for you. First, you will run an anova()
on
it to see if group
explains a significant amount of
variability. Second, you will examine the regression coefficient from
group
to see if it significantly differs from zero.
Instructions - Run an anova()
on
lmer_out
. - Look at the summary()
of
lmer_out
to see the regression coefficient for
group
.
# Run an anova() on lmer_out
anova(lmer_out)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## group 12.482 12.482 1 9 16.501 0.002833 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Look at the summary() of lmer_out
summary(lmer_out)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: extra ~ group + (1 | ID)
## Data: sleep
##
## REML criterion at convergence: 70
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.63372 -0.34157 0.03346 0.31511 1.83859
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 2.8483 1.6877
## Residual 0.7564 0.8697
## Number of obs: 20, groups: ID, 10
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) -0.8300 0.8143 17.9871 -1.019 0.32157
## group 1.5800 0.3890 9.0000 4.062 0.00283 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## group -0.717
Question
Compare the ANOVA table to the regression coefficient estimate for
group2
(the estimated difference of group 2 compared to
group 1). Which of the following are in both tables?
Possible Answers
group1
and
group2
Correct. Notice how both models find a statistically significant effect, but do so using different tests. In this case, both models produce identical p-values. Personally, I prefer regression inferences to ANOVA inferences because the output includes the estimated difference by default. The ANOVA tells us which variables explain a significant amount of variability, but the ANOVA does not tell us how things are different. Conversely, the regression tells us how much one unit of input would be expected to change the model’s output. During the next exercise, you will see how to visualize the outputs.
In the previous exercises, you have examined the raw data, used the data to build a model, and applied the model for statistical inferences. You found drug 2 increased the amount of extra sleep compared to drug 1. During this exercise, you will plot the results to see how much drug 2 increased extra sleep.
First, wrangle the data by using pivot_wider()
function
from the tidyr
package. Then, calculate the difference in
extra sleep for each individual. Last, plot this difference as a
histogram.
Instructions
tidyr
package.sleep_wide
.diff
.# Load the tidyr package
#library(tidyr)
# Make the data wider
sleep_wide <-
pivot_wider(sleep,
names_from = group, values_from = extra)
# Calculate the difference
sleep_wide$diff <- sleep_wide$`2` - sleep_wide$`1`
sleep_wide
with diff
for x aesthetic.geom_histogram()
.y
label to "Count"
.x
label to
"Extra sleep from drug 2"
.theme_bw()
.# Use the data sleep_wide and diff for the aesthetic x
ggplot(sleep_wide, aes(x = diff)) +
geom_histogram() +
ylab(label = "Count") +
xlab(label = "Extra sleep from drug 2") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You have now examined how one drug affects another as a repeated
measure. During the next exercise, you will examine how hate crimes
change in New York.
The hate crime data in this case study comes from Data.gov.The data spans 2010- 2016 .The data has however been cleaned up for this exercise’s practice.
The State of New York reports the number of hate crimes committed
against people in each county. During this case study, you will examine
and see if the number of hates crimes are changing through time. These
exercises serve two purposes. First, they demonstrate how generalized
mixed-effect regressions (glmer()
) can be used for repeated
measures in R. Second, they provide another example of using
mixed-effect models for statistical inference.
Given the different population sizes of New York counties, you can
reasonably assume the need for random-effect intercepts a
priori
. However, do you need random-effect slopes? Plot the
data to see if trends appear to vary by county. Additionally, plotting
the data will help you see what is going on.
Instructions
hate
, plot the total number of
hate crimes, (y = TotalIncidents
) across time
(x = Year
).County
and include
geom_line()
.geom_smooth()
.method
to "glm"
with
method.args
set to c("poisson")
.se
to
FALSE
.# Plot the TotalIncidents of hate crimes in NY by Year, grouped by County
ggplot(hate, aes(Year, TotalIncidents, group = County))+
geom_line()+
# Add a Poisson trend line for each county
geom_smooth(method = "glm", method.args = c("poisson"), se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Question
In the previous plot, you saw different trends in hate crime data through time. Which of the following statements does this observation support?
Possible Answers
Correct. Different counties appear to have different trends. In the next exercise, you will build models to describe this data.
As part of the model building process, first build a simple Poisson
regression. A glm()
runs quicker than glmer()
and is easier to debug. For example, the Poisson regression might catch
if you have non-integer data. Plus, the Poisson regression can provide
intuition about the more complicated model.
If you run into any problems, follow the instructions to fix your data.
Last, build and save a repeated-measures Poisson regression.
Instructions
glm()
function by
setting family
to "poisson"
.TotalIncidents
of hate crimes are
predicted by Year
and County
as fixed-effects
in your formula using the hate
data frame.# Model TotalIncidents predicted by Year and County
# Use a Poisson regression with the hate data.frame
glm( TotalIncidents ~ Year + County ,family = "poisson", data = hate)
##
## Call: glm(formula = TotalIncidents ~ Year + County, family = "poisson",
## data = hate)
##
## Coefficients:
## (Intercept) Year CountyAllegany CountyBronx
## 121.98334 -0.05977 -1.65787 1.15126
## CountyBroome CountyCattaraugus CountyCayuga CountyChautauqua
## -0.90255 -1.45353 -1.01203 -1.48607
## CountyChemung CountyChenango CountyClinton CountyColumbia
## -1.70959 -1.17785 -0.68000 -1.72240
## CountyCortland CountyDelaware CountyDutchess CountyErie
## -1.52662 -1.53832 0.01178 1.50107
## CountyEssex CountyFranklin CountyFulton CountyGenesee
## -1.45495 -1.24769 -1.71765 -1.65787
## CountyGreene CountyHerkimer CountyJefferson CountyKings
## -1.53832 -1.72478 -1.61690 2.56910
## CountyLewis CountyLivingston CountyMadison CountyMonroe
## -1.71765 -1.22832 -0.87154 -0.20972
## CountyMultiple CountyNassau CountyNew York CountyNiagara
## -1.09013 0.79580 2.48491 -0.60248
## CountyOneida CountyOnondaga CountyOntario CountyOrange
## -1.11226 -1.17683 -1.76025 -0.17693
## CountyOrleans CountyOswego CountyOtsego CountyPutnam
## -1.40229 -0.73686 -0.89300 -0.99506
## CountyQueens CountyRensselaer CountyRichmond CountyRockland
## 1.51898 -1.39772 0.66575 0.04996
## CountySaratoga CountySchenectady CountySchoharie CountySchuyler
## -0.83543 -0.74492 -1.58092 -1.77742
## CountySeneca CountySt. Lawrence CountySteuben CountySuffolk
## -1.71765 -1.04982 -1.34608 1.45168
## CountySullivan CountyTioga CountyTompkins CountyUlster
## -1.83720 -1.59988 -0.63539 -0.88945
## CountyWarren CountyWashington CountyWayne CountyWestchester
## -1.53832 -1.43472 -1.11461 0.74295
##
## Degrees of Freedom: 232 Total (i.e. Null); 173 Residual
## Null Deviance: 4107
## Residual Deviance: 302.3 AIC: 1160
glm()
ran without any problems. Now, build a
glmer()
.County
as a random-effect intercept and
Year
as both a fixed- and random-effect slope with the
hate
data.glmer()
.# Build a glmer with County as a random-effect intercept and Year as both a fixed- and random-effect slope
glmer( TotalIncidents ~ Year + (Year|County) ,family = "poisson", data = hate)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.370199 (tol = 0.002, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
## - Rescale variables?
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: TotalIncidents ~ Year + (Year | County)
## Data: hate
## AIC BIC logLik deviance df.resid
## 1165.2746 1182.5298 -577.6373 1155.2746 228
## Random effects:
## Groups Name Std.Dev. Corr
## County (Intercept) 217.8915
## Year 0.1084 -1.00
## Number of obs: 233, groups: County, 59
## Fixed Effects:
## (Intercept) Year
## 295.4813 -0.1464
## optimizer (Nelder_Mead) convergence code: 0 (OK) ; 0 optimizer warnings; 3 lme4 warnings
glmer()
gave this output:
convergence code 0; 3 optimizer warnings; 0 lme4 warnings
,
which means it did not converge. The reason is the scale of Year.Year2
, which starts with zero
rather than 2010.glmer()
with
County
as a random-effect intercept and Year2
as both a fixed- and random-effect slope with the hate
data.glmer_out
.# Create a new column, Year2 that starts with the min of Year
hate$Year2 <- hate$Year - min(hate$Year)
# Build a glmer with County as a random-effect intercept and Year2 as both a fixed- and random-effect slope
glmer_out <- glmer( TotalIncidents ~ Year2 + (Year2|County) ,family = "poisson", data = hate)
You have now build the model. Next, you will look at the model’s results.
Now, examine the model output you just fit to see if any trends exist in hate crime for New York.
Based upon the model’s summary()
, what is the trend in
New York hate crimes between 2010 and 2016?
Instructions
summary()
of glmer_out
.# Examine the summary of glmer_out
summary(glmer_out)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: TotalIncidents ~ Year2 + (Year2 | County)
## Data: hate
##
## AIC BIC logLik deviance df.resid
## 1165.3 1182.5 -577.6 1155.3 228
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.5434 -0.4864 -0.1562 0.3319 3.1939
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## County (Intercept) 1.16291 1.0784
## Year2 0.01175 0.1084 0.02
## Number of obs: 233, groups: County, 59
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.27952 0.16600 7.708 1.28e-14 ***
## Year2 -0.14622 0.03324 -4.398 1.09e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## Year2 -0.338
Question
Based upon the regression coefficient for Year2
, what
was the trend for hate crimes in New York between 2010 and 2016?
Possible Answers
Correct. The trend was significantly decreasing. As you may have noticed, Poisson regression coefficients can be hard to describe. The DataCamp Course Generalized Linear Models in R includes other methods for describing Poisson regression coefficients.
The last, and arguably most important step in creating a model, is sharing the results.
During this exercise, you’ll extract out the county-level estimates
and plot them with ggplot2
. The county-level random-effect
slopes need to be added to the fixed-effect slopes to get the slope
estimates for each county.
In addition to this addition, the code includes ordering the counties by rate of crime (the slope estimates) to help visualize the data clearly.
The model you previously fit, glmer_out
has been loaded
for you.
Instructions
Year2
slope estimates as
Year2_slope
.County
-level random-effects.# Extract out the fixed-effect slope for Year2
Year2_slope <- fixef(glmer_out)['Year2']
# Extract out the random-effect slopes for county
county_slope <- ranef(glmer_out)$County
# Create a new column for the slope
county_slope$slope <- county_slope$Year2 + Year2_slope
# Use the row names to create a county name column
county_slope$county <- rownames(county_slope)
# Create an ordered county-level factor based upon slope values
county_slope$county_plot <- factor(county_slope$county,
levels = county_slope$county[order(county_slope$slope)])
data.frame
county_slope
and plot
x = county_plot
against y = slope
.# Now plot the results using ggplot2
ggplot(data = county_slope, aes(x = county_plot, y = slope)) +
geom_point() +
coord_flip() +
theme_bw() +
ylab("Change in hate crimes per year") +
xlab("County")
Notice how the change in hate crime rates varies greatly across counties. In this case, a random-effect model can help capture this source of variability.
During this course, you have used and compared fixed-effect models
(e.g., lm()
and glm()
) and mixed-effect models
(those with both fixed- and random-effects such as lmer()
and glmer()
). What are some of the trade-offs you have
observed during the course?
Instructions
Congratulations! Happy coding