• Introduction
  • 1. Overview and Introduction to Hierarchical and Mixed Models
  • 2. Linear Mixed Effect Models
  • 3. Generalized Linear Mixed Effect Models
    • Crash course on GLMs
      • Logistic Regression
      • Poisson Regression
      • Plotting GLMs
    • Binomial data
  • 4. Repeated Measures
    • An introduction to repeated measures
      • Paired t-test
      • Repeated measures ANOVA
    • Sleep study
      • Exploring the data
      • Building models
      • Comparing regressions and ANOVAs
      • Plotting results
    • Hate in NY state?
      • Exploring NY hate data
      • Building the model
      • Interpreting model results
      • Displaying the results
      • Hierarchical models in R review

Introduction

My personal course notes from the Hierarchical and Mixed Effects Models in R course on DataCamp taught by Richard Erickson. His other course, Generalized Linear Models in R is a pre-requisite to this one. Dr. Erickson is a quantitative ecologist.

Course Description

This course begins by reviewing slopes and intercepts in linear regressions before moving on to random-effects. You’ll learn what a random effect is and how to use one to model your data. Next, the course covers linear mixed-effect regressions. These powerful models will allow you to explore data with a more complicated structure than a standard linear regression. The course then teaches generalized linear mixed-effect regressions. Generalized linear mixed-effects models allow you to model more kinds of data, including binary responses and count data. Lastly, the course goes over repeated-measures analysis as a special case of mixed-effect modeling. This kind of data appears when subjects are followed over time and measurements are collected at intervals. Throughout the course you’ll work with real data to answer interesting questions using mixed-effects models.

What is Covered

  1. Overview and Introduction to Hierarchical and Mixed Models
  2. Linear Mixed Effect Models
  3. Generalized Linear Mixed Effect Models
  4. Repeated Measures

Libraries and Data

library(tidyverse)
library(tidymodels) # broom
library(lme4)
library(lmerTest)
library(broom.mixed)
library(readxl)
library(knitr)

student_data <- read_csv("data/classroom.csv")
county_births_data <- read_csv("data/countyBirthsDataUse.csv")
md_crime <- read_csv("data/MDcrime.csv")
hate <- read_csv("data/hateNY.csv")
il_data <- read_csv("data/ILdata.csv")
user_groups <- read_xlsx("data/user_groups.xlsx")
sleep <-  read_xlsx("data/sleep.xlsx")

1. Overview and Introduction to Hierarchical and Mixed Models

The first chapter provides an example of when to use a mixed-effect and also describes the parts of a regression. The chapter also examines a student test-score dataset with a nested structure to demonstrate mixed-effects.

What is a hierarchical model?

Examples of hierarchical datasets

Often, many different types of observed data lend themselves to hierarchical models. Below are six examples of data that were analyzed using mixed-effect regressions. In this case, the term mixed-effect regression is used because this variation of the term hierarchical model was used in all of these studies.

  • A study where molluscs were exposed to pollution, grouped by tank with the same number of individuals per tank.
  • Examining vehicular crash data in Virginia, USA, using data from different counties (regions) that vary in population.
  • Kidney disease studied through time, where multiple observations were recorded per individual.
  • Land value modeled using data on multiple spatial scales.
  • Modeling county-level opioid-related deaths, accounting for differences in counties using a mixed-effect regression.
  • Variation in bike routes estimated by tracking individuals through time using GPS.

Instructions

  • Based upon this summary of the dataset and analysis, drag-and-drop the datasets (described above) to the reason for using mixed-effect models with that dataset.
  • The nested and pooled reasons are lumped because the pooled structure requires nesting.

Solution

You have seen how different types of predictor variables can be observed at different levels. Hierarchical models can capture this structure.

Multi-level student data

Before using hierarchical models, explore the data with a plot and linear model. Plotting data gives you visual intuition and allows you to see potential trends or problems. Building a simple model provides a comparison to a hierarchical model. Building a simple linear model provides a starting place for hierarchical models and troubleshooting the linear model is quicker and simpler than a hierarchical model.

During this exercise, you will plot how a student’s gain in math scores may be predicted by the teacher’s math knowledge.

Instructions

  • Visualize student-level data (student_data) with ggplot2’s ggplot().
  • Plot a teacher’s math knowledge (mathknow) on the x-axis and the student’s gain in math (mathgain) as the y-axis.
  • Use geom_smooth() to add a lm trend line.
ggplot(student_data, aes(mathknow, mathgain))+
  geom_point()+
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 109 rows containing non-finite values (stat_smooth).
## Warning: Removed 109 rows containing missing values (geom_point).

  • Examine the summary results of stat’s package linear model (lm())
  • Have mathgain predicted by mathknow using the student_data

Note that I have used the tidyverse modeling principles (tidymodels) instead of the base R version

math_mod <- lm(mathgain ~ mathknow, student_data)
tidy(math_mod)
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   57.8        1.06    54.7   1.19e-313
## 2 mathknow       0.680      1.04     0.657 5.11e-  1

Question

During the video, you saw different reasons for using hierarchical models with data such as this student data. Which of the following examples is not a reason for using hierarchical models with data such as these student test scores?

Possible Answers

  • Nested data such as students within classrooms or classrooms within schools.
  • Non-normal data such as counts of students per classroom.
  • Pooling or sharing information across small sample sizes such as classroom data within schools.
  • Repeated observations such as following a student through time.

Correct. A hierarchical model cannot help with modeling non-normal count data. However, we will learn about generalized linear mixed-effect models which can model count data in Chapter 3. Also, notice how the results from the linear model correspond to the ggplot2 line. In this case, there is a slight increase in student math scores when the teacher has a higher test score. You can tell this because the coefficient term mathknow was 0.68. Yet, there is uncertainty around this estimate because the standard error was 1.04. Thus, the results are unclear if a teacher’s math knowledge affects a student’s math gains. Within this data set, students are not independent within classrooms or school. Next, we will examine some approaches for accounting for this nesting.

Exploring multiple-levels: Classrooms and schools

In the last exercise, the simple linear model you used did not account for the structure of the data. Students learn within classrooms and classrooms exist within schools, which means students within the same classroom are not independent. One solution is to collapse the data by taking a mean for each level. However, the method used to collapse the data can be important, especially for small or unequal-sized groups.

In this exercise, you will aggregate the gains in math scores (mathgain) three different ways. After summarizing the data, you will examine a linear model of the data at each level.

Instructions

  • Run the code to summarize the student data to the classroom level.
# Summarize the student data at the classroom level
class_data <-
    student_data %>%
    group_by(classid, schoolid) %>%
    summarize(mathgain_class = mean(mathgain),
              mathknow_class = mean(mathknow),
              n_class = n(), .groups = "keep")
  • Fit a linear model using student_data where mathgain is predicted by mathknow.
## Model the math gain with the student-level data
lm(mathgain ~ mathknow, student_data)
## 
## Call:
## lm(formula = mathgain ~ mathknow, data = student_data)
## 
## Coefficients:
## (Intercept)     mathknow  
##     57.8224       0.6801

Fit a linear model using class_data where mathgain_class is predicted by mathknow_class.

# Model the math gain with the classroom-level data
lm(mathgain_class ~ mathknow_class, class_data)
## 
## Call:
## lm(formula = mathgain_class ~ mathknow_class, data = class_data)
## 
## Coefficients:
##    (Intercept)  mathknow_class  
##          58.43            2.71

Note: .groups ="keep" tells summarize() to keep the grouping variable. It is included to suppress a tidyverse notification message.

  • Run the code to summarize the student data to the school level.
  • Fit a linear model using school_data where mathgain_school is predicted by mathknow_school.
# Summarize the data at the school level
school_data <-
    student_data %>%
    group_by(schoolid) %>%
    summarize(mathgain_school = mean(mathgain),
              mathknow_school = mean(mathknow),
              n_school = n(), .groups = 'keep')

# Model the data at the school-level
lm(mathgain_school ~ mathknow_school, school_data)
## 
## Call:
## lm(formula = mathgain_school ~ mathknow_school, data = school_data)
## 
## Coefficients:
##     (Intercept)  mathknow_school  
##        58.00624         -0.01119
  • Run the code to summarize the class data to the school level (school by class, s_by_c for short)
  • Fit a linear model using s_by_c_data where mathgain_s_by_c is predicted by mathknow_s_by_c.
  • Compare all model outputs.
# Summarize school by class (s_by_c)
s_by_c_data <-
    class_data %>%
    group_by(schoolid) %>%
    summarize(mathgain_s_by_c = mean(mathgain_class),
              mathknow_s_by_c = mean(mathknow_class),
              n_s_by_c = n(), .groups = 'keep')

# Model the data at the school-level after summarizing
# students at the class level
lm(mathgain_s_by_c ~ mathknow_s_by_c, s_by_c_data)
## 
## Call:
## lm(formula = mathgain_s_by_c ~ mathknow_s_by_c, data = s_by_c_data)
## 
## Coefficients:
##     (Intercept)  mathknow_s_by_c  
##         58.0246           0.3966

Question

You have fit four different linear models. Which statement is true?

Possible Answers

  • All coefficient estimate for mathknow (mathknow, mathknow_class, mathknow_school, and mathknow_s_by_c are same
  • The coefficient estimate for mathknow (mathknow, mathknow_class, mathknow_school, and mathknow_s_by_c are different.

Correct. Notice how the coefficient estimates are different from each other. Which estimate is correct? Which one should you use? Can we model the multiple levels of data? Yes. We can model multiple levels of data. However, first, we will review regression coefficients such as slopes and intercepts.

Parts of a regression

Intercepts

Intercepts are an important part of regression models, including hierarchical models, and allow the modeling of discrete groups. Without other coefficients, a single intercept is the global mean of the data. This model is also called a null model by some. Similarly, multiple intercepts allow you to estimate the mean for each group as long as other coefficients are not estimated.

During this exercise, you will learn about intercepts and see their relationship to means. You will look at a subset of the school data that only includes student data from the school with the id code of 3. This data has been loaded for you as school_3_data.

school_3_data <- student_data %>% 
  filter(schoolid == 3)
  • Use a linear model to estimate a global intercept for all student math gains mathgain. Recall that a global intercept is ~ 1 in R.
  • Use mean() to calculate the mean for all student math gains using summarize().
# Use a linear model to estimate the global intercept
lm(mathgain ~ 1 , school_3_data)
## 
## Call:
## lm(formula = mathgain ~ 1, data = school_3_data)
## 
## Coefficients:
## (Intercept)  
##       88.86
# Use summarize to calculate the mean
school_3_data %>%
    summarize(mean(mathgain))
## # A tibble: 1 × 1
##   `mean(mathgain)`
##              <dbl>
## 1             88.9
  • Try to estimate an intercept for each classroom with the formula mathgain ~ classid using lm()
# Use a linear model to estimate mathgain in each classroom

lm(mathgain ~ classid, school_3_data)
## 
## Call:
## lm(formula = mathgain ~ classid, data = school_3_data)
## 
## Coefficients:
## (Intercept)      classid  
##    92.64880     -0.03074
  • Notice R treats classid as a continuous variable.

  • mutate() classidto be a factor.

  • Rerun the model you fit in the previous step.

# Change classid to be a factor
school_3_data <-
    school_3_data %>%
    mutate(classid = factor(classid))

# Use a linear model to estimate mathgain in each classroom
lm(mathgain ~ classid, school_3_data)
## 
## Call:
## lm(formula = mathgain ~ classid, data = school_3_data)
## 
## Coefficients:
## (Intercept)   classid137   classid145   classid228  
##        85.0         20.5         11.0        -14.0
  • Calculate the means for each classroom using group_by(classid)
  • Estimate an intercept for each class using ~ classid - 1 in the lm() formula.
# Change classid to be a factor
school_3_data <-
    school_3_data %>%
    mutate(classid = factor(classid))

# Calculate the mean of mathgain for each class
school_3_data %>%
    group_by(classid) %>%
    summarize(n(), mathgain_class = mean(mathgain))
## # A tibble: 4 × 3
##   classid `n()` mathgain_class
##   <fct>   <int>          <dbl>
## 1 11          4            85 
## 2 137         2           106.
## 3 145         5            96 
## 4 228         3            71
# Estimate an intercept for each class
lm(mathgain ~ classid-1, school_3_data)
## 
## Call:
## lm(formula = mathgain ~ classid - 1, data = school_3_data)
## 
## Coefficients:
##  classid11  classid137  classid145  classid228  
##       85.0       105.5        96.0        71.0

You observed the relationships between means and intercepts. Notice how the global intercept was the same as the mean across all the data. When examining intercepts, notice how R’s default option gave you the difference between each classroom and the reference classroom. With formulas in R, the reference group is always the first group in a factor unless you change this setting. When using the - 1 option, the intercept for each classroom was the classroom’s mean. With multiple regression, the first group can have an intercept estimated for each group or contrasts estimated. The additional discrete predictors can only have contrasts estimated.

Slopes and multiple regression

Previously, you used multiple intercepts to model the expected values for discrete groups. Now, you will model continuous predictor variables with slopes.

In particular, you will model gains in math test scores by building three different models.

The data, school_3_data, has been loaded for you.

Notes:

  • y ~ x is a shortcut for y ~ 1 + x (where 1 + is the intercept in the model), and thus you can omit 1 + in your formulas.
  • Multiple regression use a + in the formula (e.g., y ~ x_one + x_two).
  • Use the interaction shortcut syntax y ~ x_one * x_two so the testing software will correctly score your answer.

Instructions

With the data school_3_data, use a linear model to examine how a student’s math gain (mathgain) is predicted by the student’s math score in kindergarten (mathkind).

# Use a linear model to estimate how math kindergarten
# scores predict math gains later
lm(mathgain ~ mathkind, school_3_data)
## 
## Call:
## lm(formula = mathgain ~ mathkind, data = school_3_data)
## 
## Coefficients:
## (Intercept)     mathkind  
##    335.5802      -0.5259

With the data school_3_data, model a student’s math gain (mathgain) being predicted by class (classid) and score (mathkind). Estimate a coefficient for each classroom using a - 1 at the end of your formula.

# Build a multiple regression
lm(mathgain ~ classid + mathkind  - 1 , school_3_data )
## 
## Call:
## lm(formula = mathgain ~ classid + mathkind - 1, data = school_3_data)
## 
## Coefficients:
##  classid11  classid137  classid145  classid228    mathkind  
##   404.1494    431.5223    435.9090    404.8938     -0.7049
  • With the data school_3_data, model a student’s math gain (mathgain) predicted by class (classid) and score (mathkind).
  • Estimate a coefficient for each classroom using a - 1 at the end of your formula. Include an interaction term for classid and mathkind.
# Build a multiple regression with interaction
lm(mathgain ~ classid * mathkind  - 1 , school_3_data )
## 
## Call:
## lm(formula = mathgain ~ classid * mathkind - 1, data = school_3_data)
## 
## Coefficients:
##           classid11           classid137           classid145  
##            187.7939             596.0303             561.1480  
##          classid228             mathkind  classid137:mathkind  
##           1376.9393              -0.2270              -0.8336  
## classid145:mathkind  classid228:mathkind  
##             -0.7376              -2.5300

You used three different models to estimate three different coefficients from mathkind. As an alternative, hierarchical models allow you to share parameter information across groups. This is also the topic of the next section.

Random-effects in regressions with school data

Random-effect intercepts

Linear models in R estimate parameters that are considered fixed or non-random and are called fixed-effects. In contrast, random-effect parameters assume data share a common error distribution, and can produce different estimates when there are small amounts of data or outliers. Models with both fixed and random-effects are mixed-effect models or linear mixed-effect regression.

The lme4 package fits mixed-effect models (models with both fixed- and random-effects) with lmer(), which uses a formula similar to lm().But, random-effect intercepts use special syntax:

lmer(y ~ x + (1 | random-effect), data = my_data)

lmer() function requires the model to include a random-effect, otherwise the model gives you an error. Here, you will fit a lm() and a lmer(), and then graphically compare the fitted models using a subset of the data. We provide this code because of the advanced data wrangling, which is required because random-effects are usually not plotted (ggplot2 also does not include nice plot options for mixed-effect models). In this plot, notice how the dashed lines from random-effect slopes compare to the solid lines from the fixed-effect slopes.

Note: broom.mixed is required because the broom package does not support lme4.

Instructions

  • Build a linear model with mathgain predicted by classid + mathkind using the student_data. Save the output as lm_out.
  • Build a linear mixed-effect model with mathgain predicted by mathkind as a fixed-effect and classid predicted as a random-effect using the student_data.
  • Run the existing code to extract out the mathkind coefficient details.
# Build a liner model including class as fixed-effect model
lm_out <- lm(mathgain ~ classid + mathkind , student_data)

# Build a mixed-effect model including class id as a random-effect
lmer_out <- lmer(mathgain ~ mathkind + (1 | classid), data = student_data)

# Extract out the slope estimate for mathkind
tidy(lm_out) %>%
    filter(term == "mathkind")
## # A tibble: 1 × 5
##   term     estimate std.error statistic  p.value
##   <chr>       <dbl>     <dbl>     <dbl>    <dbl>
## 1 mathkind   -0.407    0.0212     -19.2 4.98e-72
tidy(lmer_out) %>%
    filter(term == "mathkind")
## # A tibble: 1 × 8
##   effect group term     estimate std.error statistic    df  p.value
##   <chr>  <chr> <chr>       <dbl>     <dbl>     <dbl> <dbl>    <dbl>
## 1 fixed  <NA>  mathkind   -0.424    0.0215     -19.7 1178. 6.27e-75

Run the code as written. It includes these steps:

  • Re-run the model from previous step to re-load it.
  • Extract predicted model values. This allow us to get the model outputs. This also extracts the data only from school 1 to help you better see an example with the data.
  • Plot the original data.
  • Use the predicted values to add lines to the previous plot.
# Re-run the models to load their outputs
lm_out <- lm(mathgain ~ classid + mathkind, data = student_data)
lmer_out <- lmer(mathgain ~ mathkind + (1 | classid), data = student_data)

# Add the predictions to the original data
student_data_subset <-
    student_data %>%
    mutate(lm_predict = predict(lm_out),
           lmer_predict = predict(lmer_out)) %>%
    filter(schoolid == "1")

# Plot the predicted values
ggplot(student_data_subset,
       aes(x = mathkind, y = mathgain, color = as_factor(classid))) +
    geom_point() +
    geom_line(aes(x = mathkind, y = lm_predict)) +
    geom_line(aes(x = mathkind, y = lmer_predict), linetype = 'dashed') +
    xlab("Kindergarten math score") +
    ylab("Math gain later in school") +
    theme_bw() +
    scale_color_manual("Class ID", values = c("red", "blue"))

Question

In Step 1, you estimated a fixed-effect and random-effect coefficient for predicting mathgain. Both models included mathkind as a fixed effect, with classid changing from a fixed-effect to a random-effect. The models estimate standard errors for mathkind, which represents the uncertainty around the parameter estimate. How did the standard error change for mathkind between the two models?

Possible Answers

  • The uncertainty did not change.
  • The standard error for mathkind was 0.0258 from the lm() and decreased to 0.0215 from the lmer().
  • The tidy() outputs do not include the details necessary to answer this question.

Correct. Notice how the random-effect for classid explains more variability than the fixed-effect.

Random-effect slopes

In the previous exercise, you saw how to code random-effect intercepts. You will now see how to code random-effect slopes. With lme4 syntax, lmer() uses (countinuous_predictor | random_effect_group) for a random-effect slope. When lme4 estimates a random-effect slope, it also estimates a random-effect intercept. scale() rescaled the predictor variable mathkind to make the model more numerically stable. Without this change, lmer()cannot fit the model.

In the previous exercise, you estimated a random-effect intercept for each classroom and one slope for all data. Here, you will estimate a random-effect intercept for each class and a random-effect slope for each classroom. Like a random-effect intercept, a random-effect slope comes from a shared distribution of all random-effect slopes.

Instructions

  • Run the existing code to rescale mathkind to be mathkind_scaled.
  • Use the lmer() function from the lme4 package to fit a random-effects intercept model. This is similar to the model you fit before, but has mathgain predicted by mathkind_scaled. classid is the random-effect. Use the student_data.
  • Use the lmer() function from the lme4 package to fit a random-effects slope model. mathgain is predicted by mathkind_scaled as a random-effect slope and classid as a random-effect group.
# Rescale mathkind to make the model more stable
student_data <-
    student_data %>%
    mutate(mathkind_scaled = scale(mathkind))

# Build lmer models
lmer_intercept <- lmer(mathgain ~ mathkind_scaled + (1 | classid), data = student_data)

lmer_slope     <- lmer(mathgain ~ (mathkind_scaled | classid), data = student_data)
  • First, re-run the model from previous step to re-load it. This is done above.
  • Second, extract out the predicted model values. You will use these to plot the expected (or predicted) values from the model. Only the data from school 1 is plotted to make it easier to view plot.
  • Third, plot the original data.
# Add the predictions to the original data
student_data_subset <-
    student_data %>%
    mutate(lmer_intercept = predict(lmer_intercept),
           lmer_slope = predict(lmer_slope)) %>%
    filter(schoolid == "1")

# Plot the predicted values
ggplot(student_data_subset,
       aes(x = mathkind_scaled, y = mathgain, color =as_factor(classid))) +
    geom_point() +
    geom_line(aes(x = mathkind_scaled, y = lmer_intercept)) +
    geom_line(aes(x = mathkind_scaled, y = lmer_slope), linetype = 'dashed') +
    theme_bw() +
    scale_color_manual("Class ID", values = c("red", "blue"))

The model with fixed-effect slopes has parallel lines (solid lines) because the slope estimates are the same. The model with random-effect slopes (dashed lines) does not have parallel lines because the slope estimates are different. The model with random-effect slopes (dashed lines) has lines that are shallower than the other model. This occurred because slopes are being estimated for each classroom, but include a shared distribution. This shared distribution pools information from all classrooms (including those not shown on the plot).

Building the school model

You will now built a model to examine what factors predict a student’s gain in math knowledge. As described in the video, you will build a model to see:

  1. Does the sex of a student impact their knowledge gain?
  2. Does the teacher’s training impact the gain and does the teacher’s math knowledge impact the gain?

As part of this model, you will also account for other possible predictors including:

  • Does a student’s math knowledge in kindergarten impact their gain?
  • Does a school’s socio-economic status impact student gains?

Instructions

  • Using the student_data, build a model and save the output as lmer_classroom
  • Have mathgain predicted by mathknow, mathprep, sex, mathkind and ses as fixed-effects -Include classid as a random-effect.
  • Print the model’s output.
# Build the model
lmer_classroom <- lmer(mathgain ~ mathknow + mathprep + sex + mathkind + ses + (1| classid), data = student_data)

# Print the model's output
print(lmer_classroom)
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: mathgain ~ mathknow + mathprep + sex + mathkind + ses + (1 |  
##     classid)
##    Data: student_data
## REML criterion at convergence: 10337.78
## Random effects:
##  Groups   Name        Std.Dev.
##  classid  (Intercept) 12.61   
##  Residual             26.90   
## Number of obs: 1081, groups:  classid, 285
## Fixed Effects:
## (Intercept)     mathknow     mathprep          sex     mathkind          ses  
##    267.4076       1.6243       1.2964      -1.3139      -0.4548       6.0905

You have now built a model to examine student’s gains in math knowledge. In the exercise, you will examine and interpret the model’s coefficients.

  • The lmer model, lmer_classroom, you built in the previous exercise is loaded.
  • Use the tidy() function to extract the model’s coefficients including confidence intervals by setting conf.int to be TRUE.
  • Save the Tidy outputs as lmer_coef and print this object.
# Extract coefficents

lmer_coef <- tidy(lmer_classroom, conf.int = TRUE)

# Print coefficents
lmer_coef
## # A tibble: 8 × 10
##   effect   group    term  estimate std.error statistic    df    p.value conf.low
##   <chr>    <chr>    <chr>    <dbl>     <dbl>     <dbl> <dbl>      <dbl>    <dbl>
## 1 fixed    <NA>     (Int…  267.      11.1       24.0   1036.  1.05e-101  246.   
## 2 fixed    <NA>     math…    1.62     1.14       1.43   232.  1.55e-  1   -0.619
## 3 fixed    <NA>     math…    1.30     1.18       1.10   223.  2.73e-  1   -1.03 
## 4 fixed    <NA>     sex     -1.31     1.73      -0.758 1015.  4.49e-  1   -4.72 
## 5 fixed    <NA>     math…   -0.455    0.0223   -20.4   1074.  3.60e- 78   -0.499
## 6 fixed    <NA>     ses      6.09     1.25       4.85  1075.  1.39e-  6    3.63 
## 7 ran_pars classid  sd__…   12.6     NA         NA       NA  NA           NA    
## 8 ran_pars Residual sd__…   26.9     NA         NA       NA  NA           NA    
## # … with 1 more variable: conf.high <dbl>
  • Add lmer_coef to the code.
  • Leave the rest of the code as is to examine the outputs.
# Plot results
lmer_coef %>%
    filter(effect == "fixed" & term != "(Intercept)") %>%
    ggplot(., aes(x = term, y = estimate,
                  ymin = conf.low, ymax = conf.high)) +
    geom_hline(yintercept = 0, color = 'red') + 
    geom_point() +
    geom_linerange() +
    coord_flip() +
    theme_bw() +
    ylab("Coefficient estimate and 95% CI") +
    xlab("Regression coefficient")

Question

Examine the coefficients and their 95% confidence intervals.

Possible Answers

  • All of the predictors differed from zero.
  • None of the predictors differed from zero.
  • ses and mathkind differed from zero.

Correct. Both a school’s socio-economic status and a student’s kindergarten math knowldege had coefficients that differed from zero. Without knowing more about the subject area, we cannot say why. Additionally, perhaps there are more sources of variability not considered in the model. You will learn more about interpreting the results from mixed-effect models in the next chapter.

2. Linear Mixed Effect Models

This chapter providers an introduction to linear mixed-effects models. It covers different types of random-effects, describes how to understand the results for linear mixed-effects models, and goes over different methods for statistical inference with mixed-effects models using crime data from Maryland.

Building a lmer model with random effects

In the video, you learned about the county-level birth rate data. Counties exist within states and perhaps states contribute to variability. During these exercises, you’ll build a series of mixed-effects models using this data.

In this exercise, you’ll build a hierarchical model with a global intercept (fixed-effect) and random-effect for state. You will then look at the summary() of the model and the plot() of the residuals. Like other types of regression analysis, examining residuals can help you see if anything is wrong with the model.

With lmer(), there are two methods for doing this: y ~ 1 + (1 | random_effect) or the shortcut, y ~ (1 | random_effect). Use the shortcut in this exercise so that your answer passes the DataCamp test.

When building mixed-effect models, starting with simple models such as the global intercept model can check to see if problems exist with either the data or code. A global intercept assumes a single intercept can describe all of the variability in the data. One way to view a global intercept is that you cannot do any better modeling that data than to only model the mean without including any other predictor variables.

Instructions

  • Fit a lmer() model to the county_births_data data. Include State as a random-effect that predicts BirthRate.
  • Look at the summary from the model using summary() and plot the residuals of the model with plot().
# Build a lmer with State as a random effect
birth_rate_state_model <- lmer(BirthRate ~ (1 | State),
                            data = county_births_data)

# Look at the model's summary and the residuals
summary(birth_rate_state_model)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: BirthRate ~ (1 | State)
##    Data: county_births_data
## 
## REML criterion at convergence: 2411
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.7957 -0.6056 -0.1063  0.5211  5.5948 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  State    (Intercept) 1.899    1.378   
##  Residual             3.256    1.804   
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  12.3362     0.2216 43.3830   55.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(birth_rate_state_model)
## # A tibble: 3 × 8
##   effect   group    term            estimate std.error statistic    df   p.value
##   <chr>    <chr>    <chr>              <dbl>     <dbl>     <dbl> <dbl>     <dbl>
## 1 fixed    <NA>     (Intercept)        12.3      0.222      55.7  43.4  5.49e-42
## 2 ran_pars State    sd__(Intercept)     1.38    NA          NA    NA   NA       
## 3 ran_pars Residual sd__Observation     1.80    NA          NA    NA   NA
plot(birth_rate_state_model)

  • Run the code as written to extract the predicted values for county and then add these to the plot shown during the video. Notice how the outliers have been drawn closer to the center.
# Plot the predicted values from the model on top of the plot shown during the video
county_births_data$birthPredictState <-
    predict(birth_rate_state_model, county_births_data)
ggplot() + 
    theme_minimal() +
    geom_point(data = county_births_data,
               aes(x = TotalPopulation, y = BirthRate)) + 
    geom_point(data = county_births_data,
               aes(x = TotalPopulation, y = birthPredictState),
               color = 'blue', alpha = 0.5) 
## Warning: Removed 2 rows containing missing values (geom_point).
## Removed 2 rows containing missing values (geom_point).

You’ve now built a simple lmer() model that included a global mean. You might do this as a first step modeling your own data or as a debugging step. For example, if R could not fit a lmer() model, this simple model could help you see if the trouble was with your random-effects. In the next exercise, you will see how to include a fixed-effect. Later in this chapter, you will learn more about interpreting lmer() outputs.

Including a fixed effect

During the previous exercise, you built a model with only a global intercept. Usually, hierarchical models include predictor variables of interest.

The county-level birth data includes the average age of the mother, AverageAgeofMother. Perhaps this explains a county’s birth rate.

In this case, the formula in R “knows” AverageAgeofMother is numeric and will treat the corresponding coefficient as a slope.

Build a hierarchical model with county_births_data, which has been loaded for you, and include a fixed-effect. Does the average age of the mother at birth predict the birth rate?

Instructions

  • Build a model with AverageAgeofMother as a fixed-effect slope parameter including State as a random-effect. The response variable is BirthRate.
  • Examine the model’s summary().
# Include the AverageAgeofMother as a fixed effect within the lmer and state as a random effect
age_mother_model <- lmer(BirthRate ~ AverageAgeofMother + (1 | State), county_births_data)
summary(age_mother_model)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: BirthRate ~ AverageAgeofMother + (1 | State)
##    Data: county_births_data
## 
## REML criterion at convergence: 2347.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9602 -0.6086 -0.1042  0.5144  5.2686 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  State    (Intercept) 1.562    1.250   
##  Residual             2.920    1.709   
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##                     Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)         27.57033    1.81801 575.96197  15.165  < 2e-16 ***
## AverageAgeofMother  -0.53549    0.06349 574.18338  -8.434 2.71e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## AvrgAgfMthr -0.994
tidy(age_mother_model)
## # A tibble: 4 × 8
##   effect   group    term            estimate std.error statistic    df   p.value
##   <chr>    <chr>    <chr>              <dbl>     <dbl>     <dbl> <dbl>     <dbl>
## 1 fixed    <NA>     (Intercept)       27.6      1.82       15.2   576.  5.92e-44
## 2 fixed    <NA>     AverageAgeofMo…   -0.535    0.0635     -8.43  574.  2.71e-16
## 3 ran_pars State    sd__(Intercept)    1.25    NA          NA      NA  NA       
## 4 ran_pars Residual sd__Observation    1.71    NA          NA      NA  NA

Question

Looking at the regression coefficient AverageAgeofMother, which of the following statements is true about its estimate?

Possible Answers

  • The estimate is negative, suggesting counties with older mothers have lower birth rates.
  • The estimate is positive, suggesting counties with older mothers have higher birth rates.
  • The model did not estimate this coefficient.

Correct. The estimate was negative, suggesting counties with older mothers have lower birth rates on average. You will learn about statistical inferences with hierarchical models later. Additionally, other DataCamp courses cover statistical inference and introduce regression in R in greater detail.

Random-effect slopes

In the previous exercise, you estimated random-effect intercepts for each state. This allowed you to account for each state having its own intercept. During this exercise, you will estimate a random-effect slope for each state. For example, perhaps the log10(total population of each county), LogTotalPop, changes the birth rate of a county AND varies by state.

Recall from the video, a random-effect slope may be estimated for each group using (slope | group) syntax with lmer().

During this exercise, fit a mixed-effects model estimating the effect of the mother’s average age while accounting for state and total population as random-effects.

How do the outputs from this model compare to the previous model you built?

Instructions

  • Build the model from the previous exercise with AverageAgeofMother as a fixed-effect and State a random-effect.
  • Build the previous model, adding LogTotalPop as a random-effects slope.
  • Look at the tidy() output for each model.
county_births_data <- county_births_data %>% 
  mutate(LogTotalPop = log10(TotalPopulation))
# Include the AverageAgeofMother as fixed-effect and State as a random-effect
model_a <- lmer(BirthRate ~ AverageAgeofMother + (1 | State), county_births_data)
tidy(model_a)
## # A tibble: 4 × 8
##   effect   group    term            estimate std.error statistic    df   p.value
##   <chr>    <chr>    <chr>              <dbl>     <dbl>     <dbl> <dbl>     <dbl>
## 1 fixed    <NA>     (Intercept)       27.6      1.82       15.2   576.  5.92e-44
## 2 fixed    <NA>     AverageAgeofMo…   -0.535    0.0635     -8.43  574.  2.71e-16
## 3 ran_pars State    sd__(Intercept)    1.25    NA          NA      NA  NA       
## 4 ran_pars Residual sd__Observation    1.71    NA          NA      NA  NA
# Include the AverageAgeofMother as fixed-effect and LogTotalPop and State as random-effects
model_b <- lmer(BirthRate ~ AverageAgeofMother + (LogTotalPop | State), county_births_data)
tidy(model_b)
## # A tibble: 6 × 8
##   effect   group    term            estimate std.error statistic    df   p.value
##   <chr>    <chr>    <chr>              <dbl>     <dbl>     <dbl> <dbl>     <dbl>
## 1 fixed    <NA>     (Intercept)       34.6      1.81        19.1  570.  1.97e-63
## 2 fixed    <NA>     AverageAgeofMo…   -0.756    0.0620     -12.2  565.  1.75e-30
## 3 ran_pars State    sd__(Intercept)   13.5     NA           NA     NA  NA       
## 4 ran_pars State    cor__(Intercep…   -0.997   NA           NA     NA  NA       
## 5 ran_pars State    sd__LogTotalPop    2.26    NA           NA     NA  NA       
## 6 ran_pars Residual sd__Observation    1.53    NA           NA     NA  NA

Question

How did the estimated effect for the average age of a mother change between the two models?

Possible Answers

  • The AverageAgeofMother coefficient in model_b becomes closer to zero than in model_a, indicating a weaker relationship.
  • The AverageAgeofMother coefficient in model_b becomes farther from zero than in model_a, indicating a stronger relationship.
  • The AverageAgeofMother coefficient was the same in model_b and model_a.

Correct. The estimate for AverageAgeofMother became more negative (farther from zero) in model B. Including LogTotalPop changed the estimate for AverageAgeofMother and is an important coefficient. You may have also noticed the AverageAgeofMother coefficient in model b had a smaller standard error (std.error). This also caused a larger t-value (statistic). Both of these show including LogTotalPop in the model explains a source of variability in the dataset.

Uncorrelated random-effect slope

In the previous exercise, you used lme4’s default setting and assumed slopes and intercepts within each group were correlated for the random-effect estimates. However, this assumption may not always be valid or we may want to simplify the model if we are having trouble numerically fitting the model.

Building a model with uncorrelated random-effects is one method for potentially simplifying the model. Furthermore, lmer() models can be hard to fit and checking model outputs can be a useful step when debugging your model. Alternatively, you may have subject matter expertise and want to assume the random-effects are not correlated.

To fit a model with an uncorrelated random-effect slope, use || rather than | with lmer() syntax.

The second model you built in the previous exercise, model_b has been loaded for you. Compare the outputs of model_c to the old outputs from model_b.

The lme4 lmer vignette includes a section on uncorrelated random-effects.

Instructions

  • Build a model with AverageAgeofMother as a fixed-effect and LogTotalPop as an uncorrelated random-effect slope with State.
  • Compare the summary() outputs for each model.
# Include AverageAgeofMother as fixed-effect and LogTotalPop and State as uncorrelated random-effects
model_c <- lmer(BirthRate ~ AverageAgeofMother + (LogTotalPop || State), county_births_data)
## boundary (singular) fit: see help('isSingular')
# Compare outputs of both models 
summary(model_b)  #tidy(model_b)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: BirthRate ~ AverageAgeofMother + (LogTotalPop | State)
##    Data: county_births_data
## 
## REML criterion at convergence: 2276.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.7040 -0.5714 -0.1073  0.4427  5.7292 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  State    (Intercept) 181.609  13.476        
##           LogTotalPop   5.104   2.259   -1.00
##  Residual               2.334   1.528        
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##                     Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)         34.57570    1.80649 569.97823   19.14   <2e-16 ***
## AverageAgeofMother  -0.75561    0.06202 565.19529  -12.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## AvrgAgfMthr -0.994
summary(model_c)  #tidy(model_c)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: BirthRate ~ AverageAgeofMother + (LogTotalPop || State)
##    Data: county_births_data
## 
## REML criterion at convergence: 2347.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9602 -0.6086 -0.1042  0.5144  5.2686 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  State    (Intercept) 1.562    1.250   
##  State.1  LogTotalPop 0.000    0.000   
##  Residual             2.920    1.709   
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##                     Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)         27.57033    1.81801 575.96198  15.165  < 2e-16 ***
## AverageAgeofMother  -0.53549    0.06349 574.18338  -8.434 2.71e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## AvrgAgfMthr -0.994
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

Question

Fitting model_c gave a singular warning message. In this case, the more complex model is required. The results from model_b provide insight into why. Specifically, look at the summary of model_b. Which statement is true?

Possible Answers

  • The random effect correlation for LogTotalPop is 1.00.
  • The random effect correlation for LogTotalPop is -1.00.
  • The random effect correlation for LogTotalPop is 0.00.

Correct. A perfect negative correlation exists between the random-effect slope and intercept. Thus, fitting the model without the correlation produces an error message. The random-effect slope and random-effect intercept cannot be estimated independent of each other.

Fixed- and random-effect predictor

In the previous exercises, you fit mixed-effect models with different fixed- and random-effects.

Sometimes, a model can have the same predictor as both a fixed and random-effect. For example, perhaps you are interested in estimating the average effect the age of a mother at birth (AverageAgeofMother). Including the predictor as fixed-effect allows you to estimate the effect of a mother’s age across all locations. Including the predictor as a random-effect allows you to simultaneously account (or correct) for different slope estimates among states.

Instructions

  • Construct a lmer() using AverageAgeofMother as a fixed-effect and AverageAgeofMother as a random-effect nested within State to predict BirthRate with the county_births_data. Make sure the fixed-effect goes before the random-effects in the formula.
  • Use summary() on the output.
# Construct a lmer() 
out <- lmer(BirthRate ~ AverageAgeofMother + (AverageAgeofMother|State), county_births_data)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0133555 (tol = 0.002, component 1)
# Look at the summary
summary(out)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: BirthRate ~ AverageAgeofMother + (AverageAgeofMother | State)
##    Data: county_births_data
## 
## REML criterion at convergence: 2337.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8402 -0.5965 -0.1132  0.5233  5.1817 
## 
## Random effects:
##  Groups   Name               Variance Std.Dev. Corr 
##  State    (Intercept)        78.33144 8.8505        
##           AverageAgeofMother  0.08433 0.2904   -0.99
##  Residual                     2.80345 1.6744        
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##                    Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)        27.21961    2.41010 39.91438  11.294 5.31e-14 ***
## AverageAgeofMother -0.52344    0.08293 39.42721  -6.312 1.83e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## AvrgAgfMthr -0.997
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0133555 (tol = 0.002, component 1)

During this exercise, you built a model where the same predictor was both a fixed- and random-effect. This is another tool for your modeling toolbox. During the next video, you will learn about how to interpret lmer() outputs.

Understanding and reporting the outputs of a lmer

Comparing print and summary output

One of the first things to examine after fitting a model using lmer() is the model’s output using either the print() or summary() functions. Although similar, each produces slightly different outputs. The purpose of this exercise is to have you compare the two outputs and then answer a question about how the two differ. A lmer() model has been fit for you and saved as out.

out
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: BirthRate ~ AverageAgeofMother + (AverageAgeofMother | State)
##    Data: county_births_data
## REML criterion at convergence: 2337.506
## Random effects:
##  Groups   Name               Std.Dev. Corr 
##  State    (Intercept)        8.8505        
##           AverageAgeofMother 0.2904   -0.99
##  Residual                    1.6744        
## Number of obs: 578, groups:  State, 50
## Fixed Effects:
##        (Intercept)  AverageAgeofMother  
##            27.2196             -0.5234  
## optimizer (nloptwrap) convergence code: 0 (OK) ; 0 optimizer warnings; 1 lme4 warnings
summary(out)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: BirthRate ~ AverageAgeofMother + (AverageAgeofMother | State)
##    Data: county_births_data
## 
## REML criterion at convergence: 2337.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8402 -0.5965 -0.1132  0.5233  5.1817 
## 
## Random effects:
##  Groups   Name               Variance Std.Dev. Corr 
##  State    (Intercept)        78.33144 8.8505        
##           AverageAgeofMother  0.08433 0.2904   -0.99
##  Residual                     2.80345 1.6744        
## Number of obs: 578, groups:  State, 50
## 
## Fixed effects:
##                    Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)        27.21961    2.41010 39.91438  11.294 5.31e-14 ***
## AverageAgeofMother -0.52344    0.08293 39.42721  -6.312 1.83e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## AvrgAgfMthr -0.997
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0133555 (tol = 0.002, component 1)

Which of the following appears in the summary() output but not the print() output?

Possible Answers

  • The model formula
  • The REML criterion at convergence
  • The standard errors for the fixed-effects
  • The number of observations and groups

Correct. The standard errors for the fixed-effects are not included in the print() output.

Extracting coefficients

We often want to know what values a model estimates as coefficients. Although the summary() function provides the model outputs, we might also want to directly access model outputs.

The fixed-effects estimates can be called directly using the fixef() function. The random-effects estimates can be called directly using the ranef() function. We can also extract confidence intervals for the fixed-effects using the function confint().

The broom.mixed package also contains tidy methods for extracting model results from lmer() models, namely the tidy() function. However, these results are more complex and less tidy than many tidy outputs due to the complexity of mixed-effect models.

A lmer model has been fit for you and saved as out.

Instructions

  • Extract the fixed-effect coefficients using fixef() with the saved model out.
  • Extract the random-effect coefficients using the ranef() with the saved model out.
  • Estimate the 95% confidence intervals using the confint() function with the saved model out.
  • Use the tidy() with out and conf.int = TRUE to repeat your previous three code calls with one tidy command.
# Extract the fixed-effect coefficients
fixef(out)
##        (Intercept) AverageAgeofMother 
##          27.219605          -0.523442
# Extract the random-effect coefficients
ranef(out)
## $State
##     (Intercept) AverageAgeofMother
## AK   4.15003868       -0.109184783
## AL  -4.03378321        0.127790698
## AR   1.21082566       -0.026165323
## AZ  -5.11009217        0.148282653
## CA  11.43254356       -0.373285277
## CO   1.35768165       -0.043219507
## CT  -3.82970951        0.066100814
## DC   1.21115127       -0.012314396
## DE  -2.11510834        0.060412134
## FL -11.31590854        0.334159606
## GA   6.93457372       -0.230274526
## HI  -0.51788167        0.020356335
## IA   3.41174428       -0.085895750
## ID   7.51123552       -0.222596318
## IL  -3.07183968        0.102977648
## IN  -3.02904144        0.097145469
## KS   5.15187643       -0.150220284
## KY   2.30557751       -0.058607045
## LA   5.32029471       -0.154658352
## MA  -5.72651427        0.144383430
## MD  -3.31971034        0.116256669
## ME  -5.76419680        0.145079276
## MI  -5.51723036        0.156052226
## MN  -1.41379992        0.077580405
## MO   0.46288340       -0.013492801
## MS  -0.77739331        0.021041676
## MT   0.07683320       -0.006560404
## NC   4.20869679       -0.159191892
## ND   3.20725583       -0.078130864
## NE   4.97980324       -0.122101047
## NH  -3.02333587        0.055641597
## NJ   6.33991764       -0.220034709
## NM   1.24087579       -0.066463156
## NV   0.35709749       -0.009233999
## NY  -5.34557647        0.162856972
## OH  -7.43833421        0.227092266
## OK   4.73462323       -0.140260716
## OR  -5.64958450        0.170021845
## PA  -8.17679196        0.230593543
## RI  -0.82967532       -0.014536489
## SC  -5.60511798        0.179511398
## SD   4.96670949       -0.128324308
## TN  -1.21831791        0.038711838
## TX  10.90259845       -0.331687671
## UT  14.12700916       -0.369723836
## VA  -8.76482204        0.350357075
## VT  -0.02787975       -0.014042974
## WA   0.30152985       -0.002489047
## WI  -0.35751943       -0.004946489
## WV  -3.92421158        0.115236390
## 
## with conditional variances for "State"
# Estimate the confidence intervals 
confint(out)
## Computing profile confidence intervals ...
## Warning in nextpar(mat, cc, i, delta, lowcut, upcut): unexpected decrease in
## profile: using minstep
## Warning in FUN(X[[i]], ...): non-monotonic profile for .sig02
## Warning in confint.thpr(pp, level = level, zeta = zeta): bad spline fit
## for .sig02: falling back to linear interpolation
##                         2.5 %     97.5 %
## .sig01              4.4099143 13.8613915
## .sig02             -0.9996895 -0.9607260
## .sig03              0.1333730  0.4630564
## .sigma              1.5761224  1.7821072
## (Intercept)        22.4264494 32.1350493
## AverageAgeofMother -0.6923139 -0.3576358
# Use the tidy function
tidy(out, conf.int = TRUE)
## # A tibble: 6 × 10
##   effect   group    term   estimate std.error statistic    df   p.value conf.low
##   <chr>    <chr>    <chr>     <dbl>     <dbl>     <dbl> <dbl>     <dbl>    <dbl>
## 1 fixed    <NA>     (Inte…   27.2      2.41       11.3   39.9  5.31e-14   22.3  
## 2 fixed    <NA>     Avera…   -0.523    0.0829     -6.31  39.4  1.83e- 7   -0.691
## 3 ran_pars State    sd__(…    8.85    NA          NA     NA   NA          NA    
## 4 ran_pars State    cor__…   -0.992   NA          NA     NA   NA          NA    
## 5 ran_pars State    sd__A…    0.290   NA          NA     NA   NA          NA    
## 6 ran_pars Residual sd__O…    1.67    NA          NA     NA   NA          NA    
## # … with 1 more variable: conf.high <dbl>

Notice how the tidy() function produces similar results. You will see how to use the other functions for plotting in the next exercise.

Displaying the results from a lmer model

Data scientists must communicate their work and DataCamp offers courses on the topic. Explaining your work helps your audience understand the results. To do this, match your presentation to your audience’s knowledge level and expectations.

For non-technical audiences, describe the important findings from your output. For example, you might say, counties with older mothers tend to have lower birth rates. For technical audiences, include details such as coefficient estimates, confidence intervals, and test statistics. Books such as The Chicago Guide to Writing about Multivariate Analysis provide suggestions for describing regression outputs.

During this exercise, you will extract and plot fixed-effects. Besides plotting the coefficients (with geom_point()) and their 95% confidence intervals (with geom_linerange()), you will add a red-line to the plot to help visualize where zero is located (using geom_hline()). If the 95% confidence intervals do not include zero, the coefficient’s estimate differs from zero.

coord_flip() is required because ggplot does not allow for xmin or xmax, only ymin and ymax. And, theme_minimal() changes the theme from the default.

Technical note: Extracting regression coefficients from lmer is tricky (see the discussion between the lmer and broom authors).

Instructions

  • Extract the coefficients from the model out using tidy() from the broom.mixed package. Include the confidence interval.
  • Use the existing code to filter out the random-effect estimates.
  • Print the coefficient table to the screen.
  • Plot the outputs using ggplot2. Use term for the x-axis, estimate for the y-axis, conf.low for the ymin, and conf.high for the ymax.
# Extract out the parameter estimates and confidence intervals
coef_estimates <-
    tidy(out, conf.int = TRUE) %>%
    filter(effect == "fixed")

# Print the new dataframe
print(coef_estimates)
## # A tibble: 2 × 10
##   effect group term         estimate std.error statistic    df  p.value conf.low
##   <chr>  <chr> <chr>           <dbl>     <dbl>     <dbl> <dbl>    <dbl>    <dbl>
## 1 fixed  <NA>  (Intercept)    27.2      2.41       11.3   39.9 5.31e-14   22.3  
## 2 fixed  <NA>  AverageAgeo…   -0.523    0.0829     -6.31  39.4 1.83e- 7   -0.691
## # … with 1 more variable: conf.high <dbl>
# Plot the results using ggplot2
ggplot(coef_estimates, aes(x = term, y = estimate,
                     ymin = conf.low, ymax = conf.high)) +
    geom_hline( yintercept = 0, color = 'red' ) +
    geom_linerange() + geom_point() + coord_flip() + theme_minimal()

In this exercise, you saw how to plot fixed-effect coefficients from a lmer() model. You used more code to wrangle and plot the output than to fit the model. In fact, data science projects usually include more code for the wrangling of model inputs and outputs than code for running models. As a data scientist, you add value to projects by helping people see and understand results.

Statistical inference with Maryland crime data

Visualizing Maryland crime data

Before fitting a model, plotting the data can be helpful to see if trends or data points jump out, outliers exist, or other attributes of the data require future consideration. Using ggplot2, you can plot lines for county and examine how crimes change through time. For this exercise, examine Maryland crime data (md_crime). This includes the Year, a count of violent Crimes in the county, and the County’s name.

To explore this data, first plot the data points for each county through time. This lets you see how each county changes through time. Rather than using an aesthetic such as color, group is used here because there are too many counties to easily distinguish colors. After plotting the raw data, add trend lines for each county.

Both the connect points (geom_line) and trend lines (geom_smooth) provide insight into what, if any, kinds of random effects are required. If all of the points appear to have similar ranges and means, a random-effect intercept may not be important. Likewise, if trends look consistent across counties (i.e., the trend lines look similar or parallel across groups), a random-effect slope may not be required.

Instructions

  • Plot how Crime (y variable) has changed across Year (x variable) in each County (group variable) using the md_crime data.
  • Add trend lines for each county with geom_smooth(method = 'lm', se = FALSE). se = FALSE creates a less cluttered plot.
# Plot the change in crime through time by County
plot1 <- 
    ggplot(data = md_crime, 
           aes(x = Year, y = Crime, group = County)) +
    geom_line() + 
    theme_minimal() +
    ylab("Major crimes reported per county")
print(plot1)

# Add the trend line for each county
plot1 + 
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Question

Does the number of major crimes per county appear to be different across counties? If so, a random-effect intercept will be needed.

Possible Answers

  • Yes, the number of major crimes per county appears to be different across counties.
  • No, all of the counties have similar numbers of major crimes.

Correct. Notice how the number of major crimes varies across counties. This suggests a random-effects intercept will improve the model’s fit. Likewise, the slopes may be different as well. This is something you will explore more in the next exercises.

Rescaling slopes

The last plot showed changes in crime rate varied by county. This shows you that you should include Year as both a random- and fixed-effect in your model. Including Year this way will estimate a global slope across all counties as well as slope for each county. The fixed-effect slope estimates the change in major crimes across all Maryland counties. The random-effect slope estimates model for that counties have different changes in crime.

But, fitting this model produces a warning message! To address this warning, change Year from starting at 2006 to starting at 0. We provide you with this new variable, Year2 (e.g., 2006 in Year is 0 in Year2). Sometimes when fitting regression, you need to scale or center the intercept to start at 0. This improves numerical stability of the model.

Instructions

  • Build a lmer() to predict Crime with Year as both a fixed-effect and random-effect slope and County as the random-effect intercept.
  • Build a second lmer() to predict Crime with Year2 as both a fixed-effect and random-effect slope and County as the random-effect intercept.
# Fit the model with Year as both a fixed and random-effect
lmer(Crime ~ Year + (1 + Year | County) , data = md_crime)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## unable to evaluate scaled gradient
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge: degenerate Hessian with 1 negative eigenvalues
## Warning: Model failed to converge with 1 negative eigenvalue: -7.2e-02
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: Crime ~ Year + (1 + Year | County)
##    Data: md_crime
## REML criterion at convergence: 2892.018
## Random effects:
##  Groups   Name        Std.Dev. Corr 
##  County   (Intercept) 386.29        
##           Year          1.34   -0.84
##  Residual             328.32        
## Number of obs: 192, groups:  County, 24
## Fixed Effects:
## (Intercept)         Year  
##   136642.97       -67.33  
## optimizer (nloptwrap) convergence code: 0 (OK) ; 0 optimizer warnings; 2 lme4 warnings
# Fit the model with Year2 rather than Year
lmer(Crime ~ Year2 + (1 + Year2 | County) , data = md_crime)
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: Crime ~ Year2 + (1 + Year2 | County)
##    Data: md_crime
## REML criterion at convergence: 2535.723
## Random effects:
##  Groups   Name        Std.Dev. Corr 
##  County   (Intercept) 2754.03       
##           Year2        130.15  -0.91
##  Residual               91.79       
## Number of obs: 192, groups:  County, 24
## Fixed Effects:
## (Intercept)        Year2  
##     1577.28       -67.33

Notice how you had to rescale year so that the model could be fit by R. This is a common regression modeling approach to improve model fits. Some people will rescale the slope so the middle value is zero instead. Either approach is acceptable. The benefit of starting at zero is that it can sometimes create a regression coefficient that is easier to explain and understand. In the next exercise, you will examine the second model.

Null hypothesis testing

Null hypothesis testing uses p-values to test if a variable differs significantly from zero. Recently, the abuse and overuse of null hypothesis testing and p-values has caused the American Statistical Association to issue a statement about the use of p-values.

Because of criticisms such as these and other numerical challenges, Doug Bates (the creator of the lme4 package) does not include p-values as part of his package. Yet, you may still want or need to estimate p-values. To fill this need, several packages exist, including the lmerTest package.

lmerTest uses the same lmer() syntax as the lme4 package, but includes different outputs. During this exercise, you will fit a lmer() model using lmerTest and lme4.

Instructions

  • Load the lmerTest package, which will also load the lme4 package.
  • Build a lmer() model using only the lme4 package. In R, this is done with lme4::lmer().
  • Build the same lmer() but use the lmerTest package.
  • Look at each model’s summary.
# Load lmerTest

# Fit a lmer use lme4
out_lme4 <- 
    lme4::lmer(Crime ~ Year2 + (1 + Year2 | County), 
              data = md_crime)

# Fit a lmer use lmerTest
out_lmerTest <- 
    lmerTest::lmer(Crime ~ Year2 + (1 + Year2 | County), 
                  data = md_crime)

# Look at the summaries
summary(out_lme4)
## Linear mixed model fit by REML ['lmerMod']
## Formula: Crime ~ Year2 + (1 + Year2 | County)
##    Data: md_crime
## 
## REML criterion at convergence: 2535.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8080 -0.2235 -0.0390  0.2837  3.0767 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  County   (Intercept) 7584686  2754.03       
##           Year2         16939   130.15  -0.91
##  Residual                8425    91.79       
## Number of obs: 192, groups:  County, 24
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  1577.28     562.29   2.805
## Year2         -67.33      26.72  -2.520
## 
## Correlation of Fixed Effects:
##       (Intr)
## Year2 -0.906
summary(out_lmerTest)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Crime ~ Year2 + (1 + Year2 | County)
##    Data: md_crime
## 
## REML criterion at convergence: 2535.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.8080 -0.2235 -0.0390  0.2837  3.0767 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  County   (Intercept) 7584686  2754.03       
##           Year2         16939   130.15  -0.91
##  Residual                8425    91.79       
## Number of obs: 192, groups:  County, 24
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)  
## (Intercept)  1577.28     562.29   23.01   2.805   0.0100 *
## Year2         -67.33      26.72   23.01  -2.520   0.0191 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## Year2 -0.906

Question

lmerTest includes output lme4 lacks. Which of the following was not in the lme4::lmer() output, but was in the lmeTest::lmer()?

Possible Answers

  • Fixed-effect regression coefficient estimates (Estimate).
  • Coefficient estimate standard error (Std. Error).
  • P-value that the coefficient differs from zero (Pr(>|t|)).

Yes. lmerTest includes p-values as part of the outputs. Using this extra package helps people to estimate p-values when they need or want to with lmer models.

Controversies around P-values

P-values and null hypothesis testing historically have been important in science and statistics. Based upon the previous exercise and video, which of the following is not true:

Possible Answers

  • The American Statistical Association issued a statement about the use and misuse of P-values in 2016.
  • All statisticians agree about how to estimate P-values.
  • For complex models such a mixed-effect models, calculating P-values can be difficult.
  • lme4 does not include P-values by default.

As you saw, P-values can be powerful, but have limitations. Please consider reading the American Statistical Association’s statement of the use of p-values.

Model comparison with ANOVA

Comparing models can be difficult. Many methods exist although these are beyond the scope of this course such as model selection (e.g., AIC).

Analysis of Variance (ANOVA) exists as a basic option to compare lmer models. The ANOVA tests to see if one model explains more variability than a second model. The ANOVA does this by examining the amount of variability explained by the models.

For example, you can see if Year predicts Crime in Maryland. To do this, build a null model with only County as a random-effect and a year model that includes Year. You can then compare the two models using the anova() function.

If Year explains a significant amount of variability, then the P-value will be less than your pre-specified threshold (usually 0.05).

Instructions

  • Build a model that only includes County as a random-effect.
  • Build a model that includes Year2 as a random-effect slope.
  • Compare null_model (first) to year_model (second) using anova().
# Build the Null model with only County as a random-effect
null_model <- lmer(Crime ~ (1 | County) , data = md_crime)

# Build the Year2 model with Year2 as a fixed and random slope and County as the random-effect
year_model <- lmer(Crime ~ Year2 + (1 + Year2 | County) , data = md_crime)

# Compare null_model and year_model using an anova
anova(null_model, year_model)
## refitting model(s) with ML (instead of REML)
## Data: md_crime
## Models:
## null_model: Crime ~ (1 | County)
## year_model: Crime ~ Year2 + (1 + Year2 | County)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## null_model    3 2954.4 2964.2 -1474.2   2948.4                         
## year_model    6 2568.9 2588.4 -1278.4   2556.9 391.52  3  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question

Look at the anova() output, specifically Pr(>Chisq), which is the output from the null hypothesis where both models explain the same amount of variability. Does the second model (the model with Year as a slope, year_model) explain a significant amount of variability compared to the first model without the slope (null_model)?

Possible Answers

  • null_model explains more variability than year_model.
  • year_model explains more variability than null_model.

Yes. year_model explains significantly more variablity. In the next chapter, you will learn how to generalize lmers.

3. Generalized Linear Mixed Effect Models

This chapter extends linear mixed-effects models to include non-normal error terms using generalized linear mixed-effects models. By altering the model to include a non-normal error term, you are able to model more kinds of data with non-linear responses. After reviewing generalized linear models, the chapter examines binomial data and count data in the context of mixed-effects models.

Crash course on GLMs

Logistic Regression

In toxicology studies, organisms are often dosed and binary outcomes often occur such as dead/alive or inhibited/mobile. This is called a dose-response study. For example, the response to different doses might be mortality (1) or survival (0) at the end of a study.

During this exercise, we will fit a logistic regression using all three methods described in the video. You have been given two datasets.

  • df_long, in a “long” format with each row corresponding to an observation (i.e., a 0 or 1).
  • df_short, in an aggregated format with each row corresponding to a treatment (e.g., 6 successes, 4 failures, number of replicates = 10, proportion = 0.6).

When using the “wide” or “short” data frame, the “success, failure” methods for inputing logistic regression results require success and failure be a matrix. The easiest way to do this is with the cbind() function.

Tip: When working with data in the wild, always check to see what 0 and 1 correspond to. Different people use different notation and assumptions can cause problems for you if you assume wrong!

# Making up the said data frames
dose =  rep(c(0,2,4,6,8,10), each =20)

mortality = c(
rep(0:1, times = c(20, 0)),
rep(0:1, times = c(14, 6)),
rep(0:1, times = c(9, 11)),
rep(0:1, times = c(4, 16)),
rep(0:1, times = c(5, 15)),
rep(0:1, times = c(3, 17))
)

df_long <- tibble(
  dose, mortality
)

df_short <- data_frame(
dose = c(0,2,4,6,8,10),
mortality = c(0,6,11,16,15,17),
survival = c(20,14,9,4,5,3),
nReps = rep(20,6),
mortalityP = c(0,0.3,0.55,0.8,0.75,0.85))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Instructions

  • Using the data df_long, fit a glm() with the "binomial" distribution family (or, synonymously, binomial error term) where mortality is predicted by dose.
  • Look at the summary() of the model.
# Fit a glm using data in a long format
fit_long <- glm(mortality ~ dose, data = df_long, 
                family = "binomial")
summary(fit_long)
## 
## Call:
## glm(formula = mortality ~ dose, family = "binomial", data = df_long)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2356  -0.7756   0.4141   0.9043   1.6419  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.91229    0.42900  -4.458 8.29e-06 ***
## dose         0.43256    0.07872   5.495 3.92e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.52  on 119  degrees of freedom
## Residual deviance: 121.29  on 118  degrees of freedom
## AIC: 125.29
## 
## Number of Fisher Scoring iterations: 4
  • Using the data df_short, fit a glm() with the “binomial” distribution family where the matrix cbind(mortality, survival) is predicted by dose.
  • Look at the summary() of the model.
# Fit a glm using data in a short format with two columns
fit_short <- glm(cbind(mortality, survival) ~ dose , 
                data = df_short, family = "binomial")
summary(fit_short)
## 
## Call:
## glm(formula = cbind(mortality, survival) ~ dose, family = "binomial", 
##     data = df_short)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -2.3477   0.4040   0.8544   1.3424  -0.8367  -1.0004  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.91229    0.42900  -4.458 8.29e-06 ***
## dose         0.43256    0.07872   5.495 3.92e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 54.143  on 5  degrees of freedom
## Residual deviance:  9.908  on 4  degrees of freedom
## AIC: 29.746
## 
## Number of Fisher Scoring iterations: 5
  • Using the data df_short, fit a glm() with the "binomial" distribution family where mortalityP is predicted by dose with weights of nReps.
  • Look at the summary() of the model.
# Fit a glm using data in a short format with weights
fit_short_p <- glm(mortalityP  ~ dose , data = df_short, 
                   weights = nReps , family = "binomial")
summary(fit_short_p)
## 
## Call:
## glm(formula = mortalityP ~ dose, family = "binomial", data = df_short, 
##     weights = nReps)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -2.3477   0.4040   0.8544   1.3424  -0.8367  -1.0004  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.91229    0.42900  -4.458 8.29e-06 ***
## dose         0.43256    0.07872   5.495 3.92e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 54.143  on 5  degrees of freedom
## Residual deviance:  9.908  on 4  degrees of freedom
## AIC: 29.746
## 
## Number of Fisher Scoring iterations: 5

You fit a GLM three different methods. All three methods produced outputs with the same coefficient estimates, but differed with respect to the degrees of freedom estimated. As a tip, use the input method that best matches your data and requires the least amount of data wrangling. The difference in degrees of freedom does not change the models. If you use model selection, make sure your data is in the same format (long vs. Short) for all models you compare.

Poisson Regression

A Poisson regression is another type of GLM. This requires integers or count data (i.e., 0, 1, 2, 3,…). For some situations, a Poisson regression can be more powerful (e.g., detecting statistically significantly trends) than a linear model or “Gaussian” regression.

During this exercise, we’re going to build a linear regression using the lm() function and a Poisson regression using glm().

The objects x and y are loaded into R for you.

x <- 1:20
y <-  c(rep(0:1, length.out = 9),2,1,2,0,1,1,0,1,5,1,1)

Instructions

  • Build an lm() where y is predicted by x and then print the summary.
  • Build a glm() where y is predicted by x with a "poisson" distribution function and then print the summary to the terminal.
  • Examine the coefficient estimates for each and notice how only the glm() produces statistically significant estimates for x.
# Fit the linear model
summary(lm(y~x))
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3677 -0.6145 -0.2602  0.4297  3.4805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.15263    0.50312   0.303   0.7651  
## x            0.07594    0.04200   1.808   0.0873 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.083 on 18 degrees of freedom
## Multiple R-squared:  0.1537, Adjusted R-squared:  0.1067 
## F-statistic: 3.269 on 1 and 18 DF,  p-value: 0.08733
# Fit the generalized linear model
summary(glm(y~x, family = "poisson"))
## 
## Call:
## glm(formula = y ~ x, family = "poisson")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6389  -0.9726  -0.3115   0.5307   2.1559  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.04267    0.60513  -1.723   0.0849 .
## x            0.08360    0.04256   1.964   0.0495 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 23.589  on 19  degrees of freedom
## Residual deviance: 19.462  on 18  degrees of freedom
## AIC: 52.17
## 
## Number of Fisher Scoring iterations: 5

You’ve now seen how to fit a Poisson regression, a type of GLM. Remember this, because you’ll be fitting a glmer() soon using a similar approach!

Plotting GLMs

Often, we want to “look” at our data and trends in our data. ggplot2 allows us to add trend lines to our plots. The default lines are created using a technique called local regression. However, we can specify that different models are used to create the lines, including GLMs.

During this exercise, we’ll see how to plot a GLM with ggplot2.

Instructions

  • Add a stat_smooth() to the first plot to fit the default line type.
# Plot the data using jittered points and the default stat_smooth
ggplot(data = df_long, aes(x = dose, y = mortality)) + 
    geom_jitter(height = 0.05, width = 0.1) +
    stat_smooth(fill = 'pink', color = 'red') 
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  • Add a stat_smooth() to the second plot. This time, specify the method to be "glm" and family to be "binomial" to fit a logistic regression.
# Plot the data using jittered points and the the glm stat_smooth
ggplot(data = df_long, aes(x = dose, y = mortality)) + 
    geom_jitter(height = 0.05, width = 0.1) +
    stat_smooth(method = "glm",  
                method.args = list(family = "binomial"))
## `geom_smooth()` using formula 'y ~ x'

Notice how the second plot is bounded by 0 and 1 on the y-axis. Also, notice how the curve is now monotonic (or only increasing and never decreasing). Being monotic is an important assumption of binomial regression.

Binomial data

Toxicology data

A toxicologist wonders if exposure to a chemical increases mortality as its dose increases. She has exposed test organisms to different doses of a chemical and recorded the number that died in each test tank. Each dose was repeated on three different days.

First, build a GLM. Building a simple model makes debugging easier. Also, you will later compare the results to GLMER. This model includes dose and replicate as a fixed-effects.

Second, build a GLMER. This time, dose is a fixed-effect and replicate is a random-effects intercept.

Last, examine the coefficient estimates with coef() from each model. Notice how the intercept estimates are displayed differently from each model.

# The toxicoligst's dataframe
dose =  rep(c(0,2,4,6,8,10), each =20, times = 3)
replicate  = rep(c("a","b","c"), each = 120)
mortality = c(
  rep(0:1, times = c(20, 0)),
  rep(0:1, times = c(16,4)),
  rep(0:1, times = c(10, 10)),
  rep(0:1, times = c(7,13)),
  rep(0:1, times = c(6, 14)),
  rep(0:1, times = c(3, 17)),
  rep(0:1, times = c(20, 0)),
  rep(0:1, times = c(15,5)),
  rep(0:1, times = c(8,12)),
  rep(0:1, times = c(11,9)),
  rep(0:1, times = c(11,9)),
  rep(0:1, times = c(5, 15)),
  rep(0:1, times = c(20, 0)),
  rep(0:1, times = c(19,1)),
  rep(0:1, times = c(15, 5)),
  rep(0:1, times = c(9,11)),
  rep(0:1, times = c(11, 9)),
  rep(0:1, times = c(6, 14))
  )

df <- data_frame(
  dose, replicate, mortality
)
df
## # A tibble: 360 × 3
##     dose replicate mortality
##    <dbl> <chr>         <int>
##  1     0 a                 0
##  2     0 a                 0
##  3     0 a                 0
##  4     0 a                 0
##  5     0 a                 0
##  6     0 a                 0
##  7     0 a                 0
##  8     0 a                 0
##  9     0 a                 0
## 10     0 a                 0
## # … with 350 more rows

Instructions

  • Create glm_out, a glm() model that predicts mortality as a function of dose and replicate. Estimate an intercept for the replicate group using replicate - 1.
  • Look at the model’s coefficient estimates with coef().
# Fit glm_out and look at its coefficient estimates
glm_out <- glm(mortality ~ dose + replicate -1, family = "binomial", data = df )
coef(glm_out)
##       dose replicatea replicateb replicatec 
##  0.3657509 -1.9199043 -2.2872780 -2.7645061
  • Create glmer_out, a glmer() model that predicts mortality as a function of dose while accounting for replicate as a random effect.
  • Look at the model’s coefficient estimates with coef().
# Load lmerTest
#library(lmerTest) # This package is already loaded

# Fit glmer_out and look at its coefficient estimates 
glmer_out <- glmer(mortality ~ dose + (1 |replicate), family = "binomial", data = df )
coef(glmer_out)
## $replicate
##   (Intercept)      dose
## a   -2.057613 0.3622273
## b   -2.280911 0.3622273
## c   -2.563972 0.3622273
## 
## attr(,"class")
## [1] "coef.mer"

Question

How do the coefficient estimates differ?

Possible Answers

  • The intercept estimates are closer to each other for the glmer() outputs.
  • The intercept estimates are closer to each other for the glm() outputs.
  • The intercept estimates are the same for both outputs.
  • Not enough information is provided by the models to compare the outputs.

Correct. Notice how the intercept coefficient estimates for the glmer()outputs are closer together. However, in this case both model estimates are similar and either model would be a tenable choice.

Marketing example

As described in the video, our client is interested in knowing if a friend’s recommendation increases the number of people who buy, rather than pass, on his online product. He has given us a summary of his data as a data.frame called all_data. This data includes the number of Purchases and Passes for 4 test cities (city) as well as the customer ranking. This data structure lends itself to using cbind() on the two columns of interest to create a matrix (You could use other methods of making a matrix in R, but this is one of the easiest methods).

You are interested to see if the recommendation from a friend increases people buying the product. To answer this question, you will build a glmer() model and then examine the model’s output.

If the parameter estimate for friend is significantly greater than zero, then a friend’s recommendation increases the chance somebody makes a purchase. If the parameter estimate for friend is significantly less than zero, then a friend’s recommendation decreases the chance somebody makes a purchase. If the parameter estimate for friend is not significantly different than zero, then a friend’s recommendation has no effect on somebody making a purchase.

# I copied the marketing dataset from datacamp's R console, pasted (automatically as text into 1 column in Excel, then converted it to columns (Text to Column functionality)

library(readxl)
all_data <- read_xlsx("data/all_data.xlsx")
  • Fit a glmer() with all_data data.frame. Use cbind(Purchases, Pass) being predicted by friend and ranking (friend goes first). Use city as your random-effect and family = "binomial".
  • cbind() is required because glmer() requires a matrix input.
# Load lmerTest

# Fit the model and look at its summary 
model_out <- glmer(cbind(Purchases, Pass) ~ friend + ranking + (1|city), family = "binomial", data = all_data)
summary(model_out) 
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(Purchases, Pass) ~ friend + ranking + (1 | city)
##    Data: all_data
## 
##      AIC      BIC   logLik deviance df.resid 
##    977.1    989.6   -484.5    969.1      164 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9986 -0.7692  0.1070  0.7954  2.8951 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  city   (Intercept) 0.01833  0.1354  
## Number of obs: 168, groups:  city, 4
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.274771   0.095609 -13.333   <2e-16 ***
## friendyes    0.511344   0.058862   8.687   <2e-16 ***
## ranking      0.079873   0.004969  16.075   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr) frndys
## friendyes -0.346       
## ranking   -0.553  0.057

Question

How did a friend’s recommendation (the friendyes coefficient) change the chance of a purchase?

Possible Answers

  • Cannot tell from the outputs.
  • A friend’s recommendation decreased the chance of a purchase.
  • A friend’s recommendation had no effect on the chance of a purchase.
  • A friend’s recommendation increased the chance of a purchase.

Yes. The coefficient estimate for slope is positive and significantly different than zero. This indicates a friend’s recommendation increases the chance of purchasing an item. However, describing the coefficient to your friend would be hard. In the next exercise, you will see how odds-ratios can be used to describe logistic regression outputs. Other DataCamp courses such as A/B Testing in R cover this topic in greater detail.

Calculating odds-ratios

In the previous exercise, we saw how to compare the effects of a friend’s recommendation on sales. However, regression outputs can be hard to describe and sometimes odds-ratios can be easier to use. Using the outputs from the previous exercise, we’re going to calculate odds-ratios.

Refresher on odds-ratios:

  • If an odds-ratio is 1.0, then both events have an equal chance of occurring. For example, if the odds-ratio for a friend’s recommendation was 1.0, then a friend would have no influence on a purchase decision.
  • If an odds-ratio is less than 1, then a friend’s recommendation would decrease the chance of a purchase occurring. For example, an odds-ratio of 0.5 would mean a friend’s recommendation has odds of 1:2 or 1 purchase occurring for every 2 passes.
  • If an odds-ratio is greater than 1, then a friend’s recommendation would increase the chance of a purchase occurring. For example, an odds-ratio of 3.0 would mean a friend’s recommendation has odds of 3:1 or 3 purchases occurring for every 1 passes.

Note on course code: Since this course launched, the broom package has dropped support for lme4::lmer() models. If you try to repeat this on your own, you will need the broom.mixed package, which is on cran.

Instructions

  • Look at the summary() of model_out.
  • Extract the coefficients from model_out with fixef() and then convert to an odds-ratio by taking exponential. Repeat with confint() to get the confidence intervals.
  • Calculate the confidence intervals and then exponentiate the effect of friends on a purchase using tidy(). Make sure to set the conf.int and exponentiate parameters.
# Run the code to see how to calculate odds ratios
summary(model_out)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(Purchases, Pass) ~ friend + ranking + (1 | city)
##    Data: all_data
## 
##      AIC      BIC   logLik deviance df.resid 
##    977.1    989.6   -484.5    969.1      164 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9986 -0.7692  0.1070  0.7954  2.8951 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  city   (Intercept) 0.01833  0.1354  
## Number of obs: 168, groups:  city, 4
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.274771   0.095609 -13.333   <2e-16 ***
## friendyes    0.511344   0.058862   8.687   <2e-16 ***
## ranking      0.079873   0.004969  16.075   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr) frndys
## friendyes -0.346       
## ranking   -0.553  0.057
exp(fixef(model_out))
## (Intercept)   friendyes     ranking 
##   0.2794948   1.6675308   1.0831499
exp(confint(model_out))
## Computing profile confidence intervals ...
##                 2.5 %   97.5 %
## .sig01      1.0616505 1.429753
## (Intercept) 0.2257593 0.345121
## friendyes   1.4860713 1.871783
## ranking     1.0726937 1.093795
# Create the tidied output
tidy(model_out, conf.int = TRUE, exponentiate = TRUE)
## # A tibble: 4 × 9
##   effect   group term  estimate std.error statistic   p.value conf.low conf.high
##   <chr>    <chr> <chr>    <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 fixed    <NA>  (Int…    0.279   0.0267     -13.3   1.48e-40    0.232     0.337
## 2 fixed    <NA>  frie…    1.67    0.0982       8.69  3.72e-18    1.49      1.87 
## 3 fixed    <NA>  rank…    1.08    0.00538     16.1   3.84e-58    1.07      1.09 
## 4 ran_pars city  sd__…    0.135  NA           NA    NA          NA        NA

Sometimes, odds ratios may be easier to explain than logistic regression coefficients. As you gain more experience as a data scientist, you will get better at explaining your results. These ‘soft skills’ can help set you apart from other data scientists. ‘May the odds be ever in your favor!’

Count Data

Internet click-throughs

A common theme throughout this course has been the nestedness of data. During this case study, you will examine the number of users who click through on a link to a website before and after a redesign. These users are nested within trial groups. You will then examine the outputs and make a recommendation to your client.

For this redesign, you will examine the web behavior of 10 people from 4 different focus groups, for a total of 40 people. You want to know if the number of clicks to different pages changed from the old to the new webpage while correcting for groups.

Instructions

  • Fit a glmer() where clicks is predicted by webpage using group as a random-effect and the data.frame user_groups. Be sure to use the ‘"poisson"’ family. Save the model as glmer_out.
  • Look at the summary() of the model.
# Some data wrangling
user_groups <- mutate_if(user_groups, is.character, as.factor)
user_groups <- within(user_groups, webpage <- relevel(webpage, ref = "New"))

# Fit a Poisson glmer
glmer_out <- glmer(clicks ~ webpage + (1|group), family = "poisson", data = user_groups)

# Examine output with summary()
summary(glmer_out)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: clicks ~ webpage + (1 | group)
##    Data: user_groups
## 
##      AIC      BIC   logLik deviance df.resid 
##    260.6    267.7   -127.3    254.6       77 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.7009 -0.7755 -0.1949  0.6753  2.4145 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  group  (Intercept) 0.1693   0.4115  
## Number of obs: 80, groups:  group, 4
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   0.6415     0.2366   2.711  0.00671 **
## webpageOld   -0.5566     0.1820  -3.058  0.00222 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr)
## webpageOld -0.280

Question

Look at the regression coefficient estimate for the new webpage (compared to the old webpage). How did the update change the number of click-throughs?

Possible Answers

  • There was a negative, but statistically non-significant effect of the webpage redesign.
  • There was a positive, but statistically non-significant effect of the webpage redesign.
  • There was a negative, and statistically significant effect of the webpage redesign.
  • There was a positive, and statistically significant effect of the webpage redesign.

Question An important part of data science can be helping clients understand model outputs. Which of the following recommendations would you make based upon the model’s outputs?

Possible Answers

  • The updated webpage had no-effect on click-throughs.
  • The updated webpage appears to increase click-throughs.
  • The webpage should be removed because it is hurting company sales.

Correct. The updated webpage appears to increase the number of times people click-through. In this case, Poisson coefficients can be hard to explain. However, the DataCamp Course on GLMs in R describes other methods for describing Poisson regressions. The DataCamp Course on A/B Testing in R also provides more background on tests such as these. Hopefully, these examples have helped you to see how one statistical tool may be applied to many different problems if you know the language of the problem.

Chlamydia by age-group and county

The number of infections change through time and vary across age groups. Some possible reasons cultural, social, and policy-related factors. For small populations, the number of infections often includes zeros and may be non-normal. For data such as these, use a Poisson model.

For this exercise, you will examine how chlamydia infections vary in small, Illinois counties. You will ask:

  1. Do the number of reported cases vary between people aged 15-19 compared to 20-24?
  2. Are the number of reported cases changing across time for these two age group?

This data comes from the State of Illinois who provides summaries of infections such as chlamydia by age groups and counties. First, fit a Poisson glmer to the data. Then, examine the results. In the next exercise, plot the data.

Warning: If you mistype the formula, you may cause R to crash. This is a pitfall of using lmer() and glmer().

Instructions

  • Run a glmer() with a “poisson” family, predicting count as a function of fixed effects age (1st fixed-effect) and year (2nd fixed-effect), and include year as a random-effect grouped by county. Use the data il_data.
  • Look at the summary() of the model.
  • Be careful with the random-effect. If you wrongly specify the random-effect, you may crash R.
# Load lmerTest

# Age goes before year
model_out <- glmer(count ~ age + year + (year|county), family = "poisson", data = il_data)
summary(model_out)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: count ~ age + year + (year | county)
##    Data: il_data
## 
##      AIC      BIC   logLik deviance df.resid 
##   3215.6   3259.6  -1599.8   3199.6     1800 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4511 -0.0151 -0.0056 -0.0022  4.0053 
## 
## Random effects:
##  Groups Name        Variance  Std.Dev. Corr 
##  county (Intercept) 129.94605 11.3994       
##         year          0.06481  0.2546  -1.00
## Number of obs: 1808, groups:  county, 47
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.76264    2.16559  -4.970  6.7e-07 ***
## age20_24     -0.04152    0.03690  -1.125    0.261    
## age25_29     -1.16262    0.05290 -21.976  < 2e-16 ***
## age30_34     -2.28278    0.08487 -26.898  < 2e-16 ***
## year          0.32709    0.25644   1.276    0.202    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##          (Intr) a20_24 a25_29 a30_34
## age20_24 -0.008                     
## age25_29 -0.006  0.342              
## age30_34 -0.004  0.213  0.148       
## year     -0.768  0.000  0.000  0.000

Question

Did the number of cases vary between age groups?

Possible Answers

  • People aged 20 to 24 had significantly more reported infections than those aged 15 to 19.
  • People aged 20 to 24 had significantly less reported infections than those aged 15 to 19.
  • People aged 20 to 24 had fewer reported infections than those aged 15 to 19, but this difference was not significant.
  • People aged 20 to 24 had more reported infections than those aged 15 to 19, but this difference was not significant.

Question

Did the slope estimate for year suggest the number of reported cases changing across time?

Possible Answers

  • Yes, the slope estimate for year was significantly greater than zero, suggesting an increase in reported cases through time.
  • Yes, the slope estimate for year was significantly less than zero, suggesting a decrease in reported cases through time.
  • No, the slope estimate for year was not significantly different from zero. This suggests no trend in the number of reported cases.

Correct. No trend appears to be occuring through time. Next, you will plot this data.

Displaying chlamydia results

In the previous exercise, you fit a GLMER to the Illinois chlamydia data. During this exercise, we will go over some methods for displaying the results. You could use these methods to get summaries of the model for a client or document you are writing describing your results. However, I encourage you to learn how to manipulate and explore model outputs on your own to create your own methods for displaying results. Developing your own unique methods can help you stand out as a data scientist!

Here is what you will do:

  1. Examine the model estimates.
  2. Plot the data with and fit a glm to each age class. Although not exactly the same as the glmer() outputs, this approximation helps to display the results in a visually easy to understand method.

Instructions

  • Extract out the fixed-effect estimates from model_out using fixef().
  • Extract out the random-effect estimates from model_out using ranef().
  • Run the code to plot the data using ggplot2 methods.
il_data_2 <- read_xlsx("data/il_data_2.xlsx")
il_data_2 <- mutate_if(il_data_2, is.character, as.factor)
# Extract out fixed effects
fixef(model_out)
##  (Intercept)     age20_24     age25_29     age30_34         year 
## -10.76264149  -0.04151673  -1.16261773  -2.28278215   0.32708938
# Extract out random effects 
ranef(model_out)
## $county
##            (Intercept)         year
## ALEXANDER   -0.2847730  0.006331923
## BROWN       -0.2847730  0.006331923
## CALHOUN     -0.2847730  0.006331923
## CARROLL     12.2419046 -0.260432572
## CASS        -0.2847730  0.006331923
## CLARK       12.2138202 -0.268561955
## CLAY        -0.2847730  0.006331923
## CRAWFORD    12.5037945 -0.265761403
## CUMBERLAND  -0.2847730  0.006331923
## DE WITT     12.7456625 -0.277684054
## DOUGLAS     13.0752147 -0.306338945
## EDGAR       12.3643337 -0.283614781
## EDWARDS     -0.2847730  0.006331923
## FAYETTE     12.8095075 -0.273069319
## FORD        -0.2847730  0.006331923
## GALLATIN    -0.2847730  0.006331923
## GREENE      -0.2847730  0.006331923
## HAMILTON    -0.2847730  0.006331923
## HANCOCK     12.8581820 -0.305659278
## HARDIN      -0.2847730  0.006331923
## HENDERSON   -0.2847730  0.006331923
## IROQUOIS    13.1617300 -0.311381985
## JASPER      -0.2847730  0.006331923
## JERSEY      12.9203294 -0.272292923
## JO DAVIESS  12.7409939 -0.289756683
## JOHNSON     -0.2847730  0.006331923
## LAWRENCE    12.3714098 -0.268579903
## MARSHALL    -0.2847730  0.006331923
## MASON       -0.2847730  0.006331923
## MENARD      -0.2180907  0.004850099
## MERCER      12.7534737 -0.271687371
## MOULTRIE    -0.2180907  0.004850099
## PIATT       12.5653681 -0.296696601
## PIKE        12.5311151 -0.259219952
## POPE        -0.2180907  0.004850099
## PULASKI     -0.2180907  0.004850099
## PUTNAM      -0.2180907  0.004850099
## RICHLAND    12.0351399 -0.273937455
## SCHUYLER    -0.2180907  0.004850099
## SCOTT       -0.2180907  0.004850099
## SHELBY      12.5183838 -0.283481057
## STARK       -0.2180907  0.004850099
## UNION       13.1465830 -0.308682407
## WABASH      -0.2180907  0.004850099
## WASHINGTON  -0.2180907  0.004850099
## WAYNE       12.1149423 -0.253243180
## WHITE       -0.2180907  0.004850099
## 
## with conditional variances for "county"
# Run code to see one method for plotting the data
ggplot(data = il_data_2, 
       aes(x = year, y = count, group = county)) +
    geom_line() +
    facet_grid(age ~ . ) +
    stat_smooth(method = "glm",
                method.args = list(family = "poisson"), 
                se = FALSE,
                alpha = 0.5) +
    theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

Good job on both fitting a model and plotting the results. Look at the figure and the model outputs. How might you tell a story to describe the outputs to your friend, a scientist, and a news reporter? Plots can help you tell stories with your data. The next chapter will cover repeated measures applications of LMERS and GLMERS.

4. Repeated Measures

This chapter shows how repeated-measures analysis is a special case of mixed-effect modeling. The chapter begins by reviewing paired t-tests and repeated measures ANOVA. Next, the chapter uses a linear mixed-effect model to examine sleep study data. Lastly, the chapter uses a generalized linear mixed-effect model to examine hate crime data from New York state through time.

An introduction to repeated measures

Paired t-test

In the video, you learned how paired t-tests can be more powerful than regular t-tests. During this exercise, you will see an example demonstrating this. The first step will be to simulate data. Similar data might come from a people’s weights before or after a drug treatment or amount of money spent by a customer before or after a seeing a commercial.

Simulated data allows you to know the properties of the data and check to see if your model behaves as expected. R comes with many distributions including the normal. Your simulated data will have unequal variance (that is, the standard deviations will be different). The second step will be to analyze the data with both a paired and regular t-test. Last, you will be asked about the results from the paired t-tests.

As part of the first step, you will “set the seed” for R’s random number generator. This ensures you get the same numbers each time you run the code and DataCamp’s software correctly scores your code.

Instructions

  • Set the seed to equal 345659.
  • Set n_ind <- 10 to simulate 10 individuals.
  • Simulate the before group with a mean of 0 and standard deviation of 0.5.
  • Simulate the after group by adding a random vector with a mean 4.5 and standard deviation of 5 to the before vector.
# Set the seed to be 345659
set.seed(345659)

# Model 10 individuals 
n_ind <- 10

# simulate before with mean of 0 and sd of 0.5
before <- rnorm(n = n_ind, mean = 0, sd = 0.5)
# simulate after with mean effect of 4.5 and standard devation of 5
after  <- before + rnorm(n = n_ind, mean = 4.5, sd = 5)
  • Re-run the code from the previous section to load it into R.
  • Run an unpaired t-test comparing before (x) and after (y) by specifying paired = FALSE.
  • Run a paired t-test comparing before (x) and after (y) by specifying paired = TRUE.
# Run a standard, non-paired t-test
t.test(x = before, y = after, paired = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  before and after
## t = -4.6991, df = 9.3736, p-value = 0.001004
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.966165 -2.809916
## sample estimates:
##  mean of x  mean of y 
## -0.2201914  5.1678489
# Run a standard, paired t-test
t.test(x = before, y =after, paired = TRUE)
## 
##  Paired t-test
## 
## data:  before and after
## t = -5.138, df = 9, p-value = 0.0006129
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -7.760267 -3.015813
## sample estimates:
## mean difference 
##        -5.38804

Question

In this case, both t-tests indicate significantly different means between groups. However, which test produces a larger (absolute value farthest from zero), t-score?

Possible Answers

  • The regular t-test
  • The paired t-test

Correct. Notice how the paired t-test has a larger t-score. The paired t-test is more powerful because it accounts for individual variability.

Repeated measures ANOVA

In the previous exercise, you saw how a paired t-test is more powerful than a regular t-test. During this exercise, you will see how statistical methods generalize. First, you will see how a paired t-test is a special case of a repeated measures ANOVA. In the process, you will see how a repeated measures ANOVA is a special case of a mixed-effects model by using lmer() in R.

The first part of this exercise will consist of transforming the simulated data from two vectors into a data.frame(). The second part will have you examine the model results to see how they are different. The third part will have you examine the outputs of the models and compare the results.

Instructions

  • Create the data.frame() dat:
    • y is the combined vector before and after, which have all been loaded for you.
    • trial is the repeated vector "before" and "after", each one repeated n_ind.
    • ind is the individual ID vector, repeated 2 times because there were two observations.
# Create the data.frame, using the variables from the previous exercise. 
# y is the joined vectors before and after.
# trial is the repeated names before and after, each one repeated n_ind
# ind is the letter of the individual, repeated 2 times (because there were two observations)
dat <- data.frame(y = c(before, after), 
                  trial = rep(c("before", "after"), each = n_ind),
                  ind = rep(letters[1:n_ind], times = 2))
  • Load the lmerTest package.
  • Run a paired t-test on before and after with t.test().
  • Run a lmer() and save it as lmer_out. Have y predicted by trial with ind as a random-effect intercept using dat.
  • Look at the summary() of lmer_out.
#library(lmerTest)

# Run the code from the previous step
dat <- data.frame(y = c(before, after), 
                  trial = rep(c("before", "after"), each = n_ind),
                  ind = rep(letters[1:n_ind], times = 2))

# Run a standard, paired t-test
t.test(before, after, paired = TRUE)
## 
##  Paired t-test
## 
## data:  before and after
## t = -5.138, df = 9, p-value = 0.0006129
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -7.760267 -3.015813
## sample estimates:
## mean difference 
##        -5.38804
# Run a lmer and save it as lmer_out
lmer_out <- lmer(y ~ trial + (1|ind), data = dat)

# Look at the summary() of lmer_out
summary(lmer_out)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: y ~ trial + (1 | ind)
##    Data: dat
## 
## REML criterion at convergence: 89.3
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.89008 -0.30775 -0.02928  0.22414  2.41195 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ind      (Intercept) 1.075    1.037   
##  Residual             5.498    2.345   
## Number of obs: 20, groups:  ind, 10
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)   5.1678     0.8108 17.5311   6.374 5.99e-06 ***
## trialbefore  -5.3880     1.0487  9.0000  -5.138 0.000613 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## trialbefore -0.647

Question

Which of the following statements about the outputs are false?

Possible Answers

  • The mean of the differences from the paired t-test and lmer() coefficient estimate for trialbefore both equal -5.388.
  • The t value from the paired t-test and the trialbeforecoefficient both equal -5.138.
  • The intercept from the t-test is the same as the (intercept) from the lmer() model’s fixed-effects.

Correct. The t-test does not provide an intercept estimate. The main purpose of this exercise was to show you a paired t-test is a special case of repeated measures ANOVA and a repeated measures ANOVA is simply a special case of a linear mixed-effects model. In the next exercise, you will see real data and learn more about interpreting the results.

Sleep study

Exploring the data

One of the first things to do with new datasets is to plot it prior to analysis. During this exercise, you will plot the data using ggplot2 and create a publication quality figure. This will help you to see the data.

This data examines if two different drugs change the amount of sleep individuals get. The response variable of interest is the amount of extra sleep a patient gets. The predictor variable is the drug group. The ID of each patient allows us to do a repeated measures analysis. This is a random-effect intercept and corresponds to the baseline effect of giving a person a sleeping drug. We do not care how much an individuals sleeps in this case, only the change in sleep of the groups.

The data.frame sleep contains these variables.

Instruction

  • Plot the raw data points using ggplot2.
  • extra is the y variable and group is the x variable with the sleep data.frame.
# Plot the raw data
ggplot(data = sleep, aes(x = group, y = extra)) +
    geom_point()

  • Replace geom_point() with geom_line() because you are plotting results from the same individuals.
  • Add group = ID in the aes() function.
# Replace geom_point with geom_line
ggplot(data = sleep, aes(x = group, y = extra, group = ID)) +
    geom_line()

  • Set the xlab() to be "Drug".
  • Set the ylab() to be "Extra sleep".
  • Change the theme to be theme_minimal().
# Add an x and y label to the plot and change the theme to be "minimal"
ggplot(sleep, aes(x = group, y = extra, group = ID)) +
  geom_line() +
  xlab(label = "Drug") +
  ylab(label = "Extra sleep") + 
  theme_minimal()

Now that you have plotted the data, you will get to build a model in the next exercise and compare two different forms of statistical inference. Also, in this figure, notice how almost all individuals had a greater increase in sleep with the second drug compared to the first. This suggests the second drug is more effective.

Building models

In this exercise, you will build a simple linear model (lm()) and then build a linear mixed-effects model (lmer()). The purpose of the first step is to make sure the data works well with a simple model because lm() outputs are easier to debug than lmer() outputs. During the next exercise, you will compare two different methods of statistical inference on the model.

Instructions

  • Build a linear model using lm(). The goal of this step is to simply make sure the model builds without errors or warnings.
  • Have extra predicted by fixed-effects group (1st) and ID (2nd) from the sleep data.
# Build a simple linear model 
lm( extra ~ group + ID, sleep)
## 
## Call:
## lm(formula = extra ~ group + ID, data = sleep)
## 
## Coefficients:
## (Intercept)        group           ID  
##     -2.6533       1.5800       0.3315

- Build almer()model withextrapredicted by the fixed-effectgroupand random-effect interceptIDusing thesleepdata. - Save the output aslmer_out`.

# Build a simple linear model 
lm(extra ~ group + ID, data = sleep)
## 
## Call:
## lm(formula = extra ~ group + ID, data = sleep)
## 
## Coefficients:
## (Intercept)        group           ID  
##     -2.6533       1.5800       0.3315
# Build a mixed-effects regression
lmer_out <- lmer(extra ~ group + (1|ID), sleep)

Now that you have built the models, the next step is using the models to make statistical inference. Usually, you would only use one approach, but in this case, you will compare two different methods of statistical inference.

Comparing regressions and ANOVAs

In the previous exercise, you built a regression model. Two methods for statistical inference include examining the amount of variance explained by coefficients in the model (an ANOVA-like analysis) and using linear predictor variables to model the data (a regression analysis framework). The choice of approaches largely depends upon personal preference and statistical training. Both of these approaches may be done using frequentists or Bayesian methods. Although this course only uses frequentist methods, the same ideas apply to Bayesian models.

The lmer_out model you build in the previous exercise has been loaded for you. First, you will run an anova() on it to see if group explains a significant amount of variability. Second, you will examine the regression coefficient from group to see if it significantly differs from zero.

Instructions - Run an anova() on lmer_out. - Look at the summary() of lmer_out to see the regression coefficient for group.

# Run an anova() on lmer_out
anova(lmer_out)
## Type III Analysis of Variance Table with Satterthwaite's method
##       Sum Sq Mean Sq NumDF DenDF F value   Pr(>F)   
## group 12.482  12.482     1     9  16.501 0.002833 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Look at the summary() of lmer_out
summary(lmer_out)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: extra ~ group + (1 | ID)
##    Data: sleep
## 
## REML criterion at convergence: 70
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.63372 -0.34157  0.03346  0.31511  1.83859 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  ID       (Intercept) 2.8483   1.6877  
##  Residual             0.7564   0.8697  
## Number of obs: 20, groups:  ID, 10
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)   
## (Intercept)  -0.8300     0.8143 17.9871  -1.019  0.32157   
## group         1.5800     0.3890  9.0000   4.062  0.00283 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## group -0.717

Question

Compare the ANOVA table to the regression coefficient estimate for group2 (the estimated difference of group 2 compared to group 1). Which of the following are in both tables?

Possible Answers

  • The estimated difference between group1 and group2
  • f-statistic from the models
  • p-values from the models
  • t-statistic from the models

Correct. Notice how both models find a statistically significant effect, but do so using different tests. In this case, both models produce identical p-values. Personally, I prefer regression inferences to ANOVA inferences because the output includes the estimated difference by default. The ANOVA tells us which variables explain a significant amount of variability, but the ANOVA does not tell us how things are different. Conversely, the regression tells us how much one unit of input would be expected to change the model’s output. During the next exercise, you will see how to visualize the outputs.

Plotting results

In the previous exercises, you have examined the raw data, used the data to build a model, and applied the model for statistical inferences. You found drug 2 increased the amount of extra sleep compared to drug 1. During this exercise, you will plot the results to see how much drug 2 increased extra sleep.

First, wrangle the data by using pivot_wider() function from the tidyr package. Then, calculate the difference in extra sleep for each individual. Last, plot this difference as a histogram.

Instructions

  • Load the tidyr package.
  • Spread the data and save the results as sleep_wide.
  • Calculate the difference and save this as the column diff.
# Load the tidyr package
#library(tidyr)

# Make the data wider
sleep_wide <- 
    pivot_wider(sleep, 
                names_from = group, values_from = extra)

# Calculate the difference 
sleep_wide$diff <- sleep_wide$`2` - sleep_wide$`1` 
  • Rerun the code from the previous section.
  • Use the data.frame sleep_wide with diff for x aesthetic.
  • Add geom_histogram().
  • Set the y label to "Count".
  • Set the x label to "Extra sleep from drug 2".
  • Use the black and white theme, theme_bw().
# Use the data sleep_wide and diff for the aesthetic x  
ggplot(sleep_wide, aes(x = diff)) + 
  geom_histogram() +
  ylab(label = "Count") +
  xlab(label = "Extra sleep from drug 2") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You have now examined how one drug affects another as a repeated measure. During the next exercise, you will examine how hate crimes change in New York.

Hate in NY state?

The hate crime data in this case study comes from Data.gov.The data spans 2010- 2016 .The data has however been cleaned up for this exercise’s practice.

Exploring NY hate data

The State of New York reports the number of hate crimes committed against people in each county. During this case study, you will examine and see if the number of hates crimes are changing through time. These exercises serve two purposes. First, they demonstrate how generalized mixed-effect regressions (glmer()) can be used for repeated measures in R. Second, they provide another example of using mixed-effect models for statistical inference.

Given the different population sizes of New York counties, you can reasonably assume the need for random-effect intercepts a priori. However, do you need random-effect slopes? Plot the data to see if trends appear to vary by county. Additionally, plotting the data will help you see what is going on.

Instructions

  • Using the data.frame hate, plot the total number of hate crimes, (y = TotalIncidents) across time (x = Year).
  • Group by County and include geom_line().
  • Add in a Poisson based trend line with geom_smooth().
  • Set the method to "glm" with method.args set to c("poisson").
  • Shut off the confidence interval by setting se to FALSE.
# Plot the TotalIncidents of hate crimes in NY by Year, grouped by County
ggplot(hate, aes(Year, TotalIncidents, group = County))+
  geom_line()+
  # Add a Poisson trend line for each county
  geom_smooth(method = "glm", method.args = c("poisson"), se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Question

In the previous plot, you saw different trends in hate crime data through time. Which of the following statements does this observation support?

Possible Answers

  • Different trends across groups implies you should use different fixed-effect intercepts for each group.
  • Different trends across groups implies you should use different random-effect slopes for each group.
  • Different trends across groups implies you should use different random-effect intercepts for each group.

Correct. Different counties appear to have different trends. In the next exercise, you will build models to describe this data.

Building the model

As part of the model building process, first build a simple Poisson regression. A glm() runs quicker than glmer() and is easier to debug. For example, the Poisson regression might catch if you have non-integer data. Plus, the Poisson regression can provide intuition about the more complicated model.

If you run into any problems, follow the instructions to fix your data.

Last, build and save a repeated-measures Poisson regression.

Instructions

  • Build a Poisson regression with the glm() function by setting family to "poisson".
  • Model how the TotalIncidents of hate crimes are predicted by Year and County as fixed-effects in your formula using the hate data frame.
# Model TotalIncidents predicted by Year and County
# Use a Poisson regression with the hate data.frame 
glm( TotalIncidents ~ Year + County ,family = "poisson", data = hate)
## 
## Call:  glm(formula = TotalIncidents ~ Year + County, family = "poisson", 
##     data = hate)
## 
## Coefficients:
##        (Intercept)                Year      CountyAllegany         CountyBronx  
##          121.98334            -0.05977            -1.65787             1.15126  
##       CountyBroome   CountyCattaraugus        CountyCayuga    CountyChautauqua  
##           -0.90255            -1.45353            -1.01203            -1.48607  
##      CountyChemung      CountyChenango       CountyClinton      CountyColumbia  
##           -1.70959            -1.17785            -0.68000            -1.72240  
##     CountyCortland      CountyDelaware      CountyDutchess          CountyErie  
##           -1.52662            -1.53832             0.01178             1.50107  
##        CountyEssex      CountyFranklin        CountyFulton       CountyGenesee  
##           -1.45495            -1.24769            -1.71765            -1.65787  
##       CountyGreene      CountyHerkimer     CountyJefferson         CountyKings  
##           -1.53832            -1.72478            -1.61690             2.56910  
##        CountyLewis    CountyLivingston       CountyMadison        CountyMonroe  
##           -1.71765            -1.22832            -0.87154            -0.20972  
##     CountyMultiple        CountyNassau      CountyNew York       CountyNiagara  
##           -1.09013             0.79580             2.48491            -0.60248  
##       CountyOneida      CountyOnondaga       CountyOntario        CountyOrange  
##           -1.11226            -1.17683            -1.76025            -0.17693  
##      CountyOrleans        CountyOswego        CountyOtsego        CountyPutnam  
##           -1.40229            -0.73686            -0.89300            -0.99506  
##       CountyQueens    CountyRensselaer      CountyRichmond      CountyRockland  
##            1.51898            -1.39772             0.66575             0.04996  
##     CountySaratoga   CountySchenectady     CountySchoharie      CountySchuyler  
##           -0.83543            -0.74492            -1.58092            -1.77742  
##       CountySeneca  CountySt. Lawrence       CountySteuben       CountySuffolk  
##           -1.71765            -1.04982            -1.34608             1.45168  
##     CountySullivan         CountyTioga      CountyTompkins        CountyUlster  
##           -1.83720            -1.59988            -0.63539            -0.88945  
##       CountyWarren    CountyWashington         CountyWayne   CountyWestchester  
##           -1.53832            -1.43472            -1.11461             0.74295  
## 
## Degrees of Freedom: 232 Total (i.e. Null);  173 Residual
## Null Deviance:       4107 
## Residual Deviance: 302.3     AIC: 1160
  • The glm() ran without any problems. Now, build a glmer().
  • Use County as a random-effect intercept and Year as both a fixed- and random-effect slope with the hate data.
  • Make sure you correctly specify which family you are using in the glmer().
# Build a glmer with County as a random-effect intercept and Year as both a fixed- and random-effect slope
glmer( TotalIncidents ~ Year + (Year|County) ,family = "poisson", data = hate)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.370199 (tol = 0.002, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: TotalIncidents ~ Year + (Year | County)
##    Data: hate
##       AIC       BIC    logLik  deviance  df.resid 
## 1165.2746 1182.5298 -577.6373 1155.2746       228 
## Random effects:
##  Groups Name        Std.Dev. Corr 
##  County (Intercept) 217.8915      
##         Year          0.1084 -1.00
## Number of obs: 233, groups:  County, 59
## Fixed Effects:
## (Intercept)         Year  
##    295.4813      -0.1464  
## optimizer (Nelder_Mead) convergence code: 0 (OK) ; 0 optimizer warnings; 3 lme4 warnings
  • glmer() gave this output: convergence code 0; 3 optimizer warnings; 0 lme4 warnings, which means it did not converge. The reason is the scale of Year.
  • Fix this by creating Year2, which starts with zero rather than 2010.
  • Now that you have updated the data, run a glmer() with County as a random-effect intercept and Year2 as both a fixed- and random-effect slope with the hate data.
  • Save the output as glmer_out.
# Create a new column, Year2 that starts with the min of Year
hate$Year2 <- hate$Year - min(hate$Year)

# Build a glmer with County as a random-effect intercept and Year2 as both a fixed- and random-effect slope
glmer_out <- glmer( TotalIncidents ~ Year2 + (Year2|County) ,family = "poisson", data = hate)

You have now build the model. Next, you will look at the model’s results.

Interpreting model results

Now, examine the model output you just fit to see if any trends exist in hate crime for New York.

Based upon the model’s summary(), what is the trend in New York hate crimes between 2010 and 2016?

Instructions

  • Examine the summary() of glmer_out.
# Examine the summary of glmer_out
summary(glmer_out)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: poisson  ( log )
## Formula: TotalIncidents ~ Year2 + (Year2 | County)
##    Data: hate
## 
##      AIC      BIC   logLik deviance df.resid 
##   1165.3   1182.5   -577.6   1155.3      228 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.5434 -0.4864 -0.1562  0.3319  3.1939 
## 
## Random effects:
##  Groups Name        Variance Std.Dev. Corr
##  County (Intercept) 1.16291  1.0784       
##         Year2       0.01175  0.1084   0.02
## Number of obs: 233, groups:  County, 59
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.27952    0.16600   7.708 1.28e-14 ***
## Year2       -0.14622    0.03324  -4.398 1.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##       (Intr)
## Year2 -0.338

Question

Based upon the regression coefficient for Year2, what was the trend for hate crimes in New York between 2010 and 2016?

Possible Answers

  • There was no statistically significant trend.
  • There was a statistically significant decreasing trend.
  • There was a statistically significant increasing trend.

Correct. The trend was significantly decreasing. As you may have noticed, Poisson regression coefficients can be hard to describe. The DataCamp Course Generalized Linear Models in R includes other methods for describing Poisson regression coefficients.

Displaying the results

The last, and arguably most important step in creating a model, is sharing the results.

During this exercise, you’ll extract out the county-level estimates and plot them with ggplot2. The county-level random-effect slopes need to be added to the fixed-effect slopes to get the slope estimates for each county.

In addition to this addition, the code includes ordering the counties by rate of crime (the slope estimates) to help visualize the data clearly.

The model you previously fit, glmer_out has been loaded for you.

Instructions

  • Extract and save the Year2 slope estimates as Year2_slope.
  • Extract out the County-level random-effects.
  • Create a new column for the slope by adding together fixed- and random-effect slopes.
# Extract out the fixed-effect slope for Year2
Year2_slope <- fixef(glmer_out)['Year2']

# Extract out the random-effect slopes for county
county_slope <- ranef(glmer_out)$County

# Create a new column for the slope
county_slope$slope <- county_slope$Year2 + Year2_slope

# Use the row names to create a county name column
county_slope$county <- rownames(county_slope)

# Create an ordered county-level factor based upon slope values
county_slope$county_plot <- factor(county_slope$county, 
                                   levels = county_slope$county[order(county_slope$slope)])
  • Re-run your previous code.
  • Use the data.frame county_slope and plot x = county_plot against y = slope.
# Now plot the results using ggplot2
ggplot(data = county_slope, aes(x = county_plot, y = slope)) + 
    geom_point() +
    coord_flip() + 
    theme_bw() + 
    ylab("Change in hate crimes per year")  +
    xlab("County")

Notice how the change in hate crime rates varies greatly across counties. In this case, a random-effect model can help capture this source of variability.

Hierarchical models in R review

During this course, you have used and compared fixed-effect models (e.g., lm() and glm()) and mixed-effect models (those with both fixed- and random-effects such as lmer() and glmer()). What are some of the trade-offs you have observed during the course?

Instructions

  • Sort the following concepts and trade-offs for each model type.

Congratulations! Happy coding