Visualizing Maryland crime data Before fitting a model, plotting the data can be helpful to see if trends or data points jump out, outliers exist, or other attributes of the data require future consideration. Using ggplot2, you can plot lines for county and examine how crimes change through time. For this exercise, examine Maryland crime data (md_crime). This includes the Year, a count of violent Crimes in the county, and the County's name. To explore this data, first plot the data points for each county through time. This lets you see how each county changes through time. Rather than using an aesthetic such as color, group is used here because there are too many counties to easily distinguish colors. After plotting the raw data, add trend lines for each county. Both the connect points (geom_line) and trend lines (geom_smooth) provide insight into what, if any, kinds of random effects are required. If all of the points appear to have similar ranges and means, a random-effect intercept may not be important. Likewise, if trends look consistent across counties (i.e., the trend lines look similar or parallel across groups), a random-effect slope may not be required. #Plot how Crime (y variable) has changed across Year (x variable) in each County (group variable) using the md_crime data. plot1 <- ggplot(data = md_crime, aes(x = Year, y = Crime, group = County)) + geom_line() + theme_minimal() + ylab("Major crimes reported per county") print(plot1) # Add the trend line for each county plot1 + geom_smooth(method = "lm", se = FALSE) ############################################################################################################################################# Rescaling slopes The last plot showed changes in crime rate varied by county. This shows you that you should include Year as both a random- and fixed-effect in your model. Including Year this way will estimate a global slope across all counties as well as slope for each county. The fixed-effect slope estimates the change in major crimes across all Maryland counties. The random-effect slope estimates model for that counties have different changes in crime. But, fitting this model produces a warning message! To address this warning, change Year from starting at 2006 to starting at 0. We provide you with this new variable, Year2 (e.g., 2006 in Year is 0 in Year2). Sometimes when fitting regression, you need to scale or center the intercept to start at 0. This improves numerical stability of the model. # Fit the model with Year as both a fixed and random-effect lmer(Crime ~ Year + (1 + Year | County) , data = md_crime) # Fit the model with Year2 rather than Year lmer(Crime ~ Year2 + (1 + Year2 | County) , data = md_crime) Notice how you had to rescale year so that the model could be fit by R. This is a common regression modeling approach to improve model fits. Some people will rescale the slope so the middle value is zero instead. Either approach is acceptable. The benefit of starting at zero is that it can sometimes create a regression coefficient that is easier to explain and understand. In the next exercise, you will examine the second model. ############################################################################################################################################# Null hypothesis testing lmerTest uses the same lmer() syntax as the lme4 package, but includes different outputs. During this exercise, you will fit a lmer() model using lmerTest and lme4. # Load lmerTest library(lmerTest) # Fit a lmer use lme4 out_lme4 <- lme4::lmer(Crime ~ Year2 + (1 + Year2 | County), data = md_crime) # Fit a lmer use lmerTest out_lmerTest <- lmerTest::lmer(Crime ~ Year2 + (1 + Year2 | County), data = md_crime) # Look at the summaries summary(out_lme4) summary(out_lmerTest) ############################################################################################################################################# Model comparison with ANOVA Comparing models can be difficult. Many methods exist although these are beyond the scope of this course such as model selection (e.g., AIC). Analysis of Variance (ANOVA) exists as a basic option to compare lmer models. The ANOVA tests to see if one model explains more variability than a second model. The ANOVA does this by examining the amount of variability explained by the models. For example, you can see if Year predicts Crime in Maryland. To do this, build a null model with only County as a random-effect and a year model that includes Year. You can then compare the two models using the anova() function. If Year explains a significant amount of variability, then the P-value will be less than your pre-specified threshold (usually 0.05). # Build the Null model with only County as a random-effect null_model <- lmer(Crime ~ (1 | County) , data = md_crime) # Build the Year2 model with Year2 as a fixed and random slope and County as the random-effect year_model <- lmer(Crime ~ Year2 + (1 + Year2 | County) , data = md_crime) # Compare null_model and year_model using an anova anova(null_model, year_model)