Using residuals (1) In the next few exercises, you will calculate residuals from a data set that complies with the linear regression technical conditions. For the linear model conditions to hold, the points should be scattered throughout the residual plot with no discernible pattern. Here, the residual plot looks like a scattering of points. * Using hypdata_nice, draw a scatter plot of response versus explanatory. * Run a linear regression of response versus explanatory on hypdata_nice. * Get the observation-level information from the model using augment. * Using modeled_observations, plot .resid vs. .fitted and add a point layer. > glimpse(hypdata_nice) Rows: 200 Columns: 2 $ explanatory 2.140046, 3.037086, 2.482534, 2.913218, 3.117546, 2.452627… $ response 19.03777, 21.44482, 19.20788, 20.62550, 21.66421, 15.98922… # Using hypdata_nice, draw a scatter plot of response vs. explanatory ggplot(hypdata_nice, aes(x = explanatory, y = response)) + geom_point() # Run a linear regression of response vs. explanatory model <- lm(response ~ explanatory, data = hypdata_nice) # Augment to get the observation-level information modeled_observations <- augment(model) # See the result modeled_observations # Using modeled_observations, draw a scatter plot of .resid vs. .fitted ggplot(modeled_observations, aes(x = .fitted, y = .resid)) + geom_point() -------------------------------------------------------------------------------------------------------------------------------------------- Using residuals (2) Now, you will calculate residuals from a data set that violates the technical conditions. For the linear model conditions to hold, the points should be scattered throughout the residual plot with no discernible pattern. Here the residuals reveal a violation of the technical conditions. * Using hypdata_poor, draw a scatter plot of response versus explanatory. * Run a linear regression of response versus explanatory on hypdata_poor. * Get the observation-level information from the model using augment. * Using modeled_observations, plot residuals vs. fitted values and add a point layer. > glimpse(hypdata_poor) Rows: 200 Columns: 2 $ explanatory 2.140046, 3.037086, 2.482534, 2.913218, 3.117546, 2.452627… $ response 19.08102, 21.61779, 19.15005, 20.57338, 21.83831, 15.22685… # Using hypdata_poor, draw a scatter plot of response vs. explanatory ggplot(hypdata_poor, aes(x = explanatory, y = response)) + geom_point() # Run a linear regression of response vs. explanatory model <- lm(response ~ explanatory, data = hypdata_poor) # Augment to get the observation-level information modeled_observations <- augment(model) # See the result modeled_observations # Using modeled_observations, draw a scatter plot # of residuals vs. fitted values ggplot(modeled_observations, aes(x = .fitted, y = .resid)) + geom_point() -------------------------------------------------------------------------------------------------------------------------------------------- Estimation with and without outlier The data provided in this exercise (hypdata_outlier) has an extreme outlier. A plot is shown of the dataset, and a linear regression model of response versus explanatory. You will remove the outlying point to see how one observation can affect the estimate of the line. * Filter hypdata_outlier to remove the outlier. * Update the plot, p, to add another smooth layer (use geom_smooth). * Like the other ribbon, the update should use the linear regression method, and not draw the ribbon. * Unlike the other ribbon, the update should use the data = hypdata_no_outlier and be colored red. * For now, just use the smooth curve, and not the confidence bounds (se = FALSE). > glimpse(hypdata_outlier) Rows: 200 Columns: 2 $ explanatory 2.140046, 3.037086, 2.482534, 2.913218, 3.117546, 2.452627… $ response 19.03777, 21.44482, 19.20788, 20.62550, 21.66421, 15.98922 # This plot is shown p <- ggplot(hypdata_outlier, aes(x = explanatory, y = response)) + geom_point() + geom_smooth(method = "lm", se = FALSE) # Filter to remove the outlier hypdata_no_outlier <- hypdata_outlier %>% filter(response < 100) p + # Add another smooth lin .reg. layer, no ribbon, # hypdata_no_outlier data, colored red geom_smooth(method = "lm", data = hypdata_no_outlier, se = FALSE, color = "red") -------------------------------------------------------------------------------------------------------------------------------------------- Inference with and without outlier (t-test) Not only can one point change the estimated regression line, but it can also change the inferential analysis. The datasets with and without the outlier are provided as hypdata_outlier and hypdata_no_outlier respectively. # Model response vs. explanatory on hypdata_outlier and tidy it lm(response ~ explanatory, data = hypdata_outlier) %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) 2.71 7.52 0.361 0.719 2 explanatory 6.98 2.41 2.89 0.00428 # Do the same on hypdata_no_outlier lm(response ~ explanatory, data = hypdata_no_outlier) %>% tidy() # A tibble: 2 × 5 term estimate std.error statistic p.value 1 (Intercept) 12.7 0.496 25.7 9.06e-65 2 explanatory 2.78 0.160 17.4 1.02e-41 The models tell the same story as the scatter plots: the coefficients and p-values are wildly different. -------------------------------------------------------------------------------------------------------------------------------------------- Inference with and without outlier (randomization) Using the randomization test, you can again evaluate the effect of an outlier on the inferential conclusions of a linear model. Run a randomization test on the hypdata_out data twice: once with the outlying value and once without it. Note that the extended lines of code communicate clearly the steps of the randomization tests. Using the data frames hypdata_out (containing an outlier) and hypdata_noout (outlier removed), the data frames perm_slope_out and perm_slope_noout were created to contain the permuted slopes the original datasets, respectively. The observed values are stored in the variables obs_slope_out and obs_slope_noout. * Find the p-values by finding the proportion of (absolute value) permuted slopes which are larger than or equal to the (absolute value of the) observed slopes. As before, use mean on the binary inequality to find the proportion. > glimpse(hypdata_out) Rows: 50 Columns: 2 $ explanatory 2.140046, 3.037086, 2.482534, 2.913218, 3.117546, 2.452627… $ response 23.059854, 21.848164, 14.397363, 27.132020, 5.310260, 15.7… > glimpse(hypdata_noout) Rows: 49 Columns: 2 $ explanatory 2.140046, 3.037086, 2.482534, 2.913218, 3.117546, 2.452627… $ response 23.059854, 21.848164, 14.397363, 27.132020, 5.310260, 15.7… > glimpse(perm_slope_out) Rows: 500 Columns: 2 $ replicate 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1… $ stat 5.40932063, -13.16786787, 2.62257862, -25.18570496, 7.829668… > glimpse(perm_slope_noout) Rows: 500 Columns: 2 $ replicate 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1… $ stat 2.4142296, -2.6724489, -3.6208553, -1.9259065, 2.5988298, -3… # Calculate the p-value with the outlier perm_slope_out %>% mutate(abs_perm_slope = abs(stat)) %>% summarize(p_value = mean(abs_perm_slope >= obs_slope_out)) # A tibble: 1 × 1 p_value 1 0.018 # Calculate the p-value without the outlier perm_slope_noout %>% mutate(abs_perm_slope = abs(stat)) %>% summarize(p_value = mean(abs_perm_slope >= obs_slope_noout)) # A tibble: 1 × 1 p_value 1 0.304 As with the t-inference, the p-values associated with the randomization inference done with and without the outlier are quite different. -------------------------------------------------------------------------------------------------------------------------------------------- Adjusting for non-linear relationship The next three examples work with datasets where the underlying data structure violates the linear regression technical conditions. For each example, you will apply a transformation to the data in order to create residual plots that look scattered. In this first example, it appears as though the variables are not linearly related. * Run a linear regression of response versus explanatory on hypdata_nonlinear. * Get the observation-level information from the model. * Using modeled_observations, plot the residuals versus the fitted values. * Add a point layer. * Add a horizontal line at y = 0 using geom_hline() and setting yintercept to 0. * Update the model add a quadratic term, I(explanatory^2), to the right-hand side of the formula. * Rerun the code and look at how the plots change. # Run this to see how the model looks ggplot(hypdata_nonlinear, aes(x = explanatory, y = response)) + geom_point() + geom_smooth(method = "lm", se = FALSE) # Model response vs. explanatory model <- lm(response ~ explanatory, data = hypdata_nonlinear) # Extract observation-level information modeled_observations <- augment(model) # See the result modeled_observations # Using modeled_observations, plot residuals vs. fitted values ggplot(modeled_observations, aes(x = .fitted, y = .resid)) + # Add a point layer geom_point() + # Add horizontal line at y = 0 geom_hline(yintercept = 0) # Run this to see how the model looks ggplot(hypdata_nonlinear, aes(x = explanatory + explanatory ^ 2, y = response)) + geom_point() + geom_smooth(method = "lm", se = FALSE) # Model response vs. explanatory plus explanatory squared model <- lm(response ~ explanatory + I(explanatory^2), data = hypdata_nonlinear) # Extract observation-level information modeled_observations <- augment(model) # See the result modeled_observations # Using modeled_observations, plot residuals vs. fitted values ggplot(modeled_observations, aes(x = .fitted, y = .resid)) + # Add a point layer geom_point() + # Add horizontal line at y = 0 geom_hline(yintercept = 0) Keep in mind that squaring the input variable is great for modeling data that are better fit by a curved line (and really, any time when y is a function of x and x ^ 2). -------------------------------------------------------------------------------------------------------------------------------------------- Adjusting for non-constant errors In this next example, it appears as though the variance of the response variable increases as the explanatory variable increases. Note that the fix in this exercise has the effect of changing both the variability as well as modifying the linearity of the relationship. * Run a linear regression of response versus explanatory on hypdata_nonequalvar. * Get the observation-level information from the model. * Using modeled_observations, plot the residuals versus the fitted values. * Add a point layer. * Add a horizontal line at y = 0 using geom_hline() and setting yintercept to 0. * Update the model so that the left-hand side of the formula is log(response). * Rerun the code and look at how the plots change. # Run this to see how the model looks ggplot(hypdata_nonequalvar, aes(x = explanatory, y = response)) + geom_point() + geom_smooth(method = "lm", se = FALSE) # Model response vs. explanatory model <- lm(response ~ explanatory, data = hypdata_nonequalvar) # Extract observation-level information modeled_observations <- augment(model) # See the result modeled_observations # Using modeled_observations, plot residuals vs. fitted values ggplot(modeled_observations, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0) # Run this to see how the model looks ggplot(hypdata_nonequalvar, aes(x = explanatory, y = log(response))) + geom_point() + geom_smooth(method = "lm", se = FALSE) # Model log-response vs. explanatory model <- lm(log(response) ~ explanatory, data = hypdata_nonequalvar) # Extract observation-level information modeled_observations <- augment(model) # See the result modeled_observations # Using modeled_observations, plot residuals vs. fitted values ggplot(modeled_observations, aes(x = .fitted, y = .resid)) + # Add a point layer geom_point() + # Add horizontal line at y = 0 geom_hline(yintercept = 0) Keep in mind that using the log transformation on the response is good for data where the variance is unequal (and really, when ln(y) is a function of x). -------------------------------------------------------------------------------------------------------------------------------------------- Adjusting for non-normal errors In this last example, it appears as though the points are not normally distributed around the regression line. Again, note that the fix in this exercise has the effect of changing both the variability as well as modifying the linearity of the relationship. * Run a linear regression of response versus explanatory on hypdata_nonnorm. * Get the observation-level information from the model. * Using modeled_observations, plot the residuals versus the fitted values. * Add a point layer. * Add a horizontal line at y = 0 using geom_hline() and setting yintercept to 0. * Update the model so that the left-hand side of the formula is the square root of response. * Rerun the code and look at how the plots change. # Run this to see how the model looks ggplot(hypdata_nonnorm, aes(x = explanatory, y = response)) + geom_point() + geom_smooth(method = "lm", se = FALSE) # Model response vs. explanatory model <- lm(response ~ explanatory, data = hypdata_nonnorm) # Extract observation-level information modeled_observations <- augment(model) # See the result modeled_observations # Using modeled_observations, plot residuals vs. fitted values ggplot(modeled_observations, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0) # Run this to see how the model looks ggplot(hypdata_nonnorm, aes(x = explanatory, y = sqrt(response))) + geom_point() + geom_smooth(method = "lm", se = FALSE) # Model response vs. explanatory model <- lm(sqrt(response) ~ explanatory, data = hypdata_nonnorm) # Extract observation-level information modeled_observations <- augment(model) # See the result modeled_observations # Using modeled_observations, plot residuals vs. fitted values ggplot(modeled_observations, aes(x = .fitted, y = .resid)) + # Add a point layer geom_point() + # Add horizontal line at y = 0 geom_hline(yintercept = 0) Keep in mind that using the square root transformation on the response is good for data where the data are not normal around the line (and really, when sqrt(y) is a function of x).