Regression output: example I The following code provides two equivalent methods for calculating the most important pieces of the linear model output. Recall that the p-value is the probability of the observed data (or more extreme) given the null hypothesis is true. As with inference in other settings, you will need the sampling distribution for the statistic (here the slope) assuming the null hypothesis is true. You will generate the null sampling distribution in later chapters, but for now, assume that the null sampling distribution is correct. Additionally, notice that the standard error of the slope and intercept estimates describe the variability of those estimates. * Load the mosaicData package and load the RailTrail data. The RailTrail data contains information about the number of users of a trail in Florence, MA and the weather for each day. * Using the lm() function, run a linear model regressing the volume of riders on the hightemp for the day. Assign the output of the lm() function to the object ride_lm. * Use the summary() function on the linear model output to see the inferential analysis (including the p-value for the slope). * Additionally, tidy() the linear model output to make it easier to use later. > glimpse(RailTrail) Rows: 90 Columns: 11 $ hightemp 83, 73, 74, 95, 44, 69, 66, 66, 80, 79, 78, 65, 41, 59, 50,… $ lowtemp 50, 49, 52, 61, 52, 54, 39, 38, 55, 45, 55, 48, 49, 35, 35,… $ avgtemp 66.5, 61.0, 63.0, 78.0, 48.0, 61.5, 52.5, 52.0, 67.5, 62.0,… $ spring 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,… $ summer 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,… $ fall 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,… $ cloudcover 7.6, 6.3, 7.5, 2.6, 10.0, 6.6, 2.4, 0.0, 3.8, 4.1, 8.5, 7.2… $ precip 0.00, 0.29, 0.32, 0.00, 0.14, 0.02, 0.00, 0.00, 0.00, 0.00,… $ volume 501, 419, 397, 385, 200, 375, 417, 629, 533, 547, 432, 418,… $ weekday TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TR… $ dayType "weekday", "weekday", "weekday", "weekend", "weekday", "wee… # Load the mosaicData package and the RailTrail data library(mosaicData) data(RailTrail) # Fit a linear model ride_lm <- lm(volume ~ hightemp, data = RailTrail) # View the summary of your model summary(ride_lm) # Print the tidy model output tidy(ride_lm) -------------------------------------------------------------------------------------------------------------------------------------------- First random sample, second random sample Now, you will dive in to understanding how linear models vary from sample to sample. Here two random samples from a population are plotted onto the scatterplot. The population data (called popdata) already exists and is pre-loaded, along with ggplot and dplyr. * Using the whole dataset, popdata, plot response vs. explanatory. * Add a point layer using geom_point(). * Add a smooth trend layer using geom_smooth(), with the linear regression method, "lm", and no standard error ribbon, se = FALSE. * Using sample_n() from dplyr, randomly sample 50 rows of popdata without replacement. Note that the function sample_n takes a random sample of observations (of the designated size) from the dataframe, without replacement. * Do the same sampling again for sample2. * Combine each random sample using bind_rows(). Pass it sample1 and sample2, and set .id to "replicate". * Using both_samples, plot response vs. explanatory, colored by replicate. * Add a point layer. * Add a smooth trend layer using linear regression, without a standard error ribbon. > glimpse(popdata) Rows: 1,000 Columns: 2 $ explanatory -4.299772326, 0.185431494, -2.587328240, -0.433911956, 0.5… $ response 27.79571, 39.18826, 39.31460, 42.60371, 46.37942, 40.20737… # Using popdata, plot response vs. explanatory ggplot(popdata, aes(x = explanatory, y = response)) + # Add a point layer geom_point() + # Add a smooth trend layer, using lin. reg., no ribbon geom_smooth(method = "lm", se = FALSE) # Set the random number generator seed for reproducibility set.seed(4747) # From popdata, randomly sample 50 rows without replacement sample1 <- popdata %>% sample_n(size = 50) # Do the same again sample2 <- popdata %>% sample_n(size = 50) # Combine both samples both_samples <- bind_rows(sample1, sample2, .id = "replicate") # See the result glimpse(both_samples) # Using both_samples, plot response vs. explanatory, colored by replicate ggplot(both_samples, aes(x = explanatory, y = response, color = replicate)) + # Add a point layer geom_point() + # Add a smooth trend layer, using lin. reg., no ribbon geom_smooth(method = "lm" , se = FALSE) -------------------------------------------------------------------------------------------------------------------------------------------- Superimpose lines Building on the previous exercise, you will now repeat the sampling process 100 times in order to visualize the sampling distribution of regression lines generated by 100 different random samples of the population. Rather than repeatedly calling sample_n(), like you did in the previous exercise, rep_sample_n() from the oilabs package provides a convenient way to generate many random samples. The function rep_sample_n() repeats the sample_n() command reps times. The function do() from dplyr will allow you to run the lm call separately for each level of a variable that has been group_by'ed. Here, the group variable is the sampling replicate, so each lm is run on a different random sample of the data. * Pipe popdata to rep_sample_n(), setting the size of each sample to 50, and generating 100 reps. * Using many_samples (that you created in the previous step), plot response vs. explanatory, grouped by replicate. * Add a point layer. * Add a smooth trend layer using linear regression, without a standard error ribbon. * Again, group many_samples by replicate. * For each replicate, run the model then tidy it. Inside do(), call lm() with the usual model formula: response vs. explanatory. The data argument is simply .. Pipe this to tidy(). * Filter for rows where term equals "explanatory". * Using many_lms, plot estimate. * Add a histogram layer using geom_histogram(). # Set the seed for reproducibility set.seed(4747) # Repeatedly sample the population without replacement many_samples <- popdata %>% rep_sample_n(size = 50, reps = 100) # See the result glimpse(many_samples) Rows: 5,000 Columns: 3 Groups: replicate [100] $ replicate 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ explanatory -0.40361970, -2.51718202, -0.93785455, 5.90849087, -1.6208… $ response 37.80038, 28.87435, 56.99458, 41.22943, 35.03302, 46.82063… # Using many_samples, plot response vs. explanatory, grouped by replicate ggplot(many_samples, aes(x = explanatory, y = response, group = replicate)) + # Add a point layer geom_point() + # Add a smooth trend line, using lin. reg., no ribbon geom_smooth(method = "lm", se = FALSE) many_lms <- many_samples %>% # Group by replicate group_by(replicate) %>% # Run the model on each replicate, then tidy it do(lm(response ~ explanatory, data = .) %>% tidy) %>% # Filter for rows where the term is explanatory filter(term == "explanatory") # See the result many_lms # A tibble: 100 × 6 # Groups: replicate [100] replicate term estimate std.error statistic p.value 1 1 explanatory 1.89 0.293 6.43 5.42e- 8 2 2 explanatory 2.18 0.240 9.08 5.38e-12 3 3 explanatory 2.54 0.229 11.1 8.44e-15 4 4 explanatory 2.15 0.324 6.65 2.54e- 8 5 5 explanatory 2.07 0.198 10.4 5.97e-14 6 6 explanatory 2.20 0.284 7.75 5.22e-10 7 7 explanatory 1.84 0.266 6.91 1.02e- 8 8 8 explanatory 2.20 0.292 7.52 1.18e- 9 9 9 explanatory 2.17 0.254 8.55 3.33e-11 10 10 explanatory 1.81 0.314 5.75 6.04e- 7 # … with 90 more rows # Using many_lms, plot estimate ggplot(many_lms, aes(x = estimate)) + # Add a histogram layer geom_histogram() -------------------------------------------------------------------------------------------------------------------------------------------- Original population - change sample size In order to understand the sampling distribution associated with the slope coefficient, it is valuable to visualize the impact changes in the sample and population have on the slope coefficient. Here, changing the sample size directly impacts how variable the slope is. * From popdata, take random samples of size 50, replicated 100 times. * Using many_samples, plot response vs. explanatory, grouped by replicate. * Add a point layer. * Add a smooth trend layer using linear regression, without a standard error ribbon. * Edit the code for many samples to reduce the size from 50 to 10 When you used smaller sample sizes, there was more variation in the positions of each trend line. set.seed(4747) # Generate 100 random samples of size 50 many_samples <- popdata %>% rep_sample_n(size = 50, reps = 100) # Using many_samples, plot response vs. explanatory, grouped by replicate ggplot(many_samples, aes(x = explanatory, y = response, group = replicate)) + # Add a point layer geom_point() + # Add a smooth trend layer, using lin. reg., no ribbon geom_smooth(method = "lm", se = FALSE) # Edit the code to take samples of size 10 many_samples <- popdata %>% rep_sample_n(size = 10, reps = 100) # Draw the plot again; how is it different? ggplot(many_samples, aes(x = explanatory, y = response, group = replicate)) + geom_point() + geom_smooth(method = "lm", se = FALSE) -------------------------------------------------------------------------------------------------------------------------------------------- Hypothetical population - less variability around the line In order to understand the sampling distribution associated with the slope coefficient, it is valuable to visualize the impact changes in the sample and population have on the slope coefficient. Here, reducing the variance associated with the response variable around the line changes the variability associated with the slope statistics. Notice that the farther the points are from the line (the more variable the points are), the more variable the slope coefficients will be. # Update the sampling to use new_popdata many_samples <- new_popdata %>% rep_sample_n(size = 50, reps = 100) # Rerun the plot; how does it change? ggplot(many_samples, aes(x = explanatory, y = response, group = replicate)) + geom_point() + geom_smooth(method = "lm", se = FALSE) -------------------------------------------------------------------------------------------------------------------------------------------- Hypothetical population - less variability in x direction In order to understand the sampling distribution associated with the slope coefficient, it is valuable to visualize the impact changes in the sample and population have on the slope coefficient. Here, reducing the variance associated with the explanatory variable around the line changes the variability associated with the slope statistics. Somewhat counter-intuitively, reducing the variability in the direction of the explanatory variable INCREASES the variability of the slope coefficients. This is because with a smaller range of the explanatory variable, there is less information on which to build the model. # Update the sampling to use even_newer_popdata many_samples <- even_newer_popdata %>% rep_sample_n(size = 50, reps = 100) # Update and rerun the plot; how does it change? ggplot(many_samples, aes(x = explanatory, y = response, group = replicate)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + # Set the x-axis limit from -17 to 17 xlim(-17,17) --------------------------------------------------------------------------------------------------------------------------------------------