t-statistic Using the permuted datasets (recall, the randomization forces the null hypothesis to be true), investigate the distribution of the standardized slope statistics (the slope, which has been divided by the standard error). Note that the distribution of the standardized slope is well described by a t-distribution. * Filter twins_perm for rows where term equals "Biological_perm". * Calculated the degrees of freedom of twins, as the number of its rows minus two. * Using biological_perm, plot statistic. * Add a histogram layer, mapping y to the special value ..density... * Add a function statistic of the probability density function of the t-distribution. * Set fun to dt. * Set args to a list, with the element df equal to degrees_of_freedom. * Set color to "red". > glimpse(twins) Rows: 27 Columns: 3 $ Foster 82, 80, 88, 108, 116, 117, 132, 71, 75, 93, 95, 88, 111, 63… $ Biological 82, 90, 91, 115, 115, 129, 131, 78, 79, 82, 97, 100, 107, 6… $ Social "high", "high", "high", "high", "high", "high", "high", "mi… # Look at the data twins_perm # A tibble: 1,000 × 6 # Groups: replicate [500] replicate term estimate std.error statistic p.value 1 1 (Intercept) 92.1 19.7 4.67 0.0000872 2 1 Biological_perm 0.0311 0.204 0.152 0.880 3 2 (Intercept) 84.9 19.6 4.33 0.000213 4 2 Biological_perm 0.107 0.203 0.525 0.604 5 3 (Intercept) 75.6 19.3 3.91 0.000626 6 3 Biological_perm 0.205 0.200 1.02 0.316 7 4 (Intercept) 54.3 17.9 3.03 0.00564 8 4 Biological_perm 0.429 0.186 2.31 0.0294 9 5 (Intercept) 120. 19.1 6.31 0.00000133 10 5 Biological_perm -0.264 0.197 -1.34 0.193 # … with 990 more rows # Filter for Biological_perm biological_perm <- twins_perm %>% filter(term == "Biological_perm") # Calculate degrees of freedom of twins degrees_of_freedom <- nrow(twins) - 2 # Using biological_perm, plot statistic ggplot(biological_perm, aes(x = statistic))+ # Add a histogram layer, with density on the y axis geom_histogram(aes(y = ..density..)) + # Add a t-distribution function stat, colored red stat_function(fun = dt, args = list(df = degrees_of_freedom), color = "red") -------------------------------------------------------------------------------------------------------------------------------------------- Working with R-output (1) The p-value given by the lm output is a two-sided p-value by default. In the twin study, it might seem more reasonable to follow along the one-sided scientific hypothesis that the IQ scores of the twins are positively associated. Because the p-value is the probability of the observed data or more extreme, the two-sided test p-value is twice as big as the one-sided result. That is, to get a one-sided p-value from the two-sided output in R, divide the p-value by two. The linear regression model of Foster vs. Biological, model, is provided in the script. * Get the Biological coefficient from the model. * Tidy the model. * Filter for the term "Biological". * Add a column, one_sided_p_value, of the one-sided p-value. model <- lm(Foster ~ Biological, data = twins) # Get the Biological model coefficient biological_term <- model %>% # Tidy the model tidy() %>% # Filter for term equal to "Biological" filter(term == "Biological") biological_term %>% # Add a column of one-sided p-values mutate(one_sided_p_value = p.value / 2) -------------------------------------------------------------------------------------------------------------------------------------------- Working with R-output (2) In thinking about the scientific research question, if IQ is caused only by genetics, then we would expect the slope of the line between the two sets of twins to be 1. Testing the hypothesized slope value of 1 can be done by making a new test statistic which evaluates how far the observed slope is from the hypothesized value of 1. new_t = (slope - 1) / SE If the hypothesis that the slope equals one is true, then the new test statistic will have a t-distribution which we can use for calculating a p-value. The biological term from the model is available as biological_term. > biological_term # A tibble: 1 × 5 term estimate std.error statistic p.value 1 Biological 0.901 0.0963 9.36 0.00000000120 * Calculate the degrees of freedom of the twins dataset. * Calculate the two-sided p-value for the alternative hypothesis that the true slope is different than 1. Build up the calculation in stages. * Calculate the test statistic as the slope estimate minus 1, all divided by the standard error. * Calculate the one-sided p-value of the test statistic using the cumulative distribution function of the t-distribution, pt(), at the test_statistic, with the degrees of freedom you just calculated. * Calculate the two-sided p-value as double the one-sided p-value. # Calculate the degrees of freedom of twins degrees_of_freedom <- nrow(twins) - 2 biological_term %>% mutate( # Calculate the test statistic test_statistic = (estimate - 1) / std.error, # Calculate its one-sided p-value one_sided_p_value_of_test_statistic = pt(test_statistic, df = degrees_of_freedom), # ... and its two-sided p-value two_sided_p_value_of_test_statistic = one_sided_p_value_of_test_statistic * 2 ) # A tibble: 1 × 8 term estimate std.error statistic p.value test_statistic 1 Biological 0.901 0.0963 9.36 0.00000000120 -1.02 one_sided_p_value_of_test_statistic two_sided_p_value_of_test_statistic 1 0.158 0.316 The p-value of 0.31 suggests that we cannot reject the null hypothesis: the slope of the IQ line is not significantly different from 1. -------------------------------------------------------------------------------------------------------------------------------------------- Comparing randomization inference and t-inference When technical conditions (see next chapter) hold, the inference from the randomization test and the t-distribution test should give equivalent conclusions. They will not provide the exact same answer because they are based on different methods. But they should give p-values and confidence intervals that are reasonably close. * Calculate the absolute value of the observed slope, obs_slope. * Add a column to perm_slope of the absolute value of the slope in each permuted replicate. The slope column is called stat. * In the call to summarize(), calculate the p-value as the proportion of absolute estimates of slope that are at least as extreme as the absolute observed estimate of slope. # The slope in the observed data and each permutation replicate obs_slope perm_slope obs_slope [1] 0.901 perm_slope Response: Foster (integer) Explanatory: Biological (integer) # A tibble: 1,000 × 2 replicate stat 1 1 0.0760 2 2 -0.151 3 3 -0.131 4 4 -0.0677 5 5 0.0914 6 6 0.194 7 7 -0.102 8 8 -0.147 9 9 0.0562 10 10 0.0451 # … with 990 more rows # Calculate the absolute value of the observed slope abs_obs_slope <- abs(obs_slope) # Find the p-value perm_slope %>% # Add a column for the absolute value of stat mutate(perm_slope = abs(stat)) %>% summarize( # Calculate prop'n permuted values at least as extreme as observed p_value = mean(perm_slope >= abs_obs_slope) ) -------------------------------------------------------------------------------------------------------------------------------------------- CI using t-theory In previous courses, you have created confidence intervals with the formula of statistic plus/minus some number of standard errors. With bootstrapping, we typically use two standard errors. With t-based theory, we use the specific t-multiplier. Create a CI for the slope parameter using both the default tidy() call as well as mutate() to calculate the confidence interval bounds explicitly. Note that the two methods should give exactly the same CI values because they are using the same computations. alpha has been set to 0.05 and the degrees of freedom of the twins dataset is given to you. * Calculate the confidence level as one minus alpha. * Calculate the upper percentile cutoff from alpha. * Calculate the critical value from the inverse cumulative density function of the t-distribution, qt(). Pass it the upper percentile cutoff and the degrees of freedom. * Tidy the linear regression model, including a confidence interval by setting conf.int to TRUE, and setting the confidence level, conf.level, to confidence_level. * Manually calculate the same confidence interval as the estimate plus or minus the critical value times the standard error. alpha <- 0.05 degrees_of_freedom <- nrow(twins) - 2 # Calculate the confidence level confidence_level <- 1 - alpha # Calculate the upper percentile cutoff p_upper <- 1 - alpha / 2 # Find the critical value from the t-distribution critical_value <- qt(p_upper, df = degrees_of_freedom) tidied_model <- lm(Foster ~ Biological, data = twins) %>% # Tidy the model, with a confidence interval tidy(conf.int = TRUE, conf.level = confidence_level) tidied_model %>% # Manually calculate the same interval mutate( lower = estimate - critical_value * std.error, upper = estimate + critical_value * std.error ) -------------------------------------------------------------------------------------------------------------------------------------------- Comparing randomization CIs and t-based CIs As with hypothesis testing, if technical conditions hold (technical conditions are discussed more in the next chapter), the CI created for the slope parameter in the t-distribution setting should be in line with the CI created using bootstrapping. Create a CI for the slope parameter and compare it to the one created using the bootstrap percentile interval from the previous chapter. Note that the bootstrap and t-intervals will not be exactly the same because they use different computational steps to arrive at the interval estimate. * Calculate the confidence level and the lower and upper percentile cutoffs from alpha. * Tidy the linear regression model, including a confidence interval, and set the confidence level to confidence_level. * Summarize the bootstrap replicates to calculate a confidence interval using quantiles of stat. Use p_lower and p_upper for the quantiles. > boot_slope Response: Foster (integer) Explanatory: Biological (integer) # A tibble: 1,000 × 2 replicate stat 1 1 0.630 2 2 0.980 3 3 1.01 4 4 0.972 5 5 0.775 6 6 0.848 7 7 0.915 8 8 0.816 9 9 0.843 10 10 0.810 # … with 990 more rows alpha <- 0.05 # Calculate the confidence level confidence_level <- 1 - alpha # Calculate lower percentile cutoff p_lower <- alpha / 2 # ... and the upper one p_upper <- 1 - alpha / 2 # Tidy the model, including a confidence interval tidied_model <- lm(Foster ~ Biological, data = twins) %>% tidy(conf.int = TRUE, conf.level = confidence_level) # Recall the t-confidence interval tidied_model %>% filter(term == "Biological") %>% select(conf.low, conf.high) # Create the bootstrap confidence interval boot_slope %>% summarize( lower = quantile(stat, p_lower), upper = quantile(stat, p_upper) ) The confidence interval is almost the same when calculated using bootstrap quantiles compared to the theoretical t-distrabution quantiles. That means that the theoretical assumptions hold up well -------------------------------------------------------------------------------------------------------------------------------------------- Confidence intervals for the average response at specific values The previous two exercises gave CIs for the slope parameter. That is, based on the observed data, you provided an interval estimate of the plausible values for the true slope parameter in the population. Recall that the number 1 was in the CI for the slope, meaning 1 cannot be ruled out as a possible value for the true slope between Biological and Foster twins. There is no evidence to claim that there is a difference, on average, between the IQ scores of two twins in any given pair. When working with a linear regression model, you might also want to know the plausible values for the expected value of the response at a particular explanatory location. That is, what would you expect the IQ of a Foster twin to be given a Biological twin's IQ of 100? * Call augment() on model to see the observation-level data in the model. * Create a new twins dataset of Biological IQs to make predictions of Foster IQs on. Create a data.frame with a column named Biological, taking values as a sequence from 70 to 130 in steps of 15. * Augment the model with the new dataset. Set augment()'s newdata argument to new_twins. * Calculate a confidence interval of the predicted IQs of the foster twin, based on the augmented model. This is the predicted value, .fitted, plus or minus the critical value times the standard error of the prediction, .se.fit. model <- lm(Foster ~ Biological, data = twins) # Get observation-level values from the model augment(model) # A tibble: 27 × 9 Foster Biological .fitted .se.fit .resid .hat .sigma .cooksd .std.resid 1 82 82 83.1 1.96 -1.13 0.0645 7.89 0.000781 -0.151 2 80 90 90.3 1.57 -10.3 0.0414 7.59 0.0403 -1.37 3 88 91 91.2 1.54 -3.24 0.0399 7.86 0.00380 -0.428 4 108 115 113. 2.41 -4.87 0.0973 7.82 0.0237 -0.664 5 116 115 113. 2.41 3.13 0.0973 7.86 0.00978 0.426 6 117 129 125. 3.57 -8.49 0.213 7.64 0.208 -1.24 7 132 131 127. 3.75 4.70 0.235 7.81 0.0744 0.696 8 71 78 79.5 2.23 -8.52 0.0835 7.68 0.0604 -1.15 9 75 79 80.4 2.16 -5.42 0.0783 7.80 0.0227 -0.731 10 93 82 83.1 1.96 9.87 0.0645 7.61 0.0601 1.32 # … with 17 more rows # Create a dataframe of new observations new_twins <- data.frame(Biological = seq(from = 70, to = 130, by = 15)) # Augment the model with the new dataset augmented_model <- augment(model, newdata = new_twins) # See the result augmented_model # A tibble: 5 × 3 Biological .fitted .se.fit 1 70 72.3 2.85 2 85 85.8 1.79 3 100 99.4 1.55 4 115 113. 2.41 5 130 126. 3.66 augmented_model %>% # Calculate a confidence interval on the predicted values mutate( lower_mean_prediction = .fitted - critical_value * .se.fit, upper_mean_prediction = .fitted + critical_value * .se.fit ) # A tibble: 5 × 5 Biological .fitted .se.fit lower_mean_prediction upper_mean_prediction 1 70 72.3 2.85 66.4 78.2 2 85 85.8 1.79 82.1 89.5 3 100 99.4 1.55 96.1 103. 4 115 113. 2.41 108. 118. 5 130 126. 3.66 119. 134. -------------------------------------------------------------------------------------------------------------------------------------------- Confidence intervals for the average response for all observations The confidence interval for the average response can be computed for all observations in the dataset. Using augment() directly on the twins dataset gives predictions and standard errors for the Foster twin based on all the Biological observations. Note that the calculation of the regression line is more stable at the center, so predictions for the extreme values are more variable than predictions in the middle of the range of explanatory IQs. The foster twin IQ predictions that you calculated last time are provided as predictions. These predictions are shown in a plot using geom_smooth(). Manually create what geom_smooth() does, using predictions. Provide the aesthetics and data to each geom. * Add a point layer of Foster vs. Biological, using the data = twins dataset. * Add a line layer of .fitted vs. Biological, using the data = predictions dataset. Color the line "blue". * Add a ribbon layer with x mapped to Biological, ymin mapped to lower_mean_prediction and ymax mapped to upper_mean_prediction. Use the data = predictions dataset and set the transparency, alpha to 0.2. # This plot is shown ggplot(twins, aes(x = Biological, y = Foster)) + geom_point() + geom_smooth(method = "lm") ggplot() + # Add a point layer of Foster vs. Biological, using twins geom_point(aes(x = Biological, y = Foster), data = twins) + # Add a line layer of .fitted vs Biological, using predictions, colored blue geom_line(aes(x=Biological, y = .fitted), data = predictions, color = "blue") + # Add a ribbon layer of lower_mean_prediction to upper_mean_prediction vs Biological, # using predictions, transparency of 0.2 geom_ribbon(aes(x = Biological, ymin = lower_mean_prediction, ymax = upper_mean_prediction), data = predictions, alpha = 0.2) -------------------------------------------------------------------------------------------------------------------------------------------- Prediction intervals for the individual response Along with an interval estimate for the expected value of the response, it is often desired to have an interval estimate for the actual individual responses. The formulation for the prediction is the same, but the predicted points are more variable around the line, so the standard error is calculated to be a larger value. As with the interval around the expected average values, the interval for predicted individual values is smaller in the middle than on the extremes due to the calculation of the regression line being more stable at the center. Note that the intervals for the average responses are much smaller than the intervals for the individual responses. You have already seen tidy(), to pull out coefficient-level information from a model, and augment() for observation-level information. glance() completes the triumvirate, giving you model-level information. The linear regression is provided as model and the predictions from the previous exercise are given as predictions. * Find the natural variability of the points around the prediction line. * Use glance() to get the model-level information from model. * Pull out the sigma element. * Calculate the standard error of the predictions as the square root of the sum of (twins_sigma squared and .se.fit squared). critical_value has already been loaded into your workspace. * Calculate the prediction interval for the foster twins' IQ as the model prediction, .fitted plus or minus the critical value times the standard error of the predictions. twins_sigma <- model %>% # Get model-level information glance() %>% # Pull out sigma pull(sigma) predictions %>% # Calculate the std err of the predictions mutate(std_err_of_predictions = sqrt(twins_sigma^2 + .se.fit^2)) # A tibble: 5 × 6 Biological .fitted .se.fit lower_mean_predi… upper_mean_pred… std_err_of_pred… 1 70 72.3 2.85 66.4 78.2 67.9 2 85 85.8 1.79 82.1 89.5 62.9 3 100 99.4 1.55 96.1 103. 62.2 4 115 113. 2.41 108. 118. 65.6 5 130 126. 3.66 119. 134. 73.1 predictions3 <- predictions2 %>% # Calculate the prediction intervals mutate( lower_response_prediction = .fitted - critical_value * std_err_of_predictions, upper_response_prediction = .fitted + critical_value * std_err_of_predictions ) # See the result predictions3 # Update the plot ggplot() + geom_point(aes(x = Biological, y = Foster), data = twins) + geom_smooth(aes(x = Biological, y = Foster), data = twins, method = "lm") + # Add a ribbon layer geom_ribbon(aes(x = Biological, # ... of lower_response_prediction to upper_response_prediction vs. Biological ymin = lower_response_prediction, ymax = upper_response_prediction), # ... using the predictions3 dataset data = predictions3, # ... with transparency set to 0.2 alpha = 0.2, # ... and fill color red fill = "red" )