Null sampling distribution of the slope In the previous chapter, you investigated the sampling distribution of the slope from a population where the slope was non-zero. Typically, however, to do inference, you will need to know the sampling distribution of the slope under the hypothesis that there is no relationship between the explanatory and response variables. Additionally, in most situations, you don't know the population from which the data came, so the null sampling distribution must be derived from only the original dataset. In the mid-20th century, a study was conducted that tracked down identical twins that were separated at birth: one child was raised in the home of their biological parents and the other in a foster home. In an attempt to answer the question of whether intelligence is the result of nature or nurture, both children were given IQ tests. The resulting data is given for the IQs of the foster twins (Foster is the response variable) and the IQs of the biological twins (Biological is the explanatory variable). In this exercise you'll use the pull() function. This function takes a data frame and returns a selected column as a vector (similar to $). * Run a linear regression of Foster vs. Biological on the twins dataset. * Tidy the result. * Filter for rows where term equals "Biological". * Use pull() to pull out the estimate column. Simulate 10 slopes. * Use specify() to specify Foster vs. Biological (same formula as for a linear regression). * Use hypothesize(), to set a null hypothesis of "independence". * Use generate() to generate 10 replicates (reps) of type "permute". * Use calculate() to calculate the summary statistic "slope". Having a range of slope estimates will let us measure the variation and calculate confidence intervals for that statistic. > glimpse(twins) Rows: 27 Columns: 3 $ Foster 82, 80, 88, 108, 116, 117, 132, 71, 75, 93, 95, 88, 111, 63… $ Biological 82, 90, 91, 115, 115, 129, 131, 78, 79, 82, 97, 100, 107, 6… $ Social "high", "high", "high", "high", "high", "high", "high", "mi… library(infer) set.seed(4747) # Calculate the observed slope # Run a lin. reg. of Foster vs. Biological on the twins data obs_slope <- lm(Foster ~ Biological, data = twins) %>% # Tidy the result tidy() %>% # Filter for rows where term equal Biological filter(term == "Biological") %>% # Pull out the estimate column pull(estimate) # See the result obs_slope # Simulate 10 slopes with a permuted dataset perm_slope <- twins %>% # Specify Foster vs. Biological specify(Foster ~ Biological) %>% # Use a null hypothesis of independence hypothesize(null = "independence") %>% # Generate 10 permutation replicates generate(reps = 10, type = "permute") %>% # Calculate the slope statistic calculate(stat = "slope") # See the result perm_slope -------------------------------------------------------------------------------------------------------------------------------------------- SE of the slope The previous exercise generated 10 different slopes under a model of no (i.e., null) relationship between the explanatory and response variables. Now repeat the null slope calculations 1000 times to derive a null sampling distribution for the slope coefficient. The null sampling distribution will be used as a benchmark against which to compare the original data. Then, you'll calculate the mean and standard deviation of the null sampling distribution for the slope. Simulate 500 slopes. * Use specify() to specify Foster vs. Biological (same formula as for a linear regression). * Use hypothesize(), to set a null hypothesis of "independence". * Use generate() to generate 500 replicates (reps) of type "permute". * Use calculate() to calculate the summary statistic "slope". * Using the perm_slope data, plot the stat you calculated with a density geom. * Use ungroup() without arguments to ungroup the dataset. * Use summarize() to calculate summary statistics. * Calculate mean_stat as the mean of stat. * Calculate std_err_stat as the standard error of stat, using sd(). # Simulate 500 slopes with a permuted dataset perm_slope <- twins %>% # Specify Foster vs. Biological specify(Foster ~ Biological) %>% # Use a null hypothesis of independence hypothesize(null = "independence") %>% # Generate 500 permutation replicates generate(reps = 500, type = "permute") %>% # Calculate the slope statistic calculate(stat = "slope") # Using perm_slope, plot stat ggplot(perm_slope, aes(x = stat)) + # Add a density layer geom_density() perm_slope %>% # Ungroup the dataset (cini mi se nepotrebno, nije nist grupirano) ungroup() %>% # Calculate summary statistics summarize( # Mean of stat mean_stat = mean(stat), # Std error of stat std_err_stat = sd(stat) ) -------------------------------------------------------------------------------------------------------------------------------------------- p-value Now that you have created the null sampling distribution, you can use it to find the p-value associated with the original slope statistic from the twins data. Although you might first consider this to be a one-sided research question, instead, use the absolute value function for practice performing two-sided tests on a slope coefficient. You can calculate the proportion of TRUE values in a logical vector using mean(). For example, given a numeric vector x, the proportion of cases where x is greater than or equal to 10 can be calculated using mean(x >= 10). * Run a linear regression of Foster vs. Biological on the twins dataset. * Tidy the result. * Filter for rows where term equals "Biological". * Use pull() to pull out the estimate column. * Use abs() to calculate the absolute value of the slopes. * Mutate perm_slope to add a column, abs_perm_slope, equal to the absolute value of stat. * Summarize to calculate p_value, equal to the proportion of rows where abs_perm_slope is greater than or equal to abs_obs_slope. # Run a lin. reg. of Foster vs. Biological on twins abs_obs_slope <- lm(Foster ~ Biological, data = twins) %>% # Tidy the result tidy() %>% # Filter for rows where term equals Biological filter(term == "Biological") %>% # Pull out the estimate pull(estimate) %>% # Take the absolute value abs() # Compute the p-value perm_slope %>% # Add a column of the absolute value of the slope mutate(abs_perm_slope = abs(stat)) %>% # Calculate a summary statistic summarize( # Get prop. cases where abs. permuted slope is greater than or equal to abs. observed slope p_value = mean(abs_perm_slope >= abs_obs_slope) ) # A tibble: 1 × 1 p_value 1 0 What can we conclude based on the p-value associated with the twins data? If there were no association between foster and biological twin IQ (no nature) in the population, we would be extremely unlikely to have collected a sample of data like we did. -------------------------------------------------------------------------------------------------------------------------------------------- Bootstrapping the data Using the infer package with type = "bootstrap", you can repeatedly sample from the dataset to estimate the sampling distribution and standard error of the slope coefficient. Using the sampling distribution will allow you to directly find a confidence interval for the underlying population slope. Use the infer steps to bootstrap the twins data 1000 times. You don't have to hypothesize() because now you're creating confidence intervals, not hypothesis testing! * Specify Foster versus Biological. * Generate 1000 replicates by bootstrapping. * Calculate the slope statistic. # Set the seed for reproducibility set.seed(4747) # Calculate 1000 bootstrapped slopes boot_slope <- twins %>% # Specify Foster vs. Biological specify(Foster ~ Biological) %>% # Generate 1000 bootstrap replicates generate(reps = 1000, type = "bootstrap") %>% # Calculate the slope statistic calculate(stat = "slope") # See the result boot_slope -------------------------------------------------------------------------------------------------------------------------------------------- SE method - bootstrap CI for slope The twins study was used to weigh in on the question of whether IQ is a result of nature (your genes) or nurture (your environment). If IQ was purely a result of nature, what slope would you expect to see in your linear model? Recall that one way to create a confidence interval for a statistic is to calculate the statistic repeatedly under different bootstrap samples and to find the standard deviation associated with the bootstrapped statistics. The twins data is already loaded in your workspace. * As with the previous exercise, use the infer steps on the twins data to specify a Foster vs. Biological model, generate 1000 bootstrapped replicates, and calculate the slope statistic. * Calculate the confidence interval of stat as the mean plus or minus two standard deviations. If the statistic has a normal distribution, two standard deviations on either side of the mean gives you a rough 95% confidence interval for its value. set.seed(4747) # Calculate the slope statistic # from 1000 bootstrap replicates of # the Foster vs. Biological model # of the twins dataset # Calculate 1000 bootstrapped slopes boot_slope <- twins %>% specify(Foster ~ Biological) %>% generate(reps = 1000, type = "bootstrap") %>% calculate(stat = "slope") # Create a confidence interval of stat # 2 std devs each side of the mean boot_slope %>% summarize( lower = mean(stat) - 2 * sd(stat), upper = mean(stat) + 2 * sd(stat) ) -------------------------------------------------------------------------------------------------------------------------------------------- Percentile method - bootstrap CI for slope Alternatively, a CI for the slope can be created using the percentiles of the distribution of the bootstrapped slope statistics. Recall that a CI is created in such a way that, over a lifetime of analysis, the coverage rate of a CI is (1-alpha)*100%. If you always set alpha = 0.05, then the 95% confidence intervals will capture the parameter of interest (over your lifetime) 95% of the time. Typically, out of the 5% of the time when the interval misses the parameter, sometimes the interval is too high (2.5% of the time) and sometimes the interval is too low (2.5% of the time). The bootstrapped estimates of slope, boot_slope, are loaded in your workspace. * Set alpha to be 0.05 (although for your own work, feel free to use a different confidence level). * Calculate the relevant percentiles needed to create the confidence interval. * The lower percentile cutoff is at half alpha. * The upper percentile cutoff is at one minus half alpha. * Create the confidence interval of stat using quantile() and the percentile cutoffs. Your interval ends should be named lower and upper. Using quantile() to calculate the confidence interval makes no asumptions about the shape of the distribution. # Set alpha alpha <- 0.05 # Set the lower percentile cutoff p_lower <- alpha / 2 # Set the upper percentile cutoff p_upper <- 1 - p_lower # Create a confidence interval of stat using quantiles boot_slope %>% summarize( lower = quantile(stat, p_lower), upper = quantile(stat, p_upper) )