Generate bootstrap distribution for median When building a bootstrap distribution for a single statistic, we first generate a series of bootstrap resamples, and then record the relevant statistic (in this case, the median) of each distribution. In this exercise, the tidyverse packages, including ggplot2 have been loaded for you, as has infer. Generate 15000 bootstrap distributions of rent in the manhattan data frame and record the median of each bootstrap distribution. Specify that rent is the response variable. Generate 15000 bootstrap replicates. Calculate the median of each distribution. Plot a histogram of the bootstrap medians. Using rent_med_ci, plot stat on the x axis. Add a histogram layer with a binwidth of 50. # Generate bootstrap distribution of medians rent_med_ci <- manhattan %>% # Specify the variable of interest specify(response = rent) %>% # Generate 15000 bootstrap samples generate(reps = 15000, type = "bootstrap") %>% # Calculate the median of each bootstrap sample calculate(stat = "median") # View its structure str(rent_med_ci) # Plot the rent_med_ci statistic ggplot(rent_med_ci, aes(x = stat)) + # Make it a histogram with a binwidth of 50 geom_histogram(binwidth = 50) ------------------------------------------------------------------------------------------------------------------------------------------- Calculate bootstrap interval using both methods Using our bootstrap distribution from an earlier exercise, we can calculate bootstrap intervals for the median price of 1 bedroom apartments in Manhattan. Remember that we saved the bootstrap distribution as rent_med_ci 1. Calculate a 95% bootstrap confidence interval using the percentile method. Summarize rent_med_ci's statistic with appropriate lower and upper quantiles. 2. Calculate the observed median rent. Summarize to calculate the median rent. Pull out the value. 3. Determine the critical value. Calculate the degrees of freedom in the dataset. Use the t-distribution to determine the critical value for a 95% confidence interval. qt() takes the percentile of the t-distribution you want to find and the degrees of freedom as arguments. 4. Complete the calculation of the 95% bootstrap confidence interval using the standard error method. Summarize to calculate the standard error of the bootstrap medians. Summarize again to calculate the lower and upper limits of the CI. Compare the interval to the one you calculated in step 1. # Calculate bootstrap CI as lower and upper quantiles rent_med_ci %>% summarize( l = quantile(stat, probs = 0.025), u = quantile(stat, probs = 0.975) ) rent_med_obs <- manhattan %>% # Calculate observed median rent summarize(median_rent = median(rent)) %>% # Pull out the value pull() # See the result rent_med_obs # Calculate the degrees of freedom degrees_of_freedom <- nrow(manhattan) - 1 # Determine the critical value t_star <- qt(0.975, df = degrees_of_freedom) # Calculate the CI using the std error method rent_med_ci %>% # Calculate the std error of the statistic summarize(boot_se = sd(stat)) %>% # Calculate the lower and upper limits of the CI summarize( l = rent_med_obs - t_star * boot_se, u = rent_med_obs + t_star * boot_se ) ------------------------------------------------------------------------------------------------------------------------------------------- Doctor visits during pregnancy The state of North Carolina released to the public a large data set containing information on births recorded in this state. This data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. The ncbirths dataset (which is already loaded for you) is a random sample of 1000 cases from this data set. Among other variables, the number of doctor's visits (visits) the mother had throughout the pregnancy is recorded. Which of the following is false about the distribution of visits? ---> The number of doctor's visits for one of the mothers in the dataset is missing. 1. Filter ncbirths for rows where visits is not missing. 2. Generate 15000 bootstrap means of the number of visits. Specify visits as the variable of interest. Generate 15000 bootstrap replicates. Calculate the mean. 3. Calculate the 90% confidence interval via percentile method. Summarize to calculate appropriate quantiles of the bootstrap statistic. 4. Suppose now we're interested in the standard deviation of the number of doctor's visits throughout pregnancy. For the number of doctor's visits, generate 15000 bootstrap replicates and calculate their standard deviations. Calculate a 90% confidence interval for the bootstrap standard deviations using the percentile method. # Filter for rows with non-missing visits ncbirths_complete_visits <- ncbirths %>% filter(!is.na(visits)) # See the result glimpse(ncbirths_complete_visits) # Generate 15000 bootstrap means visit_mean_ci <- ncbirths_complete_visits %>% # Specify visits as the response specify(response = "visits") %>% # Generate 15000 bootstrap replicates generate(reps = 15000, type = "bootstrap") %>% # Calculate the mean calculate(stat = "mean") # Calculate the 90% CI via percentile method visit_mean_ci %>% summarize( l = quantile(stat, probs = 0.05), u = quantile(stat, probs = 0.95) ) # Calculate 15000 bootstrap standard deviations of visits visit_sd_ci <- ncbirths_complete_visits %>% specify(response = "visits") %>% generate(reps = 15000, type = "bootstrap") %>% calculate(stat = "sd") # See the result visit_sd_ci # Calculate the 90% CI via percentile method visit_sd_ci %>% summarize( l = quantile(stat, probs = 0.05), u = quantile(stat, probs = 0.95) ) -------------------------------------------------------------------------------------------------------------------------------------------