Generate bootstrap distribution for median

When building a bootstrap distribution for a single statistic, we first generate a series of bootstrap resamples, and then record the relevant statistic (in this case, the median) of each distribution.

In this exercise, the tidyverse packages, including ggplot2 have been loaded for you, as has infer.

Generate 15000 bootstrap distributions of rent in the manhattan data frame and record the median of each bootstrap distribution.

    Specify that rent is the response variable.
    Generate 15000 bootstrap replicates.
    Calculate the median of each distribution.
    
Plot a histogram of the bootstrap medians.

    Using rent_med_ci, plot stat on the x axis.
    Add a histogram layer with a binwidth of 50.


# Generate bootstrap distribution of medians
rent_med_ci <- manhattan %>%
  # Specify the variable of interest
  specify(response = rent) %>%  
  # Generate 15000 bootstrap samples
  generate(reps = 15000, type = "bootstrap") %>% 
  # Calculate the median of each bootstrap sample
  calculate(stat = "median")
  
# View its structure
str(rent_med_ci)
 
# Plot the rent_med_ci statistic
ggplot(rent_med_ci, aes(x = stat)) +
  # Make it a histogram with a binwidth of 50
  geom_histogram(binwidth = 50)

-------------------------------------------------------------------------------------------------------------------------------------------

Calculate bootstrap interval using both methods

Using our bootstrap distribution from an earlier exercise, we can calculate bootstrap intervals for the median price of 1 bedroom apartments in Manhattan. Remember that we saved the bootstrap distribution as rent_med_ci

1. Calculate a 95% bootstrap confidence interval using the percentile method.

    Summarize rent_med_ci's statistic with appropriate lower and upper quantiles.
    
2. Calculate the observed median rent.

    Summarize to calculate the median rent.
    Pull out the value.

3. Determine the critical value.

    Calculate the degrees of freedom in the dataset.
    Use the t-distribution to determine the critical value for a 95% confidence interval. qt() takes the percentile of the t-distribution you want to find and the degrees of freedom as arguments.
    
4. Complete the calculation of the 95% bootstrap confidence interval using the standard error method.

    Summarize to calculate the standard error of the bootstrap medians.
    Summarize again to calculate the lower and upper limits of the CI.
    Compare the interval to the one you calculated in step 1.


# Calculate bootstrap CI as lower and upper quantiles
rent_med_ci %>%
  summarize(
    l = quantile(stat, probs = 0.025),
    u = quantile(stat, probs = 0.975)
  )

rent_med_obs <- manhattan %>%
  # Calculate observed median rent
  summarize(median_rent = median(rent)) %>%
  # Pull out the value
  pull()
  
# See the result
rent_med_obs
  
# Calculate the degrees of freedom
degrees_of_freedom <- nrow(manhattan) - 1  
  
# Determine the critical value
t_star <- qt(0.975, df = degrees_of_freedom)

# Calculate the CI using the std error method
rent_med_ci %>%
  # Calculate the std error of the statistic
  summarize(boot_se = sd(stat)) %>%
  # Calculate the lower and upper limits of the CI
  summarize(
    l = rent_med_obs - t_star * boot_se,
    u = rent_med_obs + t_star * boot_se
  )

-------------------------------------------------------------------------------------------------------------------------------------------

Doctor visits during pregnancy

The state of North Carolina released to the public a large data set containing information on births recorded in this state. This data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. The ncbirths dataset (which is already loaded for you) is a random sample of 1000 cases from this data set. Among other variables, the number of doctor's visits (visits) the mother had throughout the pregnancy is recorded. Which of the following is false about the distribution of visits?
---> The number of doctor's visits for one of the mothers in the dataset is missing.

1. Filter ncbirths for rows where visits is not missing.

2. Generate 15000 bootstrap means of the number of visits.

    Specify visits as the variable of interest.
    Generate 15000 bootstrap replicates.
    Calculate the mean.

3. Calculate the 90% confidence interval via percentile method.

    Summarize to calculate appropriate quantiles of the bootstrap statistic.

4. Suppose now we're interested in the standard deviation of the number of doctor's visits throughout pregnancy.
   
   For the number of doctor's visits, generate 15000 bootstrap replicates and calculate their standard deviations.
   Calculate a 90% confidence interval for the bootstrap standard deviations using the percentile method.

# Filter for rows with non-missing visits
ncbirths_complete_visits <- ncbirths %>%
  filter(!is.na(visits))
  
# See the result
glimpse(ncbirths_complete_visits)

# Generate 15000 bootstrap means
visit_mean_ci <- ncbirths_complete_visits %>%
  # Specify visits as the response
  specify(response = "visits") %>%
  # Generate 15000 bootstrap replicates
  generate(reps = 15000, type = "bootstrap") %>%
  # Calculate the mean
  calculate(stat = "mean")

# Calculate the 90% CI via percentile method
visit_mean_ci %>%
  summarize(
    l = quantile(stat, probs = 0.05),
    u = quantile(stat, probs = 0.95)
  )

# Calculate 15000 bootstrap standard deviations of visits
visit_sd_ci <- ncbirths_complete_visits %>%
  specify(response = "visits") %>%
  generate(reps = 15000, type = "bootstrap") %>%
  calculate(stat = "sd")
  
# See the result
visit_sd_ci

# Calculate the 90% CI via percentile method
visit_sd_ci %>%
  summarize(
    l = quantile(stat, probs = 0.05),
    u = quantile(stat, probs = 0.95)
  )

-------------------------------------------------------------------------------------------------------------------------------------------