Null sampling distribution of the slope

In the previous chapter, you investigated the sampling distribution of the slope from a population where the slope was non-zero. Typically,
however, to do inference, you will need to know the sampling distribution of the slope under the hypothesis that there is no relationship
between the explanatory and response variables. Additionally, in most situations, you don't know the population from which the data came,
so the null sampling distribution must be derived from only the original dataset.

In the mid-20th century, a study was conducted that tracked down identical twins that were separated at birth: one child was raised in the
home of their biological parents and the other in a foster home. In an attempt to answer the question of whether intelligence is the result
of nature or nurture, both children were given IQ tests. The resulting data is given for the IQs of the foster twins (Foster is the response
variable) and the IQs of the biological twins (Biological is the explanatory variable).

In this exercise you'll use the pull() function. This function takes a data frame and returns a selected column as a vector (similar to $).

* Run a linear regression of Foster vs. Biological on the twins dataset.
* Tidy the result.
* Filter for rows where term equals "Biological".
* Use pull() to pull out the estimate column.

Simulate 10 slopes.

* Use specify() to specify Foster vs. Biological (same formula as for a linear regression).
* Use hypothesize(), to set a null hypothesis of "independence".
* Use generate() to generate 10 replicates (reps) of type "permute".
* Use calculate() to calculate the summary statistic "slope".

Having a range of slope estimates will let us measure the variation and calculate confidence intervals for that statistic.


>
glimpse(twins)
Rows: 27
Columns: 3
$ Foster     <int> 82, 80, 88, 108, 116, 117, 132, 71, 75, 93, 95, 88, 111, 63…
$ Biological <int> 82, 90, 91, 115, 115, 129, 131, 78, 79, 82, 97, 100, 107, 6…
$ Social     <chr> "high", "high", "high", "high", "high", "high", "high", "mi…

library(infer)
set.seed(4747)

# Calculate the observed slope
# Run a lin. reg. of Foster vs. Biological on the twins data
obs_slope <- lm(Foster ~ Biological, data = twins) %>%
  # Tidy the result
  tidy() %>%   
  # Filter for rows where term equal Biological
  filter(term == "Biological") %>%
  # Pull out the estimate column
  pull(estimate) 

# See the result
obs_slope

# Simulate 10 slopes with a permuted dataset
perm_slope <- twins %>%
  # Specify Foster vs. Biological
  specify(Foster ~ Biological) %>%
  # Use a null hypothesis of independence
  hypothesize(null = "independence") %>%
  # Generate 10 permutation replicates
  generate(reps = 10, type = "permute") %>%
  # Calculate the slope statistic
  calculate(stat = "slope")

# See the result
perm_slope

--------------------------------------------------------------------------------------------------------------------------------------------

SE of the slope

The previous exercise generated 10 different slopes under a model of no (i.e., null) relationship between the explanatory and response
variables. Now repeat the null slope calculations 1000 times to derive a null sampling distribution for the slope coefficient. The null
sampling distribution will be used as a benchmark against which to compare the original data. Then, you'll calculate the mean and standard
deviation of the null sampling distribution for the slope.

Simulate 500 slopes.

* Use specify() to specify Foster vs. Biological (same formula as for a linear regression).
* Use hypothesize(), to set a null hypothesis of "independence".
* Use generate() to generate 500 replicates (reps) of type "permute".
* Use calculate() to calculate the summary statistic "slope".

* Using the perm_slope data, plot the stat you calculated with a density geom.

* Use ungroup() without arguments to ungroup the dataset.
* Use summarize() to calculate summary statistics.
      * Calculate mean_stat as the mean of stat.
      * Calculate std_err_stat as the standard error of stat, using sd().


# Simulate 500 slopes with a permuted dataset
perm_slope <- twins %>%
  # Specify Foster vs. Biological
  specify(Foster ~ Biological) %>%
  # Use a null hypothesis of independence
  hypothesize(null = "independence") %>%
  # Generate 500 permutation replicates
  generate(reps = 500, type = "permute") %>%
  # Calculate the slope statistic
  calculate(stat = "slope")

# Using perm_slope, plot stat
ggplot(perm_slope, aes(x = stat)) + 
  # Add a density layer
  geom_density()

perm_slope %>% 
  # Ungroup the dataset (cini mi se nepotrebno, nije nist grupirano)
  ungroup() %>% 
  # Calculate summary statistics
  summarize(
    # Mean of stat
    mean_stat = mean(stat), 
    # Std error of stat
    std_err_stat = sd(stat)
  )

--------------------------------------------------------------------------------------------------------------------------------------------

p-value

Now that you have created the null sampling distribution, you can use it to find the p-value associated with the original slope statistic
from the twins data. Although you might first consider this to be a one-sided research question, instead, use the absolute value function
for practice performing two-sided tests on a slope coefficient.

You can calculate the proportion of TRUE values in a logical vector using mean(). For example, given a numeric vector x, the proportion of
cases where x is greater than or equal to 10 can be calculated using mean(x >= 10).

* Run a linear regression of Foster vs. Biological on the twins dataset.
* Tidy the result.
* Filter for rows where term equals "Biological".
* Use pull() to pull out the estimate column.
* Use abs() to calculate the absolute value of the slopes.

* Mutate perm_slope to add a column, abs_perm_slope, equal to the absolute value of stat.
* Summarize to calculate p_value, equal to the proportion of rows where abs_perm_slope is greater than or equal to abs_obs_slope.


# Run a lin. reg. of Foster vs. Biological on twins
abs_obs_slope <- lm(Foster ~ Biological, data = twins) %>%
  # Tidy the result
  tidy() %>%   
  # Filter for rows where term equals Biological
  filter(term == "Biological") %>%
  # Pull out the estimate
  pull(estimate) %>%
  # Take the absolute value
  abs()

# Compute the p-value  
perm_slope %>% 
  # Add a column of the absolute value of the slope
  mutate(abs_perm_slope = abs(stat)) %>%
  # Calculate a summary statistic
  summarize(
    # Get prop. cases where abs. permuted slope is greater than or equal to abs. observed slope
    p_value = mean(abs_perm_slope >= abs_obs_slope)
  )
  
# A tibble: 1 × 1
  p_value
    <dbl>
1       0


What can we conclude based on the p-value associated with the twins data?
If there were no association between foster and biological twin IQ (no nature) in the population, we would be extremely unlikely to have
collected a sample of data like we did.

--------------------------------------------------------------------------------------------------------------------------------------------

Bootstrapping the data

Using the infer package with type = "bootstrap", you can repeatedly sample from the dataset to estimate the sampling distribution and
standard error of the slope coefficient. Using the sampling distribution will allow you to directly find a confidence interval for the
underlying population slope.

Use the infer steps to bootstrap the twins data 1000 times. You don't have to hypothesize() because now you're creating confidence 
intervals, not hypothesis testing!

* Specify Foster versus Biological.
* Generate 1000 replicates by bootstrapping.
* Calculate the slope statistic.


# Set the seed for reproducibility
set.seed(4747)

# Calculate 1000 bootstrapped slopes
boot_slope <- twins %>%
  # Specify Foster vs. Biological
  specify(Foster ~ Biological) %>%
  # Generate 1000 bootstrap replicates
  generate(reps = 1000, type = "bootstrap") %>%
  # Calculate the slope statistic
  calculate(stat = "slope")

# See the result  
boot_slope

--------------------------------------------------------------------------------------------------------------------------------------------

SE method - bootstrap CI for slope

The twins study was used to weigh in on the question of whether IQ is a result of nature (your genes) or nurture (your environment). If IQ
was purely a result of nature, what slope would you expect to see in your linear model?

Recall that one way to create a confidence interval for a statistic is to calculate the statistic repeatedly under different bootstrap
samples and to find the standard deviation associated with the bootstrapped statistics.

The twins data is already loaded in your workspace.

* As with the previous exercise, use the infer steps on the twins data to specify a Foster vs. Biological model, generate 1000 bootstrapped
  replicates, and calculate the slope statistic.
* Calculate the confidence interval of stat as the mean plus or minus two standard deviations.

If the statistic has a normal distribution, two standard deviations on either side of the mean gives you a rough 95% confidence interval 
for its value. 


set.seed(4747)

# Calculate the slope statistic 
# from 1000 bootstrap replicates of
# the Foster vs. Biological model
# of the twins dataset
# Calculate 1000 bootstrapped slopes
boot_slope <- twins %>%
  specify(Foster ~ Biological) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "slope")

# Create a confidence interval of stat
# 2 std devs each side of the mean
boot_slope %>% 
  summarize(
    lower = mean(stat) - 2 * sd(stat),
    upper = mean(stat) + 2 * sd(stat)
  )

--------------------------------------------------------------------------------------------------------------------------------------------

Percentile method - bootstrap CI for slope

Alternatively, a CI for the slope can be created using the percentiles of the distribution of the bootstrapped slope statistics. 
Recall that a CI is created in such a way that, over a lifetime of analysis, the coverage rate of a CI is (1-alpha)*100%. If you always 
set alpha = 0.05, then the 95% confidence intervals will capture the parameter of interest (over your lifetime) 95% of the time. Typically,
out of the 5% of the time when the interval misses the parameter, sometimes the interval is too high (2.5% of the time) and sometimes the
interval is too low (2.5% of the time).

The bootstrapped estimates of slope, boot_slope, are loaded in your workspace.

* Set alpha to be 0.05 (although for your own work, feel free to use a different confidence level).
* Calculate the relevant percentiles needed to create the confidence interval.
   * The lower percentile cutoff is at half alpha.
   * The upper percentile cutoff is at one minus half alpha.
* Create the confidence interval of stat using quantile() and the percentile cutoffs. Your interval ends should be named lower and upper.

Using quantile() to calculate the confidence interval makes no asumptions about the shape of the distribution. 


# Set alpha
alpha <- 0.05

# Set the lower percentile cutoff
p_lower <- alpha / 2

# Set the upper percentile cutoff
p_upper <- 1 - p_lower

# Create a confidence interval of stat using quantiles
boot_slope %>% 
  summarize(
    lower = quantile(stat, p_lower),
    upper = quantile(stat, p_upper)
  )