Evaluating the effectiveness of stem cell treatment

Now that we set out hypotheses, we can calculate the observed difference in means between the treatment and control groups. The stem.cell
data frame gives us the pumping capacity of the heart before and after the experiment. So first we need to find the difference between
these for each experimental unit, and then use these change values to calculate the observed difference in mean change between treatment
and control groups.

1. Add a column to stem.cell named change, equal to after minus before.

2. Calculate observed difference in means between treatment groups.

Evaluating the effectiveness of stem cell treatment (cont.)

Conduct the hypothesis test and compute the p-value for evaluating whether there is a difference between the mean heart pumping capacities of sheep's hearts in the control and treatment groups.

Since we're testing for a difference, the order in which we subtract matters. We can specify it using the order argument in the calculate() function. For example, to achieve group1 - group2 use order = c("group1", "group2").

3. Generate 1000 differences in means via randomization.
   Specify change versus treatment group.
   Use an independence hypothesis.
   Generate 1000 replicates via permutation.
   Calculate the difference in means (esc first, then ctrl).

4. Calculate the one-sided p-value.
   Filter for replicates where the simulated test statistic is at least as extreme as the observed statistic.
   Summarize to calculate the p-value as the number of rows in the filtered dataset divided by the number of replicates.


>
glimpse(stem.cell)
Rows: 18
Columns: 3
$ trmt   <fct> ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, esc, e...
$ before <dbl> 35.25, 36.50, 39.75, 39.75, 41.75, 45.00, 47.00, 52.00, 52.0...
$ after  <dbl> 29.50, 29.50, 36.25, 38.00, 37.50, 42.75, 39.00, 45.25, 52.2...

# Calculate difference between before and after
stem.cell <- stem.cell %>%
  mutate(change = after - before)

# Calculate observed difference in means
diff_mean <- stem.cell %>%
  # Group by treatment group
  group_by(trmt) %>%
  # Calculate mean change for each group
  summarize(mean_change = mean(change)) %>% 
  # Pull out the value
  pull() %>%
  # Calculate difference
  diff()
  
# See the result
diff_mean

n_replicates <- 1000

# Generate 1000 differences in means via randomization
diff_mean_ht <- stem.cell %>%
  # y ~ x
  specify(change ~ trmt) %>% 
  # Null = no difference between means
  hypothesize(null = "independence") %>% 
  # Shuffle labels 1000 times
  generate(reps = 1000, type = "permute") %>% 
  # Calculate test statistic
  calculate(stat = "diff in means", order = c("esc", "ctrl")) 

diff_mean_ht %>%
  # Filter for simulated test statistics at least as extreme as observed
  filter(stat >= diff_mean) %>%
  # Calculate p-value
  summarize(p_val = n() / n_replicates)

-------------------------------------------------------------------------------------------------------------------------------------------

Evaluating the relationship between smoking during pregnancy and birth weight

The state of North Carolina released to the public a large data set containing information on births recorded in this state. This data set
has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth
of their children. ncbirths is a random sample of 1000 cases from this data set. We want to evaluate whether there is a difference between
weights of babies born to smoker and non-smoker mothers.

1. Filter ncbirths for rows where habit is non-missing.

2. Calculate observed difference in means between smoking habit groups.

3. Generate 1000 differences in means via randomization and store as diff_mean_ht.

4. Filter for rows where the bootstrap difference in means is less than or equal to the observed difference in means.
   Summarize to calculate one_sided_p_val as the number of rows in the filtered dataset divided by the number of replicates.
   Calculate two_sided_p_val from one_sided_p_val.

Quantifying the relationship between smoking during pregnancy and birth weight

Let's construct a bootstrap interval for the difference in mean weights of babies born to smoker and non-smoker mothers. The
ncbirths_complete_habit data frame you created earlier is available to use.

5. Generate 1500 bootstrap difference in means for birth weight by smoking habit.

6. Calculate a 95% confidence interval for the bootstrap difference in mean birth weights using the percentile method.


>
glimpse(ncbirths)
Rows: 1,000
Columns: 13
$ fage           <int> NA, NA, 19, 21, NA, NA, 18, 17, NA, 20, 30, NA, NA, ...
$ mage           <int> 13, 14, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, ...
$ mature         <fct> younger mom, younger mom, younger mom, younger mom, ...
$ weeks          <int> 39, 42, 37, 41, 39, 38, 37, 35, 38, 37, 45, 42, 40, ...
$ premie         <fct> full term, full term, full term, full term, full ter...
$ visits         <int> 10, 15, 11, 6, 9, 19, 12, 5, 9, 13, 9, 8, 4, 12, 15,...
$ marital        <fct> married, married, married, married, married, married...
$ gained         <int> 38, 20, 38, 34, 27, 22, 76, 15, NA, 52, 28, 34, 12, ...
$ weight         <dbl> 7.63, 7.88, 6.63, 8.00, 6.38, 5.38, 8.44, 4.69, 8.81...
$ lowbirthweight <fct> not low, not low, not low, not low, not low, low, no...
$ gender         <fct> male, male, female, male, female, male, male, male, ...
$ habit          <fct> nonsmoker, nonsmoker, nonsmoker, nonsmoker, nonsmoke...
$ whitemom       <fct> not white, not white, white, white, not white, not w...

# Filter for subjects with non-missing habit
ncbirths_complete_habit <- ncbirths %>%
  filter(!is.na(habit))
  
# Calculate observed difference in means
diff_mean_obs <- ncbirths_complete_habit %>%
  # Group by habit group
  group_by(habit) %>%
  # Calculate mean weight for each group
  summarize(mean_weight = mean(weight)) %>%
  # Pull out the value
  pull() %>%
  # Calculate the difference
  diff()

n_replicates <- 1000

# Generate 1000 differences in means via randomization
diff_mean_ht <- ncbirths_complete_habit %>% 
  # Specify weight vs. habit
  specify(weight ~ habit) %>% 
  # Null = no difference between means
  hypothesize(null = "independence") %>% 
  # Shuffle labels 1000 times
  generate(reps = n_replicates, type = "permute") %>%
  # Calculate test statistic, nonsmoker then smoker
  calculate(stat = "diff in means", order = c("nonsmoker", "smoker"))

# Calculate p-value
diff_mean_ht %>%
  # Identify simulated test statistics at least as extreme as observed
  filter(stat <= diff_mean_obs) %>%
  # Calculate p-value
  summarize(
    one_sided_p_val = n() / n_replicates,
    two_sided_p_val = 2 * one_sided_p_val 
  )

# Generate 1500 bootstrap difference in means
diff_mean_ci <- ncbirths_complete_habit %>%
  # Specify weight vs. habit
  specify(weight ~ habit) %>%
  # Generate 1500 bootstrap replicates
  generate(reps = 1500, type = "bootstrap") %>%
  # Calculate the difference in means, nonsmoker then smoker
  calculate(stat = "diff in means", order = c("nonsmoker", "smoker"))

# Calculate the 95% CI via percentile method
diff_mean_ci %>%
  summarize(
    l = quantile(stat, 0.025),
    u = quantile(stat, 0.975)
  )

-------------------------------------------------------------------------------------------------------------------------------------------

Median lengths of pregnancies for smoking and non-smoking mothers

Let's turn our attention to another variable, weeks, indicating the length of the pregnancy. Construct a bootstrap interval for the difference in median lengths of pregnancies of smoker and non-smoker mothers.

1. Filter ncbirths for rows where smoking habit and length of pregnancy in weeks are both non-missing.

2. Generate 1500 bootstrap difference in medians for pregnancy length in weeks by smoking habit.

3. Calculate a 92% confidence interval for the bootstrap difference in median pregnancy length in weeks using the percentile method.


# Filter for non-missing habit & non-missing weeks
ncbirths_complete_habit_weeks <- ncbirths %>%
  filter(!is.na(habit), !is.na(weeks))
  
# Generate 1500 bootstrap difference in medians
diff_med_ci <- ncbirths_complete_habit_weeks %>%
  # Specify weeks versus habit
  specify(weeks ~ habit) %>%
  # Generate 1500 bootstrap replicates
  generate(reps = 1500, type = "bootstrap") %>%
  # Calculate the difference in medians, nonsmoker then smoker
  calculate(stat = "diff in medians", order = c("nonsmoker", "smoker"))

# Calculate the 92% CI via percentile method
diff_med_ci %>%
  summarize(
    l = quantile(stat, 0.04),
    u = quantile(stat, 0.96)
  )

-------------------------------------------------------------------------------------------------------------------------------------------

Hourly pay vs. citizenship status

Using the acs12 data, and specifically the variables income, hrs_work, and citizen, summarize, visualize, and compare the distributions 
of hourly pay rate for citizens and non-citizens.

1. Filter acs12 for rows where hourly pay, hrly_pay, and U.S. citizenship status, citizen, are both non-missing.

2. Get the number of rows in the full dataset, acs12.
   Get the number of rows in the filtered dataset, acs12_complete_hrlypay_citizen.
   Calculate the number of rows where at least one of hrly_pay and citizen was missing.
   Calculate the proportion of missing rows where at least one of hrly_pay and citizen was missing.

3. Group by U.S. citizenship status.
   Summarize to calculate the mean of hourly pay, the standard deviation of hourly pay, and the number of rows in that citizenship group.

4. Plot a histogram of the hrly_pay, faceted by citizenship status.

5. Run a t-test to find a confidence interval for the difference in average hourly pay, hrly_pay between citizen and non-citizens, 
   citizen in the acs12_complete_hrlypay_citizen dataset. Assign to test_results.


>
glimpse(acs12)
Rows: 2,000
Columns: 14
$ income       <int> 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, 8600,...
$ employment   <fct> not in labor force, not in labor force, NA, not in lab...
$ hrs_work     <int> 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA...
$ race         <fct> white, white, white, white, white, other, white, other...
$ age          <int> 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, 17, 1...
$ gender       <fct> female, male, female, male, female, female, male, male...
$ citizen      <chr> "Citizen", "Citizen", "Citizen", "Citizen", "Citizen",...
$ time_to_work <int> NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA,...
$ lang         <fct> english, english, english, other, other, other, englis...
$ married      <fct> no, no, no, no, no, yes, no, no, no, yes, no, no, yes,...
$ edu          <fct> college, hs or lower, hs or lower, hs or lower, hs or ...
$ disability   <fct> no, yes, no, no, yes, yes, no, yes, no, no, no, no, ye...
$ birth_qrtr   <fct> jul thru sep, jan thru mar, oct thru dec, oct thru dec...
$ hrly_pay     <dbl> 28.8461538, NA, NA, NA, NA, 0.8173077, NA, NA, NA, 10....

# Filter for non-missing hrly_pay and non-missing citizen
acs12_complete_hrlypay_citizen <- acs12 %>%
  filter(!is.na(hrly_pay), !is.na(citizen))

# Get nuber of rows in full dataset
n_all <- nrow(acs12)

# Get number of rows in filtered dataset
n_non_missing <- nrow(acs12_complete_hrlypay_citizen)

# Calculate number missing
n_missing <- n_all - n_non_missing

# Calculate proportion missing
prop_missing <- n_missing / n_all

# See the result
prop_missing

acs12_complete_hrlypay_citizen %>%
  # Group by citizen
  group_by(citizen) %>%
  summarize(
    # Calculate mean hourly pay
    x_bar = mean(hrly_pay),
    # Calculate std dev of hourly pay
    s = sd(hrly_pay),
    # Count number of rows
    n = n()
  )

# Using acs12_complete_hrlypay_citizen, plot hrly_pay
ggplot(acs12_complete_hrlypay_citizen, aes(x = hrly_pay)) +
  # Add a histogram layer
  geom_histogram() +
  facet_grid(rows = vars(citizen))

# Construct 95% CI using a t-test
test_results <- t.test(hrly_pay ~ citizen, data = acs12_complete_hrlypay_citizen, mu = 0, alternative = "two.sided")

# See the results
test_results