Evaluating the effectiveness of stem cell treatment Now that we set out hypotheses, we can calculate the observed difference in means between the treatment and control groups. The stem.cell data frame gives us the pumping capacity of the heart before and after the experiment. So first we need to find the difference between these for each experimental unit, and then use these change values to calculate the observed difference in mean change between treatment and control groups. 1. Add a column to stem.cell named change, equal to after minus before. 2. Calculate observed difference in means between treatment groups. Evaluating the effectiveness of stem cell treatment (cont.) Conduct the hypothesis test and compute the p-value for evaluating whether there is a difference between the mean heart pumping capacities of sheep's hearts in the control and treatment groups. Since we're testing for a difference, the order in which we subtract matters. We can specify it using the order argument in the calculate() function. For example, to achieve group1 - group2 use order = c("group1", "group2"). 3. Generate 1000 differences in means via randomization. Specify change versus treatment group. Use an independence hypothesis. Generate 1000 replicates via permutation. Calculate the difference in means (esc first, then ctrl). 4. Calculate the one-sided p-value. Filter for replicates where the simulated test statistic is at least as extreme as the observed statistic. Summarize to calculate the p-value as the number of rows in the filtered dataset divided by the number of replicates. > glimpse(stem.cell) Rows: 18 Columns: 3 $ trmt ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, ctrl, esc, e... $ before 35.25, 36.50, 39.75, 39.75, 41.75, 45.00, 47.00, 52.00, 52.0... $ after 29.50, 29.50, 36.25, 38.00, 37.50, 42.75, 39.00, 45.25, 52.2... # Calculate difference between before and after stem.cell <- stem.cell %>% mutate(change = after - before) # Calculate observed difference in means diff_mean <- stem.cell %>% # Group by treatment group group_by(trmt) %>% # Calculate mean change for each group summarize(mean_change = mean(change)) %>% # Pull out the value pull() %>% # Calculate difference diff() # See the result diff_mean n_replicates <- 1000 # Generate 1000 differences in means via randomization diff_mean_ht <- stem.cell %>% # y ~ x specify(change ~ trmt) %>% # Null = no difference between means hypothesize(null = "independence") %>% # Shuffle labels 1000 times generate(reps = 1000, type = "permute") %>% # Calculate test statistic calculate(stat = "diff in means", order = c("esc", "ctrl")) diff_mean_ht %>% # Filter for simulated test statistics at least as extreme as observed filter(stat >= diff_mean) %>% # Calculate p-value summarize(p_val = n() / n_replicates) ------------------------------------------------------------------------------------------------------------------------------------------- Evaluating the relationship between smoking during pregnancy and birth weight The state of North Carolina released to the public a large data set containing information on births recorded in this state. This data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. ncbirths is a random sample of 1000 cases from this data set. We want to evaluate whether there is a difference between weights of babies born to smoker and non-smoker mothers. 1. Filter ncbirths for rows where habit is non-missing. 2. Calculate observed difference in means between smoking habit groups. 3. Generate 1000 differences in means via randomization and store as diff_mean_ht. 4. Filter for rows where the bootstrap difference in means is less than or equal to the observed difference in means. Summarize to calculate one_sided_p_val as the number of rows in the filtered dataset divided by the number of replicates. Calculate two_sided_p_val from one_sided_p_val. Quantifying the relationship between smoking during pregnancy and birth weight Let's construct a bootstrap interval for the difference in mean weights of babies born to smoker and non-smoker mothers. The ncbirths_complete_habit data frame you created earlier is available to use. 5. Generate 1500 bootstrap difference in means for birth weight by smoking habit. 6. Calculate a 95% confidence interval for the bootstrap difference in mean birth weights using the percentile method. > glimpse(ncbirths) Rows: 1,000 Columns: 13 $ fage NA, NA, 19, 21, NA, NA, 18, 17, NA, 20, 30, NA, NA, ... $ mage 13, 14, 15, 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, ... $ mature younger mom, younger mom, younger mom, younger mom, ... $ weeks 39, 42, 37, 41, 39, 38, 37, 35, 38, 37, 45, 42, 40, ... $ premie full term, full term, full term, full term, full ter... $ visits 10, 15, 11, 6, 9, 19, 12, 5, 9, 13, 9, 8, 4, 12, 15,... $ marital married, married, married, married, married, married... $ gained 38, 20, 38, 34, 27, 22, 76, 15, NA, 52, 28, 34, 12, ... $ weight 7.63, 7.88, 6.63, 8.00, 6.38, 5.38, 8.44, 4.69, 8.81... $ lowbirthweight not low, not low, not low, not low, not low, low, no... $ gender male, male, female, male, female, male, male, male, ... $ habit nonsmoker, nonsmoker, nonsmoker, nonsmoker, nonsmoke... $ whitemom not white, not white, white, white, not white, not w... # Filter for subjects with non-missing habit ncbirths_complete_habit <- ncbirths %>% filter(!is.na(habit)) # Calculate observed difference in means diff_mean_obs <- ncbirths_complete_habit %>% # Group by habit group group_by(habit) %>% # Calculate mean weight for each group summarize(mean_weight = mean(weight)) %>% # Pull out the value pull() %>% # Calculate the difference diff() n_replicates <- 1000 # Generate 1000 differences in means via randomization diff_mean_ht <- ncbirths_complete_habit %>% # Specify weight vs. habit specify(weight ~ habit) %>% # Null = no difference between means hypothesize(null = "independence") %>% # Shuffle labels 1000 times generate(reps = n_replicates, type = "permute") %>% # Calculate test statistic, nonsmoker then smoker calculate(stat = "diff in means", order = c("nonsmoker", "smoker")) # Calculate p-value diff_mean_ht %>% # Identify simulated test statistics at least as extreme as observed filter(stat <= diff_mean_obs) %>% # Calculate p-value summarize( one_sided_p_val = n() / n_replicates, two_sided_p_val = 2 * one_sided_p_val ) # Generate 1500 bootstrap difference in means diff_mean_ci <- ncbirths_complete_habit %>% # Specify weight vs. habit specify(weight ~ habit) %>% # Generate 1500 bootstrap replicates generate(reps = 1500, type = "bootstrap") %>% # Calculate the difference in means, nonsmoker then smoker calculate(stat = "diff in means", order = c("nonsmoker", "smoker")) # Calculate the 95% CI via percentile method diff_mean_ci %>% summarize( l = quantile(stat, 0.025), u = quantile(stat, 0.975) ) ------------------------------------------------------------------------------------------------------------------------------------------- Median lengths of pregnancies for smoking and non-smoking mothers Let's turn our attention to another variable, weeks, indicating the length of the pregnancy. Construct a bootstrap interval for the difference in median lengths of pregnancies of smoker and non-smoker mothers. 1. Filter ncbirths for rows where smoking habit and length of pregnancy in weeks are both non-missing. 2. Generate 1500 bootstrap difference in medians for pregnancy length in weeks by smoking habit. 3. Calculate a 92% confidence interval for the bootstrap difference in median pregnancy length in weeks using the percentile method. # Filter for non-missing habit & non-missing weeks ncbirths_complete_habit_weeks <- ncbirths %>% filter(!is.na(habit), !is.na(weeks)) # Generate 1500 bootstrap difference in medians diff_med_ci <- ncbirths_complete_habit_weeks %>% # Specify weeks versus habit specify(weeks ~ habit) %>% # Generate 1500 bootstrap replicates generate(reps = 1500, type = "bootstrap") %>% # Calculate the difference in medians, nonsmoker then smoker calculate(stat = "diff in medians", order = c("nonsmoker", "smoker")) # Calculate the 92% CI via percentile method diff_med_ci %>% summarize( l = quantile(stat, 0.04), u = quantile(stat, 0.96) ) ------------------------------------------------------------------------------------------------------------------------------------------- Hourly pay vs. citizenship status Using the acs12 data, and specifically the variables income, hrs_work, and citizen, summarize, visualize, and compare the distributions of hourly pay rate for citizens and non-citizens. 1. Filter acs12 for rows where hourly pay, hrly_pay, and U.S. citizenship status, citizen, are both non-missing. 2. Get the number of rows in the full dataset, acs12. Get the number of rows in the filtered dataset, acs12_complete_hrlypay_citizen. Calculate the number of rows where at least one of hrly_pay and citizen was missing. Calculate the proportion of missing rows where at least one of hrly_pay and citizen was missing. 3. Group by U.S. citizenship status. Summarize to calculate the mean of hourly pay, the standard deviation of hourly pay, and the number of rows in that citizenship group. 4. Plot a histogram of the hrly_pay, faceted by citizenship status. 5. Run a t-test to find a confidence interval for the difference in average hourly pay, hrly_pay between citizen and non-citizens, citizen in the acs12_complete_hrlypay_citizen dataset. Assign to test_results. > glimpse(acs12) Rows: 2,000 Columns: 14 $ income 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, 8600,... $ employment not in labor force, not in labor force, NA, not in lab... $ hrs_work 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA... $ race white, white, white, white, white, other, white, other... $ age 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, 17, 1... $ gender female, male, female, male, female, female, male, male... $ citizen "Citizen", "Citizen", "Citizen", "Citizen", "Citizen",... $ time_to_work NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA,... $ lang english, english, english, other, other, other, englis... $ married no, no, no, no, no, yes, no, no, no, yes, no, no, yes,... $ edu college, hs or lower, hs or lower, hs or lower, hs or ... $ disability no, yes, no, no, yes, yes, no, yes, no, no, no, no, ye... $ birth_qrtr jul thru sep, jan thru mar, oct thru dec, oct thru dec... $ hrly_pay 28.8461538, NA, NA, NA, NA, 0.8173077, NA, NA, NA, 10.... # Filter for non-missing hrly_pay and non-missing citizen acs12_complete_hrlypay_citizen <- acs12 %>% filter(!is.na(hrly_pay), !is.na(citizen)) # Get nuber of rows in full dataset n_all <- nrow(acs12) # Get number of rows in filtered dataset n_non_missing <- nrow(acs12_complete_hrlypay_citizen) # Calculate number missing n_missing <- n_all - n_non_missing # Calculate proportion missing prop_missing <- n_missing / n_all # See the result prop_missing acs12_complete_hrlypay_citizen %>% # Group by citizen group_by(citizen) %>% summarize( # Calculate mean hourly pay x_bar = mean(hrly_pay), # Calculate std dev of hourly pay s = sd(hrly_pay), # Count number of rows n = n() ) # Using acs12_complete_hrlypay_citizen, plot hrly_pay ggplot(acs12_complete_hrlypay_citizen, aes(x = hrly_pay)) + # Add a histogram layer geom_histogram() + facet_grid(rows = vars(citizen)) # Construct 95% CI using a t-test test_results <- t.test(hrly_pay ~ citizen, data = acs12_complete_hrlypay_citizen, mu = 0, alternative = "two.sided") # See the results test_results