Probabilities under the t-distribution We can use the pt() function to find probabilities under the t-distribution. For a given cutoff value q and a given degrees of freedom df, pt(q, df) gives us the probability under the t-distribution with df degrees of freedom for values of t less than q. In other words, P(t_df < T) = pt(q = T, df). # P(T < 3) for df = 10 (x <- pt(3, df = 10)) # P(T > 3) for df = 10 (y <- pt(3, df = 10, lower.tail = FALSE)) # P(T > 3) for df = 100 (z <- pt(3, df = 100, lower.tail = FALSE)) # Comparison y == z y > z y < z ------------------------------------------------------------------------------------------------------------------------------------------- Cutoffs under the t-distribution We can use the qt() function to find cutoffs under the t-distribution. For a given probability p and a given degrees of freedom df, qt(p, df) gives us the cutoff value for the t-distribution with df degrees of freedom for which the probability under the curve is p. In other words, if P(t_df < T) = p, then T = qt(p, df). For example, if T corresponds to the 95th percentile of a distribution, p=0.95. The "middle 95%" means from p = 0.025 to p = 0.975. # 95th percentile for df = 10 (x <- qt(0.95, df = 10)) # Upper bound of middle 95th percent for df = 10 (y <- qt(0.975, df = 10)) # Upper bound of middle 95th percent for df = 100 (z <- qt(0.975, df = 100)) # Comparison y == z y > z y < z ------------------------------------------------------------------------------------------------------------------------------------------- Average commute time of Americans Each year since 2005, the US Census Bureau surveys about 3.5 million households with The American Community Survey (ACS). Data collected from the ACS have been crucial in government and policy decisions, helping to determine the allocation of federal and state funds each year. Data from the 2012 ACS is available in the acs12 dataset. When given one argument, t.test() tests whether the population mean of its input is different than 0. That is H0: mu_diff = 0 and Ha: mu_diff != 0. It also provides a 95% confidence interval. * Filter the acs12 dataset for individuals whose employment status is "employed". * Use a t-test to construct a 95% confidence interval for the average commute time of Americans. Full-time employment is employment in which a person works a minimum number of hours defined as such by his/her employer. Companies in the United States commonly require 40 hours per week to be considered a full time employee. We will use data from the American Community Survey to estimate the number of hours Americans work. * Create a 95% confidence interval using the hrs_work column. > glimpse(acs12) Rows: 2,000 Columns: 13 $ income 60000, 0, NA, 0, 0, 1700, NA, NA, NA, 45000, NA, 8600,... $ employment not in labor force, not in labor force, NA, not in lab... $ hrs_work 40, NA, NA, NA, NA, 40, NA, NA, NA, 84, NA, 23, NA, NA... $ race white, white, white, white, white, other, white, other... $ age 68, 88, 12, 17, 77, 35, 11, 7, 6, 27, 8, 69, 69, 17, 1... $ gender female, male, female, male, female, female, male, male... $ citizen yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,... $ time_to_work NA, NA, NA, NA, NA, 15, NA, NA, NA, 40, NA, 5, NA, NA,... $ lang english, english, english, other, other, other, englis... $ married no, no, no, no, no, yes, no, no, no, yes, no, no, yes,... $ edu college, hs or lower, hs or lower, hs or lower, hs or ... $ disability no, yes, no, no, yes, yes, no, yes, no, no, no, no, ye... $ birth_qrtr jul thru sep, jan thru mar, oct thru dec, oct thru dec... # Filter for employed respondents acs12_emp <- acs12 %>% filter(employment == "employed") # Construct 95% CI for avg time_to_work t.test(acs12_emp$time_to_work) # Run a t-test on hrs_work and look at the CI t.test(acs12_emp$hrs_work) ------------------------------------------------------------------------------------------------------------------------------------------- t-interval at various levels A random sample was taken of nearly 10% of UCLA courses. The most expensive textbook for each course was identified, and its price at the UCLA Bookstore and on Amazon.com were recorded. These data are recorded in the textbooks dataset. We want to test whether there is a difference between the average prices of textbooks sold in the bookstore vs. on Amazon. Since the book price data are paired (the price of the same book at the two stores are not independent), rather than using individual variables for the prices from each store, you will look at the a single variables of the differences in price. The diff column the UCLA Bookstore price minus the Amazon.com price for each book. > glimpse(textbooks) Rows: 73 Columns: 7 $ deptAbbr Am Ind, Anthro, Anthro, Anthro, Art His, Art His, Asia Am,... $ course C170, 9, 135T, 191HB, M102K, 118E, 187B, 191E, C125, M145... $ ibsn 978-0803272620, 978-0030119194, 978-0300080643, 978-022620... $ uclaNew 27.67, 40.59, 31.68, 16.00, 18.95, 14.95, 24.70, 19.50, 12... $ amazNew 27.95, 31.14, 32.00, 11.52, 14.21, 10.17, 20.06, 16.66, 10... $ more Y, Y, Y, Y, Y, Y, Y, N, N, Y, Y, N, Y, Y, N, N, N, N, N, N... $ diff -0.28, 9.45, -0.32, 4.48, 4.74, 4.78, 4.64, 2.84, 17.59, 3... # Run a t-test on diff with a 90% CI t.test(textbooks$diff, conf.level = 0.9) # Same with 95% CI t.test(textbooks$diff) # Same with 99% CI t.test(textbooks$diff, conf.level = 0.99) ------------------------------------------------------------------------------------------------------------------------------------------- Estimate the median difference in textbook prices Suppose instead of the mean, we want to estimate the median difference in prices of the same textbook at the UCLA bookstore and on Amazon. You can't do this using a t-test, as the Central Limit Theorem only talks about means, not medians. You'll use an infer pipeline to estimate the median. 1. For the price differences, generate 15000 bootstrap replicates and calculate their medians. 2. Calculate a 95% confidence interval for the bootstrap median price differences using the percentile method. # Calculate 15000 bootstrap medians of diff textdiff_med_ci <- textbooks %>% specify(response = "diff") %>% generate(reps = 15000, type = "bootstrap") %>% calculate(stat = "median") # Calculate the 95% CI via percentile method textdiff_med_ci %>% summarize( l = quantile(stat, 0.025), u = quantile(stat, 0.975) ) ------------------------------------------------------------------------------------------------------------------------------------------- Test for a difference in median test scores The High School and Beyond survey is conducted on high school seniors by the National Center of Education Statistics. We randomly sampled 200 observations from this survey, and these data are in the hsb2 data frame (which is already loaded for you). Among other variables, this data frame contains scores on math and science scores of each sampled student. 1. Add a column to hsb2 named diff containing the math scores minus the science scores. 2. Specify diff as the response. Use a point null hypothesis of the score difference being med = 0. Generate 15000 bootstrap replicates. Calculate the median. 3. Calculate the median observed score difference and pull out the value. 4. Calculate the two-sided p-value. Filter for rows where the bootstrap median is greater than or equal to the observed median. Summarize to calculate one_sided_p_val as the number of rows in the filtered dataset divided by the number of replicates. Calculate two_sided_p_val from one_sided_p_val. > glimpse(hsb2) Rows: 200 Columns: 11 $ id 70, 121, 86, 141, 172, 113, 50, 11, 84, 48, 75, 60, 95, 104... $ gender "male", "female", "male", "male", "male", "male", "male", "... $ race "white", "white", "white", "white", "white", "white", "afri... $ ses low, middle, high, high, middle, middle, middle, middle, mi... $ schtyp public, public, public, public, public, public, public, pub... $ prog general, vocational, general, vocational, academic, academi... $ read 57, 68, 44, 63, 47, 44, 50, 34, 63, 57, 60, 57, 73, 54, 45,... $ write 52, 59, 33, 44, 52, 52, 59, 46, 57, 55, 46, 65, 60, 63, 57,... $ math 41, 53, 54, 47, 57, 51, 42, 45, 54, 52, 51, 51, 71, 57, 50,... $ science 47, 63, 58, 53, 53, 63, 53, 39, 58, 50, 53, 63, 61, 55, 31,... $ socst 57, 61, 31, 56, 61, 61, 61, 36, 51, 51, 61, 61, 71, 46, 56,... # Add a column, diff, of math minus science hsb2 <- hsb2 %>% mutate(diff = math - science) n_replicates <- 15000 # Generate 15000 bootstrap medians centered at null scorediff_med_ht <- hsb2 %>% specify(response = "diff") %>% hypothesize(null = "point", med = 0) %>% generate(reps = n_replicates, type = "bootstrap") %>% calculate(stat = "median") # Calculate observed median of differences scorediff_med_obs <- hsb2 %>% summarize(median_diff = median(diff)) %>% pull() # Calculate two-sided p-value scorediff_med_ht %>% filter(stat >= scorediff_med_obs) %>% summarize( one_sided_p_val = n() / n_replicates, two_sided_p_val = 2 * one_sided_p_val )