Simulating a Beta prior
Suppose you’re running in an election for public office. Let be your underlying support, the proportion of voters that plan to vote for you. Based on past polls, your prior model of is captured by a Beta distribution with shape parameters 45 and 55.
You will approximate the Beta(45, 55) prior using random samples from the rbeta()
function. This function takes three arguments: sample size (n
) and two shape parameters (shape1
,shape2
). Subsequently, you will construct a density plot of the samples using ggplot()
. This function takes two arguments: the data set containing the samples and, within aes()
, the variable to be plotted on the x
axis. The density plot layer is added using geom_density()
.
library(ggplot2)
# Sample 10000 draws from Beta(45,55) prior
prior_A <- rbeta(n = 10000, shape1 = 45, shape2 = 55)
# Store the results in a data frame
prior_sim <- data.frame(prior_A)
# Construct a density plot of the prior sample
ggplot(prior_sim, aes(x = prior_A)) +
geom_density()
Great! Take a quick look - the distribution of your sample approximates the features of the Beta(45,55) prior.
Comparing & contrasting Beta priors
The Beta(,)
distribution is defined on the interval from 0 to 1, thus provides a
natural and flexible prior for your underlying election support, . You can tune the Beta shape parameters and
to produce alternative prior models. Below you will compare your
original Beta(45,55) prior with two alternatives: Beta(1, 1) and
Beta(100, 100). The original 10,000 prior_A
samples drawn from Beta(45,55) are in your workspace.
library(tibble)
# Sample 10000 draws from the Beta(1,1) prior
prior_B <- rbeta(n = 10000, shape1 = 1, shape2 = 1)
# Sample 10000 draws from the Beta(100,100) prior
prior_C <- rbeta(n = 10000, shape1 = 100, shape2 = 100)
# Combine the results in a single data frame
prior_sim <- data.frame(samples = c(prior_A, prior_B, prior_C),
priors = rep(c("A","B","C"), each = 10000))
tibble(prior_sim)
## # A tibble: 30,000 x 2
## samples priors
## <dbl> <chr>
## 1 0.459 A
## 2 0.450 A
## 3 0.487 A
## 4 0.474 A
## 5 0.470 A
## 6 0.439 A
## 7 0.484 A
## 8 0.390 A
## 9 0.486 A
## 10 0.385 A
## # ... with 29,990 more rows
# Plot the 3 priors
ggplot(prior_sim, aes(x = samples, fill = priors)) +
geom_density(alpha = 0.5)
Nice work! The three Beta priors here are just a few of the countless different priors we can obtain by tuning the shape parameters.
Prior B reflects ‘vague’ prior information about p - it gives equal prior weight to all values of p between 0 and 1. Prior C reflects more prior certainty about p - it has less spread and is centered around a mean that’s greater than that for Prior A.
Simulating the dependence of X on p
In your quest for election to public office, your campaign polls 10 likely voters. Let be the number that support you. Of course, varies from sample to sample and depends upon , your underlying support in the broader population. Since is a count of successes in 10 independent trials, each having probability of success , you can model its dependence on by the Binomial distribution: Bin(10, ).
You will simulate the Binomial model using random samples from the rbinom(n, size, prob)
function. This vectorized function draws n
samples from a Bin(size
, prob
) distribution. Given a vector of prob
values, the first prob
value will be used for the first draw, the second prob
value will be used for the second draw, etc.
library(ggridges)
# Define a vector of 1000 p values
p_grid <- seq(from = 0, to = 1, length.out = 1000)
# Simulate 1 poll result for each p in p_grid
poll_result <- rbinom(n = 1000, size = 10, prob = p_grid)
# Create likelihood_sim data frame
likelihood_sim <- data.frame(p_grid, poll_result)
# Density plots of p_grid grouped by poll_result
ggplot(likelihood_sim, aes(x = p_grid, y = poll_result, group = poll_result)) +
geom_density_ridges()
## Picking joint bandwidth of 0.0408
Great! Notice that polls in which 0 people supported you (poll_result = 0
) correspond to smaller values of underlying support p (p_grid
). The opposite is true for polls in which all 10 people supported you.
Approximating the likelihood function
The first election poll is in! = 6 of 10 polled voters plan to vote for you. You can use these data to build insight into your underlying support . To this end, you will use the likelihood_sim
data frame (in your workspace). This contains the values of (poll_result
) simulated from each of 1,000 possible values of between 0 to 1 (p_grid
).
# Density plots of p_grid grouped by poll_result
ggplot(likelihood_sim, aes(x = p_grid, y = poll_result, group = poll_result, fill = poll_result == 6)) +
geom_density_ridges()
## Picking joint bandwidth of 0.0408
Great! Reexamine the highlighted density plot. This is a scaled approximation of the likelihood function! It indicates that the simulated surveys in which 6 of 10 voters supported you corresponded to underlying support p that ranged from approximately 0.25 to 1, with p around 0.6 being the most common.
Define, compile, and simulate
In your election quest, let be the proportion of the underlying voting population that supports you. Built from previous polls & election data, your prior model of is a Beta(,) with shape parameters and . For added insight into , you also polled potential voters. The dependence of , the number of these voters that support you, on is modeled by the Bin(,) distribution.
In the completed poll, of voters supported you. The next goal is to update your model of in light of these observed polling data! To this end, you will use the rjags
package to approximate the posterior model of . We break this exercise down into the 3 rjags
steps: define, compile, simulate.
library(rjags)
## Loading required package: coda
## Linked to JAGS 4.3.0
## Loaded modules: basemod,bugs
# DEFINE the model
vote_model <- "model{
# Likelihood model for X
X ~ dbin(p, n)
# Prior model for p
p ~ dbeta(a, b)
}"
# COMPILE the model
vote_jags <- jags.model(textConnection(vote_model),
data = list(a = 45, b = 55, X = 6, n = 10),
inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 100))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 1
## Total graph size: 5
##
## Initializing model
# SIMULATE the posterior
vote_sim <- coda.samples(model = vote_jags, variable.names = c("p"), n.iter = 10000)
# PLOT the posterior
plot(vote_sim, trace = FALSE)
Nice work! You’ve successfully defined, compiled, and simulated your Bayesian model. Notice that after observing a poll in which 6 of 10 (60%) of voters supported you, your updated posterior optimism and certainty about your underlying support, p, are slightly higher than they were prior to the poll.
Updating the posterior
The posterior model of your underlying election support is informed by both the prior model of and polling data .
Run the script to the right to remind yourself of the posterior that
evolved from your original prior (Beta(45, 55)) and original poll data ( of polled voters support you). The defined vote_model
is in your workspace.
In a 3-step exercise, you will explore how using a different prior model or observing new data (or a combination of the two!) might impact the posterior.
Re-compile, simulate, and plot the posterior to reflect the setting in which you start with a Beta(1,1) prior but observe the same polling data (, ). NOTE: Recall that Beta(1,1) is uniform across the (0,1) interval.
# COMPILE the model
vote_jags <- jags.model(textConnection(vote_model),
data = list(a = 1, b = 1, X = 6, n = 10),
inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 100))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 1
## Total graph size: 5
##
## Initializing model
# SIMULATE the posterior
vote_sim <- coda.samples(model = vote_jags, variable.names = c("p"), n.iter = 10000)
# PLOT the posterior
plot(vote_sim, trace = FALSE, xlim = c(0,1), ylim = c(0,18))
In a new poll, 214 of 390 voters support you. Combined with the first poll, a total of voters support you. Re-compile, simulate, and plot the posterior to reflect these combined poll results and a Beta(1,1) prior.
# COMPILE the model
vote_jags <- jags.model(textConnection(vote_model),
data = list(a = 1, b = 1, X = 220, n = 400),
inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 100))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 1
## Total graph size: 5
##
## Initializing model
# SIMULATE the posterior
vote_sim <- coda.samples(model = vote_jags, variable.names = c("p"), n.iter = 10000)
# PLOT the posterior
plot(vote_sim, trace = FALSE, xlim = c(0,1), ylim = c(0,18))
Finally, re-compile, simulate, and plot the posterior to reflect the combined poll results ( of ) and your original Beta(45,55) prior.
# COMPILE the model
vote_jags <- jags.model(textConnection(vote_model),
data = list(a = 45, b = 55, X = 220, n = 400),
inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 100))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 1
## Unobserved stochastic nodes: 1
## Total graph size: 5
##
## Initializing model
# SIMULATE the posterior
vote_sim <- coda.samples(model = vote_jags, variable.names = c("p"), n.iter = 10000)
# PLOT the posterior
plot(vote_sim, trace = FALSE, xlim = c(0,1), ylim = c(0,18))
Nice work! Re-visit the plots you constructed throughout this exercise to review the impact of the prior and data on the posterior.
Even in light of the same data, different priors lead to different posteriors.
The influence of the prior on the posterior diminishes as the sample size of your data increases.
As sample size increases, your posterior understanding becomes more precise.
Normal-Normal priors
Researchers developed a test to evaluate the impact of sleep deprivation on reaction time. For subject , let be the change in reaction time (in ms) after 3 sleep deprived nights. Of course, people react differently to sleep deprivation. It’s reasonable to assume that are Normally distributed around some average with standard deviation : .
In the first step of your Bayesian analysis, you’ll simulate the following prior models for parameters and : and . This requires the rnorm(n, mean, sd)
and runif(n, min, max)
functions.
# Take 10000 samples from the m prior
prior_m <- rnorm(n = 10000, mean = 50, sd = 25)
# Take 10000 samples from the s prior
prior_s <- runif(n = 10000, min = 0, max = 200)
# Store samples in a data frame
samples <- data.frame(prior_m, prior_s)
# Density plots of the prior_m & prior_s samples
ggplot(samples, aes(x = prior_m)) +
geom_density()
ggplot(samples, aes(x = prior_s)) +
geom_density()
Right! The distributions of these random samples approximate the features of your Normal prior for m
and Uniform prior for s
.
Sleep study data
Researchers enrolled 18 subjects in a sleep deprivation study. Their observed sleep_study
data are loaded in the workspace. These data contain the day_0
reaction times and day_3
reaction times after 3 sleep deprived nights for each subject
.
You will define and explore diff_3
, the observed difference in reaction times for each subject. This will require the mutate()
& summarize()
functions. For example, the following would add variable day_0_s
, day_0
reaction times in seconds, to sleep_study
:
sleep_study <- sleep_study %>%
mutate(day_0_s = day_0 * 0.001)
You can then summarize()
the day_0_s
values, here by their minimum & maximum:
sleep_study %>%
summarize(min(day_0_s), max(day_0_s))
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sleep_study <- read.csv("_data/sleep_study.csv")
# Check out the first 6 rows of sleep_study
head(sleep_study)
## subject day_0 day_3
## 1 308 249.5600 321.4398
## 2 309 222.7339 204.7070
## 3 310 199.0539 232.8416
## 4 330 321.5426 285.1330
## 5 331 287.6079 320.1153
## 6 332 234.8606 309.7688
# Define diff_3
sleep_study <- sleep_study %>%
mutate(diff_3 = day_3 - day_0)
# Histogram of diff_3
ggplot(sleep_study, aes(x = diff_3)) +
geom_histogram(binwidth = 20, color = "white")
# Mean and standard deviation of diff_3
sleep_study %>%
summarize(mean(diff_3), sd(diff_3))
## mean(diff_3) sd(diff_3)
## 1 26.34021 37.20764
Great work! Reaction times increased by an average of ~26 ms with a standard deviation of ~37 ms. Further, only 4 of the 18 test subjects had faster reaction times on day 3 than on day 0.
Though not in perfect agreement about the degree to which the average reaction time changes under sleep deprivation, both the likelihood and prior are consistent with the hypothesis that the average increases relative to reaction time under normal sleep conditions.
Define, compile, & simulate the Normal-Normal
Upon observing the change in reaction time for each of the 18 subjects enrolled in the sleep study, you can update your posterior model of the effect of sleep deprivation on reaction time. This requires the combination of insight from the likelihood and prior models:
sleep_study
data are in your work space.# DEFINE the model
sleep_model <- "model{
# Likelihood model for Y[i]
for(i in 1:length(Y)) {
Y[i] ~ dnorm(m, s^(-2))
}
# Prior models for m and s
m ~ dnorm(50, 25^(-2))
s ~ dunif(0, 200)
}"
# COMPILE the model
sleep_jags <- jags.model(
textConnection(sleep_model),
data = list(Y = sleep_study$diff_3),
inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 1989)
)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 18
## Unobserved stochastic nodes: 2
## Total graph size: 28
##
## Initializing model
# SIMULATE the posterior
sleep_sim <- coda.samples(model = sleep_jags, variable.names = c("m", "s"), n.iter = 10000)
# PLOT the posterior
plot(sleep_sim, trace = FALSE)
Nice work! You’ve successfully defined, compiled, and simulated your Bayesian Normal-Normal model.
Your posterior model is more narrow and lies almost entirely above 0, thus you’re more confident that the average reaction time increases under sleep deprivation. Further, the location of the posterior is below that of the prior. This reflects the strong insight from the observed sleep study data in which the increase in average reaction time was only ~26 ms.
Markov chain distribution: an approximation of the posterior!
Storing Markov chains
Let
be the average change in reaction time after 3 days of sleep
deprivation. In a previous exercise, you obtained an approximate sample
of 10,000 draws from the posterior model of . You stored the resulting mcmc.list
object as sleep_sim
which is loaded in your workspace:
sleep_sim <- coda.samples(model = sleep_jags, variable.names = c("m", "s"), n.iter = 10000)
In fact, the sample of values in sleep_sim
is a dependent Markov chain, the distribution of which converges to the posterior. You will examine the contents of sleep_sim
and, to have finer control over your analysis, store the contents in a data frame.
# Check out the head of sleep_sim
head(sleep_sim)
## [[1]]
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1001
## End = 1007
## Thinning interval = 1
## m s
## [1,] 17.25796 31.46256
## [2,] 34.58469 37.88655
## [3,] 36.45480 39.58056
## [4,] 25.00971 39.69494
## [5,] 29.95475 35.90001
## [6,] 28.43894 37.46466
## [7,] 38.32427 35.44081
##
## attr(,"class")
## [1] "mcmc.list"
# Store the chains in a data frame
sleep_chains <- data.frame(sleep_sim[[1]], iter = 1:10000)
# Check out the head of sleep_chains
head(sleep_chains)
## m s iter
## 1 17.25796 31.46256 1
## 2 34.58469 37.88655 2
## 3 36.45480 39.58056 3
## 4 25.00971 39.69494 4
## 5 29.95475 35.90001 5
## 6 28.43894 37.46466 6
Great! Next, you’ll visualize the contents of these Markov chains.
Markov chain trace plots
A trace plot provides a visualization of a Markov chain’s longitudinal behavior. Specifically, a trace plot for the chain plots the observed chain value (y-axis) against the corresponding iteration number (x-axis).
You will construct trace plots of the chain using two different approaches: by applying the built-in plot()
function to the mcmc.list
object sleep_sim
and, for finer control over this graphic (and finer control over analyses in later chapters), by applying ggplot()
to the data.frame
object sleep_chains
. Both sleep_sim
and sleep_chains
are in your workspace:
sleep_sim <- coda.samples(model = sleep_jags, variable.names = c("m", "s"), n.iter = 10000)
sleep_chains <- data.frame(sleep_sim[[1]], iter = 1:10000)
# Use plot() to construct trace plots of the m and s chains
plot(sleep_sim, density = FALSE)
# Use ggplot() to construct a trace plot of the m chain
ggplot(sleep_chains, aes(x = iter, y = m)) +
geom_line()
# Trace plot the first 100 iterations of the m chain
ggplot(sleep_chains[1:100, ], aes(x = iter, y = m)) +
geom_line()
Nice work! Note that the longitudinal behavior of the chain appears quite random and that the trend remains relatively constant. This is a good thing - it indicates that the Markov chain (likely) converges quickly to the posterior distribution of m.
Markov chain density plots
Whereas a trace plot captures a Markov chain’s longitudinal behavior, a density plot illustrates the final distribution of the chain values. In turn, the density plot provides an approximation of the posterior model. You will construct and examine density plots of the Markov chain below. The mcmc.list
object sleep_sim
and sleep_chains
data frame are in your workspace:
sleep_sim <- coda.samples(model = sleep_jags, variable.names = c("m", "s"), n.iter = 10000)
sleep_chains <- data.frame(sleep_sim[[1]], iter = 1:10000)
# Use plot() to construct density plots of the m and s chains
plot(sleep_sim, trace = FALSE)
# Use ggplot() to construct a density plot of the m chain
ggplot(sleep_chains, aes(x = m)) +
geom_density()
Check it out! These density plots approximate the posterior models of m and s.
Multiple chains
Trace plots help us diagnose the quality of a Markov
chain simulation. A “good” Markov chain will exhibit stability as the
chain length increases and consistency across repeated simulations, or multiple chains. You will use RJAGS
to run and construct trace plots for four parallel chains below. The defined sleep_model
is in your workspace.
# COMPILE the model
sleep_jags_multi <- jags.model(textConnection(sleep_model), data = list(Y = sleep_study$diff_3), n.chains = 4)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 18
## Unobserved stochastic nodes: 2
## Total graph size: 28
##
## Initializing model
# SIMULATE the posterior
sleep_sim_multi <- coda.samples(model = sleep_jags_multi, variable.names = c("m", "s"), n.iter = 1000)
# Check out the head of sleep_sim_multi
head(sleep_sim_multi)
## [[1]]
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1001
## End = 1007
## Thinning interval = 1
## m s
## [1,] 28.318679 59.58722
## [2,] 64.650105 64.77086
## [3,] 5.686389 69.82354
## [4,] 47.446461 50.50661
## [5,] 48.712439 50.68817
## [6,] 31.868964 36.71130
## [7,] 38.878090 53.32325
##
## [[2]]
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1001
## End = 1007
## Thinning interval = 1
## m s
## [1,] 29.76849 51.09705
## [2,] 33.28692 51.91032
## [3,] 22.36573 56.36948
## [4,] 51.43043 59.70859
## [5,] 25.03049 30.72082
## [6,] 18.36155 37.64469
## [7,] 34.71654 40.52099
##
## [[3]]
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1001
## End = 1007
## Thinning interval = 1
## m s
## [1,] 41.42646 38.45844
## [2,] 43.88512 37.12345
## [3,] 31.17395 40.63014
## [4,] 42.18353 44.86224
## [5,] 30.58723 54.22735
## [6,] 47.07128 37.65286
## [7,] 25.39798 36.49480
##
## [[4]]
## Markov Chain Monte Carlo (MCMC) output:
## Start = 1001
## End = 1007
## Thinning interval = 1
## m s
## [1,] 32.63888 44.26067
## [2,] 39.05726 36.88991
## [3,] 27.68728 37.51415
## [4,] 36.94070 37.88889
## [5,] 33.10890 35.16639
## [6,] 49.18366 52.55942
## [7,] 21.43233 53.66403
##
## attr(,"class")
## [1] "mcmc.list"
# Construct trace plots of the m and s chains
plot(sleep_sim_multi, density = FALSE)
Good work! The most important thing to notice here is the similarity and stability among the 4 parallel chains. This provides some reassurance about the quality and consistency of our Markov chain simulation.
Naive standard errors
The mean of the Markov chain provides an estimate of the posterior mean of . The naive standard error provides a measure of the potential error in this estimate. In turn, we can use this measure to determine an appropriate chain length. For example, suppose your goal is to estimate the posterior mean of within a standard error of 0.1
ms. If your observed
naive standard error exceeds this target, no problem! Simply run a
longer chain - the error in using a Markov chain to approximate a
posterior tends to decrease as chain length increases.
The defined sleep_model
and compiled sleep_jags
object are your workspace.
# SIMULATE the posterior
sleep_sim_1 <- coda.samples(model = sleep_jags, variable.names = c("m", "s"), n.iter = 1000)
# Summarize the m and s chains of sleep_sim_1
summary(sleep_sim_1)
##
## Iterations = 11001:12000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 1000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## m 29.55 9.032 0.2856 0.2856
## s 40.81 7.654 0.2420 0.3414
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## m 12.00 23.62 29.31 35.58 47.48
## s 28.34 35.25 39.97 45.41 58.60
# RE-SIMULATE the posterior
sleep_sim_2 <- coda.samples(model = sleep_jags, variable.names = c("m", "s"), n.iter = 10000)
# Summarize the m and s chains of sleep_sim_2
summary(sleep_sim_2)
##
## Iterations = 12001:22000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## m 29.33 9.080 0.09080 0.0942
## s 40.12 7.483 0.07483 0.1115
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## m 11.69 23.25 29.25 35.24 47.55
## s 28.61 34.89 39.07 44.25 57.82
No problem! You’ve proved to yourself that if the standard errors associated with your Markov chain are too big, simply increase the number of iterations. In general, naive standard error will decrease as the chain length increases.
Reproducibility
Now that you’ve completed (and passed!) some Markov chain diagnostics, you’re ready to finalize your RJAGS
simulation. To this end, reproducibility is crucial. To obtain reproducible simulation output, you must set the seed of the RJAGS
random number generator. This works differently than in base R
. Instead of using set.seed()
, you will specify a starting seed using
inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = ___)
when you compile your model.
# COMPILE the model
sleep_jags <- jags.model(textConnection(sleep_model), data = list(Y = sleep_study$diff_3), inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 1989))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 18
## Unobserved stochastic nodes: 2
## Total graph size: 28
##
## Initializing model
# SIMULATE the posterior
sleep_sim <- coda.samples(model = sleep_jags, variable.names = c("m", "s"), n.iter = 10000)
# Summarize the m and s chains of sleep_sim
summary(sleep_sim)
##
## Iterations = 1001:11000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## m 29.29 8.980 0.08980 0.08854
## s 40.19 7.519 0.07519 0.11557
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## m 11.78 23.34 29.16 35.05 47.45
## s 28.68 34.85 39.12 44.39 57.79
Excellent work. Now that you can diagnose and reproduce your Markov chain simulations, you’re ready for the next Bayesian model!
Regression priors
Let be the weight (in kg) of subject . Past studies have shown that weight is linearly related to height (in cm). The average weight among adults of any shared height can be written as . But height isn’t a perfect predictor of weight - individuals vary from the trend. To this end, it’s reasonable to assume that are Normally distributed around with residual standard deviation : .
Note the 3 parameters in the model of weight by height: intercept , slope , & standard deviation . In the first step of your Bayesian analysis, you will simulate the following prior models for these parameters: , , and .
# Take 10000 samples from the a, b, & s priors
a <- rnorm(n = 10000, mean = 0, sd = 200)
b <- rnorm(n = 10000, mean = 1, sd = 0.5)
s <- runif(n = 10000, min = 0, max = 20)
# Store samples in a data frame
samples <- data.frame(set = 1:10000, a, b, s)
# Construct density plots of the prior samples
ggplot(samples, aes(x = a)) +
geom_density()
ggplot(samples, aes(x = b)) +
geom_density()
ggplot(samples, aes(x = s)) +
geom_density()
Great work! These simulations approximate your prior models of each separate model parameter. There’s likely a positive association between weight & height (b > 0) but more uncertainty about the intercept a. Further, at any given height, the typical deviation of individuals’ weights from the trend is equally likely to be anywhere between 0 and 20 kg.
Visualizing the regression priors
In the previous exercise, you simulated 10,000 samples
for each parameter (, , ) in the Bayesian regression model of weight by height : with mean . The set of , , and values in each row of samples
represents a prior plausible regression scenario. To explore the scope
of these prior scenarios, you will simulate 50 pairs of height and
weight values from each of the first 12 sets of prior parameters , , and .
# Replicate the first 12 parameter sets 50 times each
prior_scenarios_rep <- bind_rows(replicate(n = 50, expr = samples[1:12, ], simplify = FALSE))
# Simulate 50 height & weight data points for each parameter set
prior_simulation <- prior_scenarios_rep %>%
mutate(height = rnorm(n = 600, mean = 170, sd = 10)) %>%
mutate(weight = rnorm(n = 600, mean = a + b * height, sd = s))
# Plot the simulated data & regression model for each parameter set
ggplot(prior_simulation, aes(x = height, y = weight)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, size = 0.75) +
facet_wrap(~ set)
## `geom_smooth()` using formula 'y ~ x'
Exciting! These 12 plots demonstrate the range of prior plausible
models. These models have different intercepts, slopes, and residual
standard deviations. Almost all of the models have positive slopes,
demonstrating the prior information that there is likely a positive
association between weight & height. Given your vague prior for a
, some of these models are even biologically impossible!
Weight & height data
The bdims
data set from the openintro
package is loaded in your workspace. bdims
contains physical measurements on a sample of 507 individuals, including their weight in kg (wgt
) and height in cm (hgt
). You will use these data to build insights into the relationship between weight and height.
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
data(bdims, package = "openintro")
# Construct a scatterplot of wgt vs hgt
ggplot(bdims, aes(x = hgt, y = wgt)) +
geom_point()
# Add a model smooth
ggplot(bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Nice work. These data support your prior information about a positive association between weight and height. With insights from the priors and data in place, you’re ready to simulate the posterior regression model in RJAGS!
Insight from the observed weight & height data
Define, compile, & simulate the regression model
Upon observing the relationship between weight and height for the 507 subjects in the bdims
data set, you can update your posterior model of this relationship. To build your posterior, you must combine your insights from the likelihood and priors:
In this series of exercises, you’ll define, compile, and simulate your Bayesian regression posterior. The bdims
data are in your work space.
# DEFINE the model
weight_model <- "model{
# Define model for data Y[i]
for(i in 1:length(Y)) {
Y[i] ~ dnorm(m[i], s^(-2))
m[i] <- a + b * X[i]
}
# Define the a, b, s priors
a ~ dnorm(0, 200^(-2))
b ~ dnorm(1, 0.5^(-2))
s ~ dunif(0, 20)
}"
# COMPILE the model
weight_jags <- jags.model(
textConnection(weight_model),
data = list(Y = bdims$wgt, X = bdims$hgt),
inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 1989)
)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 507
## Unobserved stochastic nodes: 3
## Total graph size: 1321
##
## Initializing model
# SIMULATE the posterior
weight_sim <- coda.samples(model = weight_jags, variable.names = c("a", "b", "s"), n.iter = 1000)
# PLOT the posterior
plot(weight_sim)
Yuck! You’ve successfully defined, compiled, and simulated your Bayesian regression model. But the results don’t look that great. Let’s fix that.
# SIMULATE the posterior
weight_sim_big <- coda.samples(model = weight_jags, variable.names = c("a", "b", "s"), n.iter = 100000)
# PLOT the posterior
plot(weight_sim_big)
Trace plots indicate that after only 1,000 iterations, the a and b parallel chains had not stabilized. However, after 100,000 iterations, the chains demonstrate greater stability. We might also increase the stability of our simulation by standardizing the height data, but this is beyond the scope of our current discussion.
Posterior point estimates
Recall the likelihood of the Bayesian regression model of weight by height : where . A 100,000 iteration RJAGS simulation of the posterior, weight_sim_big
, is in your workspace along with a data frame of the Markov chain output:
> head(weight_chains, 2)
a b s iter
1 -113.9029 1.072505 8.772007 1
2 -115.0644 1.077914 8.986393 2
The posterior means of the intercept & slope parameters, & , reflect the posterior mean trend in the relationship between weight & height. In contrast, the full posteriors of & reflect the range
of plausible parameters, thus posterior uncertainty in the trend. You
will examine the trend and uncertainty in this trend below. The bdims
data are in your workspace.
# Summarize the posterior Markov chains
summary(weight_sim_big)
##
## Iterations = 2001:102000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 1e+05
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## a -104.222 7.77968 0.0246015 0.646878
## b 1.013 0.04539 0.0001435 0.003794
## s 9.331 0.29656 0.0009378 0.001214
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## a -118.8879 -109.7096 -104.681 -98.780 -88.712
## b 0.9224 0.9813 1.016 1.045 1.098
## s 8.7743 9.1273 9.323 9.526 9.938
# Calculate the estimated posterior mean of b
weight_chains <- data.frame(weight_sim_big[[1]], iter = 1:100000)
mean(weight_chains$b)
## [1] 1.013019
# Plot the posterior mean regression model
ggplot(bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_abline(intercept = mean(weight_chains$a), slope = mean(weight_chains$b), color = "red")
# Visualize the range of 20 posterior regression models
ggplot(bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_abline(intercept = weight_chains$a[1:20], slope = weight_chains$b[1:20], color = "gray", size = 0.25)
Excellent work. You now have a sense of the posterior mean trend in the linear relationship between weight and height as well as the posterior uncertainty in this trend. Given the size of the data and selection of priors, the posterior uncertainty is noticeably small as evidenced by the tight distribution of the gray posterior plausible lines around the trend.
Posterior credible intervals
Let’s focus on slope parameter , the rate of change in weight over height. The posterior mean of reflects the trend in the posterior model of the slope. In contrast, a posterior credible interval provides a range of posterior plausible slope values, thus reflects posterior uncertainty about . For example, the 95% credible interval for ranges from the 2.5th to the 97.5th quantile of the posterior. Thus there’s a 95% (posterior) chance that is in this range.
You will use RJAGS simulation output to approximate credible intervals for . The 100,000 iteration RJAGS simulation of the posterior, weight_sim_big
, is in your workspace along with a data frame of the Markov chain output, weight_chains
.
# Summarize the posterior Markov chains
summary(weight_sim_big)
##
## Iterations = 2001:102000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 1e+05
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## a -104.222 7.77968 0.0246015 0.646878
## b 1.013 0.04539 0.0001435 0.003794
## s 9.331 0.29656 0.0009378 0.001214
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## a -118.8879 -109.7096 -104.681 -98.780 -88.712
## b 0.9224 0.9813 1.016 1.045 1.098
## s 8.7743 9.1273 9.323 9.526 9.938
# Calculate the 95% posterior credible interval for b
ci_95 <- quantile(weight_chains$b, probs = c(0.025, 0.975))
ci_95
## 2.5% 97.5%
## 0.9224311 1.0983272
# Calculate the 90% posterior credible interval for b
ci_90 <- quantile(weight_chains$b, probs = c(0.05, 0.95))
ci_90
## 5% 95%
## 0.9348413 1.0851638
# Mark the 90% credible interval
ggplot(weight_chains, aes(x = b)) +
geom_density() +
geom_vline(xintercept = ci_90, color = "red")
Right! Based on your calculations we can say that there’s a 90% (posterior) probability that, on average, the increase in weight per 1 cm increase in height is between 0.93 and 1.08 kg.
Posterior probabilities
You’ve used RJAGS output to explore and quantify the posterior trend & uncertainty . You can also use RJAGS output to assess specific hypotheses. For example: What’s the posterior probability that, on average, weight increases by more than 1.1 kg for every 1 cm increase in height? That is, what’s the posterior probability that ?
You will approximate this probability by the proportion of Markov chain values that exceed 1.1. The weight_chains
data frame with the 100,000 iteration Markov chain output is in your workspace.
# Mark 1.1 on a posterior density plot for b
ggplot(weight_chains, aes(x = b)) +
geom_density() +
geom_vline(xintercept = 1.1, color = "red")
# Summarize the number of b chain values that exceed 1.1
table(weight_chains$b > 1.1)
##
## FALSE TRUE
## 97751 2249
# Calculate the proportion of b chain values that exceed 1.1
mean(weight_chains$b > 1.1)
## [1] 0.02249
Great. Based on your calculations we can say that there’s only a ~2% (posterior) chance that, on average, the increase in weight per 1 cm increase in height exceeds 1.1 kg.
Inference for the posterior trend
Recall the likelihood of the Bayesian regression model of weight by height : where . In earlier exercises you approximated the form of the posterior trend (solid line). From this, notice that the typical weight among 180 cm adults is roughly 80 kg (dashed lines):
You will use RJAGS simulation output to approximate the posterior trend in weight among 180 cm tall adults as well as the posterior uncertainty in this trend. The 100,000 iteration RJAGS simulation of the posterior, weight_sim_big
, is in your workspace along with a data frame of the Markov chain output, weight_chains
.
# Calculate the trend under each Markov chain parameter set
weight_chains <- weight_chains %>%
mutate(m_180 = a + b * 180)
# Construct a posterior density plot of the trend
ggplot(weight_chains, aes(x = m_180)) +
geom_density()
# Construct a posterior credible interval for the trend
quantile(weight_chains$m_180, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 76.97793 79.23810
Yes! The posterior trend of your regression model indicates that the typical weight among 180 cm tall adults is roughly 78 kg. However, posterior uncertainty in the regression model trickles down to uncertainty about this trend. This uncertainty is communicated through your credible interval: there’s a 95% (posterior) chance that the typical weight at a height of 180 cm is between 76.95 and 79.24 kg.
Calculating posterior predictions
You just explored the posterior trend in weight among adults with height :. The weight_chains
data frame contains 100,000 posterior plausible values of that you calculated from the corresponding values of and :
> head(weight_chains, 2)
a b s iter m_180
1 -113.9029 1.072505 8.772007 1 79.14803
2 -115.0644 1.077914 8.986393 2 78.96014
Forget the trend - what if you wanted to predict the weight of a specific 180 cm tall adult? You can! To do so, you must account for individual variability from the trend, modeled by
Using this model, you will simulate predictions of weight under each set of posterior plausible parameters in weight_chains
.
# Simulate 1 prediction under the first parameter set
rnorm(n = 1, mean = weight_chains$m_180[1], sd = weight_chains$s[1])
## [1] 88.86259
# Simulate 1 prediction under the second parameter set
rnorm(n = 1, mean = weight_chains$m_180[2], sd = weight_chains$s[2])
## [1] 84.04445
# Simulate & store 1 prediction under each parameter set
weight_chains <- weight_chains %>%
mutate(Y_180 = rnorm(n = 100000, mean = m_180, sd = s))
# Print the first 6 parameter sets & predictions
head(weight_chains)
## a b s iter m_180 Y_180
## 1 -106.5042 1.027620 9.053909 1 78.46745 80.25860
## 2 -106.6708 1.029252 9.159864 2 78.59459 84.13774
## 3 -106.5268 1.026147 9.441036 3 78.17955 91.08447
## 4 -106.1365 1.023241 9.607720 4 78.04680 83.36590
## 5 -106.0383 1.027861 9.274800 5 78.97676 67.07144
## 6 -107.2631 1.031626 9.155143 6 78.42966 84.55290
You now have 100,000 predictions for the weight of a 180 cm tall adult that reflect the range of posterior plausible parameter settings.
Posterior predictive distribution
The weight_chains
data frame (in your workspace) contains your 100,000 posterior predictions, Y_180
, for the weight of a 180 cm tall adult:
> head(weight_chains, 2)
a b s iter m_180 Y_180
1 -113.9029 1.072505 8.772007 1 79.14803 71.65811
2 -115.0644 1.077914 8.986393 2 78.96014 75.78893
You will use these 100,000 predictions to approximate the posterior predictive distribution for the weight of a 180 cm tall adult. The bdims
data are in your workspace.
# Construct a posterior credible interval for the prediction
ci_180 <- quantile(weight_chains$Y_180, probs = c(0.025, 0.975))
ci_180
## 2.5% 97.5%
## 59.70452 96.50401
# Construct a density plot of the posterior predictions
ggplot(weight_chains, aes(x = Y_180)) +
geom_density() +
geom_vline(xintercept = ci_180, color = "red")
# Visualize the credible interval on a scatterplot of the data
ggplot(bdims, aes(x = hgt, y = wgt)) +
geom_point() +
geom_abline(intercept = mean(weight_chains$a), slope = mean(weight_chains$b), color = "red") +
geom_segment(x = 180, xend = 180, y = ci_180[1], yend = ci_180[2], color = "red")
Congratulations! You’ve simulated your first posterior predictive distribution. Your 100,000 posterior plausible weights for a given 180 cm tall adult ranged from roughly 36 to 117 kg. Eliminating the most extreme 5% of these predictions, you observed that there’s a 95% (posterior) chance that the weight is between 72 and 84 kg.
RailTrail sample data
The RailTrail
data frame from the mosaic
package is loaded in your workspace. RailTrail
contains data collected by the Pioneer Valley Planning Commission on
the usage of a local rail-trail. For each of 90 days, they recorded the
rail-trail volume
(number of users) and whether it was a weekday
(TRUE
if yes and FALSE
otherwise). You will explore the trends in weekday vs weekend volume below.
library(mosaicData)
data(RailTrail)
# Confirm that weekday is a factor variable
class(RailTrail$weekday)
## [1] "logical"
# Construct a density plot of volume by weekday
ggplot(RailTrail, aes(x = volume, fill = weekday)) +
geom_density(alpha = 0.5)
Right! Notice that, as might be intuitive, rail-trail volume tends to be slightly higher on weekends (~430 users per day) than on weekdays (~350 users per day).
RJAGS simulation with categorical variables
Consider the Normal regression model of volume by weekday status :
You explored the relationship between and for the 90 days recorded in RailTrail
(in your workspace). In light of these data and the priors above, you will update your posterior model of this relationship. This differs from previous analyses in that is categorical. In rjags
syntax, its coefficient is defined by two elements, b[1]
and b[2]
, which correspond to the weekend and weekday levels, respectively. For reference, b[1]
is set to 0. In contrast, b[2]
is modeled by the prior for
RailTrail$weekday <- as.factor(RailTrail$weekday)
# DEFINE the model
rail_model_1 <- "model{
# Likelihood model for Y[i]
for(i in 1:length(Y)){
Y[i] ~ dnorm(m[i], s^(-2))
m[i] <- a + b[X[i]]
}
# Prior models for a, b, s
a ~ dnorm(400, 100^(-2))
b[1] <- 0
b[2] ~ dnorm(0, 200^(-2))
s ~ dunif(0, 200)
}"
# COMPILE the model
rail_jags_1 <- jags.model(
textConnection(rail_model_1),
data = list(Y = RailTrail$volume, X = RailTrail$weekday),
inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 10)
)
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 90
## Unobserved stochastic nodes: 3
## Total graph size: 194
##
## Initializing model
# SIMULATE the posterior
rail_sim_1 <- coda.samples(model = rail_jags_1, variable.names = c("a", "b", "s"), n.iter = 10000)
# Store the chains in a data frame
rail_chains_1 <- data.frame(rail_sim_1[[1]])
# PLOT the posterior
plot(rail_sim_1)
Nice work! You’ve successfully defined, compiled, and simulated a Bayesian regression model with a categorical predictor.
summary(rail_sim_1)
##
## Iterations = 1001:11000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## a 428.47 23.052 0.23052 0.5321
## b[1] 0.00 0.000 0.00000 0.0000
## b[2] -77.78 27.900 0.27900 0.6422
## s 124.25 9.662 0.09662 0.1335
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## a 383.1 412.9 428.78 443.62 474.20
## b[1] 0.0 0.0 0.00 0.00 0.00
## b[2] -133.2 -96.3 -77.65 -59.43 -23.07
## s 107.5 117.3 123.59 130.32 144.90
Typically, there are 428.47 trail users on a weekend day and 77.78 fewer users (~350.69) on a weekday.
Parameter a
describes the typical weekend volume whereas b
describes the contrast between weekday and weekend volume.
Inference for volume by weekday
The 10,000 iteration RJAGS simulation output, rail_sim_1
, is in your workspace along with a data frame of the Markov chain output:
> head(rail_chains_1, 2)
a b.1. b.2. s
1 420.6966 0 -54.30783 118.2328
2 399.5823 0 -52.02570 119.9499
These chains provide 10,000 unique sets of values for a
, the typical trail volume on weekend days, and b.2.
, the contrast between typical weekday volume vs weekend volume. For example, the first set of parameters indicate that there are typically 420.6966
riders on weekend days and 54.30783
fewer riders on weekdays. Thus there are typically 420.6966 - 54.30783 = 366.3888
riders on weekdays. You will utilize these simulation data to make inferences about weekday trail volume.
# Construct a chain of values for the typical weekday volume
rail_chains_1 <- rail_chains_1 %>%
mutate(weekday_mean = a + b.2.)
# Construct a density plot of the weekday chain
ggplot(rail_chains_1, aes(x = weekday_mean)) +
geom_density()
# 95% credible interval for typical weekday volume
quantile(rail_chains_1$weekday_mean, c(0.025, 0.975))
## 2.5% 97.5%
## 318.9520 381.7573
Excellent! You’ve shown that there’s a 95% posterior chance that the typical weekday volume is between 319 and 382 trail users.
Re-examining the RailTrail data
In your previous work, you observed that rail-trail volume
tends to be lower on a weekday
than a weekend. Some of the variability in volume
might also be explained by outside temperature. For example, we might expect trail volume to increase on warm, pleasant days.
The RailTrail
data set in your workspace includes hightemp
,
the observed high temperature (F) for each of the 90 days in the study
period. You will use these data to explore the associations between
trail volume
, weekday
status, and hightemp
# Construct a plot of volume by hightemp & weekday
ggplot(RailTrail, aes(y = volume, x = hightemp, color = weekday)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Nice work. Notice that for the 90 days in the study period, volume tends to increase with temperature. Further, volume tends to be higher on weekends than on weekdays of the same temperature.
RJAGS simulation for multivariate regression
Consider the following Bayesian model of volume by weekday status and temperature :
Your previous exploration of the relationship between volume
, weekday
, and hightemp
in the RailTrail
data provided some insight into this relationship. You will combine this with insight from the priors to develop a posterior model of this relationship using RJAGS. The RailTrail
data are in your work space.
# DEFINE the model
rail_model_2 <- "model{
# Likelihood model for Y[i]
for(i in 1:length(Y)) {
Y[i] ~ dnorm(m[i], s^(-2))
m[i] <- a + b[X[i]] + c * Z[i]
}
# Prior models for a, b, c, s
a ~ dnorm(0, 200^(-2))
b[1] <- 0
b[2] ~ dnorm(0, 200^(-2))
c ~ dnorm(0, 20^(-2))
s ~ dunif(0, 200)
}"
# COMPILE the model
rail_jags_2 <- jags.model(textConnection(rail_model_2),
data = list(Y = RailTrail$volume, X = RailTrail$weekday, Z = RailTrail$hightemp),
inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 10))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 90
## Unobserved stochastic nodes: 4
## Total graph size: 385
##
## Initializing model
# SIMULATE the posterior
rail_sim_2 <- coda.samples(model = rail_jags_2, variable.names = c("a","b","c","s"), n.iter = 10000)
# Store the chains in a data frame
rail_chains_2 <- data.frame(rail_sim_2[[1]])
# PLOT the posterior
plot(rail_sim_2)
Wow! You’ve successfully defined, compiled, and simulated your multivariate Bayesian regression model.
summary(rail_sim_2)
##
## Iterations = 1001:11000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## a 36.592 60.6238 0.606238 4.19442
## b[1] 0.000 0.0000 0.000000 0.00000
## b[2] -49.610 23.4930 0.234930 0.55520
## c 5.417 0.8029 0.008029 0.05849
## s 103.434 7.9418 0.079418 0.11032
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## a -83.443 -4.631 38.160 78.413 150.306
## b[1] 0.000 0.000 0.000 0.000 0.000
## b[2] -96.181 -65.468 -49.547 -33.633 -3.644
## c 3.865 4.865 5.402 5.965 7.007
## s 89.466 97.766 102.875 108.503 120.205
Typical volume is ~50 less on weekdays than on weekends of the same temperature.
The b
coefficient represents the relationship between volume
and weekday
status when controlling for, or on days with similar hightemp
.
Posterior inference for multivariate regression
The 10,000 iteration RJAGS simulation output, rail_sim_2
, is in your workspace along with a data frame of the Markov chain output:
> head(rail_chains_2, 2)
a b.1. b.2. c s
1 49.76954 0 -12.62112 4.999202 111.02247
2 30.22211 0 -3.16221 4.853491 98.11892
You will use these 10,000 unique sets of parameter values to
summarize the posterior mean trend in the relationships between trail volume
, weekday
status, and hightemp
.
# Plot the posterior mean regression models
ggplot(RailTrail, aes(y = volume, x = hightemp, color = weekday)) +
geom_point() +
geom_abline(intercept = mean(rail_chains_2$a), slope = mean(rail_chains_2$c), color = "red") +
geom_abline(intercept = mean(rail_chains_2$a) + mean(rail_chains_2$b.2.), slope = mean(rail_chains_2$c), color = "turquoise3")
Your posterior analysis suggests that there’s a positive association between volume and temperature. Further, the typical weekday volume is less than that on weekends of the same temperature.
RJAGS simulation for Poisson regression
In the previous video we engineered a Poisson regression model of volume by weekday status and temperature :
Combining your insights from the observed RailTrail
data and the priors stated here, you will define, compile, and simulate a posterior
model of this relationship using RJAGS. To challenge yourself in this
last RJAGS simulation of the course, you’ll be provided with less
helpful code than usual!
The RailTrail
data are in your work space.
# DEFINE the model
poisson_model <- "model{
# Likelihood model for Y[i]
for(i in 1:length(Y)) {
Y[i] ~ dpois(l[i])
log(l[i]) <- a + b[X[i]] + c * Z[i]
}
# Prior models for a, b, c
a ~ dnorm(0, 200^(-2))
b[1] <- 0
b[2] ~ dnorm(0, 2^(-2))
c ~ dnorm(0, 2^(-2))
}"
# COMPILE the model
poisson_jags <- jags.model(textConnection(poisson_model),
data = list(Y = RailTrail$volume, X = RailTrail$weekday, Z = RailTrail$hightemp),
inits = list(.RNG.name = "base::Wichmann-Hill", .RNG.seed = 10))
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 90
## Unobserved stochastic nodes: 3
## Total graph size: 441
##
## Initializing model
# SIMULATE the posterior
poisson_sim <- coda.samples(model = poisson_jags, variable.names = c("a","b","c"), n.iter = 10000)
# Store the chains in a data frame
poisson_chains <- data.frame(poisson_sim[[1]])
# PLOT the posterior
plot(poisson_sim)
You’ve successfully defined, compiled, and simulated a Bayesian Poisson regression model!
Plotting the Poisson regression model
Recall the likelihood structure for your Bayesian Poisson regression model of volume by weekday status and temperature : where
Your 10,000 iteration RJAGS simulation of the model posterior, poisson_sim
, is in your workspace along with a data frame of the Markov chain output:
> head(poisson_chains, 2)
a b.1. b.2. c
1 5.019807 0 -0.1222143 0.01405269
2 5.018642 0 -0.1217608 0.01407691
You will use these results to plot the posterior Poisson regression trends. These nonlinear trends can be added to a ggplot()
using stat_function()
. For example, specifying fun = function(x){x^2}
would return a quadratic trend line.
# Plot the posterior mean regression models
ggplot(RailTrail, aes(x = hightemp, y = volume, color = weekday)) +
geom_point() +
stat_function(fun = function(x){exp(mean(poisson_chains$a) + mean(poisson_chains$c) * x)}, color = "red") +
stat_function(fun = function(x){exp(mean(poisson_chains$a) + mean(poisson_chains$b.2.) + mean(poisson_chains$c) * x)}, color = "turquoise3")
Excellent work. Notice that unlike the Normal regression trend, the Poisson regression trend is curved.
Inference for the Poisson rate parameter
Again, recall the likelihood structure for your Bayesian Poisson regression model of volume by weekday status and temperature :
where
Your 10,000 iteration RJAGS simulation of the model posterior, poisson_sim
, is in your workspace along with a data frame of the Markov chain output:
> head(poisson_chains, 2)
a b.1. b.2. c
1 5.019807 0 -0.1222143 0.01405269
2 5.018642 0 -0.1217608 0.01407691
Using these 10,000 unique sets of posterior plausible values for parameters , , and you will make inferences about the typical trail volume on 80 degree days.
# Calculate the typical volume on 80 degree weekends & 80 degree weekdays
poisson_chains <- poisson_chains %>%
mutate(l_weekend = exp(a + c * 80)) %>%
mutate(l_weekday = exp(a + b.2. + c * 80))
# Construct a 95% CI for typical volume on 80 degree weekend
quantile(poisson_chains$l_weekend, c(0.025, 0.975))
## 2.5% 97.5%
## 462.1431 479.3101
# Construct a 95% CI for typical volume on 80 degree weekday
quantile(poisson_chains$l_weekday, c(0.025, 0.975))
## 2.5% 97.5%
## 407.4644 420.7102
This was your first analysis that incorporated transformations. It wasn’t so bad (so long as you remember to back-transform)!
Poisson posterior prediction
Your l_weekday
variable reflects the trend
in volume on 80 degree weekdays:
> head(poisson_chains, 2)
a b.1. b.2. c l_weekend l_weekday
1 5.0198 0 -0.1222 0.0141 465.924 412.324
2 5.0186 0 -0.1218 0.0141 466.284 412.829
Now that you understand the trend, let’s make some predictions! Specifically, let’s predict trail volumes on the next 80 degree weekday. To do so, you must take into account individual variability from the trend, modeled by the likelihood .
Using rpois(n, lambda)
for sample size n
and rate parameter lambda
, you will simulate Poisson predictions of volume under each value of the posterior plausible trend in poisson_chains
.
# Simulate weekday predictions under each parameter set
poisson_chains <- poisson_chains %>%
mutate(Y_weekday = rpois(n = 10000, lambda = l_weekday))
# Construct a density plot of the posterior weekday predictions
ggplot(poisson_chains, aes(x = Y_weekday)) +
geom_density()
# Posterior probability that weekday volume is less 400
mean(poisson_chains$Y_weekday < 400)
## [1] 0.2378
Recall that one of our motivations in applying the Poisson model was to accommodate the count nature of the volume data. This trickled down to your volume predictions Y_weekday
- notice that these predictions, like the raw volume data, are discrete counts.