Code
data {
/* dimensionality of data and model */
int<lower=0> N; // number of observations
/* response and explanatory variables */
vector[N] moral; // response variable
vector[N] hetero; // x1
vector[N] mobility; // x1
}
The purpose of this tutorial is to briefly show how to write models directly in Stan, rather than using the convenient rstanarm functions. Doing this opens up almost unlimited opportunities to model more complex data structures.
Please note that this tutorial is by no means exhaustive, and offers only a quick glimpse at the Stan modeling language. For more details, I recommend the excellent collection of Stan examples accessible via the Stan website. You should also read through the package vignette for a more detailed walk through the process of fitting a model using Stan: vignette(topic = "rstan", package = "rstan")
.
The Stan language is different from R. You will write a Stan model in a text file or a script in RStudio, and save it on your computer before you can run Stan from within R. Your call to Stan (within R) will then reference your Stan model file. You can either store your model as a character object in R, or as a separate text file with the .stan
extension. Although you can use a text file with any extension, using the .stan
extension enables syntax highlighting in RStudio. Try opening the included file angell.stan
in RStudio to see for yourself!
A Stan program consists of several blocks, each of which performs a specific function. Every model needs data
, parameters
, and model
blocks, and they can also contain the transformed parameters
and generated quantities
blocks.
The Stan language has a few quirks that you need to follow.
/*
and ending with */
, and single-line comments beginning with //
. As before, we recommend commenting your code liberally to facilitate communication with coauthors and your future self.The data
block contains anything that is a fixed quantity. This includes your explanatory and response variables, as well as any indexing or dimensionality variables your model will use. Stan requires you to explicitly declare any object that you will be referencing in your model. This means that we have to assign specific data types to our objects. The two basic data types in Stan and int
and real
, which correspond to integer
and numeric
in R. While R doesn’t care if you have a dichotomous response variable stored in a numeric
object, Stan does, and trying to estimate a logit on a binary dependent variable stored as a real
object will cause an error.
For the examples in this tutorial, we’re returning to the Angell data (from the carData
package) that you’ve seen before in this course.
data {
/* dimensionality of data and model */
int<lower=0> N; // number of observations
/* response and explanatory variables */
vector[N] moral; // response variable
vector[N] hetero; // x1
vector[N] mobility; // x1
}
This model will carry out a linear regression of the two predictors on the dependent variable \(y\).
The parameters
block contains anything that is a random variable you wish to estimate. Due to the way that Stan’s sampler works, it cannot estimate discrete parameters. If you have a model with discrete parameters, you have to marginalize them out before writing the model in Stan. Practically, what this means is that your parameters will always be real
s and never int
s.
parameters {
/* parameters to be estimated by model */
vector[3] beta; // regression coefficients
real<lower=0> sigma; // standard deviation
}
This model will estimate 3 \(\beta\) coefficients (one intercept and two coefficients) and the error term \(\sigma\).
The model code below should look familiar: \(y \sim \mathcal{N}(\mu, \sigma)\), where \(\mu = \alpha + \beta_1 x_1 + \beta_2 x_2\). Here, I write the equation directly in the slot for \(\mu\): \(y \sim \mathcal{N}(\alpha + \beta_1 x_1 + \beta_2 x_2, \sigma)\), and I use beta[1]
, beta[2]
, and beta[3]
instead of \(\alpha\), \(\beta_1\), and \(\beta_3\) to put my coefficients into a vector.
model {
1] + beta[2] * hetero + beta[3] * mobility,
moral ~ normal(beta[// response standard deviation
sigma);
/* priors and data */
1] ~ normal(0, 10); // prior on intercept
beta[for(i in 2:3){
0, 2.5); // prior on coefficients
beta[i] ~ normal(
}0, 10); // prior on standard deviation
sigma ~ normal( }
This model has a \(\mathcal{N}(0, 2.5)\) prior on the coefficients and a \(\mathcal{N}(0, 10)\) prior on the intercept. If you don’t specify a prior on a bounded parameter (e.g., here sigma), then Stan uses an implicit uniform prior. You can refer to the “prior choice recommendations” by the Stan developers for suggestions. Note here that although I’m specifying a normal distribution for \(\sigma\), the constraint real<lower=0> sigma
in the parameter block above ensures that Stan will only sample from positive values for \(\sigma\).
We can also include a transformed parameters
block between the parameters
and model
blocks, and a generated quantities
block after the model
block. Transformed parameters are often used to speed up sampling, or when we are fixing the values of specific elements within parameters in latent variable models as an identification restriction. Generated quantities can include values such as \(\hat{y}\)s and residuals, or samples from the posterior predictive distribution. Both of these blocks are executed every iteration, allowing us to monitor objects in them and obtain a full posterior distribution for quantities of interest. In this case, our model file has a generated quantities block that calculates \(\hat{y}\) and the residual. Don’t forget the newline at the end of the file!
generated quantities {
/* define predictions and residuals */
vector[N] y_hat;
vector[N] resid;
/* calculate predictions and residuals */
1] + beta[2] * hetero + beta[3] * mobility;
y_hat = beta[
resid = moral - y_hat;
}
This completes the most basic setup of a Stan model: data, parameters, model, and (the optional) generated quantities block. Don’t forget the new line at the end of the file!
Once we have our model written out, we use R to prepare our data, decide which paramters we’re interested in monitoring, and setup our sampler. This is done using the rstan
package. One advantage of Stan is that Stan natively supports parallel computing and makes it exceptionally easy to run multiple chains in parallel. The following code loads the rstan
package, and sets it to use all of your machine’s cores when running parallel pains. When you load rstan
, R will give you a message about the version you’re using, as well as suggestions to speed up estimation. Notice that the two next lines are these suggestions.
library("rstan")
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
Let’s use the Angell data to estimate a linear regression model.
<- rio::import("https://www.jkarreth.net/files/angell.csv")
data head(data)
moral hetero mobility region city
1 19.0 20.6 15.0 E Rochester
2 17.0 15.6 20.2 E Syracuse
3 16.4 22.1 13.6 E Worcester
4 16.2 14.0 14.8 E Erie
5 15.8 17.4 17.6 MW Milwaukee
6 15.3 27.9 17.5 E Bridgeport
Stan requires variables in the form of a named list, where the names match the variable names declared in the data block.
<- list(moral = data$moral,
angell_data_list hetero = data$hetero,
mobility = data$mobility,
N = nrow(data))
str(angell_data_list)
List of 4
$ moral : num [1:43] 19 17 16.4 16.2 15.8 15.3 15.2 14.3 14.2 14.1 ...
$ hetero : num [1:43] 20.6 15.6 22.1 14 17.4 27.9 22.3 23.7 10.6 12.7 ...
$ mobility: num [1:43] 15 20.2 13.6 14.8 17.6 17.5 14.7 23.8 19.4 31.9 ...
$ N : int 43
The stan()
command automatically compiles the model and runs the sampler for a specified number of iterations; we don’t have to compile the model into an object and then use a separate function to obtain samples from it.
The stan()
function has many arguments, but there are a few key ones.
file
argument if we saved our model in a separate text file with the .stan
extension, or model_code
if our model is a character object in the R environment.data
argument our named list of variables.pars
argument (leaving it blank will result in samples for all parameters).chains
argument tells the function how many different chains to run, and the iter
argument tells the function how many iterations to run the sampler for.init
argument, similar to what you did using JAGS/BUGS. See ?stan
for the different options to provide starting values. If no starting values are specified, Stan generates random starting values. If you provide a seed
for the RNG, the randomly generated starting values will be the same each time the command is initiated.Stan refers to the initial discarded draws at the beginning of a sampling run as the warmup. Hamiltonian Monte Carlo only has advantages over Gibbs sampling when the auxiliary parameters are properly tuned, so the warmup phase in Stan always involves adjustments to the sampler.The samples during the warmup phase are saved, so we can diagnose exploration of the parameter space and check to see if chains from different starting values converged to the same posterior mean. The default number of warmup iterations is 1/2 iter
.
<- stan(file = "angell.stan", # Stan program
angell_stan_fit data = angell_data_list, # named list of data
pars = c("beta"), # parameters to monitor
chains = 4, # number of Markov chains
iter = 2000, # total number of iterations per chain
seed = 123) # seed for random number generator
You might get a lot of messages about C++ code while the model compiles, but you can ignore them. You might also get a few warning messages about rejected proposals, but unless there are a large number of these errors, you can ignore them as rejected proposals are a normal part of Metropolis and Hamiltonian Monte Carlo sampling.
That was fast! Each chain ran 1000 warmup iterations in an average of 0.072 seconds and 1000 sampling iterations in an average of 0.053 seconds. And remember that we used a non-conjugate prior on \(\sigma\).
Once our sampler has finished, the samples are stored in a stanfit
object.
summary(angell_stan_fit)
$summary
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[1] 19.644 0.029111 1.2289 17.174 18.857 19.672 20.4606 22.0482 1782 1
beta[2] -0.107 0.000342 0.0173 -0.141 -0.118 -0.107 -0.0954 -0.0725 2552 1
beta[3] -0.185 0.000831 0.0368 -0.254 -0.210 -0.187 -0.1621 -0.1075 1963 1
lp__ -57.975 0.038692 1.4498 -61.691 -58.726 -57.632 -56.8815 -56.1613 1404 1
$c_summary
, , chains = chain:1
stats
parameter mean sd 2.5% 25% 50% 75% 97.5%
beta[1] 19.572 1.1685 17.254 18.799 19.567 20.3688 21.8438
beta[2] -0.106 0.0162 -0.138 -0.116 -0.106 -0.0951 -0.0722
beta[3] -0.184 0.0356 -0.249 -0.209 -0.184 -0.1603 -0.1098
lp__ -57.890 1.3853 -61.107 -58.682 -57.594 -56.8172 -56.1496
, , chains = chain:2
stats
parameter mean sd 2.5% 25% 50% 75% 97.5%
beta[1] 19.665 1.2448 17.095 18.891 19.727 20.4832 21.9550
beta[2] -0.107 0.0173 -0.142 -0.118 -0.107 -0.0949 -0.0737
beta[3] -0.187 0.0371 -0.255 -0.210 -0.189 -0.1629 -0.1052
lp__ -58.001 1.4630 -61.892 -58.690 -57.650 -56.9419 -56.2087
, , chains = chain:3
stats
parameter mean sd 2.5% 25% 50% 75% 97.5%
beta[1] 19.548 1.2405 17.116 18.718 19.559 20.3689 22.0095
beta[2] -0.107 0.0167 -0.138 -0.117 -0.107 -0.0958 -0.0747
beta[3] -0.182 0.0377 -0.251 -0.207 -0.184 -0.1586 -0.1012
lp__ -57.875 1.4454 -61.610 -58.533 -57.480 -56.8065 -56.1142
, , chains = chain:4
stats
parameter mean sd 2.5% 25% 50% 75% 97.5%
beta[1] 19.791 1.2468 17.329 18.993 19.816 20.5727 22.3957
beta[2] -0.108 0.0188 -0.143 -0.121 -0.108 -0.0959 -0.0704
beta[3] -0.189 0.0366 -0.261 -0.212 -0.189 -0.1643 -0.1147
lp__ -58.133 1.4909 -61.867 -58.956 -57.814 -57.0092 -56.2085
The rstan
package also includes a number of useful functions for diagnosing convergence and ensuring that we’re sampling from the posterior distribution. We can use traceplot()
to evaluate the mixing of the chains.
traceplot(angell_stan_fit)
If we set inc_warmup = TRUE
, then we can view all samples for each parameter, giving us increased confidence that we have fully explored the parameter space.
traceplot(angell_stan_fit, inc_warmup = TRUE)
We can also evaluate the autocorrelation between samples for each parameter at different lags. Given the nature of Markov chains, there has to be some autocorrelation at the first lag, but we want to see a sharp drop off as the lags increase. Since we are performing this operation on multiple chains, these results are the average autocorrelation values from all four chains.
stan_ac(angell_stan_fit, lags = 15)
What if we want a plot of the potential scale reduction factor \(\hat{R}\) at each iteration of the sampler? Unfortunately, rstan
doesn’t include a built-in function to produce this plot. Fortunately, it does provide an easy way to convert a stanfit
object to a coda
object using the As.mcmc.list()
function.
::gelman.plot(As.mcmc.list(angell_stan_fit), autoburnin = FALSE) coda
The shinystan package (Stan Development Team 2017) offers a convenient way to inspect the posterior distribution(s) of parameters in your web browser, using the R package shiny as a backend. This is a great tool for quick model checking, even though you will want to use the commands shown above to produce plots for a manuscript and/or appendix.
shinystan is installed automatically when you install rstanarm. To use it, simply call launch_shinystan
on the model you fit with rstanarm:
library("shinystan")
launch_shinystan(angell_stan_fit)
After a few seconds, your browser should show the shinystan interface:
Click on any of the buttons DIAGNOSE, ESTIMATE, EXPLORE, or MORE to proceed. Shinystan can even produce a table in format, and you can generate and export a number of figures without typing any code. (For reproducibility purposes, this is not part of your desired workflow, but Shinystan does allow for a fast and comprehensive look at convergence diagnostics.)
Now that we’re confident that our samples are actually drawn from a posterior distribution of converging chains, we need to communicate the information contained in them to our readers. Here, you can use the same approach we’ve used before, for instance create density plots of the coefficients. Refer to prior labs and lectures for more information.
As before, you can use the mcmcTab()
function from the BayesPostEst package to create a table:
library("BayesPostEst")
mcmcTab(angell_stan_fit)
Variable Median SD Lower Upper
1 beta[1] 19.672 1.229 17.174 22.048
2 beta[2] -0.107 0.017 -0.141 -0.072
3 beta[3] -0.187 0.037 -0.254 -0.107
4 lp__ -57.632 1.450 -61.691 -56.161
All the other functions you saw in Lab 2 work for rstan objects as well.
Finally, you can use a similar plotting functions as encountered in prior labs and lectures to visualize the posterior distribution(s) of parameters. Because rstan was built for maximum convenience, you could plot objects produced by rstan directly. For rstan objects, you need to first load the bayesplot package to produce plots, and then convert the rstan object into matrix. You can do this inside the call to the mcmc_areas()
plotting function, or beforehand
library("bayesplot"); library("ggplot2")
mcmc_areas(as.matrix(angell_stan_fit),
pars = c("beta[2]", "beta[3]")) +
xlim(-0.4, 0.1) +
geom_vline(xintercept = 0) +
scale_y_discrete(labels = c("Heterogeneity", "Mobility"))
For more information on plotting rstan objects, please read the excellent vignettes on the bayesplot website:
Stan is very well documented; check out the Stan website as a portal to a plethora of resources. There is also a Stan discussion forum with a lot of examples of estimation problems that the Stan developers helped solve.
There are a few excellent sources for Stan model code that make it easy to translate your skills from this workshop (and your JAGS/BUGS code) to Stan.
stancode
function!