Lab 5: Writing customized models in Stan via RStan

Applied Bayesian Modeling (ICPSR Summer Program 2025)

Author

Affiliation

Ursinus College

Published

July 20, 2025

The purpose of this tutorial is to briefly show how to write models directly in Stan, rather than using the convenient rstanarm functions. Doing this opens up almost unlimited opportunities to model more complex data structures.

Please note that this tutorial is by no means exhaustive, and offers only a quick glimpse at the Stan modeling language. For more details, I recommend the excellent collection of Stan examples accessible via the Stan website. You should also read through the package vignette for a more detailed walk through the process of fitting a model using Stan: vignette(topic = "rstan", package = "rstan").

Background

The Stan language is different from R. You will write a Stan model in a text file or a script in RStudio, and save it on your computer before you can run Stan from within R. Your call to Stan (within R) will then reference your Stan model file. You can either store your model as a character object in R, or as a separate text file with the .stan extension. Although you can use a text file with any extension, using the .stan extension enables syntax highlighting in RStudio. Try opening the included file angell.stan in RStudio to see for yourself!

A Stan program consists of several blocks, each of which performs a specific function. Every model needs data, parameters, and model blocks, and they can also contain the transformed parameters and generated quantities blocks.

The Stan language has a few quirks that you need to follow.

Each line needs to end on a semicolon.
The model file needs to end with a new, empty line.
Blocks are set off by their title, and the contents of blocks are enclosed in curly braces.
The order of blocks is not flexible, and you need to include them in your model file in the order below.
Stan supports C style comments, with multi-line comments opening and /* and ending with */, and single-line comments beginning with //. As before, we recommend commenting your code liberally to facilitate communication with coauthors and your future self.

The Data Block

The data block contains anything that is a fixed quantity. This includes your explanatory and response variables, as well as any indexing or dimensionality variables your model will use. Stan requires you to explicitly declare any object that you will be referencing in your model. This means that we have to assign specific data types to our objects. The two basic data types in Stan and int and real, which correspond to integer and numeric in R. While R doesn’t care if you have a dichotomous response variable stored in a numeric object, Stan does, and trying to estimate a logit on a binary dependent variable stored as a real object will cause an error.

For the examples in this tutorial, we’re returning to the Angell data (from the carData package) that you’ve seen before in this course.

Code

data {
  
  /* dimensionality of data and model */
  int<lower=0> N; // number of observations

  /* response and explanatory variables */
  vector[N] moral; // response variable
  vector[N] hetero; // x1
  vector[N] mobility; // x1
}

This model will carry out a linear regression of the two predictors on the dependent variable \(y\).

The Parameters Block

The parameters block contains anything that is a random variable you wish to estimate. Due to the way that Stan’s sampler works, it cannot estimate discrete parameters. If you have a model with discrete parameters, you have to marginalize them out before writing the model in Stan. Practically, what this means is that your parameters will always be reals and never ints.

Code

parameters {
  
  /* parameters to be estimated by model */
  vector[3] beta; // regression coefficients
  real<lower=0> sigma; // standard deviation
  
}

This model will estimate 3 \(\beta\) coefficients (one intercept and two coefficients) and the error term \(\sigma\).

The Model Block

The model code below should look familiar: \(y \sim \mathcal{N}(\mu, \sigma)\), where \(\mu = \alpha + \beta_1 x_1 + \beta_2 x_2\). Here, I write the equation directly in the slot for \(\mu\): \(y \sim \mathcal{N}(\alpha + \beta_1 x_1 + \beta_2 x_2, \sigma)\), and I use beta[1], beta[2], and beta[3] instead of \(\alpha\), \(\beta_1\), and \(\beta_3\) to put my coefficients into a vector.

Code

model {
   moral ~ normal(beta[1] + beta[2] * hetero + beta[3] * mobility, 
   sigma); // response standard deviation
   
  /* priors and data */
  beta[1] ~ normal(0, 10); // prior on intercept
  for(i in 2:3){
    beta[i] ~ normal(0, 2.5);  // prior on coefficients
  }
  sigma ~ normal(0, 10); // prior on standard deviation
}

This model has a \(\mathcal{N}(0, 2.5)\) prior on the coefficients and a \(\mathcal{N}(0, 10)\) prior on the intercept. If you don’t specify a prior on a bounded parameter (e.g., here sigma), then Stan uses an implicit uniform prior. You can refer to the “prior choice recommendations” by the Stan developers for suggestions. Note here that although I’m specifying a normal distribution for \(\sigma\), the constraint real<lower=0> sigma in the parameter block above ensures that Stan will only sample from positive values for \(\sigma\).

Other Blocks

We can also include a transformed parameters block between the parameters and model blocks, and a generated quantities block after the model block. Transformed parameters are often used to speed up sampling, or when we are fixing the values of specific elements within parameters in latent variable models as an identification restriction. Generated quantities can include values such as \(\hat{y}\)s and residuals, or samples from the posterior predictive distribution. Both of these blocks are executed every iteration, allowing us to monitor objects in them and obtain a full posterior distribution for quantities of interest. In this case, our model file has a generated quantities block that calculates \(\hat{y}\) and the residual. Don’t forget the newline at the end of the file!

Code

generated quantities {
  
  /* define predictions and residuals */
  vector[N] y_hat;
  vector[N] resid;
  
  /* calculate predictions and residuals */
  y_hat = beta[1] + beta[2] * hetero + beta[3] * mobility;
  resid = moral - y_hat;
  
}

This completes the most basic setup of a Stan model: data, parameters, model, and (the optional) generated quantities block. Don’t forget the new line at the end of the file!

Fitting Bayesian Models using Stan

Once we have our model written out, we use R to prepare our data, decide which paramters we’re interested in monitoring, and setup our sampler. This is done using the rstan package. One advantage of Stan is that Stan natively supports parallel computing and makes it exceptionally easy to run multiple chains in parallel. The following code loads the rstan package, and sets it to use all of your machine’s cores when running parallel pains. When you load rstan, R will give you a message about the version you’re using, as well as suggestions to speed up estimation. Notice that the two next lines are these suggestions.

Code

library("rstan")
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

Let’s use the Angell data to estimate a linear regression model.

Code

data <- rio::import("https://www.jkarreth.net/files/angell.csv")
head(data)

  moral hetero mobility region       city
1  19.0   20.6     15.0      E  Rochester
2  17.0   15.6     20.2      E   Syracuse
3  16.4   22.1     13.6      E  Worcester
4  16.2   14.0     14.8      E       Erie
5  15.8   17.4     17.6     MW  Milwaukee
6  15.3   27.9     17.5      E Bridgeport

Stan requires variables in the form of a named list, where the names match the variable names declared in the data block.

Code

angell_data_list <- list(moral = data$moral,
                       hetero = data$hetero,
                       mobility = data$mobility, 
                       N = nrow(data))
str(angell_data_list)

List of 4
 $ moral   : num [1:43] 19 17 16.4 16.2 15.8 15.3 15.2 14.3 14.2 14.1 ...
 $ hetero  : num [1:43] 20.6 15.6 22.1 14 17.4 27.9 22.3 23.7 10.6 12.7 ...
 $ mobility: num [1:43] 15 20.2 13.6 14.8 17.6 17.5 14.7 23.8 19.4 31.9 ...
 $ N       : int 43

The stan() command automatically compiles the model and runs the sampler for a specified number of iterations; we don’t have to compile the model into an object and then use a separate function to obtain samples from it.

The stan() function has many arguments, but there are a few key ones.

We need to point the function to our model, using either the file argument if we saved our model in a separate text file with the .stan extension, or model_code if our model is a character object in the R environment.
We also need to tell the function where our data are, which we do by giving the data argument our named list of variables.
If we have a complicated model with lots of parameters that aren’t substantively interesting, we can specify specific ones to monitor through the pars argument (leaving it blank will result in samples for all parameters).
The chains argument tells the function how many different chains to run, and the iter argument tells the function how many iterations to run the sampler for.
You can provide starting values using the init argument, similar to what you did using JAGS/BUGS. See ?stan for the different options to provide starting values. If no starting values are specified, Stan generates random starting values. If you provide a seed for the RNG, the randomly generated starting values will be the same each time the command is initiated.

Stan refers to the initial discarded draws at the beginning of a sampling run as the warmup. Hamiltonian Monte Carlo only has advantages over Gibbs sampling when the auxiliary parameters are properly tuned, so the warmup phase in Stan always involves adjustments to the sampler.The samples during the warmup phase are saved, so we can diagnose exploration of the parameter space and check to see if chains from different starting values converged to the same posterior mean. The default number of warmup iterations is 1/2 iter.

Code

angell_stan_fit <- stan(file = "angell.stan",  # Stan program
           data = angell_data_list,            # named list of data
           pars = c("beta"),                   # parameters to monitor
           chains = 4,                         # number of Markov chains
           iter = 2000,                        # total number of iterations per chain
           seed = 123)                         # seed for random number generator

You might get a lot of messages about C++ code while the model compiles, but you can ignore them. You might also get a few warning messages about rejected proposals, but unless there are a large number of these errors, you can ignore them as rejected proposals are a normal part of Metropolis and Hamiltonian Monte Carlo sampling.

That was fast! Each chain ran 1000 warmup iterations in an average of 0.072 seconds and 1000 sampling iterations in an average of 0.053 seconds. And remember that we used a non-conjugate prior on \(\sigma\).

Once our sampler has finished, the samples are stored in a stanfit object.

Code

summary(angell_stan_fit)

$summary
           mean  se_mean     sd    2.5%     25%     50%      75%    97.5% n_eff Rhat
beta[1]  19.644 0.029111 1.2289  17.174  18.857  19.672  20.4606  22.0482  1782    1
beta[2]  -0.107 0.000342 0.0173  -0.141  -0.118  -0.107  -0.0954  -0.0725  2552    1
beta[3]  -0.185 0.000831 0.0368  -0.254  -0.210  -0.187  -0.1621  -0.1075  1963    1
lp__    -57.975 0.038692 1.4498 -61.691 -58.726 -57.632 -56.8815 -56.1613  1404    1

$c_summary
, , chains = chain:1

         stats
parameter    mean     sd    2.5%     25%     50%      75%    97.5%
  beta[1]  19.572 1.1685  17.254  18.799  19.567  20.3688  21.8438
  beta[2]  -0.106 0.0162  -0.138  -0.116  -0.106  -0.0951  -0.0722
  beta[3]  -0.184 0.0356  -0.249  -0.209  -0.184  -0.1603  -0.1098
  lp__    -57.890 1.3853 -61.107 -58.682 -57.594 -56.8172 -56.1496

, , chains = chain:2

         stats
parameter    mean     sd    2.5%     25%     50%      75%    97.5%
  beta[1]  19.665 1.2448  17.095  18.891  19.727  20.4832  21.9550
  beta[2]  -0.107 0.0173  -0.142  -0.118  -0.107  -0.0949  -0.0737
  beta[3]  -0.187 0.0371  -0.255  -0.210  -0.189  -0.1629  -0.1052
  lp__    -58.001 1.4630 -61.892 -58.690 -57.650 -56.9419 -56.2087

, , chains = chain:3

         stats
parameter    mean     sd    2.5%     25%     50%      75%    97.5%
  beta[1]  19.548 1.2405  17.116  18.718  19.559  20.3689  22.0095
  beta[2]  -0.107 0.0167  -0.138  -0.117  -0.107  -0.0958  -0.0747
  beta[3]  -0.182 0.0377  -0.251  -0.207  -0.184  -0.1586  -0.1012
  lp__    -57.875 1.4454 -61.610 -58.533 -57.480 -56.8065 -56.1142

, , chains = chain:4

         stats
parameter    mean     sd    2.5%     25%     50%      75%    97.5%
  beta[1]  19.791 1.2468  17.329  18.993  19.816  20.5727  22.3957
  beta[2]  -0.108 0.0188  -0.143  -0.121  -0.108  -0.0959  -0.0704
  beta[3]  -0.189 0.0366  -0.261  -0.212  -0.189  -0.1643  -0.1147
  lp__    -58.133 1.4909 -61.867 -58.956 -57.814 -57.0092 -56.2085

Assessing Convergence

The rstan package also includes a number of useful functions for diagnosing convergence and ensuring that we’re sampling from the posterior distribution. We can use traceplot() to evaluate the mixing of the chains.

Code

traceplot(angell_stan_fit)

If we set inc_warmup = TRUE, then we can view all samples for each parameter, giving us increased confidence that we have fully explored the parameter space.

Code

traceplot(angell_stan_fit, inc_warmup = TRUE)

We can also evaluate the autocorrelation between samples for each parameter at different lags. Given the nature of Markov chains, there has to be some autocorrelation at the first lag, but we want to see a sharp drop off as the lags increase. Since we are performing this operation on multiple chains, these results are the average autocorrelation values from all four chains.

Code

stan_ac(angell_stan_fit, lags = 15)

What if we want a plot of the potential scale reduction factor \(\hat{R}\) at each iteration of the sampler? Unfortunately, rstan doesn’t include a built-in function to produce this plot. Fortunately, it does provide an easy way to convert a stanfit object to a coda object using the As.mcmc.list() function.

Code

coda::gelman.plot(As.mcmc.list(angell_stan_fit), autoburnin = FALSE)

Shinystan

The shinystan package (Stan Development Team 2017) offers a convenient way to inspect the posterior distribution(s) of parameters in your web browser, using the R package shiny as a backend. This is a great tool for quick model checking, even though you will want to use the commands shown above to produce plots for a manuscript and/or appendix.

shinystan is installed automatically when you install rstanarm. To use it, simply call launch_shinystan on the model you fit with rstanarm:

Code

library("shinystan")
launch_shinystan(angell_stan_fit)

After a few seconds, your browser should show the shinystan interface:

Click on any of the buttons DIAGNOSE, ESTIMATE, EXPLORE, or MORE to proceed. Shinystan can even produce a table in format, and you can generate and export a number of figures without typing any code. (For reproducibility purposes, this is not part of your desired workflow, but Shinystan does allow for a fast and comprehensive look at convergence diagnostics.)

Presenting Results

Now that we’re confident that our samples are actually drawn from a posterior distribution of converging chains, we need to communicate the information contained in them to our readers. Here, you can use the same approach we’ve used before, for instance create density plots of the coefficients. Refer to prior labs and lectures for more information.

Tables

As before, you can use the mcmcTab() function from the BayesPostEst package to create a table:

Code

library("BayesPostEst")
mcmcTab(angell_stan_fit)

  Variable  Median    SD   Lower   Upper
1  beta[1]  19.672 1.229  17.174  22.048
2  beta[2]  -0.107 0.017  -0.141  -0.072
3  beta[3]  -0.187 0.037  -0.254  -0.107
4     lp__ -57.632 1.450 -61.691 -56.161

All the other functions you saw in Lab 2 work for rstan objects as well.

Figures

Finally, you can use a similar plotting functions as encountered in prior labs and lectures to visualize the posterior distribution(s) of parameters. Because rstan was built for maximum convenience, you could plot objects produced by rstan directly. For rstan objects, you need to first load the bayesplot package to produce plots, and then convert the rstan object into matrix. You can do this inside the call to the mcmc_areas() plotting function, or beforehand

Code

library("bayesplot"); library("ggplot2")
mcmc_areas(as.matrix(angell_stan_fit), 
     pars = c("beta[2]", "beta[3]")) + 
  xlim(-0.4, 0.1) + 
  geom_vline(xintercept = 0) + 
  scale_y_discrete(labels = c("Heterogeneity", "Mobility"))

For more information on plotting rstan objects, please read the excellent vignettes on the bayesplot website:

Plotting MCMC draws
Visual MCMC diagnostics
Graphical posterior predictive checks (also see lecture 17 for examples.)

Further resources

Stan is very well documented; check out the Stan website as a portal to a plethora of resources. There is also a Stan discussion forum with a lot of examples of estimation problems that the Stan developers helped solve.

Example models

There are a few excellent sources for Stan model code that make it easy to translate your skills from this workshop (and your JAGS/BUGS code) to Stan.

This Github repository provides Stan code for all models shown in Gelman and Hill (2007).
Here is an overview of other example models in Stan, including code from several Bayesian textbooks and the WinBUGS example models.
This repository contains all examples from Kruschke (2014) in Stan language.
The Stan discussion forum contains a wealth of (usually) more advanced examples, including helpful discussions with troubleshooting advice, often from the Stan developers themselves.
The brms package (which we discuss in our workshop) allows you to extract the Stan (source) code for any Bayesian model you specified via the familiar formula interface. Take a look at the stancode function!

References

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. New York, NY: Cambridge University Press.

Kruschke, John. 2014. Doing Bayesian Data Analysis, Second Edition: A Tutorial with r, JAGS, and STAN. Oxford: Academic Press / Elsevier.

Stan Development Team. 2017. Shinystan: Interactive Visual and Numerical Diagnostics and Posterior Analysis for Bayesian Models. http://mc-stan.org/.