Lab 5: Writing customized models in Stan via CmdStanR

Applied Bayesian Modeling (ICPSR Summer Program 2025)
Author
Affiliation

Ursinus College

Published

July 20, 2025

The purpose of this tutorial is to briefly show how to write models directly in Stan, rather than using the convenient rstanarm functions. Doing this opens up almost unlimited opportunities to model more complex data structures.

Please note that this tutorial is by no means exhaustive, and offers only a quick glimpse at the Stan modeling language. For more details, I recommend the excellent collection of Stan examples accessible via the Stan website. You should also read through the package vignette for a more detailed walk through the process of fitting a model using Stan: vignette(topic = "cmdstanr", package = "cmdstanr").

Note

You can use CmdStanR as a backend for the brms package, without ever writing Stan code. Refer to our lecture for an example.

Background

The Stan language is different from R. You will write a Stan model in a text file or a script in RStudio, and save it on your computer before you can run Stan from within R. Your call to Stan (within R) will then reference your Stan model file. You can either store your model as a character object in R, or as a separate text file with the .stan extension. Although you can use a text file with any extension, using the .stan extension enables syntax highlighting in RStudio. Try opening the included file angell.stan in RStudio to see for yourself!

A Stan program consists of several blocks, each of which performs a specific function. Every model needs data, parameters, and model blocks, and they can also contain the transformed parameters and generated quantities blocks.

The Stan language has a few quirks that you need to follow.

  • Each line needs to end on a semicolon.
  • The model file needs to end with a new, empty line.
  • Blocks are set off by their title, and the contents of blocks are enclosed in curly braces.
  • The order of blocks is not flexible, and you need to include them in your model file in the order below.
  • Stan supports C style comments, with multi-line comments opening and /* and ending with */, and single-line comments beginning with //. As before, we recommend commenting your code liberally to facilitate communication with coauthors and your future self.

The Data Block

The data block contains anything that is a fixed quantity. This includes your explanatory and response variables, as well as any indexing or dimensionality variables your model will use. Stan requires you to explicitly declare any object that you will be referencing in your model. This means that we have to assign specific data types to our objects. The two basic data types in Stan and int and real, which correspond to integer and numeric in R. While R doesn’t care if you have a dichotomous response variable stored in a numeric object, Stan does, and trying to estimate a logit on a binary dependent variable stored as a real object will cause an error.

For the examples in this tutorial, we’re returning to the Angell data (from the carData package) that you’ve seen before in this course.

Code
data {
  
  /* dimensionality of data and model */
  int<lower=0> N; // number of observations

  /* response and explanatory variables */
  vector[N] moral; // response variable
  vector[N] hetero; // x1
  vector[N] mobility; // x1
}

This model will carry out a linear regression of the two predictors on the dependent variable \(y\).

The Parameters Block

The parameters block contains anything that is a random variable you wish to estimate. Due to the way that Stan’s sampler works, it cannot estimate discrete parameters. If you have a model with discrete parameters, you have to marginalize them out before writing the model in Stan. Practically, what this means is that your parameters will always be reals and never ints.

Code
parameters {
  
  /* parameters to be estimated by model */
  vector[3] beta; // regression coefficients
  real<lower=0> sigma; // standard deviation
  
}

This model will estimate 3 \(\beta\) coefficients (one intercept and two coefficients) and the error term \(\sigma\).

The Model Block

The model code below should look familiar: \(y \sim \mathcal{N}(\mu, \sigma)\), where \(\mu = \alpha + \beta_1 x_1 + \beta_2 x_2\). Here, I write the equation directly in the slot for \(\mu\): \(y \sim \mathcal{N}(\alpha + \beta_1 x_1 + \beta_2 x_2, \sigma)\), and I use beta[1], beta[2], and beta[3] instead of \(\alpha\), \(\beta_1\), and \(\beta_3\) to put my coefficients into a vector.

Code
model {
   moral ~ normal(beta[1] + beta[2] * hetero + beta[3] * mobility, 
   sigma); // response standard deviation
   
  /* priors and data */
  beta[1] ~ normal(0, 10); // prior on intercept
  for(i in 2:3){
    beta[i] ~ normal(0, 2.5);  // prior on coefficients
  }
  sigma ~ normal(0, 10); // prior on standard deviation
}

This model has a \(\mathcal{N}(0, 2.5)\) prior on the coefficients and a \(\mathcal{N}(0, 10)\) prior on the intercept. If you don’t specify a prior on a bounded parameter (e.g., here sigma), then Stan uses an implicit uniform prior. You can refer to the “prior choice recommendations” by the Stan developers for suggestions. Note here that although I’m specifying a normal distribution for \(\sigma\), the constraint real<lower=0> sigma in the parameter block above ensures that Stan will only sample from positive values for \(\sigma\).

Other Blocks

We can also include a transformed parameters block between the parameters and model blocks, and a generated quantities block after the model block. Transformed parameters are often used to speed up sampling, or when we are fixing the values of specific elements within parameters in latent variable models as an identification restriction. Generated quantities can include values such as \(\hat{y}\)s and residuals, or samples from the posterior predictive distribution. Both of these blocks are executed every iteration, allowing us to monitor objects in them and obtain a full posterior distribution for quantities of interest. In this case, our model file has a generated quantities block that calculates \(\hat{y}\) and the residual. Don’t forget the newline at the end of the file!

Code
generated quantities {
  
  /* define predictions and residuals */
  vector[N] y_hat;
  vector[N] resid;
  
  /* calculate predictions and residuals */
  y_hat = beta[1] + beta[2] * hetero + beta[3] * mobility;
  resid = moral - y_hat;
  
}

This completes the most basic setup of a Stan model: data, parameters, model, and (the optional) generated quantities block. Don’t forget the new line at the end of the file!

Fitting Bayesian Models using CmdStanR

Once we have our model written out, we use R to prepare our data, decide which paramters we’re interested in monitoring, and setup our sampler. This is done using the cmdstanr package. One advantage of Stan is that Stan natively supports parallel computing and makes it exceptionally easy to run multiple chains in parallel. The following code loads the cmdstanr package:

Code
library("cmdstanr")

Let’s use the Angell data to estimate a linear regression model.

Code
data <- rio::import("https://www.jkarreth.net/files/angell.csv")
head(data)
  moral hetero mobility region       city
1  19.0   20.6     15.0      E  Rochester
2  17.0   15.6     20.2      E   Syracuse
3  16.4   22.1     13.6      E  Worcester
4  16.2   14.0     14.8      E       Erie
5  15.8   17.4     17.6     MW  Milwaukee
6  15.3   27.9     17.5      E Bridgeport

Stan requires variables in the form of a named list, where the names match the variable names declared in the data block.

Code
angell_data_list <- list(moral = data$moral,
                       hetero = data$hetero,
                       mobility = data$mobility, 
                       N = nrow(data))
str(angell_data_list)
List of 4
 $ moral   : num [1:43] 19 17 16.4 16.2 15.8 15.3 15.2 14.3 14.2 14.1 ...
 $ hetero  : num [1:43] 20.6 15.6 22.1 14 17.4 27.9 22.3 23.7 10.6 12.7 ...
 $ mobility: num [1:43] 15 20.2 13.6 14.8 17.6 17.5 14.7 23.8 19.4 31.9 ...
 $ N       : int 43

We now read the model file into R as an object, which will compile the model:

Code
angell_mod <- cmdstan_model("angell.stan")

Now that angell_mod is a model object, we can access it as follows. We can view the model using angell_mod$print():

Code
angell_mod$print()
data {
  
  /* dimensionality of data and model */
  int<lower=0> N; // number of observations

  /* response and explanatory variables */
  vector[N] moral; // response variable
  vector[N] hetero; // x1
  vector[N] mobility; // x2
}

parameters {
  
  /* parameters to be estimated by model */
  vector[3] beta; // regression coefficients
  real<lower=0> sigma; // standard deviation
  
}

model {
  
   moral ~ normal(beta[1] + beta[2] * hetero + beta[3] * mobility, 
   sigma); // response standard deviation
   
  /* priors and data */
  beta[1] ~ normal(0, 10); // prior on intercept
  for(i in 2:3){
    beta[i] ~ normal(0, 2.5); // prior on coefficients
  }
  sigma ~ normal(0, 10);  // prior on standard deviation

}

generated quantities {
  
  /* define predictions and residuals */
  vector[N] y_hat;
  vector[N] resid;
  
  /* calculate predictions and residuals */
  y_hat = beta[1] + beta[2] * hetero + beta[3] * mobility;
  resid = moral - y_hat;
  
}

The sample() command will then run the sampler for a specified number of iterations. That function has many arguments, but there are a few key ones.

  • We need to tell the function where our data are, which we do by giving the data argument our named list of variables.
  • The chains argument tells the function how many different chains to run, and parallel_chains will define how many of these chains should be run on separate cores.
  • The num_samples argument tells the function how many iterations to run the sampler for.
  • You can provide starting values using the init argument, similar to what you did using JAGS/BUGS. See ?stan for the different options to provide starting values. If no starting values are specified, Stan generates random starting values. If you provide a seed for the RNG, the randomly generated starting values will be the same each time the command is initiated.
  • You can find more information about additional arguments/options at the help file for model-method-sample.

Stan refers to the initial discarded draws at the beginning of a sampling run as the warmup. Hamiltonian Monte Carlo only has advantages over Gibbs sampling when the auxiliary parameters are properly tuned, so the warmup phase in Stan always involves adjustments to the sampler.The samples during the warmup phase are saved, so we can diagnose exploration of the parameter space and check to see if chains from different starting values converged to the same posterior mean. The default number of warmup iterations is 1/2 iter.

Code
angell_stan_fit <- angell_mod$sample(
           data = angell_data_list,            # named list of data
           seed = 123,                         # seed for random number generator
           chains = 4,                         # number of Markov chains total
           parallel_chains = 4,                # number of Markov chains to run in parallel
           num_samples = 2000,                 # total number of iterations per chain
           )                         
Warning in angell_mod$sample(data = angell_data_list, seed = 123, chains = 4, : 'num_samples' is deprecated. Please use 'iter_sampling' instead.
Running MCMC with 4 parallel chains...

Chain 1 Iteration:    1 / 3000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 3000 [  3%]  (Warmup) 
Chain 1 Iteration:  200 / 3000 [  6%]  (Warmup) 
Chain 1 Iteration:  300 / 3000 [ 10%]  (Warmup) 
Chain 1 Iteration:  400 / 3000 [ 13%]  (Warmup) 
Chain 1 Iteration:  500 / 3000 [ 16%]  (Warmup) 
Chain 1 Iteration:  600 / 3000 [ 20%]  (Warmup) 
Chain 1 Iteration:  700 / 3000 [ 23%]  (Warmup) 
Chain 1 Iteration:  800 / 3000 [ 26%]  (Warmup) 
Chain 1 Iteration:  900 / 3000 [ 30%]  (Warmup) 
Chain 1 Iteration: 1000 / 3000 [ 33%]  (Warmup) 
Chain 1 Iteration: 1001 / 3000 [ 33%]  (Sampling) 
Chain 1 Iteration: 1100 / 3000 [ 36%]  (Sampling) 
Chain 1 Iteration: 1200 / 3000 [ 40%]  (Sampling) 
Chain 1 Iteration: 1300 / 3000 [ 43%]  (Sampling) 
Chain 1 Iteration: 1400 / 3000 [ 46%]  (Sampling) 
Chain 1 Iteration: 1500 / 3000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1600 / 3000 [ 53%]  (Sampling) 
Chain 1 Iteration: 1700 / 3000 [ 56%]  (Sampling) 
Chain 1 Iteration: 1800 / 3000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1900 / 3000 [ 63%]  (Sampling) 
Chain 2 Iteration:    1 / 3000 [  0%]  (Warmup) 
Chain 2 Iteration:  100 / 3000 [  3%]  (Warmup) 
Chain 2 Iteration:  200 / 3000 [  6%]  (Warmup) 
Chain 2 Iteration:  300 / 3000 [ 10%]  (Warmup) 
Chain 2 Iteration:  400 / 3000 [ 13%]  (Warmup) 
Chain 2 Iteration:  500 / 3000 [ 16%]  (Warmup) 
Chain 2 Iteration:  600 / 3000 [ 20%]  (Warmup) 
Chain 2 Iteration:  700 / 3000 [ 23%]  (Warmup) 
Chain 2 Iteration:  800 / 3000 [ 26%]  (Warmup) 
Chain 2 Iteration:  900 / 3000 [ 30%]  (Warmup) 
Chain 2 Iteration: 1000 / 3000 [ 33%]  (Warmup) 
Chain 2 Iteration: 1001 / 3000 [ 33%]  (Sampling) 
Chain 2 Iteration: 1100 / 3000 [ 36%]  (Sampling) 
Chain 2 Iteration: 1200 / 3000 [ 40%]  (Sampling) 
Chain 2 Iteration: 1300 / 3000 [ 43%]  (Sampling) 
Chain 2 Iteration: 1400 / 3000 [ 46%]  (Sampling) 
Chain 2 Iteration: 1500 / 3000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1600 / 3000 [ 53%]  (Sampling) 
Chain 2 Iteration: 1700 / 3000 [ 56%]  (Sampling) 
Chain 2 Iteration: 1800 / 3000 [ 60%]  (Sampling) 
Chain 2 Iteration: 1900 / 3000 [ 63%]  (Sampling) 
Chain 2 Iteration: 2000 / 3000 [ 66%]  (Sampling) 
Chain 2 Iteration: 2100 / 3000 [ 70%]  (Sampling) 
Chain 3 Iteration:    1 / 3000 [  0%]  (Warmup) 
Chain 3 Iteration:  100 / 3000 [  3%]  (Warmup) 
Chain 3 Iteration:  200 / 3000 [  6%]  (Warmup) 
Chain 3 Iteration:  300 / 3000 [ 10%]  (Warmup) 
Chain 3 Iteration:  400 / 3000 [ 13%]  (Warmup) 
Chain 3 Iteration:  500 / 3000 [ 16%]  (Warmup) 
Chain 3 Iteration:  600 / 3000 [ 20%]  (Warmup) 
Chain 3 Iteration:  700 / 3000 [ 23%]  (Warmup) 
Chain 3 Iteration:  800 / 3000 [ 26%]  (Warmup) 
Chain 3 Iteration:  900 / 3000 [ 30%]  (Warmup) 
Chain 3 Iteration: 1000 / 3000 [ 33%]  (Warmup) 
Chain 3 Iteration: 1001 / 3000 [ 33%]  (Sampling) 
Chain 3 Iteration: 1100 / 3000 [ 36%]  (Sampling) 
Chain 3 Iteration: 1200 / 3000 [ 40%]  (Sampling) 
Chain 3 Iteration: 1300 / 3000 [ 43%]  (Sampling) 
Chain 3 Iteration: 1400 / 3000 [ 46%]  (Sampling) 
Chain 3 Iteration: 1500 / 3000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1600 / 3000 [ 53%]  (Sampling) 
Chain 3 Iteration: 1700 / 3000 [ 56%]  (Sampling) 
Chain 3 Iteration: 1800 / 3000 [ 60%]  (Sampling) 
Chain 3 Iteration: 1900 / 3000 [ 63%]  (Sampling) 
Chain 3 Iteration: 2000 / 3000 [ 66%]  (Sampling) 
Chain 3 Iteration: 2100 / 3000 [ 70%]  (Sampling) 
Chain 3 Iteration: 2200 / 3000 [ 73%]  (Sampling) 
Chain 3 Iteration: 2300 / 3000 [ 76%]  (Sampling) 
Chain 4 Iteration:    1 / 3000 [  0%]  (Warmup) 
Chain 4 Iteration:  100 / 3000 [  3%]  (Warmup) 
Chain 4 Iteration:  200 / 3000 [  6%]  (Warmup) 
Chain 4 Iteration:  300 / 3000 [ 10%]  (Warmup) 
Chain 4 Iteration:  400 / 3000 [ 13%]  (Warmup) 
Chain 4 Iteration:  500 / 3000 [ 16%]  (Warmup) 
Chain 4 Iteration:  600 / 3000 [ 20%]  (Warmup) 
Chain 4 Iteration:  700 / 3000 [ 23%]  (Warmup) 
Chain 4 Iteration:  800 / 3000 [ 26%]  (Warmup) 
Chain 4 Iteration:  900 / 3000 [ 30%]  (Warmup) 
Chain 4 Iteration: 1000 / 3000 [ 33%]  (Warmup) 
Chain 4 Iteration: 1001 / 3000 [ 33%]  (Sampling) 
Chain 4 Iteration: 1100 / 3000 [ 36%]  (Sampling) 
Chain 4 Iteration: 1200 / 3000 [ 40%]  (Sampling) 
Chain 4 Iteration: 1300 / 3000 [ 43%]  (Sampling) 
Chain 4 Iteration: 1400 / 3000 [ 46%]  (Sampling) 
Chain 4 Iteration: 1500 / 3000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1600 / 3000 [ 53%]  (Sampling) 
Chain 4 Iteration: 1700 / 3000 [ 56%]  (Sampling) 
Chain 4 Iteration: 1800 / 3000 [ 60%]  (Sampling) 
Chain 4 Iteration: 1900 / 3000 [ 63%]  (Sampling) 
Chain 4 Iteration: 2000 / 3000 [ 66%]  (Sampling) 
Chain 4 Iteration: 2100 / 3000 [ 70%]  (Sampling) 
Chain 4 Iteration: 2200 / 3000 [ 73%]  (Sampling) 
Chain 1 Iteration: 2000 / 3000 [ 66%]  (Sampling) 
Chain 1 Iteration: 2100 / 3000 [ 70%]  (Sampling) 
Chain 1 Iteration: 2200 / 3000 [ 73%]  (Sampling) 
Chain 1 Iteration: 2300 / 3000 [ 76%]  (Sampling) 
Chain 1 Iteration: 2400 / 3000 [ 80%]  (Sampling) 
Chain 1 Iteration: 2500 / 3000 [ 83%]  (Sampling) 
Chain 1 Iteration: 2600 / 3000 [ 86%]  (Sampling) 
Chain 1 Iteration: 2700 / 3000 [ 90%]  (Sampling) 
Chain 1 Iteration: 2800 / 3000 [ 93%]  (Sampling) 
Chain 1 Iteration: 2900 / 3000 [ 96%]  (Sampling) 
Chain 1 Iteration: 3000 / 3000 [100%]  (Sampling) 
Chain 2 Iteration: 2200 / 3000 [ 73%]  (Sampling) 
Chain 2 Iteration: 2300 / 3000 [ 76%]  (Sampling) 
Chain 2 Iteration: 2400 / 3000 [ 80%]  (Sampling) 
Chain 2 Iteration: 2500 / 3000 [ 83%]  (Sampling) 
Chain 2 Iteration: 2600 / 3000 [ 86%]  (Sampling) 
Chain 2 Iteration: 2700 / 3000 [ 90%]  (Sampling) 
Chain 2 Iteration: 2800 / 3000 [ 93%]  (Sampling) 
Chain 2 Iteration: 2900 / 3000 [ 96%]  (Sampling) 
Chain 2 Iteration: 3000 / 3000 [100%]  (Sampling) 
Chain 3 Iteration: 2400 / 3000 [ 80%]  (Sampling) 
Chain 3 Iteration: 2500 / 3000 [ 83%]  (Sampling) 
Chain 3 Iteration: 2600 / 3000 [ 86%]  (Sampling) 
Chain 3 Iteration: 2700 / 3000 [ 90%]  (Sampling) 
Chain 3 Iteration: 2800 / 3000 [ 93%]  (Sampling) 
Chain 3 Iteration: 2900 / 3000 [ 96%]  (Sampling) 
Chain 3 Iteration: 3000 / 3000 [100%]  (Sampling) 
Chain 4 Iteration: 2300 / 3000 [ 76%]  (Sampling) 
Chain 4 Iteration: 2400 / 3000 [ 80%]  (Sampling) 
Chain 4 Iteration: 2500 / 3000 [ 83%]  (Sampling) 
Chain 4 Iteration: 2600 / 3000 [ 86%]  (Sampling) 
Chain 4 Iteration: 2700 / 3000 [ 90%]  (Sampling) 
Chain 4 Iteration: 2800 / 3000 [ 93%]  (Sampling) 
Chain 4 Iteration: 2900 / 3000 [ 96%]  (Sampling) 
Chain 4 Iteration: 3000 / 3000 [100%]  (Sampling) 
Chain 1 finished in 0.2 seconds.
Chain 2 finished in 0.2 seconds.
Chain 3 finished in 0.2 seconds.
Chain 4 finished in 0.2 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.2 seconds.
Total execution time: 0.3 seconds.

You might get a lot of messages about C++ code while the model compiles, but you can ignore them. You might also get a few warning messages about rejected proposals, but unless there are a large number of these errors, you can ignore them as rejected proposals are a normal part of Metropolis and Hamiltonian Monte Carlo sampling.

Once our sampler has finished, the samples are stored in a CmdStanFit object, which we can summarize accessing the summary() method. Note that in this example, the output will include posterior estimates for the predicted value y_hat for each observation because we included that in our Stan model.[^2]

Code
angell_stan_fit$summary()
# A tibble: 91 × 10
   variable    mean  median     sd    mad      q5      q95  rhat ess_bulk ess_tail
   <chr>      <dbl>   <dbl>  <dbl>  <dbl>   <dbl>    <dbl> <dbl>    <dbl>    <dbl>
 1 lp__     -58.0   -57.7   1.45   1.27   -60.8   -56.3     1.00    3426.    4236.
 2 beta[1]   19.7    19.7   1.21   1.21    17.6    21.6     1.00    3263.    4038.
 3 beta[2]   -0.107  -0.107 0.0177 0.0176  -0.136  -0.0774  1.00    4876.    4900.
 4 beta[3]   -0.186  -0.186 0.0361 0.0348  -0.245  -0.126   1.00    3466.    3688.
 5 sigma      2.32    2.29  0.268  0.261    1.92    2.81    1.00    4969.    4343.
 6 y_hat[1]  14.7    14.7   0.612  0.610   13.6    15.7     1.00    3687.    5310.
 7 y_hat[2]  14.2    14.2   0.530  0.530   13.4    15.1     1.00    4222.    5731.
 8 y_hat[3]  14.8    14.8   0.643  0.636   13.7    15.8     1.00    3626.    5253.
 9 y_hat[4]  15.4    15.4   0.667  0.664   14.3    16.5     1.00    3617.    4807.
10 y_hat[5]  14.5    14.5   0.569  0.565   13.6    15.5     1.00    3882.    5332.
# ℹ 81 more rows

You can access specific parameters specifying variables:

Code
angell_stan_fit$summary(variables = c("beta", "sigma"))
# A tibble: 4 × 10
  variable   mean median     sd    mad     q5     q95  rhat ess_bulk ess_tail
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 beta[1]  19.7   19.7   1.21   1.21   17.6   21.6     1.00    3263.    4038.
2 beta[2]  -0.107 -0.107 0.0177 0.0176 -0.136 -0.0774  1.00    4876.    4900.
3 beta[3]  -0.186 -0.186 0.0361 0.0348 -0.245 -0.126   1.00    3466.    3688.
4 sigma     2.32   2.29  0.268  0.261   1.92   2.81    1.00    4969.    4343.

Assessing Convergence

The bayesplot package includes a number of useful functions for graphically checking convergence:

Code
library("bayesplot")
mcmc_hist(angell_stan_fit$draws("beta"))

Code
mcmc_trace(angell_stan_fit$draws("beta"))

What if we want a plot of the potential scale reduction factor \(\hat{R}\) at each iteration of the sampler? Unfortunately, cmdstanr doesn’t include a built-in function to produce this plot. Fortunately, it does provide an easy way to convert a CmdStanMCMC object to a coda object using the as_mcmc.list() function.

Code
coda::gelman.plot(postpack::post_subset(as_mcmc.list(angell_stan_fit), 
                                        params = "beta"), 
                  autoburnin = FALSE)

Shinystan

The shinystan package (Stan Development Team 2017) offers a convenient way to inspect the posterior distribution(s) of parameters in your web browser, using the R package shiny as a backend. This is a great tool for quick model checking, even though you will want to use the commands shown above to produce plots for a manuscript and/or appendix.

shinystan is installed automatically when you install rstanarm. To use it, simply call launch_shinystan on the model you fit with rstanarm:

Code
library("shinystan")
launch_shinystan(angell_stan_fit)

After a few seconds, your browser should show the shinystan interface:

ShinyStan start-up screen

Click on any of the buttons DIAGNOSE, ESTIMATE, EXPLORE, or MORE to proceed. Shinystan can even produce a table in format, and you can generate and export a number of figures without typing any code. (For reproducibility purposes, this is not part of your desired workflow, but Shinystan does allow for a fast and comprehensive look at convergence diagnostics.)

Presenting Results

Now that we’re confident that our samples are actually drawn from a posterior distribution of converging chains, we need to communicate the information contained in them to our readers. Here, you can use the same approach we’ve used before, for instance create density plots of the coefficients. Refer to prior labs and lectures for more information.

Tables

As before, you can use the mcmcTab() function from the BayesPostEst package to create a table, as long as you convert the CmdStanR object into an mcmc.list using as_mcmc.list():

Code
library("BayesPostEst")
mcmcTab(as_mcmc.list(angell_stan_fit),
        pars = c("beta[1]", "beta[2]", "beta[3]"))
  Variable Median    SD  Lower  Upper
1  beta[1] 19.662 1.214 17.287 22.035
2  beta[2] -0.107 0.018 -0.142 -0.072
3  beta[3] -0.186 0.036 -0.258 -0.114

All the other functions you saw in Lab 2 should work for CmdStanR objects accordingly as well (again, once you convert into an mcmc.list).

Figures

You can use a similar plotting functions as encountered in prior labs and lectures to visualize the posterior distribution(s) of parameters. You need to first load the bayesplot package to produce plots, and then convert the CmdStanR object into matrix (see above – using a combination of as.matrix() and as_mcmc.list(). You can do this inside the call to the mcmc_areas() plotting function, or beforehand

Code
library("ggplot2")
mcmc_areas(as.matrix(as_mcmc.list(angell_stan_fit)), 
     pars = c("beta[2]", "beta[3]")) + 
  xlim(-0.4, 0.1) + 
  geom_vline(xintercept = 0) + 
  scale_y_discrete(labels = c("Heterogeneity", "Mobility"))

For more information on plotting rstan objects, please read the excellent vignettes on the bayesplot website:

Further resources

Stan is very well documented; check out the Stan website as a portal to a plethora of resources. There is also a Stan discussion forum with a lot of examples of estimation problems that the Stan developers helped solve.

Example models

There are a few excellent sources for Stan model code that make it easy to translate your skills from this workshop (and your JAGS/BUGS code) to Stan.

  • This Github repository provides Stan code for all models shown in Gelman and Hill (2007).
  • Here is an overview of other example models in Stan, including code from several Bayesian textbooks and the WinBUGS example models.
  • This repository contains all examples from Kruschke (2014) in Stan language.
  • The Stan discussion forum contains a wealth of (usually) more advanced examples, including helpful discussions with troubleshooting advice, often from the Stan developers themselves.
  • The brms package (which we discuss in our workshop) allows you to extract the Stan (source) code for any Bayesian model you specified via the familiar formula interface. Take a look at the stancode function!
  • brms also allows you to use CmdStanR as a backend while retaining the brms formula language – but taking advantage of CmdStanR’s speed.

References

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. New York, NY: Cambridge University Press.
Kruschke, John. 2014. Doing Bayesian Data Analysis, Second Edition: A Tutorial with r, JAGS, and STAN. Oxford: Academic Press / Elsevier.
Stan Development Team. 2017. Shinystan: Interactive Visual and Numerical Diagnostics and Posterior Analysis for Bayesian Models. http://mc-stan.org/.