Lab 1: First steps in R

Applied Bayesian Modeling (ICPSR Summer Program 2025)
Author
Affiliation

Ursinus College

Published

July 20, 2025

Getting started

The purpose of this tutorial is to show the very basics of the R language so that participants who have never used R before can complete the first assignment in this workshop. For information on the thousands of other features of R, see the suggested resources below.

In this tutorial, R code that you would enter in your script file or in the command line shows up in gray boxes. You can copy code from these boxes into your own R script. R output shows up immediately underneath.

Opening RStudio

Upon opening the first time, RStudio will look similar to the screenshot below.

The window on the left is named “Console”. The point next to the blue “larger than” sign > is the “command line”. You can tell R to perform actions by typing commands into this command line. We will rarely do this and operate R through script files instead.

Typing R commands

In the following sections, I walk you through some basic R commands. In this tutorial and most other materials you will see in this workshop, R commands and the resulting R output will appear in light grey boxes. Output in this tutorial shows up immediately below this box, with output lines numbered and starting with [1].

To begin, see how R responds to commands. If you type a simple mathematical operation, R will return its result(s):

Code
1 + 1
[1] 2
Code
2 * 3
[1] 6
Code
10 / 3
[1] 3.33

Error messages

R will return error messages when a command is incorrect or when it cannot execute a command. Often, these error messages are informative. You can often get more information by simply searching for an error message on the web. Here, I try to add 1 and the letter a, which does not (yet) make sense as I haven’t defined an object a yet and numbers and letters cannot be added:

Code
1 + a
Error: object 'a' not found

As your coding will become more complex, you may forget to complete a particular command. For example, here I want to add 1 and the product of 2 and 4. But unless I add the parenthesis at the end of the line, or in the immediately following line, this code won’t execute:

Code
1 + (2 * 4
)
[1] 9

While executing this command and looking at the console, you will notice that the little > on the left changes into a +. This means that R is offering you a new line to finish the original command. If I type a right parenthesis, R returns the result of my operation.

R packages

Many useful and important functions in R are provided via packages that need to be installed separately. You can do this by using the Package Installer in the menu (Packages & Data – Package Installer in R or Tools – Install Packages… in RStudio), or by typing

Code
install.packages("rio")

in the R command line. Next, in every R session or script, you need to load the packages you want to use: type

Code
library("rio")

in the R command line. You only need to install packages once on your (or any) computer, but you need to load them anew in each R session.

Alternatively, if you only want to access one particular function from a package, but do not want to load the whole package, you can use the packagename::function option.

Working directory

In most cases, it is useful to create a project-specific working directory. This working directory should contain all input (mostly datasets) for your project. When set up as we show you in this workshop, your R code will then place all output (figures, tables, data) in this working directory. There are a few different best practices on how to handle working directories. For our workshop, I recommend to simply set the working directory to the same folder in which your script resides. You can do this with the following line of code – without actually hard-coding the path to this working directory.

Code
setwd(dirname(rstudioapi::getSourceEditorContext()$path))

You can typically see your current working directory on top of the R console in RStudio, or you can obtain the working directory with this command:

Code
getwd()
[1] "/Users/johanneskarreth/Library/Mobile Documents/com~apple~CloudDocs/Files/Uni/9 - ICPSR/2025/Slides/Lab 1"

RStudio also offers a very useful option to set up a whole project (File – New Project…). Projects automatically create a working directory for you. Even though we won’t use projects in this workshop, I recommend them as an easy and failsafe way to manage files and directories.

Comments

R scripts contain two types of text: R commands and comments. Commands are executed and perform actions. Comments are part of a script, but they are not executed. Comments begin with the # sign. Anything that follows after a # sign in the same line will be ignored by R. Compare what happens with the following two lines:

Code
1 + 1
[1] 2
Code
# 1 + 1
1 + 1 # + 3
[1] 2

You should use comments frequently to annotate your script files in order to explain to yourself what you are doing in a script file.

R help

Within R, you can access the help files for any command that exists by typing ?commandname or, for a list of the commands within a package, by typing help(package = packagename). So, for instance:

Code
?rnorm
help(package = "rio")

Workflows and conventions

There are many resources on how to structure your R workflow (think of routines like the ones suggested by J. Scott Long in The Workflow of Data Analysis Using Stata), and I encourage you to search for and maintain a consistent approach to working with R. It will make your life much, much easier—with regards to collaboration, replication, and general efficiency. We recommend following the Project TIER protocol. In addition, here are a few really important points that you might want to consider as you start using R:

  • Never type commands into the R command line or the console. Always use a script file in RStudio and execute your code from this script file using the “Run” button or the Command & Return (Mac) or Control & Return (Windows) key combination.
  • Always create and specify a working directory at the beginning of a script file. This will ensure that all input and output of your project-specific work is in a location that makes sense.
  • Comment your script files!
  • Save your script files in a project-specific working directory
  • Use a consistent style when writing code. A good place to start is this style guide: https://style.tidyverse.org. Read through this style guide today and consider using this style from then on.
  • In script files, try to break lines after 80 characters to keep your files readable.
  • Do not use the attach() command.

Objects in R

R relies on objects. This means that you, the user, create objects and work with them. Objects can be of different types. To create an object, first type the object name, then the “assignment character”, a leftward arrow <-, then the content of an object. To display an object, simply type the object’s name, and it will be printed to the console.

You can then apply functions to objects. Most functions have names that are somewhat descriptive of their purpose. For example, mean() calculates the mean of the numbers within the parentheses, and log() calculates the natural logarithm of the number(s) within the parentheses.

Functions consist of a function name, the function’s arguments, and specific values passed to the arguments. In symbolic terms:

Code
function_name(argument1 = value,
              argument2 = value)

Here is a specific example of the function abbreviate, its first argument names.arg, and the value "Regression" that I provide to the argument x:

Code
abbreviate(names.arg = "Regression")
Regression 
    "Rgrs" 

The following are the types of objects you need to be familiar with:

  • Scalars

  • Vectors of different types

    • Numeric (numbers)
    • Character (words or letters): always entered between quotation marks "
    • Factor (numbers with labels)
    • Logical (TRUE or FALSE)
  • Matrices

  • Data frames

  • Lists

Object types

Below, you find some more specific examples of different types of objects.

  • Numbers:
Code
x <- 1
x
[1] 1
Code
y <- 2
x + y
[1] 3
Code
x * y
[1] 2
Code
x / y
[1] 0.5
Code
y^2
[1] 4
Code
log(x)
[1] 0
Code
exp(x)
[1] 2.72
  • Vectors:
Code
xvec <- c(1, 2, 3, 4, 5)
xvec
[1] 1 2 3 4 5
Code
xvec2 <- seq(from = 1, to = 5, by = 1)
xvec2
[1] 1 2 3 4 5
Code
yvec <- rep(1, 5)
yvec
[1] 1 1 1 1 1
Code
zvec <- xvec + yvec
zvec
[1] 2 3 4 5 6
  • Matrices:
Code
mat1 <- matrix(data = c(1, 2, 3, 4, 5, 6), nrow = 3, byrow = TRUE)
mat1
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
Code
mat2 <- matrix(data = seq(from = 6, to = 3.5, by = -0.5), 
               nrow = 2, byrow = T)
mat2
     [,1] [,2] [,3]
[1,]  6.0  5.5  5.0
[2,]  4.5  4.0  3.5
Code
mat1 %*% mat2
     [,1] [,2] [,3]
[1,]   15 13.5   12
[2,]   36 32.5   29
[3,]   57 51.5   46
  • Data frames (equivalent to data sets):
Code
y <- c(1, 1, 3, 4, 7, 2)
x1 <- c(2, 4, 1, 8, 19, 11)
x2 <- c(-3, 4, -2, 0, 4, 20)
name <- c("Student 1", "Student 2", "Student 3", "Student 4", 
          "Student 5", "Student 6")
mydata <- data.frame(name, y, x1, x2)
mydata
       name y x1 x2
1 Student 1 1  2 -3
2 Student 2 1  4  4
3 Student 3 3  1 -2
4 Student 4 4  8  0
5 Student 5 7 19  4
6 Student 6 2 11 20

Random numbers and distributions

You can use R to generate (random) draws from distributions. This will be important in the first problem set For instance, to generate 1000 draws from a normal distribution with a mean of 5 and standard deviation of 10, you would write:

Code
draws <- rnorm(1000, mean = 5, sd = 10)
summary(draws)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -28.96   -1.73    4.60    4.73   11.16   36.96 

You can then use plotting commands (see for more below) to visualize your draws:

  • Density plots:
Code
draws <- rnorm(1000, mean = 5, sd = 10)
plot(density(draws), main = "This is a plot title", 
     xlab = "Label for the X-axis", ylab = "Label for the Y-axis")

  • Histograms:
Code
draws <- rnorm(1000, mean = 5, sd = 10)
hist(draws)

Extracting elements from an object

  • Elements from a vector:
Code
vec <- c(4, 1, 5, 3)
vec[3]
[1] 5
  • Variables from a data frame:
Code
mydata$x1
[1]  2  4  1  8 19 11
Code
mydata$names
NULL
  • Columns from a matrix:
Code
mat1[ ,1]
[1] 1 3 5
  • Rows from a matrix:
Code
mat1[1, ]
[1] 1 2
  • Elements from a list
Code
mylist <- list(x1, x2, y)
mylist[[1]]
[1]  2  4  1  8 19 11

Working with data sets

In most cases, you will not type up your data by hand, but use data sets that were created in other formats. You can easily import such data sets into R.

Importing data into R

The “rio” package allows you to import data sets in a variety of formats with one single function, import(). You need to first load the package:

Code
library("rio")

The import() function “guesses” the format of the data from the file type extension, so that a file ending in .csv} is read in as a comma-separated value file. If the file typ extension does not reveal the type of data (e.g., a tab-separated file saved with a .txt extension), you need to provide the format argument, as you see in the first example below. See the help file for import() for more information.

Note that for each command, many options (in R language: arguments) are available; you will most likely need to work with these options at some time, for instance when your source dataset (e.g., in Stata) has value labels. Check the help files for the respective command in that case.

  • Tab-separated files: If you have a text file with a simple tab-delimited table, where the first line designates variable names:
Code
mydata_from_tsv <- import("https://www.jkarreth.net/files/mydata.txt", format = "tsv")
head(mydata_from_tsv)
      y    x1    x2
1 -0.56  1.22 -1.07
2 -0.23  0.36 -0.22
3  1.56  0.40 -1.03
4  0.07  0.11 -0.73
5  0.13 -0.56 -0.63
6  1.72  1.79 -1.69

Alternatively, use read.table() specifically for tab-separated files:

Code
mydata_from_tsv <- read.table("https://www.jkarreth.net/files/mydata.txt", header = TRUE)
head(mydata_from_tsv)
      y    x1    x2
1 -0.56  1.22 -1.07
2 -0.23  0.36 -0.22
3  1.56  0.40 -1.03
4  0.07  0.11 -0.73
5  0.13 -0.56 -0.63
6  1.72  1.79 -1.69
  • CSV files: If you have a text file with a simple tab-delimited table, where the first line designates variable names:
Code
mydata_from_csv <- import("https://www.jkarreth.net/files/mydata.csv")
head(mydata_from_csv)
      y    x1    x2
1 -0.56  1.22 -1.07
2 -0.23  0.36 -0.22
3  1.56  0.40 -1.03
4  0.07  0.11 -0.73
5  0.13 -0.56 -0.63
6  1.72  1.79 -1.69

Alternatively, use read.csv() specifically for comma-separated files:

Code
mydata_from_csv <- read.csv("https://www.jkarreth.net/files/mydata.csv")
head(mydata_from_csv)
      y    x1    x2
1 -0.56  1.22 -1.07
2 -0.23  0.36 -0.22
3  1.56  0.40 -1.03
4  0.07  0.11 -0.73
5  0.13 -0.56 -0.63
6  1.72  1.79 -1.69
  • SPSS files: If you have an SPSS data file, you can do this:
Code
mydata_from_spss <- import("https://www.jkarreth.net/files/mydata.sav")
head(mydata_from_spss)
      y    x1    x2
1 -0.56  1.22 -1.07
2 -0.23  0.36 -0.22
3  1.56  0.40 -1.03
4  0.07  0.11 -0.73
5  0.13 -0.56 -0.63
6  1.72  1.79 -1.69
  • Stata files: If you have a Stata data file, you can do this:
Code
mydata_from_dta <- import("https://www.jkarreth.net/files/mydata.dta")
head(mydata_from_dta)
      y    x1    x2
1 -0.56  1.22 -1.07
2 -0.23  0.36 -0.22
3  1.56  0.40 -1.03
4  0.07  0.11 -0.73
5  0.13 -0.56 -0.63
6  1.72  1.79 -1.69

Alternatively, use read.dta() from the “foreign” package specifically for Stata files:

Code
library("foreign")
mydata_from_dta <- read.dta("https://www.jkarreth.net/files/mydata.dta")
head(mydata_from_dta)
      y    x1    x2
1 -0.56  1.22 -1.07
2 -0.23  0.36 -0.22
3  1.56  0.40 -1.03
4  0.07  0.11 -0.73
5  0.13 -0.56 -0.63
6  1.72  1.79 -1.69

Describing data

To obtain descriptive statistics of a dataset, or a variable, use the summary command:

Code
summary(mydata_from_dta)
       y                x1               x2         
 Min.   :-1.270   Min.   :-1.970   Min.   :-1.6900  
 1st Qu.:-0.532   1st Qu.:-0.325   1st Qu.:-1.0600  
 Median :-0.080   Median : 0.380   Median :-0.6800  
 Mean   : 0.074   Mean   : 0.208   Mean   :-0.4270  
 3rd Qu.: 0.378   3rd Qu.: 0.650   3rd Qu.: 0.0575  
 Max.   : 1.720   Max.   : 1.790   Max.   : 1.2500  
Code
summary(mydata_from_dta$y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -1.270  -0.532  -0.080   0.074   0.378   1.720 

You can access particular quantities, such as standard deviations and quantiles (in this case the 5th and 95th percentiles), with the respective functions:

Code
sd(mydata_from_dta$y)
[1] 0.956
Code
quantile(mydata_from_dta$y, probs = c(0.05, 0.95))
   5%   95% 
-1.01  1.65 

Creating figures

R offers several options to create figures. We will work with the so-called “base graphics”, mostly using the plot() function, and the “ggplot2” package.

Base graphics

R’s base graphics are very versatile and, in our workshop, ideal for creating quick plots to inspect objects. These graphs are built sequentially, beginning with the plot() command applied to an object. So, for instance to plot the density of 1000 draws from a normal distribution, you would use the following code. I’m using the set.seed() command here before every simulation to ensure that the same values are drawn when you try these commands and make these plots.

Code
set.seed(123)
dist1 <- rnorm(n = 1000, mean = 0, sd = 1)
set.seed(123)
dist2 <- rnorm(1000, mean = 0, sd = 2)
plot(density(dist1))
lines(density(dist2), col = "red")

The ggplot2 package

The “ggplot2” package has become popular because its language and plotting sequence can be somewhat more convenient (depending on users’ background), especially when working with more complex datasets. For plotting Bayesian model output, ggplot2 offers some useful features. I will mostly use ggplot2 in this workshop because (in my opinion) it offers a quick and scalable way to produce figures that are useful for diagnostics and publication-quality output alike.

ggplot2 needs to be first loaded as an external package. Its key commands are ggplot() and various types of plots, passed to R via geom_ commands. All commands are added via +, either in one line or in a new line to an existing ggplot2 object. The command below contains a couple more data manipulation steps that will come in handy for us later; we will discuss them in the workshop. Here, I use the tidyr::pivot_longer command to reshape the data so they can be plotted in one figure. When trying the code below, have a look at the structure of the dist.dat object to see what’s going on.

Code
library("ggplot2"); library("tidyr")
set.seed(123)
dist1 <- rnorm(n = 1000, mean = 0, sd = 1)
set.seed(123)
dist2 <- rnorm(1000, mean = 0, sd = 2)
dist.df <- data.frame(dist1, dist2)
dist.long <- pivot_longer(data = dist.df, cols = everything())
head(dist.long)
# A tibble: 6 × 2
  name   value
  <chr>  <dbl>
1 dist1 -0.560
2 dist2 -1.12 
3 dist1 -0.230
4 dist2 -0.460
5 dist1  1.56 
6 dist2  3.12 
Code
normal.plot <- ggplot(data = dist.long, aes(x = value, colour = name, fill = name))
normal.plot <- normal.plot + geom_density(alpha = 0.5)
normal.plot

ggplot2 offers plenty of opportunities for customizing plots; we will also encounter these later on in the workshop. You can also have a look at Winston Chang’s R Graphics Cookbook for plenty of examples of ggplot2 customization: http://www.cookbook-r.com/Graphs.

Exporting graphs

Plots created via base graphics can be printed to a PDF file using the pdf() command. This code:

Code
set.seed(123)
dist1 <- rnorm(n = 1000, mean = 0, sd = 1)
set.seed(123)
dist2 <- rnorm(1000, mean = 0, sd = 2)
pdf("normal_plot.pdf", width = 5, height = 5)
plot(density(dist1))
lines(density(dist2), col = "red")
dev.off()
agg_png 
      2 

will print a plot named normal_plot.pdf of the size 5 by 5 inches to your working directory.

Plots created with ggplot2 are best saved using the ggsave() command:

Code
ggsave(plot = normal.plot, filename = "normal_ggplot.pdf", width = 5, height = 5, unit = "in")

Integrating writing and data analysis

For project management and replication purposes, it is advantageous to combine your data analyis and writing in one framework. Quarto, RMarkdown, Sweave and knitr are great solutions for this. The RStudio website has a good explanation of these options: https://docs.posit.co/ide/user/ide/guide/documents/quarto-project.html, http://rmarkdown.rstudio.com and https://support.rstudio.com/hc/en-us/articles/200552056-Using-Sweave-and-knitr. This tutorial and slides are written using knitr. Depending on interest, I may be able to offer another lab that will address Quarto or RMarkdown as a tool for reproducible research, among other topics.