The purpose of this tutorial is to show the very basics of the R language so that participants who have not used R before can complete the first assignment in this workshop. For information on the thousands of other features of R, see the suggested resources below.
In this tutorial, R code that you would enter in your script file or in the command line is preceded by the >
character, and by +
if the current line of code continues from a previous line. You do not need to type this character in your own code. Note that copying and pasting code from the PDF version of this tutorial may lead to errors when trying to execute code. Please copy code from the R script used to produce this tutorial; this script can be found here.
The most recent version of R for all operating systems is always located at http://www.r-project.org/index.html. Go directly to http://lib.stat.cmu.edu/R/CRAN/, and download the R version for your operating system. Then, install R.
To operate R, you should rely on writing R scripts. We will write these scripts in RStudio. Download RStudio from http://www.rstudio.org. Then, install it on your computer. Some text editors also offer integration with R, so that you can send code directly to R. RStudio is generally the best solution for running R and maintaining a reproducible workflow.
Lastly, install LaTeX in order to compile PDF files from within RStudio. To do this, follow the instructions under http://www.jkarreth.net/latex.html, “Installation”. You won’t have to use LaTeX directly or learn how to write LaTeX code in this workshop.
Upon opening the first time, RStudio will look like the screenshot below.
The window on the left is named “Console”. The point next to the blue “larger than” sign >
is the “command line”. You can tell R to perform actions by typing commands into this command line. We will rarely do this and operate R through script files instead.
In the following sections, I walk you through some basic R commands. In this tutorial and most other materials you will see in this workshop, R commands and the resulting R output will appear in light grey boxes. Output in this tutorial is always preceded by two ##
signs.
To begin, see how R responds to commands. If you type a simple mathematical operation, R will return its result(s):
## [1] 2
## [1] 6
## [1] 3.333333
R will return error messages when a command is incorrect or when it cannot execute a command. Often, these error messages are informative. You can often get more information by simply searching for an error message on the web. Here, I try to add 1 and the letter a, which does not (yet) make sense as I haven’t defined an object a
yet and numbers and letters cannot be added:
## Error in eval(expr, envir, enclos): object 'a' not found
As your coding will become more complex, you may forget to complete a particular command. For example, here I want to add 1 and the product of 2 and 4. But unless I add the parenthesis at the end of the line, or in the immediately following line, this code won’t execute:
## [1] 9
While executing this command and looking at the console, you will notice that the little >
on the left changes into a +
. This means that R is offering you a new line to finish the original command. If I type a right parenthesis, R returns the result of my operation.
Many useful and important functions in R are provided via packages that need to be installed separately. You can do this by using the Package Installer in the menu (Packages & Data – Package Installer in R or Tools – Install Packages… in RStudio), or by typing
in the R command line. Next, in every R session or script, you need to load the packages you want to use: type
in the R command line. You only need to install packages once on your (or any) computer, but you need to load them anew in each R session.
Alternatively, if you only want to access one particular function from a package, but do not want to load the whole package, you can use the packagename::function
option.
In most cases, it is useful to set a project-specific working directory—especially if you work with many files and want to create graphics that you want to have printed to .pdf or .eps files. You can set the WD with this command:
You can typically see your current working directory on top of the R console in RStudio, or you can obtain the working directory with this command:
## [1] "/Users/johanneskarreth/Documents/Dropbox/Uni/9 - ICPSR/2019/Applied Bayes/Course materials/Labs/1 - R Basics"
RStudio also offers a very useful function to set up a whole project (File – New Project…). Projects automatically create a working directory for you.
Within R, you can access the help files for any command that exists by typing ?commandname
or, for a list of the commands within a package, by typing help(package = packagename)
. So, for instance:
There are many resources on how to structure your R workflow (think of routines like the ones suggested by J. Scott Long in The Workflow of Data Analysis Using Stata), and I encourage you to search for and maintain a consistent approach to working with R. It will make your life much, much easier—with regards to collaboration, replication, and general efficiency. We recommend following the Project TIER protocol. In addition, here are a few really important points that you might want to consider as you start using R:
attach()
command.As R has become one of the most popular programs for statistical computing, the number of resources in print and online has increased dramatically. Searching for terms like “introduction to R software” will return a huge number of results.
Some (of the many) good resources that I have encountered and found useful are:
R is an object-oriented programming language. This means that you, the user, create objects and work with them. Objects can be of different types. To create an object, first type the object name, then the “assignment character”, a leftward arrow <-
, then the content of an object. To display an object, simply type the object’s name, and it will be printed to the console.1
You can then apply functions to objects. Most functions have names that are somewhat descriptive of their purpose. For example, mean()
calculates the mean of the numbers within the parentheses, and log()
calculates the natural logarithm of the number(s) within the parentheses.
Functions consist of a function name, the function’s arguments, and specific values passed to the arguments. In symbolic terms:
Here is a specific example of the function abbreviate
, its first argument names.arg
, and the value "Regression"
that I provide to the argument x
:
## Regression
## "Rgrs"
The following are the types of objects you need to be familiar with:
Vectors of different types
"
Lists
Below, you find some more specific examples of different types of objects.
## [1] 1
## [1] 3
## [1] 2
## [1] 0.5
## [1] 4
## [1] 0
## [1] 2.718282
## [1] 1 2 3 4 5
## [1] 1 2 3 4 5
## [1] 1 1 1 1 1
## [1] 2 3 4 5 6
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
## [,1] [,2] [,3]
## [1,] 6.0 5.5 5.0
## [2,] 4.5 4.0 3.5
## [,1] [,2] [,3]
## [1,] 15 13.5 12
## [2,] 36 32.5 29
## [3,] 57 51.5 46
y <- c(1, 1, 3, 4, 7, 2)
x1 <- c(2, 4, 1, 8, 19, 11)
x2 <- c(-3, 4, -2, 0, 4, 20)
name <- c("Student 1", "Student 2", "Student 3", "Student 4",
"Student 5", "Student 6")
mydata <- data.frame(name, y, x1, x2)
mydata
## name y x1 x2
## 1 Student 1 1 2 -3
## 2 Student 2 1 4 4
## 3 Student 3 3 1 -2
## 4 Student 4 4 8 0
## 5 Student 5 7 19 4
## 6 Student 6 2 11 20
You can use R to generate (random) draws from distributions. This will be important in the first assignment. For instance, to generate 1000 draws from a normal distribution with a mean of 5 and standard deviation of 10, you would write:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -26.915 -2.501 4.239 4.322 10.815 34.042
You can then use a variety of plotting commands (see for more below) to visualize your draws:
draws <- rnorm(1000, mean = 5, sd = 10)
plot(density(draws), main = "This is a plot title",
xlab = "Label for the X-axis", ylab = "Label for the Y-axis")
## [1] 5
## [1] 2 4 1 8 19 11
## NULL
## [1] 1 3 5
## [1] 1 2
## [1] 2 4 1 8 19 11
In most cases, you will not type up your data by hand, but use data sets that were created in other formats. You can easily import such data sets into R.
The “rio” package allows you to import data sets in a variety of formats with one single function, import()
. You need to first load the package:
The import()
function “guesses” the format of the data from the file type extension, so that a file ending in .csv
} is read in as a comma-separated value file. If the file typ extension does not reveal the type of data (e.g., a tab-separated file saved with a .txt
extension), you need to provide the format
argument, as you see in the first example below. See the help file for import()
for more information.
Note that for each command, many options (in R language: arguments) are available; you will most likely need to work with these options at some time, for instance when your source dataset (e.g., in Stata) has value labels. Check the help files for the respective command in that case.
mydata_from_tsv <- import("http://www.jkarreth.net/files/mydata.txt", format = "tsv")
head(mydata_from_tsv)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
Alternatively, use read.table()
specifically for tab-separated files:
mydata_from_tsv <- read.table("http://www.jkarreth.net/files/mydata.txt", header = TRUE)
head(mydata_from_tsv)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
Alternatively, use read.csv()
specifically for comma-separated files:
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
Alternatively, use read.dta()
from the “foreign” package specifically for Stata files:
library("foreign")
mydata_from_dta <- read.dta("http://www.jkarreth.net/files/mydata.dta")
head(mydata_from_dta)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
To obtain descriptive statistics of a dataset, or a variable, use the summary
command:
## y x1 x2
## Min. :-1.2700 Min. :-1.970 Min. :-1.6900
## 1st Qu.:-0.5325 1st Qu.:-0.325 1st Qu.:-1.0600
## Median :-0.0800 Median : 0.380 Median :-0.6800
## Mean : 0.0740 Mean : 0.208 Mean :-0.4270
## 3rd Qu.: 0.3775 3rd Qu.: 0.650 3rd Qu.: 0.0575
## Max. : 1.7200 Max. : 1.790 Max. : 1.2500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2700 -0.5325 -0.0800 0.0740 0.3775 1.7200
You can access particular quantities, such as standard deviations and quantiles (in this case the 5th and 95th percentiles), with the respective functions:
## [1] 0.9561869
## 5% 95%
## -1.009 1.648
R offers several options to create figures. We will work with the so-called “base graphics”, mostly using the plot()
function, and the “ggplot2” package.
R’s base graphics are very versatile and, in our workshop, ideal for creating quick plots to inspect objects. These graphs are built sequentially, beginning with the plot()
command applied to an object. So, for instance to plot the density of 1000 draws from a normal distribution, you would use the following code. I’m using the set.seed()
command here before every simulation to ensure that the same values are drawn when you try these commands and make these plots.
set.seed(123)
dist1 <- rnorm(n = 1000, mean = 0, sd = 1)
set.seed(123)
dist2 <- rnorm(1000, mean = 0, sd = 2)
plot(density(dist1))
lines(density(dist2), col = "red")
The “lattice” package has long been popular for visualizing more complex data structures, e.g. nested data. For plotting Bayesian model output, lattice offers some useful features.
lattice needs to be first loaded as an external package. It offers a variety of plots, some of them specifically built-in (densityplot
or dotplot
) and many other plots can be built with xyplot
. The command below contains a couple more data manipulation steps that will come in handy for us later; we will discuss them in the workshop. Here, I use the reshape2::melt
command to reshape the data so they can be plotted in one figure. When trying the code below, have a look at the structure of the dist.dat
object to see what’s going on.
library("lattice"); library("reshape2")
set.seed(123)
dist1 <- rnorm(n = 1000, mean = 0, sd = 1)
set.seed(123)
dist2 <- rnorm(1000, mean = 0, sd = 2)
dist.df <- data.frame(dist1, dist2)
dist.df <- melt(dist.df)
## No id variables; using all as measure variables
## variable value
## 1 dist1 -0.56047565
## 2 dist1 -0.23017749
## 3 dist1 1.55870831
## 4 dist1 0.07050839
## 5 dist1 0.12928774
## 6 dist1 1.71506499
The “ggplot2” package has become popular because its language and plotting sequence can be somewhat more convenient (depending on users’ background), especially when working with more complex datasets. For plotting Bayesian model output, ggplot2 offers some useful features. I will mostly use ggplot2 in this workshop because (in my opinion) it offers a quick and scalable way to produce figures that are useful for diagnostics and publication-quality output alike.
ggplot2 needs to be first loaded as an external package. Its key commands are ggplot()
and various types of plots, passed to R via geom_
commands. All commands are added via +
, either in one line or in a new line to an existing ggplot2 object. The command below contains a couple more data manipulation steps that will come in handy for us later; we will discuss them in the workshop. Here, I use the reshape2::melt
command to reshape the data so they can be plotted in one figure. When trying the code below, have a look at the structure of the dist.dat
object to see what’s going on.
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
set.seed(123)
dist1 <- rnorm(n = 1000, mean = 0, sd = 1)
set.seed(123)
dist2 <- rnorm(1000, mean = 0, sd = 2)
dist.df <- data.frame(dist1, dist2)
dist.df <- melt(dist.df)
## No id variables; using all as measure variables
## variable value
## 1 dist1 -0.56047565
## 2 dist1 -0.23017749
## 3 dist1 1.55870831
## 4 dist1 0.07050839
## 5 dist1 0.12928774
## 6 dist1 1.71506499
normal.plot <- ggplot(data = dist.df, aes(x = value, colour = variable, fill = variable))
normal.plot <- normal.plot + geom_density(alpha = 0.5)
normal.plot
ggplot2 offers plenty of opportunities for customizing plots; we will also encounter these later on in the workshop. You can also have a look at Winston Chang’s R Graphics Cookbook for plenty of examples of ggplot2 customization: http://www.cookbook-r.com/Graphs.
Plots created via base graphics can be printed to a PDF file using the pdf()
command. This code:
set.seed(123)
dist1 <- rnorm(n = 1000, mean = 0, sd = 1)
set.seed(123)
dist2 <- rnorm(1000, mean = 0, sd = 2)
pdf("normal_plot.pdf", width = 5, height = 5)
plot(density(dist1))
lines(density(dist2), col = "red")
dev.off()
## quartz_off_screen
## 2
will print a plot named normal_plot.pdf
of the size 5 by 5 inches to your working directory.
Plots created with ggplot2 are best saved using the ggsave()
command:
For project management and replication purposes, it is advantageous to combine your data analyis and writing in one framework. RMarkdown, Sweave and knitr are great solutions for this. The RStudio website has a good explanation of these options: http://rmarkdown.rstudio.com and https://support.rstudio.com/hc/en-us/articles/200552056-Using-Sweave-and-knitr. This tutorial and all slides in Weeks 3-4 are written using knitr. We will offer another lab that will address RMarkdown as a tool for reproducible research, among other topics.
/Users/thomasbayes/Work/ICPSR/Homework/Lab1
.Go to http://gss.norc.org/ and download the General Social Survey raw data for 2014 in SPSS or Stata format. Save this file in an assignment-specific working directory. Then, create an R script that performs the following operations:
A good overview of types of objects is here: http://www.statmethods.net/input/datatypes.html. Read this page and then continue this tutorial.↩
Comments
R scripts contain two types of text: R commands and comments. Commands are executed and perform actions. Comments are part of a script, but they are not executed. Comments begin with the
#
sign. Anything that follows after a#
sign in the same line will be ignored by R. Compare what happens with the following two lines:You should use comments frequently to annotate your script files in order to explain to yourself what you are doing in a script file.