Getting started

The purpose of this tutorial is to show the very basics of the R language so that participants who have not used R before can complete the first assignment in this workshop. For information on the thousands of other features of R, see the suggested resources below.

Installing R and RStudio

The most recent version of R for all operating systems is always located at http://www.r-project.org/index.html. You can go directly to https://cloud.r-project.org, and download the R version for your operating system. Then, install R.

To operate R, you should rely on writing R scripts. We will write these scripts in RStudio. Download RStudio from http://www.rstudio.org. Then, install it on your computer. Some text editors also offer integration with R, so that you can send code directly to R. RStudio is generally the best solution for running R and maintaining a reproducible workflow.

Lastly, install LaTeX in order to compile PDF files from within RStudio. To do this, follow the instructions under http://www.jkarreth.net/latex.html, “Installation”. You won’t have to use LaTeX directly or learn how to write LaTeX code in this workshop. Alternatively, you can install the “tinytex” distribution, which requires less size and time to install. You can find excellent documentation and instructions for tinytex at https://yihui.org/tinytex/.

Opening RStudio

Upon opening the first time, RStudio will look like the screenshot below.

RStudio start-up screen

The window on the left is named “Console”. The point next to the blue “larger than” sign > is the “command line”. You can tell R to perform actions by typing commands into this command line. We will rarely do this and operate R through script files instead.

Typing R commands

In the following sections, I walk you through some basic R commands. In this tutorial and most other materials you will see in this workshop, R commands and the resulting R output will appear in light grey boxes. Output in this tutorial is always preceded by two ## signs.

To begin, see how R responds to commands. If you type a simple mathematical operation, R will return its result(s):

1 + 1
## [1] 2
2 * 3
## [1] 6
10 / 3
## [1] 3.333333

Error messages

R will return error messages when a command is incorrect or when it cannot execute a command. Often, these error messages are informative. You can often get more information by simply searching for an error message on the web. Here, I try to add 1 and the letter a, which does not (yet) make sense as I haven’t defined an object a yet and numbers and letters cannot be added:

1 + a
## Error in eval(expr, envir, enclos): object 'a' not found

As your coding will become more complex, you may forget to complete a particular command. For example, here I want to add 1 and the product of 2 and 4. But unless I add the parenthesis at the end of the line, or in the immediately following line, this code won’t execute:

1 + (2 * 4
)
## [1] 9

While executing this command and looking at the console, you will notice that the little > on the left changes into a +. This means that R is offering you a new line to finish the original command. If I type a right parenthesis, R returns the result of my operation.

R packages

Many useful and important functions in R are provided via packages that need to be installed separately. You can do this by using the Package Installer in the menu (Packages & Data – Package Installer in R or Tools – Install Packages… in RStudio), or by typing

install.packages("lme4")

in the R command line. Next, in every R session or script, you need to load the packages you want to use: type

library("lme4")
## Loading required package: Matrix

in the R command line. You only need to install packages once on your (or any) computer, but you need to load them anew in each R session.

Alternatively, if you only want to access one particular function from a package, but do not want to load the whole package, you can use the packagename::function option.

I will use this notation throughout the tutorials and labs for this course to explicitly indicate functions that are not part of base R. If you have not used a particular package on your computer before, please install it before proceeding.

Working directory

In most cases, it is useful to set a project-specific working directory—especially if you work with many files and want to create graphics that you want to have printed to .pdf or .eps files. You can set the WD with this command:

setwd("~/Documents/Dropbox/Uni/9 - ICPSR/2021/MLM/Slides/Lab-1")

You can typically see your current working directory on top of the R console in RStudio, or you can obtain the working directory with this command:

getwd()
## [1] "/Users/johanneskarreth/Documents/Dropbox/Uni/9 - ICPSR/2021/MLM/Slides/Lab-1"

RStudio also offers a very useful function to set up a whole project (File – New Project…). Projects automatically create a working directory for you. Even though we won’t use projects in this workshop, I recommend them as an easy and failproof way to manage files and directories.

Comments

R scripts contain two types of text: R commands and comments. Commands are executed and perform actions. Comments are part of a script, but they are not executed. Comments begin with the # sign. Anything that follows after a # sign in the same line will be ignored by R. Compare what happens with the following two lines:

1 + 1
## [1] 2
# 1 + 1
1 + 1 # + 3
## [1] 2

You should use comments frequently to annotate your script files in order to explain to yourself what you are doing in a script file.

R help

Within R, you can access the help files for any command that exists by typing ?commandname or, for a list of the commands within a package, by typing help(package = packagename). So, for instance:

?rnorm
help(package = "lme4")

Workflows and conventions

There are many resources on how to structure your R workflow (think of routines like the ones suggested by J. Scott Long in The Workflow of Data Analysis Using Stata), and I encourage you to search for and maintain a consistent approach to working with R. It will make your life much, much easier—with regards to collaboration, replication, and general efficiency. I recommend following the Project TIER protocol. In addition, here are a few really important points that you might want to consider as you start using R:

  • Never type commands into the R command line or the console. Always use a script file in RStudio and execute your code from this script file using the “Run” button or the Command & Return (Mac) or Control & Return (Windows) key combination.
  • Always create and specify a working directory at the beginning of a script file. This will ensure that all input and output of your project-specific work is in a location that makes sense.
  • Comment your script files!
  • Save your script files in a project-specific working directory
  • Use a consistent style when writing code. A good place to start is this style guide: http://adv-r.had.co.nz/Style.html. Read through this style guide today and use this style from then on.
  • In script files, try to break lines after 80 characters to keep your files readable.
  • Do not use the attach() command.

Useful resources

As R has become one of the most popular programs for statistical computing, the number of resources in print and online has increased dramatically. Searching for terms like “introduction to R software” will return a huge number of results.

Some (of the many) good resources that I have encountered and found useful are:

R and object-oriented programming

R is an object-oriented programming language. This means that you, the user, create objects and work with them. Objects can be of different types. To create an object, first type the object name, then the “assignment character”, a leftward arrow <-, then the content of an object. To display an object, simply type the object’s name, and it will be printed to the console.

You can then apply functions to objects. Most functions have names that are somewhat descriptive of their purpose. For example, mean() calculates the mean of the numbers within the parentheses, and log() calculates the natural logarithm of the number(s) within the parentheses.

Functions consist of a function name, the function’s arguments, and specific values passed to the arguments. In symbolic terms:

function_name(argument1 = value,
              argument2 = value)

Here is a specific example of the function abbreviate, its first argument names.arg, and the value "Regression" that I provide to the argument x:

abbreviate(names.arg = "Regression")
## Regression 
##     "Rgrs"

The following are the types of objects you need to be familiar with:

  • Scalars

  • Vectors of different types

    • Numeric (numbers)
    • Character (words or letters): always entered between quotation marks "
    • Factor (numbers with labels)
    • Logical (TRUE or FALSE)
  • Matrices

  • Data frames

  • Lists

Object types

Below, you find some more specific examples of different types of objects.

  • Numbers:
x <- 1
x
## [1] 1
y <- 2
x + y
## [1] 3
x * y
## [1] 2
x / y
## [1] 0.5
y^2
## [1] 4
log(x)
## [1] 0
exp(x)
## [1] 2.718282
  • Vectors:
xvec <- c(1, 2, 3, 4, 5)
xvec
## [1] 1 2 3 4 5
xvec2 <- seq(from = 1, to = 5, by = 1)
xvec2
## [1] 1 2 3 4 5
yvec <- rep(1, 5)
yvec
## [1] 1 1 1 1 1
zvec <- xvec + yvec
zvec
## [1] 2 3 4 5 6
  • Matrices:
mat1 <- matrix(data = c(1, 2, 3, 4, 5, 6), nrow = 3, byrow = TRUE)
mat1
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
mat2 <- matrix(data = seq(from = 6, to = 3.5, by = -0.5), 
               nrow = 2, byrow = T)
mat2
##      [,1] [,2] [,3]
## [1,]  6.0  5.5  5.0
## [2,]  4.5  4.0  3.5
mat1 %*% mat2
##      [,1] [,2] [,3]
## [1,]   15 13.5   12
## [2,]   36 32.5   29
## [3,]   57 51.5   46
  • Data frames (equivalent to data sets):
y <- c(1, 1, 3, 4, 7, 2)
x1 <- c(2, 4, 1, 8, 19, 11)
x2 <- c(-3, 4, -2, 0, 4, 20)
name <- c("Student 1", "Student 2", "Student 3", 
          "Student 4", "Student 5", "Student 6")
mydata <- data.frame(name, y, x1, x2)
mydata
##        name y x1 x2
## 1 Student 1 1  2 -3
## 2 Student 2 1  4  4
## 3 Student 3 3  1 -2
## 4 Student 4 4  8  0
## 5 Student 5 7 19  4
## 6 Student 6 2 11 20

Random numbers and distributions

You can use R to generate (random) draws from distributions. This will be important in the first assignment. For instance, to generate 1000 draws from a normal distribution with a mean of 5 and standard deviation of 10, you would write:

draws <- rnorm(n = 1000, mean = 5, sd = 10)
summary(draws)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -25.080  -1.710   4.467   4.642  11.255  44.220

You can then use a variety of plotting commands (see for more below) to visualize your draws:

  • Density plots:
draws <- rnorm(n = 1000, mean = 5, sd = 10)
plot(density(draws), main = "This is a plot title", 
     xlab = "Label for the X-axis", ylab = "Label for the Y-axis")

  • Histograms:
draws <- rnorm(n = 1000, mean = 5, sd = 10)
hist(draws)

Extracting elements from an object

  • Elements from a vector:
vec <- c(4, 1, 5, 3)
vec[3]
## [1] 5
  • Variables from a data frame:
mydata$x1
## [1]  2  4  1  8 19 11
mydata$names
## NULL
  • Columns from a matrix:
mat1[ ,1]
## [1] 1 3 5
  • Rows from a matrix:
mat1[1, ]
## [1] 1 2
  • Elements from a list
mylist <- list(x1, x2, y)
mylist[[1]]
## [1]  2  4  1  8 19 11

Subsetting objects

Throughout this course, you will need to subset (or filter) data sets. In R, this can be done in multiple ways.

Using square brackets [ , ]

You can put conditions on a matrix or data frame and subset a dataset that way. Within the brackets, put conditions on rows before the comma and conditions on columns after the comma.

To filter all rows where y takes the value of 1, use:

mydata[mydata$y == 1, ]
##        name y x1 x2
## 1 Student 1 1  2 -3
## 2 Student 2 1  4  4

To filter all rows where y takes a value larger than 1, use:

mydata[mydata$y > 1, ]
##        name y x1 x2
## 3 Student 3 3  1 -2
## 4 Student 4 4  8  0
## 5 Student 5 7 19  4
## 6 Student 6 2 11 20

For a dataframe with only the columns containing names and y, use:

mydata[ , c("name", "y")]
##        name y
## 1 Student 1 1
## 2 Student 2 1
## 3 Student 3 3
## 4 Student 4 4
## 5 Student 5 7
## 6 Student 6 2

Using subset()

The subset() function works similar to square brackets. Some users find it more intuitive. It allows to subset a data frame for specific rows and select specific variables:

subset(mydata, y > 1)
##        name y x1 x2
## 3 Student 3 3  1 -2
## 4 Student 4 4  8  0
## 5 Student 5 7 19  4
## 6 Student 6 2 11 20
subset(mydata, y > 1, select = c("name", "y"))
##        name y
## 3 Student 3 3
## 4 Student 4 4
## 5 Student 5 7
## 6 Student 6 2

Using the tidyverse and dplyr (filter())

In this workshop, we will be making use of yet another way to subset data (and do much more in terms of data management): the tidyverse set of packages, and specifically the dplyr package. Please read on for more information on this package.

Working with data sets

In most cases, you will not type up your data by hand, but use data sets that were created in other formats. You can easily import such data sets into R.

Importing data into R

The “rio” package allows you to import data sets in a variety of formats with one single function, import(). You need to first load the package:

library("rio")
## 
## Attaching package: 'rio'
## The following object is masked from 'package:lme4':
## 
##     factorize

The import() function “guesses” the format of the data from the file type extension, so that a file ending in .csv} is read in as a comma-separated value file. If the file type extension does not reveal the type of data (e.g., a tab-separated file saved with a .txt extension), you need to provide the format argument, as you see in the first example below. See the help file for import() for more information.

Note that for each command, many options (in R language: arguments) are available; you will most likely need to work with these options at some time, for instance when your source dataset (e.g., in Stata) has value labels. Check the help files for the respective command in that case.

  • Tab-separated files: If you have a text file with a simple tab-delimited table, where the first line designates variable names:
mydata_from_tsv <- import("http://www.jkarreth.net/files/mydata.txt", format = "tsv")
head(mydata_from_tsv)
##       y    x1    x2
## 1 -0.56  1.22 -1.07
## 2 -0.23  0.36 -0.22
## 3  1.56  0.40 -1.03
## 4  0.07  0.11 -0.73
## 5  0.13 -0.56 -0.63
## 6  1.72  1.79 -1.69

Alternatively, use read.table() specifically for tab-separated files:

mydata_from_tsv <- read.table("http://www.jkarreth.net/files/mydata.txt", header = TRUE)
head(mydata_from_tsv)
##       y    x1    x2
## 1 -0.56  1.22 -1.07
## 2 -0.23  0.36 -0.22
## 3  1.56  0.40 -1.03
## 4  0.07  0.11 -0.73
## 5  0.13 -0.56 -0.63
## 6  1.72  1.79 -1.69
  • CSV files: If you have a text file with a simple tab-delimited table, where the first line designates variable names:
mydata_from_csv <- import("http://www.jkarreth.net/files/mydata.csv")
head(mydata_from_csv)
##       y    x1    x2
## 1 -0.56  1.22 -1.07
## 2 -0.23  0.36 -0.22
## 3  1.56  0.40 -1.03
## 4  0.07  0.11 -0.73
## 5  0.13 -0.56 -0.63
## 6  1.72  1.79 -1.69

Alternatively, use read.csv() specifically for comma-separated files:

mydata_from_csv <- read.csv("http://www.jkarreth.net/files/mydata.csv")
head(mydata_from_csv)
##       y    x1    x2
## 1 -0.56  1.22 -1.07
## 2 -0.23  0.36 -0.22
## 3  1.56  0.40 -1.03
## 4  0.07  0.11 -0.73
## 5  0.13 -0.56 -0.63
## 6  1.72  1.79 -1.69
  • SPSS files: If you have an SPSS data file, you can do this:
mydata_from_spss <- import("http://www.jkarreth.net/files/mydata.sav")
head(mydata_from_spss)
##       y    x1    x2
## 1 -0.56  1.22 -1.07
## 2 -0.23  0.36 -0.22
## 3  1.56  0.40 -1.03
## 4  0.07  0.11 -0.73
## 5  0.13 -0.56 -0.63
## 6  1.72  1.79 -1.69
  • Stata files: If you have a Stata data file, you can do this:
mydata_from_dta <- import("http://www.jkarreth.net/files/mydata.dta")
head(mydata_from_dta)
##       y    x1    x2
## 1 -0.56  1.22 -1.07
## 2 -0.23  0.36 -0.22
## 3  1.56  0.40 -1.03
## 4  0.07  0.11 -0.73
## 5  0.13 -0.56 -0.63
## 6  1.72  1.79 -1.69

Alternatively, use read.dta() from the “foreign” package specifically for Stata files:

library("foreign")
mydata_from_dta <- read.dta("http://www.jkarreth.net/files/mydata.dta")
head(mydata_from_dta)
##       y    x1    x2
## 1 -0.56  1.22 -1.07
## 2 -0.23  0.36 -0.22
## 3  1.56  0.40 -1.03
## 4  0.07  0.11 -0.73
## 5  0.13 -0.56 -0.63
## 6  1.72  1.79 -1.69

Describing data

To obtain descriptive statistics of a dataset, or a variable, use the summary() command:

summary(mydata_from_dta)
##        y                 x1               x2         
##  Min.   :-1.2700   Min.   :-1.970   Min.   :-1.6900  
##  1st Qu.:-0.5325   1st Qu.:-0.325   1st Qu.:-1.0600  
##  Median :-0.0800   Median : 0.380   Median :-0.6800  
##  Mean   : 0.0740   Mean   : 0.208   Mean   :-0.4270  
##  3rd Qu.: 0.3775   3rd Qu.: 0.650   3rd Qu.: 0.0575  
##  Max.   : 1.7200   Max.   : 1.790   Max.   : 1.2500
summary(mydata_from_dta$y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2700 -0.5325 -0.0800  0.0740  0.3775  1.7200

You can access particular quantities, such as standard deviations and quantiles (in this case the 5th and 95th percentiles), with the respective functions:

sd(mydata_from_dta$y)
## [1] 0.9561869
quantile(mydata_from_dta$y, probs = c(0.05, 0.95))
##     5%    95% 
## -1.009  1.648

Creating figures

R offers several options to create figures. The so-called base graphics come with R and are very powerful and stable. We will, however, work almost exclusively with the “ggplot2” package, an extension of R based on a consistent “grammar of graphics”.

The ggplot2 package

The “ggplot2” package has become popular because its language and plotting sequence can be somewhat more convenient (depending on users’ background), especially when working with more complex datasets. For plotting Bayesian model output, ggplot2 offers some useful features. I will mostly use ggplot2 in this workshop because (in my opinion) it offers a quick and scalable way to produce figures that are useful for diagnostics and publication-quality output alike.

Once installed, ggplot2 needs to be first loaded as an external package. Its key commands are ggplot() and various types of plots, passed to R via geom_ commands. All commands are added via +, either in one line or in a new line to an existing ggplot2 object. ggplot() requires the data to come in the class of data frames (or tibbles, which we will discuss later), hence I convert the vector of observations I generate below into a data frame in order to plot it.

library("ggplot2")
set.seed(123)
dist1 <- rnorm(n = 1000, mean = 0, sd = 1)
# set.seed(123)
# dist2 <- rnorm(n = 1000, mean = 0, sd = 2)
# dist.df <- data.frame(dist1, dist2)
# dist.df_long <- pivot_longer(dist.df, cols = everything())
# head(dist.df_long)
normal.plot <- ggplot(data = data.frame(dist1), aes(x = dist1))
normal.plot <- normal.plot + geom_density(alpha = 0.5)
normal.plot

ggplot2 offers plenty of opportunities for customizing plots; we will also encounter these later on in the workshop. You can also have a look at Winston Chang’s R Graphics Cookbook for plenty of examples of ggplot2 customization: http://www.cookbook-r.com/Graphs.

Exporting graphs

Any plot can be printed to a PDF file using the pdf() command. This code:

pdf("normal_plot.pdf", width = 5, height = 5)
normal.plot
dev.off()
## quartz_off_screen 
##                 2

will print a plot named normal_plot.pdf of the size 5 by 5 inches to your working directory.

Plots created with ggplot2 can also be saved using the ggsave() command:

ggsave(plot = normal.plot, filename = "normal_ggplot.pdf", width = 5, height = 5, unit = "in")

Tidy data management

In the last few years, contributors to R have developed a convenient and intuitive infrastructure for data management. Several packages are now very popular and default go-to tools for researchers and data scientists who routinely spend time on data management, processing, and visualization. Because the principle behind this group of R packages is “tidy” data analysis", these packages sometimes referred to as the “tidyverse”. Technically speaking, tidyverse is an R package that installs and loads a number of other packages that are all useful for tidy data analysis.

Installation

To use these packages, please install the tidyverse package - be patient, this might take a few minutes.

install.packages("tidyverse")

When you load this package, you will notice some messages indicating that other packages are loaded as well:

library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x tidyr::pack()   masks Matrix::pack()
## x tidyr::unpack() masks Matrix::unpack()

You can ignore the message about conflicts; these tell you that R will now use functions from the tidyverse instead of base-R functions.

Use for subsetting data

Instead of the options mentioned above, you can use the filter() and select() functions from the tidyverse to subset your data:

filter(mydata, y > 1)
##        name y x1 x2
## 1 Student 3 3  1 -2
## 2 Student 4 4  8  0
## 3 Student 5 7 19  4
## 4 Student 6 2 11 20
select(mydata, name, y)
##        name y
## 1 Student 1 1
## 2 Student 2 1
## 3 Student 3 3
## 4 Student 4 4
## 5 Student 5 7
## 6 Student 6 2

Use for collapsing data frames

The tidyverse offers a number of easy-to-use functions to collapse, summarize, and otherwise manipulate data frames. We will explore these in Lab 2 soon

The linear regression model in R

As you know, we can use regression analysis as a technique to express a linear relationship between two continuous variables.1 We typically refer to these variables as outcome variable \(y\) and explanatory or predictor variable \(x\). Linear regression allows you to express the relationship between \(x\) and \(y\) through coefficients in the following equation:

\[ \begin{aligned} y_i = \alpha + \beta x_i + \varepsilon \end{aligned} \]

\(\alpha\) is the so-called intercept; \(\beta\) is the slope coefficient for your explanatory variable \(x\); \(\varepsilon\) is the residual. Other ways to express these are \(\beta_0\) for the intercept and \(\beta_1\) for the slope coefficient, or to use Roman instead of Greek letters. The index \(i\) stands for the individual observations of your data.

The following picture illustrates these concepts:

The remainder of this tutorial explains these concepts with some example data about various performance indicators of baseball teams.

A step-by-step example of fitting a linear model in R

The movie “Moneyball” focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

Here, we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.

The data

Let’s load up the data for the 2011 season.

mlb11 <- rio::import("http://www.jkarreth.net/files/mlb11.csv")
head(mlb11)
##                  team runs at_bats hits homeruns bat_avg strikeouts
## 1       Texas Rangers  855    5659 1599      210   0.283        930
## 2      Boston Red Sox  875    5710 1600      203   0.280       1108
## 3      Detroit Tigers  787    5563 1540      169   0.277       1143
## 4  Kansas City Royals  730    5672 1560      129   0.275       1006
## 5 St. Louis Cardinals  762    5532 1513      162   0.273        978
## 6       New York Mets  718    5600 1477      108   0.264       1085
##   stolen_bases wins new_onbase new_slug new_obs
## 1          143   96      0.340    0.460   0.800
## 2          102   90      0.349    0.461   0.810
## 3           49   95      0.340    0.434   0.773
## 4          153   71      0.329    0.415   0.744
## 5           57   90      0.341    0.425   0.766
## 6          130   77      0.335    0.391   0.725

In addition to runs scored, there are seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables. At the end of the tutorial, you’ll work with the newer variables on your own.

If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient:

cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627

The linear model

We can use the lm function in R to fit the linear model (a.k.a. regression line).

mod1_lm <- lm(runs ~ at_bats, data = mlb11)

The first argument in the function lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats. The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.

The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function.

summary(mod1_lm)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Let’s consider this output piece by piece. First, the formula used to describe the model is shown at the top. After the formula you find the five-number summary of the residuals. The “Coefficients” table shown next is key; its first column displays the linear model’s y-intercept and the coefficient of at_bats. With this table, we can write down the least squares regression line for the linear model:

\[ \hat{y} = -2789.2429 + 0.6305 * \text{at\_bats} \]

One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, \(R^2\). The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.

Summarizing the results of a regression model

For your applied work, you will need to present the results of a regression model in the form of a table or a graph. R offers convenient functions for both. Here, we briefly look at the screenreg function to create a nicely formatted table in the R console, and then its relatives, texreg() for a a PDF file and htmlreg function, which you can use to create a table for a Word document.

First, install (once) and then load the texreg package.

install.packages("texreg")
library("texreg")

Then, simply feed the model object, mod1_lm, to the texreg function:

screenreg(mod1_lm)
## 
## =========================
##              Model 1     
## -------------------------
## (Intercept)  -2789.24 ** 
##               (853.70)   
## at_bats          0.63 ***
##                 (0.15)   
## -------------------------
## R^2              0.37    
## Adj. R^2         0.35    
## Num. obs.       30       
## =========================
## *** p < 0.001; ** p < 0.01; * p < 0.05

We can customize this table in many ways. For your work, you at least want to simplify the number of stars that are printed (if any) and provide custom coefficient names. The stars and custom.coef.names arguments do this job:

screenreg(mod1_lm, stars = 0.05, custom.coef.names = c("Intercept", "At-bats"))
## 
## =====================
##            Model 1   
## ---------------------
## Intercept  -2789.24 *
##             (853.70) 
## At-bats        0.63 *
##               (0.15) 
## ---------------------
## R^2            0.37  
## Adj. R^2       0.35  
## Num. obs.     30     
## =====================
## * p < 0.05

If you’d like to export a regression table to Word, use htmlreg, save the file in your working directory, and open it with Word or Libre Office, and then copy and paste the table in your manuscript. Use the file = argument to specify the name of the file. R will automatically save it to your current working directory.

htmlreg(mod1_lm, stars = 0.05, custom.coef.names = c("Intercept", "At-bats"), 
  caption = "Regression table", caption.above = TRUE, file = "mod1.doc")

Model diagnostics

To assess whether the linear model is reliable, we need to check for the assumptions (A1) linearity, (A2) constant error variance, (3) normality of the errors, (A4) independence of observations, and (A5) no correlation between \(x_i\) and \(\varepsilon_i\). (A4) and (A5) cannot be directly tested, but for the other assumptions, plotting residuals and some other quantities is a good first start.

The lindia package allows doing this in one line of code. First, install the package (once):

install.packages("lindia")

Then, load the package:

library("lindia")

Because your regression model is an object in R, you can pass it to the lindia::gg_diagnose() function and create a first diagnostic plot:

gg_diagnose(mod1_lm, ncol = 2)

You can use these plots to examine the following:

  • Linearity (A1): You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a # is intended to be a comment that helps understand the code but is ignored by R. Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

  • Constant error variance (A2): You can use the residuals plot to check whether the variance of the errors is constant or changes across the range of the predictor. Based on the plot in you just created, does the constant error variance condition appear to be met?

  • Normality of the errors (A3): To check this condition, we can look at the histogram of residuals. Based on the histogram, does the nearly normal residuals condition appear to be met?

The lme4 package

Throughout this course, we will mostly rely on the lme4 package for multilevel modeling in R. lme4 stands for linear mixed-effects This package works very similar to the lm() function you just learned; it also provides a number of useful functions to diagnose and further work with multilevel models.

Please install this package once, if you didn’t already do so earlier:

install.packages("lme4")

and load it:

library("lme4")

The key function we will be using is the lmer() function. Try it out on the baseball data above:

mod1_lmer <- lmer(runs ~ at_bats, data = mlb11)
## Error: No random effects terms specified in formula

You will notice an error message that you did not specify random effects in the formula. We’ll pick up here in tomorrow’s lecture.

Integrating writing and data analysis

For project management and replication purposes, it is advantageous to combine your data analyis and writing in one framework. RMarkdown, Sweave and knitr are great solutions for this. The RStudio website has a good explanation of these options: http://rmarkdown.rstudio.com. This tutorial and all materials for this course are written using knitr.

Further reading

If you’re new to R, you may find my (very old) course material for an introductory regression class useful.

Test yourself!

Exercise 1: First steps in R

  • Create an R script that performs the following operations:
    • Sets an assignment-specific working directory, e.g. /Users/johanneskarreth/Work/ICPSR/MLM/Assignment1.
    • Calculates the mean and standard deviation of this series of numbers: \(37, -6, -38, 12, -2\)

Exercise 2: Download survey data

  • Go to http://gss.norc.org/ and download the General Social Survey raw data for 2014 in SPSS or Stata format. Save this file in an assignment-specific working directory. Then, create an R script that performs the following operations:

    • Reads the GSS data into R
    • Shows the number of respondents in this survey file
    • Creates a histogram of the distribution of age of all respondents in the survey

Exercise 3: Fake data simulation

  • Create an R script that creates a fake data set representing 1500 observations. The data set should contain three variables:
    • an index variable that identifies each respondent, ranging from 1 to 1500
    • a binary variable that represents (fake) information on whether a respondent voted in the 2012 U.S. presidential election (1 if yes, 0 if not). Create this (fake) variable assuming that the respondents are randomly drawn from the U.S. population and that the turnout rate in this (fake) sample is more or less representative of the actual turnout rate in the 2012 presidential election.
    • a categorical variable for one of the four U.S. census regions (Northeast, South, Midwest, West). Assume that respondents are randomly drawn from these regions at equal probability.
  • Save this fake dataset as a .csv file to your working directory. Then provide summary statistics for it.

Exercise 4: Fit, diagnose, and interpret a linear regression model

  • Download a dataset of social indicators provided by the United Nations. The dataset is available as a text file on John Fox’s webpage at http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/index.html under the name “UnitedNations.txt”. Read it into R using the rio::import() function and the URL of the dataset. Note that for text files, you need to add format = "tsv" to your call to import().
  • Then, estimate a regression model predicting infant mortality with a set of at least 2 predictors you think are theoretically related to infant mortality. Create a regression table.
  • Check whether outliers influence your inferences. If yes, how do you propose to deal with these outliers?

  1. Some of the content below is a derivative of materials provided by the OpenIntro Statistics project (Diez, Barr, and Çetinkaya-Rundel) and is licensed under a CC BY-SA license.↩︎