The purpose of this tutorial is to show the very basics of the R language so that participants who have not used R before can complete the first assignment in this workshop. For information on the thousands of other features of R, see the suggested resources below.
The most recent version of R for all operating systems is always located at http://www.r-project.org/index.html. You can go directly to https://cloud.r-project.org, and download the R version for your operating system. Then, install R.
To operate R, you should rely on writing R scripts. We will write these scripts in RStudio. Download RStudio from http://www.rstudio.org. Then, install it on your computer. Some text editors also offer integration with R, so that you can send code directly to R. RStudio is generally the best solution for running R and maintaining a reproducible workflow.
Lastly, install LaTeX in order to compile PDF files from within RStudio. To do this, follow the instructions under http://www.jkarreth.net/latex.html, “Installation”. You won’t have to use LaTeX directly or learn how to write LaTeX code in this workshop. Alternatively, you can install the “tinytex” distribution, which requires less size and time to install. You can find excellent documentation and instructions for tinytex at https://yihui.org/tinytex/.
Upon opening the first time, RStudio will look like the screenshot below.
The window on the left is named “Console”. The point next to the blue “larger than” sign >
is the “command line”. You can tell R to perform actions by typing commands into this command line. We will rarely do this and operate R through script files instead.
In the following sections, I walk you through some basic R commands. In this tutorial and most other materials you will see in this workshop, R commands and the resulting R output will appear in light grey boxes. Output in this tutorial is always preceded by two ##
signs.
To begin, see how R responds to commands. If you type a simple mathematical operation, R will return its result(s):
1 + 1
## [1] 2
2 * 3
## [1] 6
10 / 3
## [1] 3.333333
R will return error messages when a command is incorrect or when it cannot execute a command. Often, these error messages are informative. You can often get more information by simply searching for an error message on the web. Here, I try to add 1 and the letter a, which does not (yet) make sense as I haven’t defined an object a
yet and numbers and letters cannot be added:
1 + a
## Error in eval(expr, envir, enclos): object 'a' not found
As your coding will become more complex, you may forget to complete a particular command. For example, here I want to add 1 and the product of 2 and 4. But unless I add the parenthesis at the end of the line, or in the immediately following line, this code won’t execute:
1 + (2 * 4
)
## [1] 9
While executing this command and looking at the console, you will notice that the little >
on the left changes into a +
. This means that R is offering you a new line to finish the original command. If I type a right parenthesis, R returns the result of my operation.
Many useful and important functions in R are provided via packages that need to be installed separately. You can do this by using the Package Installer in the menu (Packages & Data – Package Installer in R or Tools – Install Packages… in RStudio), or by typing
install.packages("lme4")
in the R command line. Next, in every R session or script, you need to load the packages you want to use: type
library("lme4")
## Loading required package: Matrix
in the R command line. You only need to install packages once on your (or any) computer, but you need to load them anew in each R session.
Alternatively, if you only want to access one particular function from a package, but do not want to load the whole package, you can use the packagename::function
option.
I will use this notation throughout the tutorials and labs for this course to explicitly indicate functions that are not part of base R. If you have not used a particular package on your computer before, please install it before proceeding.
In most cases, it is useful to set a project-specific working directory—especially if you work with many files and want to create graphics that you want to have printed to .pdf or .eps files. You can set the WD with this command:
setwd("~/Documents/Dropbox/Uni/9 - ICPSR/2021/MLM/Slides/Lab-1")
You can typically see your current working directory on top of the R console in RStudio, or you can obtain the working directory with this command:
getwd()
## [1] "/Users/johanneskarreth/Documents/Dropbox/Uni/9 - ICPSR/2021/MLM/Slides/Lab-1"
RStudio also offers a very useful function to set up a whole project (File – New Project…). Projects automatically create a working directory for you. Even though we won’t use projects in this workshop, I recommend them as an easy and failproof way to manage files and directories.
Within R, you can access the help files for any command that exists by typing ?commandname
or, for a list of the commands within a package, by typing help(package = packagename)
. So, for instance:
?rnormhelp(package = "lme4")
There are many resources on how to structure your R workflow (think of routines like the ones suggested by J. Scott Long in The Workflow of Data Analysis Using Stata), and I encourage you to search for and maintain a consistent approach to working with R. It will make your life much, much easier—with regards to collaboration, replication, and general efficiency. I recommend following the Project TIER protocol. In addition, here are a few really important points that you might want to consider as you start using R:
attach()
command.As R has become one of the most popular programs for statistical computing, the number of resources in print and online has increased dramatically. Searching for terms like “introduction to R software” will return a huge number of results.
Some (of the many) good resources that I have encountered and found useful are:
R is an object-oriented programming language. This means that you, the user, create objects and work with them. Objects can be of different types. To create an object, first type the object name, then the “assignment character”, a leftward arrow <-
, then the content of an object. To display an object, simply type the object’s name, and it will be printed to the console.
You can then apply functions to objects. Most functions have names that are somewhat descriptive of their purpose. For example, mean()
calculates the mean of the numbers within the parentheses, and log()
calculates the natural logarithm of the number(s) within the parentheses.
Functions consist of a function name, the function’s arguments, and specific values passed to the arguments. In symbolic terms:
function_name(argument1 = value,
argument2 = value)
Here is a specific example of the function abbreviate
, its first argument names.arg
, and the value "Regression"
that I provide to the argument x
:
abbreviate(names.arg = "Regression")
## Regression
## "Rgrs"
The following are the types of objects you need to be familiar with:
Scalars
Vectors of different types
"
Matrices
Data frames
Lists
Below, you find some more specific examples of different types of objects.
<- 1
x x
## [1] 1
<- 2
y + y x
## [1] 3
* y x
## [1] 2
/ y x
## [1] 0.5
^2 y
## [1] 4
log(x)
## [1] 0
exp(x)
## [1] 2.718282
<- c(1, 2, 3, 4, 5)
xvec xvec
## [1] 1 2 3 4 5
<- seq(from = 1, to = 5, by = 1)
xvec2 xvec2
## [1] 1 2 3 4 5
<- rep(1, 5)
yvec yvec
## [1] 1 1 1 1 1
<- xvec + yvec
zvec zvec
## [1] 2 3 4 5 6
<- matrix(data = c(1, 2, 3, 4, 5, 6), nrow = 3, byrow = TRUE)
mat1 mat1
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
<- matrix(data = seq(from = 6, to = 3.5, by = -0.5),
mat2 nrow = 2, byrow = T)
mat2
## [,1] [,2] [,3]
## [1,] 6.0 5.5 5.0
## [2,] 4.5 4.0 3.5
%*% mat2 mat1
## [,1] [,2] [,3]
## [1,] 15 13.5 12
## [2,] 36 32.5 29
## [3,] 57 51.5 46
<- c(1, 1, 3, 4, 7, 2)
y <- c(2, 4, 1, 8, 19, 11)
x1 <- c(-3, 4, -2, 0, 4, 20)
x2 <- c("Student 1", "Student 2", "Student 3",
name "Student 4", "Student 5", "Student 6")
<- data.frame(name, y, x1, x2)
mydata mydata
## name y x1 x2
## 1 Student 1 1 2 -3
## 2 Student 2 1 4 4
## 3 Student 3 3 1 -2
## 4 Student 4 4 8 0
## 5 Student 5 7 19 4
## 6 Student 6 2 11 20
You can use R to generate (random) draws from distributions. This will be important in the first assignment. For instance, to generate 1000 draws from a normal distribution with a mean of 5 and standard deviation of 10, you would write:
<- rnorm(n = 1000, mean = 5, sd = 10)
draws summary(draws)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -25.080 -1.710 4.467 4.642 11.255 44.220
You can then use a variety of plotting commands (see for more below) to visualize your draws:
<- rnorm(n = 1000, mean = 5, sd = 10)
draws plot(density(draws), main = "This is a plot title",
xlab = "Label for the X-axis", ylab = "Label for the Y-axis")
<- rnorm(n = 1000, mean = 5, sd = 10)
draws hist(draws)
<- c(4, 1, 5, 3)
vec 3] vec[
## [1] 5
$x1 mydata
## [1] 2 4 1 8 19 11
$names mydata
## NULL
1] mat1[ ,
## [1] 1 3 5
1, ] mat1[
## [1] 1 2
<- list(x1, x2, y)
mylist 1]] mylist[[
## [1] 2 4 1 8 19 11
Throughout this course, you will need to subset (or filter) data sets. In R, this can be done in multiple ways.
[ , ]
You can put conditions on a matrix or data frame and subset a dataset that way. Within the brackets, put conditions on rows before the comma and conditions on columns after the comma.
To filter all rows where y takes the value of 1, use:
$y == 1, ] mydata[mydata
## name y x1 x2
## 1 Student 1 1 2 -3
## 2 Student 2 1 4 4
To filter all rows where y takes a value larger than 1, use:
$y > 1, ] mydata[mydata
## name y x1 x2
## 3 Student 3 3 1 -2
## 4 Student 4 4 8 0
## 5 Student 5 7 19 4
## 6 Student 6 2 11 20
For a dataframe with only the columns containing names and y, use:
c("name", "y")] mydata[ ,
## name y
## 1 Student 1 1
## 2 Student 2 1
## 3 Student 3 3
## 4 Student 4 4
## 5 Student 5 7
## 6 Student 6 2
subset()
The subset()
function works similar to square brackets. Some users find it more intuitive. It allows to subset
a data frame for specific rows and select
specific variables:
subset(mydata, y > 1)
## name y x1 x2
## 3 Student 3 3 1 -2
## 4 Student 4 4 8 0
## 5 Student 5 7 19 4
## 6 Student 6 2 11 20
subset(mydata, y > 1, select = c("name", "y"))
## name y
## 3 Student 3 3
## 4 Student 4 4
## 5 Student 5 7
## 6 Student 6 2
filter()
)In this workshop, we will be making use of yet another way to subset data (and do much more in terms of data management): the tidyverse set of packages, and specifically the dplyr package. Please read on for more information on this package.
In most cases, you will not type up your data by hand, but use data sets that were created in other formats. You can easily import such data sets into R.
The “rio” package allows you to import data sets in a variety of formats with one single function, import()
. You need to first load the package:
library("rio")
##
## Attaching package: 'rio'
## The following object is masked from 'package:lme4':
##
## factorize
The import()
function “guesses” the format of the data from the file type extension, so that a file ending in .csv
} is read in as a comma-separated value file. If the file type extension does not reveal the type of data (e.g., a tab-separated file saved with a .txt
extension), you need to provide the format
argument, as you see in the first example below. See the help file for import()
for more information.
Note that for each command, many options (in R language: arguments) are available; you will most likely need to work with these options at some time, for instance when your source dataset (e.g., in Stata) has value labels. Check the help files for the respective command in that case.
<- import("http://www.jkarreth.net/files/mydata.txt", format = "tsv")
mydata_from_tsv head(mydata_from_tsv)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
Alternatively, use read.table()
specifically for tab-separated files:
<- read.table("http://www.jkarreth.net/files/mydata.txt", header = TRUE)
mydata_from_tsv head(mydata_from_tsv)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
<- import("http://www.jkarreth.net/files/mydata.csv")
mydata_from_csv head(mydata_from_csv)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
Alternatively, use read.csv()
specifically for comma-separated files:
<- read.csv("http://www.jkarreth.net/files/mydata.csv")
mydata_from_csv head(mydata_from_csv)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
<- import("http://www.jkarreth.net/files/mydata.sav")
mydata_from_spss head(mydata_from_spss)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
<- import("http://www.jkarreth.net/files/mydata.dta")
mydata_from_dta head(mydata_from_dta)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
Alternatively, use read.dta()
from the “foreign” package specifically for Stata files:
library("foreign")
<- read.dta("http://www.jkarreth.net/files/mydata.dta")
mydata_from_dta head(mydata_from_dta)
## y x1 x2
## 1 -0.56 1.22 -1.07
## 2 -0.23 0.36 -0.22
## 3 1.56 0.40 -1.03
## 4 0.07 0.11 -0.73
## 5 0.13 -0.56 -0.63
## 6 1.72 1.79 -1.69
To obtain descriptive statistics of a dataset, or a variable, use the summary()
command:
summary(mydata_from_dta)
## y x1 x2
## Min. :-1.2700 Min. :-1.970 Min. :-1.6900
## 1st Qu.:-0.5325 1st Qu.:-0.325 1st Qu.:-1.0600
## Median :-0.0800 Median : 0.380 Median :-0.6800
## Mean : 0.0740 Mean : 0.208 Mean :-0.4270
## 3rd Qu.: 0.3775 3rd Qu.: 0.650 3rd Qu.: 0.0575
## Max. : 1.7200 Max. : 1.790 Max. : 1.2500
summary(mydata_from_dta$y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2700 -0.5325 -0.0800 0.0740 0.3775 1.7200
You can access particular quantities, such as standard deviations and quantiles (in this case the 5th and 95th percentiles), with the respective functions:
sd(mydata_from_dta$y)
## [1] 0.9561869
quantile(mydata_from_dta$y, probs = c(0.05, 0.95))
## 5% 95%
## -1.009 1.648
R offers several options to create figures. The so-called base graphics come with R and are very powerful and stable. We will, however, work almost exclusively with the “ggplot2” package, an extension of R based on a consistent “grammar of graphics”.
The “ggplot2” package has become popular because its language and plotting sequence can be somewhat more convenient (depending on users’ background), especially when working with more complex datasets. For plotting Bayesian model output, ggplot2 offers some useful features. I will mostly use ggplot2 in this workshop because (in my opinion) it offers a quick and scalable way to produce figures that are useful for diagnostics and publication-quality output alike.
Once installed, ggplot2 needs to be first loaded as an external package. Its key commands are ggplot()
and various types of plots, passed to R via geom_
commands. All commands are added via +
, either in one line or in a new line to an existing ggplot2 object. ggplot()
requires the data to come in the class of data frames (or tibbles, which we will discuss later), hence I convert the vector of observations I generate below into a data frame in order to plot it.
library("ggplot2")
set.seed(123)
<- rnorm(n = 1000, mean = 0, sd = 1)
dist1 # set.seed(123)
# dist2 <- rnorm(n = 1000, mean = 0, sd = 2)
# dist.df <- data.frame(dist1, dist2)
# dist.df_long <- pivot_longer(dist.df, cols = everything())
# head(dist.df_long)
<- ggplot(data = data.frame(dist1), aes(x = dist1))
normal.plot <- normal.plot + geom_density(alpha = 0.5)
normal.plot normal.plot
ggplot2 offers plenty of opportunities for customizing plots; we will also encounter these later on in the workshop. You can also have a look at Winston Chang’s R Graphics Cookbook for plenty of examples of ggplot2 customization: http://www.cookbook-r.com/Graphs.
Any plot can be printed to a PDF file using the pdf()
command. This code:
pdf("normal_plot.pdf", width = 5, height = 5)
normal.plotdev.off()
## quartz_off_screen
## 2
will print a plot named normal_plot.pdf
of the size 5 by 5 inches to your working directory.
Plots created with ggplot2 can also be saved using the ggsave()
command:
ggsave(plot = normal.plot, filename = "normal_ggplot.pdf", width = 5, height = 5, unit = "in")
In the last few years, contributors to R have developed a convenient and intuitive infrastructure for data management. Several packages are now very popular and default go-to tools for researchers and data scientists who routinely spend time on data management, processing, and visualization. Because the principle behind this group of R packages is “tidy” data analysis", these packages sometimes referred to as the “tidyverse”. Technically speaking, tidyverse is an R package that installs and loads a number of other packages that are all useful for tidy data analysis.
To use these packages, please install the tidyverse package - be patient, this might take a few minutes.
install.packages("tidyverse")
When you load this package, you will notice some messages indicating that other packages are loaded as well:
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.2 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x tidyr::pack() masks Matrix::pack()
## x tidyr::unpack() masks Matrix::unpack()
You can ignore the message about conflicts; these tell you that R will now use functions from the tidyverse instead of base-R functions.
Instead of the options mentioned above, you can use the filter()
and select()
functions from the tidyverse to subset your data:
filter(mydata, y > 1)
## name y x1 x2
## 1 Student 3 3 1 -2
## 2 Student 4 4 8 0
## 3 Student 5 7 19 4
## 4 Student 6 2 11 20
select(mydata, name, y)
## name y
## 1 Student 1 1
## 2 Student 2 1
## 3 Student 3 3
## 4 Student 4 4
## 5 Student 5 7
## 6 Student 6 2
The tidyverse offers a number of easy-to-use functions to collapse, summarize, and otherwise manipulate data frames. We will explore these in Lab 2 soon
As you know, we can use regression analysis as a technique to express a linear relationship between two continuous variables.1 We typically refer to these variables as outcome variable \(y\) and explanatory or predictor variable \(x\). Linear regression allows you to express the relationship between \(x\) and \(y\) through coefficients in the following equation:
\[ \begin{aligned} y_i = \alpha + \beta x_i + \varepsilon \end{aligned} \]
\(\alpha\) is the so-called intercept; \(\beta\) is the slope coefficient for your explanatory variable \(x\); \(\varepsilon\) is the residual. Other ways to express these are \(\beta_0\) for the intercept and \(\beta_1\) for the slope coefficient, or to use Roman instead of Greek letters. The index \(i\) stands for the individual observations of your data.
The following picture illustrates these concepts:
The remainder of this tutorial explains these concepts with some example data about various performance indicators of baseball teams.
The movie “Moneyball” focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
Here, we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.
Let’s load up the data for the 2011 season.
<- rio::import("http://www.jkarreth.net/files/mlb11.csv")
mlb11 head(mlb11)
## team runs at_bats hits homeruns bat_avg strikeouts
## 1 Texas Rangers 855 5659 1599 210 0.283 930
## 2 Boston Red Sox 875 5710 1600 203 0.280 1108
## 3 Detroit Tigers 787 5563 1540 169 0.277 1143
## 4 Kansas City Royals 730 5672 1560 129 0.275 1006
## 5 St. Louis Cardinals 762 5532 1513 162 0.273 978
## 6 New York Mets 718 5600 1477 108 0.264 1085
## stolen_bases wins new_onbase new_slug new_obs
## 1 143 96 0.340 0.460 0.800
## 2 102 90 0.349 0.461 0.810
## 3 49 95 0.340 0.434 0.773
## 4 153 71 0.329 0.415 0.744
## 5 57 90 0.341 0.425 0.766
## 6 130 77 0.335 0.391 0.725
In addition to runs scored, there are seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables. At the end of the tutorial, you’ll work with the newer variables on your own.
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient:
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627
We can use the lm
function in R to fit the linear model (a.k.a. regression line).
<- lm(runs ~ at_bats, data = mlb11) mod1_lm
The first argument in the function lm
is a formula that takes the form y ~ x
. Here it can be read that we want to make a linear model of runs
as a function of at_bats
. The second argument specifies that R should look in the mlb11
data frame to find the runs
and at_bats
variables.
The output of lm
is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function.
summary(mod1_lm)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
Let’s consider this output piece by piece. First, the formula used to describe the model is shown at the top. After the formula you find the five-number summary of the residuals. The “Coefficients” table shown next is key; its first column displays the linear model’s y-intercept and the coefficient of at_bats
. With this table, we can write down the least squares regression line for the linear model:
\[ \hat{y} = -2789.2429 + 0.6305 * \text{at\_bats} \]
One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, \(R^2\). The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.
For your applied work, you will need to present the results of a regression model in the form of a table or a graph. R offers convenient functions for both. Here, we briefly look at the screenreg
function to create a nicely formatted table in the R console, and then its relatives, texreg()
for a a PDF file and htmlreg
function, which you can use to create a table for a Word document.
First, install (once) and then load the texreg package.
install.packages("texreg")
library("texreg")
Then, simply feed the model object, mod1_lm
, to the texreg
function:
screenreg(mod1_lm)
##
## =========================
## Model 1
## -------------------------
## (Intercept) -2789.24 **
## (853.70)
## at_bats 0.63 ***
## (0.15)
## -------------------------
## R^2 0.37
## Adj. R^2 0.35
## Num. obs. 30
## =========================
## *** p < 0.001; ** p < 0.01; * p < 0.05
We can customize this table in many ways. For your work, you at least want to simplify the number of stars that are printed (if any) and provide custom coefficient names. The stars
and custom.coef.names
arguments do this job:
screenreg(mod1_lm, stars = 0.05, custom.coef.names = c("Intercept", "At-bats"))
##
## =====================
## Model 1
## ---------------------
## Intercept -2789.24 *
## (853.70)
## At-bats 0.63 *
## (0.15)
## ---------------------
## R^2 0.37
## Adj. R^2 0.35
## Num. obs. 30
## =====================
## * p < 0.05
If you’d like to export a regression table to Word, use htmlreg
, save the file in your working directory, and open it with Word or Libre Office, and then copy and paste the table in your manuscript. Use the file =
argument to specify the name of the file. R will automatically save it to your current working directory.
htmlreg(mod1_lm, stars = 0.05, custom.coef.names = c("Intercept", "At-bats"),
caption = "Regression table", caption.above = TRUE, file = "mod1.doc")
To assess whether the linear model is reliable, we need to check for the assumptions (A1) linearity, (A2) constant error variance, (3) normality of the errors, (A4) independence of observations, and (A5) no correlation between \(x_i\) and \(\varepsilon_i\). (A4) and (A5) cannot be directly tested, but for the other assumptions, plotting residuals and some other quantities is a good first start.
The lindia package allows doing this in one line of code. First, install the package (once):
install.packages("lindia")
Then, load the package:
library("lindia")
Because your regression model is an object in R, you can pass it to the lindia::gg_diagnose()
function and create a first diagnostic plot:
gg_diagnose(mod1_lm, ncol = 2)
You can use these plots to examine the following:
Linearity (A1): You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a # is intended to be a comment that helps understand the code but is ignored by R. Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?
Constant error variance (A2): You can use the residuals plot to check whether the variance of the errors is constant or changes across the range of the predictor. Based on the plot in you just created, does the constant error variance condition appear to be met?
Normality of the errors (A3): To check this condition, we can look at the histogram of residuals. Based on the histogram, does the nearly normal residuals condition appear to be met?
Throughout this course, we will mostly rely on the lme4 package for multilevel modeling in R. lme4 stands for linear mixed-effects This package works very similar to the lm()
function you just learned; it also provides a number of useful functions to diagnose and further work with multilevel models.
Please install this package once, if you didn’t already do so earlier:
install.packages("lme4")
and load it:
library("lme4")
The key function we will be using is the lmer()
function. Try it out on the baseball data above:
<- lmer(runs ~ at_bats, data = mlb11) mod1_lmer
## Error: No random effects terms specified in formula
You will notice an error message that you did not specify random effects in the formula. We’ll pick up here in tomorrow’s lecture.
For project management and replication purposes, it is advantageous to combine your data analyis and writing in one framework. RMarkdown, Sweave and knitr are great solutions for this. The RStudio website has a good explanation of these options: http://rmarkdown.rstudio.com. This tutorial and all materials for this course are written using knitr.
If you’re new to R, you may find my (very old) course material for an introductory regression class useful.
/Users/johanneskarreth/Work/ICPSR/MLM/Assignment1
.Go to http://gss.norc.org/ and download the General Social Survey raw data for 2014 in SPSS or Stata format. Save this file in an assignment-specific working directory. Then, create an R script that performs the following operations:
rio::import()
function and the URL of the dataset. Note that for text files, you need to add format = "tsv"
to your call to import()
.Some of the content below is a derivative of materials provided by the OpenIntro Statistics project (Diez, Barr, and Çetinkaya-Rundel) and is licensed under a CC BY-SA license.↩︎
Comments
R scripts contain two types of text: R commands and comments. Commands are executed and perform actions. Comments are part of a script, but they are not executed. Comments begin with the
#
sign. Anything that follows after a#
sign in the same line will be ignored by R. Compare what happens with the following two lines:You should use comments frequently to annotate your script files in order to explain to yourself what you are doing in a script file.