This tutorial shows you:

Note on copying & pasting code from the PDF version of this tutorial: Please note that you may run into trouble if you copy & paste code from the PDF version of this tutorial into your R script. When the PDF is created, some characters (for instance, quotation marks or indentations) are converted into non-text characters that R won’t recognize. To use code from this tutorial, please type it yourself into your R script or you may copy & paste code from the source file for this tutorial which is posted on my website.

Note on R functions discussed in this tutorial: I don’t discuss many functions in detail here and therefore I encourage you to look up the help files for these functions or search the web for them before you use them. This will help you understand the functions better. Each of these functions is well-documented either in its help file (which you can access in R by typing ?ifelse, for instance) or on the web. The Companion to Applied Regression (see our syllabus) also provides many detailed explanations.

As always, please note that this tutorial only accompanies the other materials for Day 8 and that you are expected to have worked through the videos and reading for that day before tackling this tutorial.

Basic setup

This tutorial builds on the readings and videos you worked through for Day 8. These materials introduce a new form of the familiar regression equation to you. On Days 6 and 7, you worked with one outcome variable \(y\) and one explanatory variable \(x\). The parameters you recovered via OLS were the intercept, \(\alpha\), and the slope coefficient \(\beta\). For today, you encounter a second explanatory variable \(x_2\), next to \(x_1\).

This also increases the number of slope coefficients to two: \(\beta_1\) is associated with \(x_1\), and \(\beta_2\) is associated with \(x_2\):

\[ \begin{aligned} y_i = \alpha + \beta_1 x_1 + \beta_2 x_2 + \varepsilon \end{aligned} \]

We can simulate the data-generating process for this \(y\), mirroring the approach you saw in the tutorial for Day 6:

set.seed(123)
n.obs <- 50
x1 <- rnorm(n = n.obs, mean = 5, sd = 2)
x2 <- runif(n = n.obs, min = -7, max = 7)
i <- c(1:n.obs)
e <- rnorm(n = n.obs, mean = 0, sd = 2)
a <- 2
b1<- 0.5
b2 <- -0.75
y <- a + b1 * x1 + b2 * x2 + e
sim.dat <- data.frame(y, x1, x2)

Here, I choose to create a new dataframe, sim.dat, that contains the three variables we’re working with: \(y\), \(x_1\), and \(x_2\).

When estimating a linear model in statistical software, in our case R, the only change you have to make to add the additional explanatory variable(s) to the formula for the model you estimate:

mod <- lm(y ~ x1 + x2, data = sim.dat)
summary(mod)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = sim.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1279 -1.2674 -0.2188  1.0713  4.2181 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.55417    0.71122   3.591 0.000784 ***
## x1           0.39331    0.13211   2.977 0.004589 ** 
## x2          -0.78539    0.05943 -13.215  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.709 on 47 degrees of freedom
## Multiple R-squared:  0.792,  Adjusted R-squared:  0.7832 
## F-statistic: 89.48 on 2 and 47 DF,  p-value: < 2.2e-16

This lm object has some elements that you can use later on. You can view the structure of the model object with the str() function:

str(mod)
## List of 12
##  $ coefficients : Named num [1:3] 2.554 0.393 -0.785
##   ..- attr(*, "names")= chr [1:3] "(Intercept)" "x1" "x2"
##  $ residuals    : Named num [1:50] 1.96 -0.722 -2.135 0.582 -0.279 ...
##   ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:50] -30.961 3.559 -22.578 -0.185 -0.485 ...
##   ..- attr(*, "names")= chr [1:50] "(Intercept)" "x1" "x2" "" ...
##  $ rank         : int 3
##  $ fitted.values: Named num [1:50] 2.98 6.178 5.872 -0.421 4.81 ...
##   ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
##  $ assign       : int [1:3] 0 1 2
##  $ qr           :List of 5
##   ..$ qr   : num [1:50, 1:3] -7.071 0.141 0.141 0.141 0.141 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:50] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:3] "(Intercept)" "x1" "x2"
##   .. ..- attr(*, "assign")= int [1:3] 0 1 2
##   ..$ qraux: num [1:3] 1.14 1.03 1.06
##   ..$ pivot: int [1:3] 1 2 3
##   ..$ tol  : num 1e-07
##   ..$ rank : int 3
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 47
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = y ~ x1 + x2, data = sim.dat)
##  $ terms        :Classes 'terms', 'formula' length 3 y ~ x1 + x2
##   .. ..- attr(*, "variables")= language list(y, x1, x2)
##   .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:3] "y" "x1" "x2"
##   .. .. .. ..$ : chr [1:2] "x1" "x2"
##   .. ..- attr(*, "term.labels")= chr [1:2] "x1" "x2"
##   .. ..- attr(*, "order")= int [1:2] 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(y, x1, x2)
##   .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:3] "y" "x1" "x2"
##  $ model        :'data.frame':   50 obs. of  3 variables:
##   ..$ y : num [1:50] 4.941 5.456 3.737 0.161 4.531 ...
##   ..$ x1: num [1:50] 3.88 4.54 8.12 5.14 5.26 ...
##   ..$ x2: num [1:50] 1.4 -2.34 -0.159 6.363 -0.239 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 y ~ x1 + x2
##   .. .. ..- attr(*, "variables")= language list(y, x1, x2)
##   .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:3] "y" "x1" "x2"
##   .. .. .. .. ..$ : chr [1:2] "x1" "x2"
##   .. .. ..- attr(*, "term.labels")= chr [1:2] "x1" "x2"
##   .. .. ..- attr(*, "order")= int [1:2] 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(y, x1, x2)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:3] "y" "x1" "x2"
##  - attr(*, "class")= chr "lm"

We can extract some of these quantities with commands that you’re already familiar with:

coef(mod)
## (Intercept)          x1          x2 
##   2.5541711   0.3933124  -0.7853943
plot(x = fitted.values(mod), 
     y = residuals(mod))

Before you advance, a reminder: the assumptions we discussed on Day 6 are the same assumptions underling the linear regression model with more than one predictor. These assumptions are:

Example data: Occupational prestige

We’ll now use some example data that you encountered in your AR reading. These data come from the “car” package, so I first load the package, then create the object prestige.dat containing the dataset.

library(car)
prestige.dat <- data.frame(Prestige)
head(prestige.dat)
##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof
summary(prestige.dat)
##    education          income          women           prestige    
##  Min.   : 6.380   Min.   :  611   Min.   : 0.000   Min.   :14.80  
##  1st Qu.: 8.445   1st Qu.: 4106   1st Qu.: 3.592   1st Qu.:35.23  
##  Median :10.540   Median : 5930   Median :13.600   Median :43.60  
##  Mean   :10.738   Mean   : 6798   Mean   :28.979   Mean   :46.83  
##  3rd Qu.:12.648   3rd Qu.: 8187   3rd Qu.:52.203   3rd Qu.:59.27  
##  Max.   :15.970   Max.   :25879   Max.   :97.510   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :44  
##  1st Qu.:3120   prof:31  
##  Median :5135   wc  :23  
##  Mean   :5402   NA's: 4  
##  3rd Qu.:8312            
##  Max.   :9517

The dataset has 102 rows (each row is one occupation) and 6 variables. The variables are:

Variable Description
education Average education of occupational incumbents, years, in 1971.
income Average income of incumbents, dollars, in 1971.
women Percentage of incumbents who are women.
prestige Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census Canadian Census occupational code.
type Type of occupation. A factor with levels: b(lue)c(ollar), prof(essional), and w(hite)c(ollar).

For this example, I’d like to investigate the correlates of the prestige score. I could first focus on the income of each occupation and create a scatterplot of income and prestige.

plot(x = prestige.dat$income, 
     y = prestige.dat$prestige, 
     main = "")
income.mod <- lm(prestige ~ income, data = prestige.dat)
abline(income.mod)