This tutorial shows you:
Note on copying & pasting code from the PDF version of this tutorial: Please note that you may run into trouble if you copy & paste code from the PDF version of this tutorial into your R script. When the PDF is created, some characters (for instance, quotation marks or indentations) are converted into non-text characters that R won’t recognize. To use code from this tutorial, please type it yourself into your R script or you may copy & paste code from the source file for this tutorial which is posted on my website.
Note on R functions discussed in this tutorial: I don’t discuss many functions in detail here and therefore I encourage you to look up the help files for these functions or search the web for them before you use them. This will help you understand the functions better. Each of these functions is well-documented either in its help file (which you can access in R by typing ?ifelse
, for instance) or on the web. The Companion to Applied Regression (see our syllabus) also provides many detailed explanations.
As always, please note that this tutorial only accompanies the other materials for Day 8 and that you are expected to have worked through the videos and reading for that day before tackling this tutorial.
This tutorial builds on the readings and videos you worked through for Day 8. These materials introduce a new form of the familiar regression equation to you. On Days 6 and 7, you worked with one outcome variable \(y\) and one explanatory variable \(x\). The parameters you recovered via OLS were the intercept, \(\alpha\), and the slope coefficient \(\beta\). For today, you encounter a second explanatory variable \(x_2\), next to \(x_1\).
This also increases the number of slope coefficients to two: \(\beta_1\) is associated with \(x_1\), and \(\beta_2\) is associated with \(x_2\):
\[ \begin{aligned} y_i = \alpha + \beta_1 x_1 + \beta_2 x_2 + \varepsilon \end{aligned} \]
We can simulate the data-generating process for this \(y\), mirroring the approach you saw in the tutorial for Day 6:
set.seed(123)
n.obs <- 50
x1 <- rnorm(n = n.obs, mean = 5, sd = 2)
x2 <- runif(n = n.obs, min = -7, max = 7)
i <- c(1:n.obs)
e <- rnorm(n = n.obs, mean = 0, sd = 2)
a <- 2
b1<- 0.5
b2 <- -0.75
y <- a + b1 * x1 + b2 * x2 + e
sim.dat <- data.frame(y, x1, x2)
Here, I choose to create a new dataframe, sim.dat, that contains the three variables we’re working with: \(y\), \(x_1\), and \(x_2\).
When estimating a linear model in statistical software, in our case R, the only change you have to make to add the additional explanatory variable(s) to the formula for the model you estimate:
mod <- lm(y ~ x1 + x2, data = sim.dat)
summary(mod)
##
## Call:
## lm(formula = y ~ x1 + x2, data = sim.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1279 -1.2674 -0.2188 1.0713 4.2181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.55417 0.71122 3.591 0.000784 ***
## x1 0.39331 0.13211 2.977 0.004589 **
## x2 -0.78539 0.05943 -13.215 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.709 on 47 degrees of freedom
## Multiple R-squared: 0.792, Adjusted R-squared: 0.7832
## F-statistic: 89.48 on 2 and 47 DF, p-value: < 2.2e-16
This lm
object has some elements that you can use later on. You can view the structure of the model object with the str()
function:
str(mod)
## List of 12
## $ coefficients : Named num [1:3] 2.554 0.393 -0.785
## ..- attr(*, "names")= chr [1:3] "(Intercept)" "x1" "x2"
## $ residuals : Named num [1:50] 1.96 -0.722 -2.135 0.582 -0.279 ...
## ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
## $ effects : Named num [1:50] -30.961 3.559 -22.578 -0.185 -0.485 ...
## ..- attr(*, "names")= chr [1:50] "(Intercept)" "x1" "x2" "" ...
## $ rank : int 3
## $ fitted.values: Named num [1:50] 2.98 6.178 5.872 -0.421 4.81 ...
## ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
## $ assign : int [1:3] 0 1 2
## $ qr :List of 5
## ..$ qr : num [1:50, 1:3] -7.071 0.141 0.141 0.141 0.141 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:50] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:3] "(Intercept)" "x1" "x2"
## .. ..- attr(*, "assign")= int [1:3] 0 1 2
## ..$ qraux: num [1:3] 1.14 1.03 1.06
## ..$ pivot: int [1:3] 1 2 3
## ..$ tol : num 1e-07
## ..$ rank : int 3
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 47
## $ xlevels : Named list()
## $ call : language lm(formula = y ~ x1 + x2, data = sim.dat)
## $ terms :Classes 'terms', 'formula' length 3 y ~ x1 + x2
## .. ..- attr(*, "variables")= language list(y, x1, x2)
## .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:3] "y" "x1" "x2"
## .. .. .. ..$ : chr [1:2] "x1" "x2"
## .. ..- attr(*, "term.labels")= chr [1:2] "x1" "x2"
## .. ..- attr(*, "order")= int [1:2] 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(y, x1, x2)
## .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:3] "y" "x1" "x2"
## $ model :'data.frame': 50 obs. of 3 variables:
## ..$ y : num [1:50] 4.941 5.456 3.737 0.161 4.531 ...
## ..$ x1: num [1:50] 3.88 4.54 8.12 5.14 5.26 ...
## ..$ x2: num [1:50] 1.4 -2.34 -0.159 6.363 -0.239 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 y ~ x1 + x2
## .. .. ..- attr(*, "variables")= language list(y, x1, x2)
## .. .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:3] "y" "x1" "x2"
## .. .. .. .. ..$ : chr [1:2] "x1" "x2"
## .. .. ..- attr(*, "term.labels")= chr [1:2] "x1" "x2"
## .. .. ..- attr(*, "order")= int [1:2] 1 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(y, x1, x2)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:3] "numeric" "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:3] "y" "x1" "x2"
## - attr(*, "class")= chr "lm"
We can extract some of these quantities with commands that you’re already familiar with:
coef(mod)
## (Intercept) x1 x2
## 2.5541711 0.3933124 -0.7853943
plot(x = fitted.values(mod),
y = residuals(mod))
Before you advance, a reminder: the assumptions we discussed on Day 6 are the same assumptions underling the linear regression model with more than one predictor. These assumptions are:
We’ll now use some example data that you encountered in your AR reading. These data come from the “car” package, so I first load the package, then create the object prestige.dat
containing the dataset.
library(car)
prestige.dat <- data.frame(Prestige)
head(prestige.dat)
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
summary(prestige.dat)
## education income women prestige
## Min. : 6.380 Min. : 611 Min. : 0.000 Min. :14.80
## 1st Qu.: 8.445 1st Qu.: 4106 1st Qu.: 3.592 1st Qu.:35.23
## Median :10.540 Median : 5930 Median :13.600 Median :43.60
## Mean :10.738 Mean : 6798 Mean :28.979 Mean :46.83
## 3rd Qu.:12.648 3rd Qu.: 8187 3rd Qu.:52.203 3rd Qu.:59.27
## Max. :15.970 Max. :25879 Max. :97.510 Max. :87.20
## census type
## Min. :1113 bc :44
## 1st Qu.:3120 prof:31
## Median :5135 wc :23
## Mean :5402 NA's: 4
## 3rd Qu.:8312
## Max. :9517
The dataset has 102 rows (each row is one occupation) and 6 variables. The variables are:
Variable | Description |
---|---|
education |
Average education of occupational incumbents, years, in 1971. |
income |
Average income of incumbents, dollars, in 1971. |
women |
Percentage of incumbents who are women. |
prestige |
Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s. |
census |
Canadian Census occupational code. |
type |
Type of occupation. A factor with levels: b(lue)c(ollar), prof(essional), and w(hite)c(ollar). |
For this example, I’d like to investigate the correlates of the prestige
score. I could first focus on the income
of each occupation and create a scatterplot of income
and prestige
.
plot(x = prestige.dat$income,
y = prestige.dat$prestige,
main = "")
income.mod <- lm(prestige ~ income, data = prestige.dat)
abline(income.mod)