This tutorial shows you:
Note on copying & pasting code from the PDF version of this tutorial: Please note that you may run into trouble if you copy & paste code from the PDF version of this tutorial into your R script. When the PDF is created, some characters (for instance, quotation marks or indentations) are converted into non-text characters that R won’t recognize. To use code from this tutorial, please type it yourself into your R script or you may copy & paste code from the source file for this tutorial which is posted on my website.
Note on R functions discussed in this tutorial: I don’t discuss many functions in detail here and therefore I encourage you to look up the help files for these functions or search the web for them before you use them. This will help you understand the functions better. Each of these functions is well-documented either in its help file (which you can access in R by typing ?ifelse
, for instance) or on the web. The Companion to Applied Regression (see our syllabus) also provides many detailed explanations.
As always, please note that this tutorial only accompanies the other materials for Day 11 and that you are expected to have worked through the reading for that day before tackling this tutorial.
Note: You must have read Brambor, Clark & Golder (2006) before working on this tutorial.
Interaction terms enter the regression equation as the product of two constitutive terms, \(x_1\) and \(x_2\). For this product term \(x_1 x_3\), the regression equation adds a separate coefficient \(\beta_3\).
\[ y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \varepsilon \]
In thinking about interaction terms, it helps to first simplify by working through the prediction of the regression equation for different values of two predictors, \(x_1\) and \(x_2\). We can imagine a continuous outcome \(y\), e.g. the income of Hollywood actors, that we predict with two variables. The first, \(x_1\), is a binary variable such as female gender; it takes on values of 0 (males) and 1 (females) only. The second, \(x_2\), is a continuous variable, ranging from \(-5\) to \(+5\), such as a (centered and standardized) measure of age. In this case, I’m using the term “effect” loosely and non-causally. An interaction term expresses the idea that the effect of one variable depends on the value of the other variable. With these variables, this suggests that effect of age on actors’ income is different for male actors than for female actors.
\(\beta_1\) is the effect of \(x_1\) on \(y\) when \(x_2\) is 0:
\(\beta_2\) is the effect of \(x_2\) on \(y\) when \(x_1\) is 0:
When both \(x_1\) and \(x_2\) are not 0, \(\beta_3\) becomes important, and the effect of \(x_1\) now varies with the value of \(x_2\). We can plug in 1 for \(x_1\) and simplify the equation as follows:
With simulated data, this can be illustrated easily. I begin by simulating a dataset with 200 observations, two predictors \(x_1\) (binary: male/female) and \(x_2\) (continuous: standardized and centered age), and create \(y\) (continuous: income) as the linear combination of \(x_1\), \(x_2\), and an interaction term of the two predictors.
set.seed(123)
n.sample <- 200
x1 <- rbinom(n.sample, size = 1, prob = 0.5)
x2 <- runif(n.sample, -5, 5)
a <- 5
b1 <- 3
b2 <- 4
b3 <- -3
e <- rnorm(n.sample, 0, 5)
y <- a + b1 * x1 + b2 * x2 + b3 * x1 * x2 + e
sim.dat <- data.frame(y, x1, x2)
Before advancing, this is what the simulated data look like:
par(mfrow = c(1, 3))
hist(sim.dat$y)
hist(sim.dat$x1)
hist(sim.dat$x2)
par(mfrow = c(1, 1))
For convenience, you could also use the multi.hist()
function from the “psych” package. It automatically adds a density curve (dashed line) and a normal density plot (dotted line).
# install.packages(psych)
library(psych)
multi.hist(sim.dat, nrow = 1)
Fitting a model using what we know about the data-generating process gives us:
mod.sim <- lm(y ~ x1 * x2, dat = sim.dat)
summary(mod.sim)
##
## Call:
## lm(formula = y ~ x1 * x2, data = sim.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1974 -3.4874 -0.0738 3.4837 13.2178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7891 0.4924 9.726 < 2e-16 ***
## x1 3.8716 0.7081 5.467 1.38e-07 ***
## x2 3.9554 0.1663 23.790 < 2e-16 ***
## x1:x2 -2.9435 0.2421 -12.160 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.997 on 196 degrees of freedom
## Multiple R-squared: 0.7615, Adjusted R-squared: 0.7579
## F-statistic: 208.6 on 3 and 196 DF, p-value: < 2.2e-16
First, I plot the relationship between the continuous predictor \(x_2\) and \(y\) for all observations where \(x_1 = 0\). The regression line is defined by an intercept \(\alpha\) and a slope \(\beta_2\), as we calculated above. I can extract these from the lm
object above using the coef()
function. I use the index in square brackets to extract the coefficient I need.
coef(mod.sim)
## (Intercept) x1 x2 x1:x2
## 4.789084 3.871614 3.955383 -2.943516
coef(mod.sim)[1]
## (Intercept)
## 4.789084
plot(x = sim.dat[sim.dat$x1 == 0, ]$x2, y = sim.dat[sim.dat$x1 == 0, ]$y,
col = rgb(red = 0, green = 0, blue = 1, alpha = 0.25), pch = 19,
xlab = "x2", ylab = "y")
abline(a = coef(mod.sim)[1], b = coef(mod.sim)[3], col = "blue", pch = 19, lwd = 2)