This tutorial shows you:
As always, please note that this tutorial only accompanies the other course materials and that you are expected to have worked through assigned reading before tackling this tutorial.
Note: You must have read Brambor, Clark & Golder (2006) before working on this tutorial.
Interaction terms enter the regression equation as the product of two constitutive terms, \(x_1\) and \(x_2\). For this product term \(x_1 x_3\), the regression equation adds a separate coefficient \(\beta_3\).
\[ y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \varepsilon \]
In thinking about interaction terms, it helps to first simplify by working through the prediction of the regression equation for different values of two predictors, \(x_1\) and \(x_2\). We can imagine a continuous outcome \(y\), e.g. the income of Hollywood actors, that we predict with two variables. The first, \(x_1\), is a binary variable such as female gender; it takes on values of 0 (males) and 1 (females) only. The second, \(x_2\), is a continuous variable, ranging from \(-5\) to \(+5\), such as a (centered and standardized) measure of age. In this case, I’m using the term “effect” loosely and non-causally. An interaction term expresses the idea that the effect of one variable depends on the value of the other variable. With these variables, this suggests that effect of age on actors’ income is different for male actors than for female actors.
\(\beta_1\) is the effect of \(x_1\) on \(y\) when \(x_2\) is 0:
\(\beta_2\) is the effect of \(x_2\) on \(y\) when \(x_1\) is 0:
When both \(x_1\) and \(x_2\) are not 0, \(\beta_3\) becomes important, and the effect of \(x_1\) now varies with the value of \(x_2\). We can plug in 1 for \(x_1\) and simplify the equation as follows:
With simulated data, this can be illustrated easily. I begin by simulating a dataset with 200 observations, two predictors \(x_1\) (binary: male/female) and \(x_2\) (continuous: standardized and centered age), and create \(y\) (continuous: income) as the linear combination of \(x_1\), \(x_2\), and an interaction term of the two predictors.
set.seed(123)
<- 200
n.sample <- rbinom(n.sample, size = 1, prob = 0.5)
x1 <- runif(n.sample, -5, 5)
x2 <- 5
a <- 3
b1 <- 4
b2 <- -3
b3 <- rnorm(n.sample, 0, 5)
e <- a + b1 * x1 + b2 * x2 + b3 * x1 * x2 + e
y <- data.frame(y, x1, x2) sim.dat
Before advancing, this is what the simulated data look like:
par(mfrow = c(1, 3))
hist(sim.dat$y)
hist(sim.dat$x1)
hist(sim.dat$x2)
par(mfrow = c(1, 1))
For convenience, you could also use the multi.hist()
function from the “psych” package. It automatically adds a density curve
(dashed line) and a normal density plot (dotted line).
library("psych")
multi.hist(sim.dat, nrow = 1)
Fitting a model using what we know about the data-generating process gives us:
<- lm(y ~ x1 * x2, dat = sim.dat)
mod.sim summary(mod.sim)
##
## Call:
## lm(formula = y ~ x1 * x2, data = sim.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1974 -3.4874 -0.0738 3.4837 13.2178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7891 0.4924 9.726 < 2e-16 ***
## x1 3.8716 0.7081 5.467 1.38e-07 ***
## x2 3.9554 0.1663 23.790 < 2e-16 ***
## x1:x2 -2.9435 0.2421 -12.160 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.997 on 196 degrees of freedom
## Multiple R-squared: 0.7615, Adjusted R-squared: 0.7579
## F-statistic: 208.6 on 3 and 196 DF, p-value: < 2.2e-16
First, I plot the relationship between the continuous predictor \(x_2\) and \(y\) for all observations where \(x_1 = 0\). The regression line is defined
by an intercept \(\alpha\) and a slope
\(\beta_2\), as we calculated above. I
can extract these from the lm
object above using the
coef()
function. I use the index in square brackets to
extract the coefficient I need.
coef(mod.sim)
## (Intercept) x1 x2 x1:x2
## 4.789084 3.871614 3.955383 -2.943516
coef(mod.sim)[1]
## (Intercept)
## 4.789084
plot(x = sim.dat[sim.dat$x1 == 0, ]$x2, y = sim.dat[sim.dat$x1 == 0, ]$y,
col = rgb(red = 0, green = 0, blue = 1, alpha = 0.25), pch = 19,
xlab = "x2", ylab = "y")
abline(a = coef(mod.sim)[1], b = coef(mod.sim)[3], col = "blue", pch = 19, lwd = 2)
Next, we can add the data points where \(x_1 = 1\), and add the separate regression line for these points:
plot(x = sim.dat[sim.dat$x1 == 0, ]$x2, y = sim.dat[sim.dat$x1 == 0, ]$y,
pch = 19, xlab = "x2", ylab = "y", col = rgb(red = 0, green = 0, blue = 1, alpha = 0.25))
abline(a = coef(mod.sim)[1], b = coef(mod.sim)[3], col = "blue", lwd = 2)
points(x = sim.dat[sim.dat$x1 == 1, ]$x2, y = sim.dat[sim.dat$x1 == 1, ]$y,
col = rgb(red = 1, green = 0, blue = 0, alpha = 0.25), pch = 19)
abline(a = coef(mod.sim)[1] + coef(mod.sim)[2], b = coef(mod.sim)[3] + coef(mod.sim)[4],
col = "red", lwd = 2)
From this first exercise, you should take home a few things:
x1 * x2
See Brambor et al. (2006) for more details. What would it imply if your model only included \(\beta_3 x_1 x_2\)? Omitting a coefficient is equivalent to setting \(\beta_1 = 0\) and \(\beta_2 = 0\). See for yourself:
<- lm(y ~ x1:x2, data = sim.dat)
mod.sim2 summary(mod.sim2)
##
## Call:
## lm(formula = y ~ x1:x2, data = sim.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.6719 -4.5727 0.6632 5.8826 24.6120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6555 0.7078 9.404 < 2e-16 ***
## x1:x2 0.9588 0.3513 2.729 0.00692 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.995 on 198 degrees of freedom
## Multiple R-squared: 0.03625, Adjusted R-squared: 0.03138
## F-statistic: 7.448 on 1 and 198 DF, p-value: 0.006924
plot(x = sim.dat[sim.dat$x1 == 0, ]$x2, y = sim.dat[sim.dat$x1 == 0, ]$y,
pch = 19, col = rgb(red = 0, green = 0, blue = 1, alpha = 0.25),
xlab = "x2", ylab = "y")
points(x = sim.dat[sim.dat$x1 == 1, ]$x2, y = sim.dat[sim.dat$x1 == 1, ]$y,
col = rgb(red = 1, green = 0, blue = 0, alpha = 0.25), pch = 19)
abline(a = coef(mod.sim2)[1], b = coef(mod.sim2)[2], lwd = 2)
Note: R will automatically include the constitutive terms when you use the asterisk as above.
y ~ x1 + x2 + x1 * x2
is equivalent to
y ~ x1 * x2
The first example illustrates the general logic of interaction terms. It carries over directly into the case of an interaction term between two continuous variables. Let’s now simulate another (fake) dataset, with a continuous outcome \(y\) being the income of professional baseball players. We can assume that \(x_1\) is a continuous variable and distributed normally with a mean of 2 and a standard deviation of 5 (e.g., a standardized measure of injury risk of each player). As previously, \(x_2\) is a continuous variable (such as standardized & centered age), ranging from \(-5\) to \(+5\). We might be thinking of a conditional effect, where older athletes make more — but only if they are rated as low injury risk. Older athletes with higher injury risk might be making less money. With this configuration,
\(\beta_1\) is the effect of \(x_1\) on \(y\) when \(x_2\) is 0:
\(\beta_2\) is the effect of \(x_2\) on \(y\) when \(x_1\) is 0:
When both \(x_1\) and \(x_2\) are not 0, \(\beta_3\) becomes important, and the effect of \(x_1\) now varies with the value of \(x_2\). We can again plug in 1 for \(x_1\) and simplify the equation as follows:
But \(x_1\) takes on many other values than just 1. The more general equation for \(y\) is then (cf. Brambor et al. p. 11)
I use the same simulation setup above, but make \(x_1\) a continuous variable, normally distributed with a mean of 2 and a standard deviation of 5.
set.seed(123)
<- 200
n.sample <- rnorm(n.sample, mean = 2, sd = 5)
x1 <- runif(n.sample, -5, 5)
x2 <- 5
a <- 3
b1 <- 4
b2 <- -3
b3 <- rnorm(n.sample, 0, 20)
e <- a + b1 * x1 + b2 * x2 + b3 * x1 * x2 + e
y <- data.frame(y, x1, x2) sim.dat2
Before advancing, this is what the simulated data look like:
multi.hist(sim.dat2, nrow = 1)
Fitting a model using what we know about the data-generating process gives us:
<- lm(y ~ x1 * x2, dat = sim.dat2)
mod.sim3 summary(mod.sim3)
##
## Call:
## lm(formula = y ~ x1 * x2, data = sim.dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.539 -12.129 0.986 12.424 51.132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.4950 1.5581 4.168 4.60e-05 ***
## x1 2.5923 0.3059 8.475 5.59e-15 ***
## x2 4.1026 0.5325 7.704 6.41e-13 ***
## x1:x2 -3.0383 0.1017 -29.883 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.34 on 196 degrees of freedom
## Multiple R-squared: 0.8349, Adjusted R-squared: 0.8323
## F-statistic: 330.3 on 3 and 196 DF, p-value: < 2.2e-16
We can now calculate the effect of \(x_1\) on \(y\) at different levels of \(x_2\), say, across the range of \(x_2\) with increments of 1. In R, I create
this sequence using the seq()
function.
<- seq(from = -5, to = 5, by = 1)
x2.sim x2.sim
## [1] -5 -4 -3 -2 -1 0 1 2 3 4 5
<- coef(mod.sim3)[2] + coef(mod.sim3)[4] * x2.sim eff.x1
I can now list the effect of \(x_1\) at different levels of \(x_2\):
<- data.frame(x2.sim, eff.x1)
eff.dat eff.dat
## x2.sim eff.x1
## 1 -5 17.7837936
## 2 -4 14.7454971
## 3 -3 11.7072007
## 4 -2 8.6689042
## 5 -1 5.6306077
## 6 0 2.5923113
## 7 1 -0.4459852
## 8 2 -3.4842817
## 9 3 -6.5225781
## 10 4 -9.5608746
## 11 5 -12.5991711
The object eff.x1
is now the coefficient of \(x_1\) at the respective levels of \(x_2\). \(x_2\) in this context is also called the
“moderator” or “moderating variable” because it moderates the effect of
\(x_1\). You can see that \(x_1\) exhibits a positive effect on \(y\) when \(x_2\) is approximately smaller than 1, at
which point the effect of \(x_1\) on
\(y\) turns negative.
You can now visualize this, just as you did in the binary-continuous interaction above. Because \(x_2\), the moderating variable, is now continuous, there is an (almost) infinite number of separate regression lines for \(x_1\) and \(y\): each individual value of \(x_2\) creates a separate regression line (with a separate slope coefficient). The following scatterplot plots observations against their values on \(x_1\) and \(7\) and colors them based on their value of \(x_2\), as in the previous example.
<- colorRampPalette(colors = c("red", "blue"))
rbPal $x2.col <- rbPal(10)[as.numeric(cut(sim.dat2$x2,breaks = 10))]
sim.dat2plot(x = sim.dat2$x1, y = sim.dat2$y, col = sim.dat2$x2.col,
pch = 19, xlab = "x1", ylab = "y")
Alternatively, ggplot2 makes it easier to color lines conditionally on the value of another variable:
library("ggplot2")
<- ggplot(data = sim.dat2, aes(x = x1, y = y)) + geom_point(aes(color = x2)) +
p scale_colour_continuous(low ="red", high ="blue")
+ theme_bw() p
–>
Then, I’m adding 10 regression lines for the 10 values of \(x_2\) that I just calculated. The lines are also colored according to the values of \(x_2\).
$x2.col <- rbPal(10)[as.numeric(cut(eff.dat$x2,breaks = 10))]
eff.datplot(x = sim.dat2$x1, y = sim.dat2$y, col = sim.dat2$x2.col,
pch = 19, xlab = "x1", ylab = "y")
apply(eff.dat, 1, function(x) abline(a = coef(mod.sim3)[1], b = x[2], col = x[3]))
## NULL
You can see that at low values of \(x_2\) (red), the relationship between \(x_1\) and \(y\) is positive (upward lines), whereas at higher values of \(x_2\) (blue), the relationship is negative (downward lines).
+ geom_abline(data = eff.dat, aes(intercept = coef(mod.sim3)[1],
p slope = eff.x1, color = x2.sim)) +
theme_bw()
As you’ve heard before, and as Hainmueller et al. (2019) imply, you should always inspect your data. This also applies to interactions. Always investigate whether:
You should see this as an opportunity to learn something about your data rather than a nuisance.
Here is an example of how I would inspect my data from the first
model, which contained an interaction between a dichotomous
(x1
) and continuous (x2
) variable:
par(mfrow = c(1, 2))
plot(x = sim.dat[sim.dat$x1 == 0, ]$x2,
y = sim.dat[sim.dat$x1 == 0, ]$y,
main = "x1 = 0", xlab = "x2", ylab = "y")
plot(x = sim.dat[sim.dat$x1 == 1, ]$x2,
y = sim.dat[sim.dat$x1 == 1, ]$y,
main = "x1 = 1", xlab = "x2", ylab = "y")
par(mfrow = c(1, 1))
I see no reason to be concerned about the distribution of
x2
or the linearity of the relationship between
x2
and y
, and this makes sense, given how I
simulated the data. But see Hainmueller et al. for examples where this
does become a problem.
Next, here is an example of how I would inspect my data from the
third model above, which contained an interaction between a continuous
(x1
) and continuous (x2
) variable. First, I
split the moderator (again, I choose x1
) in three groups.
For this, I use the cut()
function (see the R
Cookbook for more info).
$x1_tri <- cut(x = sim.dat2$x1,
sim.dat2breaks = quantile(x = sim.dat2$x1, probs = c(0, 0.33, 0.66, 1)),
include.lowest = TRUE)
table(sim.dat2$x1_tri)
##
## [-9.55,-0.151] (-0.151,3.66] (3.66,18.2]
## 66 66 68
par(mfrow = c(1, 3))
plot(x = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[1], ]$x2,
y = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[1], ]$y,
main = paste("x1 = ", levels(sim.dat2$x1_tri)[1], sep = ""),
xlab = "x2", ylab = "y")
plot(x = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[2], ]$x2,
y = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[2], ]$y,
main = paste("x1 = ", levels(sim.dat2$x1_tri)[2], sep = ""),
xlab = "x2", ylab = "y")
plot(x = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[3], ]$x2,
y = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[3], ]$y,
main = paste("x1 = ", levels(sim.dat2$x1_tri)[3], sep = ""),
xlab = "x2", ylab = "y")
par(mfrow = c(1, 1))
Again, given how I simulated these data, there is no reason to be
concerned about the distribution of x2
or the linearity of
the relationship between x2
and y
.
As usual, you should evaluate the variance around coefficients. The online appendix to Brambor et al. (2006) gives you the formula for this variance:
For the variance of the marginal/conditional effect of \(x_1\), use this equation:
\[ \text{Var} = \text{Var} (\beta_1) + x_2^2 \times \text{Var} (\beta_3) + 2 \times x_2 \times \text{Cov} (\beta_1, \beta_3) \]
The standard error is then easily calculated as the square root of the variance.
Note that this equation makes use of the covariance of
estimates. You learned about variance and covariance earlier, and you
can access the variance-covariance matrix with the vcov()
function. You can then extract cells of the matrix by using the familiar
square brackets, where the first number stands for rows and the second
for columns.
vcov(mod.sim3)
## (Intercept) x1 x2 x1:x2
## (Intercept) 2.4277430885 -0.1831417946 -0.0009989293 -0.0018730849
## x1 -0.1831417946 0.0935660448 -0.0018849223 0.0007608924
## x2 -0.0009989293 -0.0018849223 0.2835933457 -0.0200800974
## x1:x2 -0.0018730849 0.0007608924 -0.0200800974 0.0103373423
vcov(mod.sim3)[1, 1]
## [1] 2.427743
Now, you can obtain the standard error of the respective coefficients
by taking the square root of the variance. In this code, I’m nesting the
equation above in the sqrt()
function.
$se.eff.x1 <- sqrt(vcov(mod.sim3)[2, 2] +
eff.dat$x2.sim^2 * vcov(mod.sim3)[4, 4] +
eff.dat2 * eff.dat$x2.sim * vcov(mod.sim3)[2, 4])
eff.dat
## x2.sim eff.x1 x2.col se.eff.x1
## 1 -5 17.7837936 #FF0000 0.5868481
## 2 -4 14.7454971 #FF0000 0.5028682
## 3 -3 11.7072007 #E2001C 0.4266577
## 4 -2 8.6689042 #C60038 0.3631416
## 5 -1 5.6306077 #AA0055 0.3199713
## 6 0 2.5923113 #8D0071 0.3058857
## 7 1 -0.4459852 #71008D 0.3246924
## 8 2 -3.4842817 #5500AA 0.3714283
## 9 3 -6.5225781 #3800C6 0.4372270
## 10 4 -9.5608746 #1C00E2 0.5148307
## 11 5 -12.5991711 #0000FF 0.5996737
It is usually (always, really) more useful to plot the marginal effect of a variable across the range of the second variable in the interaction, rather than plotting separate regression lines as I did above. I strongly recommend this approach. You already have the tools to do this:
plot(x = eff.dat$x2.sim, y = eff.dat$eff.x1, type = "l", xlab = "x2",
ylab = "Conditional effect of x1")
abline(h = 0, col = "grey", lty = "dashed")
The solid line shows the conditional effect of \(x_1\) across the range of \(x_2\). You can add 95% confidence intervals by plotting lines representing the conditional effect \(\pm 1.96 \times \text{SE}\).
plot(x = eff.dat$x2.sim, y = eff.dat$eff.x1, type = "l", xlab = "x2",
ylab = "Conditional effect of x1")
abline(h = 0, col = "grey", lty = "dashed")
lines(x = eff.dat$x2.sim, y = eff.dat$eff.x1 + 1.96 * eff.dat$se.eff.x1, lty = "dashed")
lines(x = eff.dat$x2.sim, y = eff.dat$eff.x1 - 1.96 * eff.dat$se.eff.x1, lty = "dashed")
abline(h = 0, lty = 2, col = "grey")
The final example uses real-world data on the electoral success of extreme right-wing parties in 16 countries over 102 elections. These data were used in Matt Golder’s BJPS article “Electoral Institutions, Unemployment and Extreme Right Parties: A Correction.” (see Brambor et al. 2006 for the full citation).
<- rio::import("http://www.jkarreth.net/files/golder.2003.csv")
ei.dat summary(ei.dat)
## country year unemp enp
## Length:102 Min. :1970 Min. : 0.300 Min. :1.72
## Class :character 1st Qu.:1975 1st Qu.: 2.125 1st Qu.:2.79
## Mode :character Median :1980 Median : 5.300 Median :3.40
## Mean :1980 Mean : 5.811 Mean :3.59
## 3rd Qu.:1985 3rd Qu.: 8.375 3rd Qu.:4.63
## Max. :1990 Max. :20.800 Max. :5.10
## lerps thresh
## Min. :0.0000 Min. : 0.670
## 1st Qu.:0.0000 1st Qu.: 2.600
## Median :0.4700 Median : 5.000
## Mean :0.9208 Mean : 8.807
## 3rd Qu.:1.8284 3rd Qu.: 9.875
## Max. :3.3142 Max. :35.000
par(mfrow = c(2, 2))
hist(ei.dat$lerps)
hist(ei.dat$thresh)
hist(ei.dat$enp)
hist(ei.dat$unemp)
par(mfrow = c(1, 1))
Variable | Description |
---|---|
lerps |
Log of extreme right percentage support + 1 |
thresh |
Effective threshold of representation and exclusion in the political system |
enp |
Effective number of parties |
unemp |
Unemployment rate |
country |
Country name |
year |
Election year |
This study aims to determine the relationship between electoral support for extreme right-wing parties (the outcome variable), the threshold for representation in the political system (the first predictor), and the effective number of parties (the second predictor). The author hypothesized that the effect of representation thresholds might vary depending on the effective number of parties. The unemployment rate at the time of the election is a control variable. Because elections in the same country might be subject to idiosyncratic dynamics, the author includes a dummy variable for each country (cf. our earlier discussions).
After fitting the model, I display the results using the familiar
screenreg()
function from the “texreg” package. I make use
of the omit.coef
option to avoid printing of the
coefficients for each country dummy variable.
# Model 3
<- lm(lerps ~ thresh + enp + thresh * enp + unemp + factor(country), data = ei.dat)
m3 library("modelsummary")
modelsummary(list(m3),
coef_omit = "country")
Model 1 | |
---|---|
(Intercept) | −0.967 |
(1.099) | |
thresh | 0.267 |
(0.063) | |
enp | 1.159 |
(0.433) | |
unemp | 0.070 |
(0.015) | |
thresh × enp | −0.099 |
(0.021) | |
Num.Obs. | 102 |
R2 | 0.853 |
R2 Adj. | 0.819 |
AIC | 136.2 |
BIC | 191.3 |
F | 25.009 |
RMSE | 0.38 |
Rather than trying to interpret these results from the regression table, I calculate the marginal effect of electoral thresholds across the range of the effective number of parties. This step is exactly analogous to what I did with simulated data above.
<- seq(from = min(ei.dat$thresh), to = max(ei.dat$thresh), length.out = 50)
thresh_sim <- seq(from = min(ei.dat$enp), to = max(ei.dat$enp), length.out = 50)
enp_sim <- coef(m3)[2] + coef(m3)[20] * enp_sim
eff_thresh <- sqrt(vcov(m3)[2, 2] + enp_sim^2 * vcov(m3)[20, 20] + 2 * enp_sim * vcov(m3)[2, 20])
eff_thresh_se plot(x = enp_sim, y = eff_thresh, type = "l",
xlab = "Effective number of parties",
ylab = "Conditional effect of the electoral threshold")
lines(x = enp_sim, y = eff_thresh + 1.96 * eff_thresh_se, lty = "dashed")
lines(x = enp_sim, y = eff_thresh - 1.96 * eff_thresh_se, lty = "dashed")
abline(h = 0, col = "grey", lty = "dashed")
Rather than calculating marginal effects by hand, you might want to use R functions to plot marginal effects with less effort.
DAintfun2()
One good option is Dave Armstrong’s DAintfun2()
function, which is part of his “DAMisc” package. You need to first
install the package, then load it, and then specify the function with at
least the following arguments:
# install.packages("DAMisc")
library(DAMisc)
DAintfun2(m3, varnames = c("thresh", "enp"),
rug = TRUE, hist = TRUE)
DAintfun2()
by default plots both the marginal effect of
the first variable as well as that of the second variable. It also
allows you to include an indicator of the actual distribution of the
variables, via a histogram or a rug plot. With these supplementary
plots, you can immediately see whether the estimated effect of a
variable corresponds to real observations at that value of the
moderating variable.
ggintfun()
A second function is ggintfun()
, which I’ve written. I
prefer working with ggplot and wanted to have a customizable version of
a marginal effects plot; otherwise, DAintfun2()
does
everything I need. ggintfun()
uses the “ggplot2” package
and you need to download it from my GitHub repository. This requires
three extra steps:
ggintfun()
function into R using the
devtools::source_url()
function and the URL of the raw R
code for ggintfun()
on my GitHub site (every time you want
to use ggintfun()
).::source_url("https://raw.githubusercontent.com/jkarreth/JKmisc/master/ggintfun.R") devtools
The function then produces this plot, which shows the exact same
information as your hand-crafted marginal effects plot and the left
panel of the DAintfun2())
plot from above.
ggintfun(obj = m3, varnames = c("thresh", "enp"),
varlabs = c("Electoral threshold", "Effective number of parties"),
title = FALSE, rug = TRUE,
twoways = FALSE)
It can also produce both panels at the same time by setting
twoways = TRUE
:
ggintfun(obj = m3, varnames = c("thresh", "enp"),
varlabs = c("Electoral threshold", "Effective number of parties"),
title = FALSE, rug = TRUE,
twoways = TRUE)
Lastly, this function automatically adjusts the plot if a moderator is binary (as in our very first model in this tutorial):
ggintfun(obj = mod.sim, varnames = c("x1", "x2"),
varlabs = c("x1 (binary)", "x2 (continuous)"),
title = FALSE, rug = TRUE,
twoways = TRUE)
There are also two convenient packages intended to facilitate easy plotting of interaction effects:
If you want to export a plot so that you can insert it into a manuscript (or presentation slides), you can find instructions in our earlier course materials. The logic is straightforward:
In more detail:
twoway_interaction.pdf
with a width of 8 inches and a
height of 4 inches, specify the graphics device as follows:pdf("twoway_interaction.pdf", width = 8, height = 4, units = "in")
ggintfun()
function. It draws the plot in one step.ggintfun(obj = mod.sim, varnames = c("x1", "x2"),
varlabs = c("x1 (binary)", "x2 (continuous)"),
title = FALSE, rug = TRUE,
twoways = TRUE)
In other examples, you might draw a plot in several steps. For
instance, you might plot(x, y)
, followed by
abline(lm(y ~ x))
. In this case, draw the complete plot
during this step - execute all code that is necessary to complete the
plot.
dev.off()
The sequence of these three steps has now produced a PDF file
twoway_interaction.pdf
in your working directory.
If you wanted to print the plot to a JPG image file instead, use
jpeg
and specify the resolution of the plot in addition to
its size:
jpeg("twoway_interaction.jpg", width = 8, height = 4, units = "px")
ggintfun(obj = mod.sim, varnames = c("x1", "x2"),
varlabs = c("x1 (binary)", "x2 (continuous)"),
title = FALSE, rug = TRUE,
twoways = TRUE)
dev.off()
The graphics device allows you to do anything you’ve down for plots
in RStudio so far. If you want to place two plots next to each other,
you can do so by first setting up the plotting area using the
par
command:
pdf("descriptive_histograms.pdf", width = 10, height = 3.33)
par(mfrow = c(1, 3))
plot(x = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[1], ]$x2,
y = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[1], ]$y,
main = paste("x1 = ", levels(sim.dat2$x1_tri)[1], sep = ""),
xlab = "x2", ylab = "y")
plot(x = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[2], ]$x2,
y = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[2], ]$y,
main = paste("x1 = ", levels(sim.dat2$x1_tri)[2], sep = ""),
xlab = "x2", ylab = "y")
plot(x = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[3], ]$x2,
y = sim.dat2[sim.dat2$x1_tri == levels(sim.dat2$x1_tri)[3], ]$y,
main = paste("x1 = ", levels(sim.dat2$x1_tri)[3], sep = ""),
xlab = "x2", ylab = "y")
par(mfrow = c(1, 1))
dev.off()
Thomas Leeper wrote a nice guide to printing high-quality figures in
R for the Political Methodologist. You can find it here: http://thepoliticalmethodologist.com
and in the print
version of Volume 21, Issue 1. It explains the steps above in more
detail and provides some additional information on how to produce good
figures with other software (Excel, Stata) as well.