This tutorial shows you:
Note on copying & pasting code from the PDF version of this tutorial: Please note that you may run into trouble if you copy & paste code from the PDF version of this tutorial into your R script. When the PDF is created, some characters (for instance, quotation marks or indentations) are converted into non-text characters that R won’t recognize. To use code from this tutorial, please type it yourself into your R script or you may copy & paste code from the source file for this tutorial which is posted on my website.
Note on R functions discussed in this tutorial: I don’t discuss many functions in detail here and therefore I encourage you to look up the help files for these functions or search the web for them before you use them. This will help you understand the functions better. Each of these functions is well-documented either in its help file (which you can access in R by typing ?ifelse
, for instance) or on the web. The Companion to Applied Regression (see our syllabus) also provides many detailed explanations.
As always, please note that this tutorial only accompanies the other materials for Day 12 and that you are expected to have worked through the reading for that day before tackling this tutorial.
So far, we have not encountered serious violations of the assumption of linearity - a linear relationship between predictors and outcome. But this assumption simply means that we impose a linear structure on the relationship between \(x\) and \(y\). Coefficient estimates alone from a regression model will not reveal whether the relationship between \(x\) and \(y\) in your data actually are linear, but a scatterplot will be useful to investigate whether this might be the case.
Theories might often make predictions of the form, “as \(x\) increases, \(y\) first increases, and then drops again”. An example for this is the Kuznets curve in economics, suggesting that as countries developed, income inequality first increased, peaked, and then decreased (summarized, for instance, in Acemoglu and Robinson 2002). This implies a so-called curvilinear relationship between economic development and inequality: both poor and rich countries have low inequality, but middle-income countries should exhibit high levels of inequality.
Take the following example:
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.650 -5.757 3.239 9.980 28.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.1480 0.8235 -9.895 < 2e-16 ***
## x 0.8155 0.2559 3.186 0.00155 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.02 on 425 degrees of freedom
## Multiple R-squared: 0.02333, Adjusted R-squared: 0.02103
## F-statistic: 10.15 on 1 and 425 DF, p-value: 0.001547
Perhaps you might notice the low \(R^2\) value, but that itself is not indicative of problems. Examining the residual plots, however, reveals that the the model produces residuals that are grouped below 0 at low and high values of \(x\):
In this case, the residuals strongly indicate a (negative) quadratic relationship between \(x\) and \(y\). Plotting \(x\) against \(y\) and adding the regression line from the model we just fit shows that imposing a linear relationship misses the true structure of the data quite a bit:
To account for a quadratic relationship in the regression model, we enter a squared term of the predictor (\(x^2\)), using the code I(x^2)
.
##
## Call:
## lm(formula = y ~ x + I(x^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.166 -6.533 0.121 6.870 24.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.18336 0.55212 2.143 0.0327 *
## x 1.11205 0.14338 7.756 6.59e-14 ***
## I(x^2) -0.90137 0.02946 -30.601 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.511 on 424 degrees of freedom
## Multiple R-squared: 0.6956, Adjusted R-squared: 0.6942
## F-statistic: 484.4 on 2 and 424 DF, p-value: < 2.2e-16
A second, less dramatic but still consequential example with real-world data illustrates the point as well. Here I’m using a dataset on the wages of 3680 Norwegian survey respondents. The data are taken from the European Social Survey’s training module at http://essedunet.nsd.uib.no/cms/topics/multilevel/ch1/5.html.
wages.dat <- read.csv("http://www.jkarreth.net/files/ess_wages.csv")
summary(wages.dat)
## wage edyears age female
## Min. : 25.00 Min. : 0.000 Min. :16.0 Min. :0.0000
## 1st Qu.: 71.00 1st Qu.: 1.000 1st Qu.:30.0 1st Qu.:0.0000
## Median : 83.33 Median : 3.000 Median :39.0 Median :0.0000
## Mean : 90.11 Mean : 2.651 Mean :39.5 Mean :0.4696
## 3rd Qu.:102.45 3rd Qu.: 3.000 3rd Qu.:48.0 3rd Qu.:1.0000
## Max. :343.75 Max. :11.000 Max. :74.0 Max. :1.0000
## egp
## Lower service class:1044
## Routine non-manual :1139
## Skilled workers : 607
## Unskilled workers : 598
## Upper service class: 292
##
These data contain the following variables:
Variable | Description |
---|---|
wage |
Hourly wage in Norwegian kronor |
edyears |
Years of non-compulsory education |
age |
Age in years |
female |
1 if female, 0 if male |
egp |
Type of occupation |
Fitting a model assuming linear relationships between each predictor and respondents’ wages returns the following results:
mod <- lm(wage ~ edyears + age + female + egp, data = wages.dat)
summary(mod)
##
## Call:
## lm(formula = wage ~ edyears + age + female + egp, data = wages.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.100 -13.761 -2.858 8.719 234.155
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.81362 1.94612 40.498 <2e-16 ***
## edyears 3.23882 0.21025 15.404 <2e-16 ***
## age 0.47133 0.03313 14.228 <2e-16 ***
## female -17.10797 0.94132 -18.174 <2e-16 ***
## egpRoutine non-manual -13.17149 1.21909 -10.804 <2e-16 ***
## egpSkilled workers -12.60142 1.42218 -8.861 <2e-16 ***
## egpUnskilled workers -12.57560 1.44120 -8.726 <2e-16 ***
## egpUpper service class 4.09760 1.68588 2.431 0.0151 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.5 on 3672 degrees of freedom
## Multiple R-squared: 0.3484, Adjusted R-squared: 0.3471
## F-statistic: 280.4 on 7 and 3672 DF, p-value: < 2.2e-16
## Test stat Pr(>|t|)
## edyears -0.488 0.626
## age -10.724 0.000
## female 0.635 0.525
## egp NA NA
## Tukey test 2.283 0.022
Here, the residual plot for the age variable does suggest a nonlinear relationship. Adding it to the model returns the following results:
mod2 <- lm(wage ~ edyears + age + I(age^2) + female + egp, data = wages.dat)
summary(mod2)
##
## Call:
## lm(formula = wage ~ edyears + age + I(age^2) + female + egp,
## data = wages.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.899 -13.342 -2.936 8.242 231.857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.785293 4.279082 8.830 < 2e-16 ***
## edyears 3.139944 0.207269 15.149 < 2e-16 ***
## age 2.615905 0.202624 12.910 < 2e-16 ***
## I(age^2) -0.025977 0.002422 -10.724 < 2e-16 ***
## female -16.956516 0.927151 -18.289 < 2e-16 ***
## egpRoutine non-manual -11.588935 1.209635 -9.581 < 2e-16 ***
## egpSkilled workers -11.104731 1.407543 -7.889 3.97e-15 ***
## egpUnskilled workers -10.905792 1.427856 -7.638 2.80e-14 ***
## egpUpper service class 4.181041 1.660322 2.518 0.0118 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.13 on 3671 degrees of freedom
## Multiple R-squared: 0.3682, Adjusted R-squared: 0.3668
## F-statistic: 267.4 on 8 and 3671 DF, p-value: < 2.2e-16
And showing the regression line from the first model (black) and second (red) illustrates the meaning of the squared term.
In Chapter 2 of AR, you also encounter a technique that makes even fewer assumptions about the shape of a relationship between two continuous variables. Locally weighted regression (loess or lowess) fits a locally weighted polynomial (i.e. a combination of squared, cubed, etc. terms) to the data.
Consider data on democracy in a variety of countries. How do wealth, international trade, government expenditures, and ethnic fractionalization relate to democracy? I’m returning to data you saw in the video for Day 1 and take a snapshot for the year 2010:
qog_dat <- rio::import("/Users/johanneskarreth/Documents/Dropbox/Uni/6 - Datasets/Quality of Government dataset/QoG_data_vJan16/qog_std_ts_jan16.csv")
##
Read 67.7% of 14770 rows
Read 14770 rows and 2573 (of 2573) columns from 0.086 GB file in 00:00:04
qog_dat2010 <- qog_dat[qog_dat$year == 2010, c("cname", "wdi_gdppcpppcon", "wdi_trade", "uds_mean", "al_ethnic", "imf_exp")]
qog_dat2010$gdppc_log <- log(qog_dat2010$wdi_gdppcpppcon)
qog_dat2010 <- na.omit(qog_dat2010)
par(mfrow = c(2, 2))
# Wealth and Democracy
plot(x = qog_dat2010$gdppc_log, y = qog_dat2010$uds_mean, col = "gray",
xlab = "GDP per capita (logged)", ylab = "Democracy")
abline(lm(uds_mean ~ gdppc_log, data = qog_dat2010))
lines(lowess(x = qog_dat2010$gdppc_log, y = qog_dat2010$uds_mean), col = "red")
# Trade and Democracy
plot(x = qog_dat2010$wdi_trade, y = qog_dat2010$uds_mean, col = "gray",
xlab = "Trade / GDP", ylab = "Democracy")
abline(lm(uds_mean ~ wdi_trade, data = qog_dat2010))
lines(lowess(x = qog_dat2010$wdi_trade, y = qog_dat2010$uds_mean), col = "red")
# Gov't expenditure and Democracy
plot(x = qog_dat2010$imf_exp, y = qog_dat2010$uds_mean, col = "gray",
xlab = "Gov't expenditure", ylab = "Democracy")
abline(lm(uds_mean ~ imf_exp, data = qog_dat2010))
lines(lowess(x = qog_dat2010$imf_exp, y = qog_dat2010$uds_mean), col = "red")
# Ethnic fractionalization and Democracy
plot(x = qog_dat2010$al_ethnic, y = qog_dat2010$uds_mean, col = "gray",
xlab = "Ethnic fractionalization", ylab = "Democracy")
abline(lm(uds_mean ~ al_ethnic, data = qog_dat2010))
lines(lowess(x = qog_dat2010$al_ethnic, y = qog_dat2010$uds_mean), col = "red")
par(mfrow = c(1, 1))
It is probably obvious that locally weighted regression is not all that useful to test hypotheses, as you cannot make a statement about variance and test a hypothesis about a relationship from the above figure. However, generalized additive models (GAMs) do provide both flexibility to relax the assumption of linearity as well as standard error estimates. You can find more information about GAMs in Beck & Jackman (1998).1
library(mgcv)
demo_gam <- gam(uds_mean ~ s(gdppc_log) + s(wdi_trade) + s(imf_exp) + s(al_ethnic),
data = qog_dat2010)
plot(demo_gam, pages = 1, residuals = TRUE, seWithMean = TRUE, shade = TRUE)
The figure above shows the relationship between each predictor and the outcome (democracy). Note that the scale of the outcome is now means-centered, so that 0 is equal to the mean of the democracy score in the data. Values above 0 on the y-axis indicate (predicted and actual) observations above the mean of the democracy score in the data, and values below 0 indicate observations below the mean of the democracy score in the data.
You’ve already worked with residuals a lot, but here is one more way they can be used to detect potential omitted variables in a regression model. As an example, we’ll work with the “Angell” data from John Fox’s “car” package. See http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-3E/datasets/Angell.pdf for a description of these data, which come from R. C. Angell (1951), “The moral integration of American Cities,” American Journal of Sociology, 57 (part 2): 1-140. The dataset contains 43 rows; each is one American city. The variables are:
Variable | Description |
---|---|
moral |
Composite of crime rate and welfare expenditures. |
hetero |
Ethnic heterogenity, from percentages of nonwhite and foreign-born white residents. |
mobility |
Geographic mobility: from percentages of residents moving into and out of the city. |
region |
E = Northeast, MW = Midwest, S = Southeast, W = West |
The outcome variable is Angell’s morality index.
angell.dat <- read.csv("http://www.jkarreth.net/files/angell.csv")
summary(angell.dat)
## moral hetero mobility region city
## Min. : 4.20 Min. :10.60 Min. :12.10 E : 9 Akron : 1
## 1st Qu.: 8.70 1st Qu.:16.90 1st Qu.:19.45 MW:14 Atlanta : 1
## Median :11.10 Median :23.70 Median :25.90 S :14 Baltimore : 1
## Mean :11.20 Mean :31.37 Mean :27.60 W : 6 Birmingham: 1
## 3rd Qu.:13.95 3rd Qu.:39.00 3rd Qu.:34.80 Bridgeport: 1
## Max. :19.00 Max. :84.50 Max. :49.80 Buffalo : 1
## (Other) :37
library(psych)
multi.hist(angell.dat[, c(1:3)], ncol = 3)
mod <- lm(moral ~ hetero + mobility, data = angell.dat)
Plotting observed against predicted values can reveal which observations the model over- and under-predicts:
angell.dat$predicted <- fitted(mod)
plot(x = angell.dat$moral, y = angell.dat$predicted,
xlab = "Observed Morality", ylab = "Predicted Morality", pch = 20,
col = "gray")
abline(a = 0, b = 1)
text(x = angell.dat$moral, y = angell.dat$predicted,
labels = angell.dat$city, cex = 0.7)
In the plot above, the line shows the perfect concordance between observed and predicted morality. For observations to the left of the line, the model predicts higher than actual morality; for observations to the right of the line, the model predicts lower than actual morality. Patterns in this plot can suggest potential omitted variables that should be in the model. Can you think of any omitted variables that this particular plot suggests?
Most of the diagnostics we’ve discussed so far relied on visual interpretation of residual plots. Statisticians have developed a larger number of more formal tests for violations of the regression assumptions; most of them are discussed in AR. Often, you can “eyeball” violations of the assumptions in residual plots, but more formal tests can be useful. Overall, though, the important question you should usually ask is whether potential violations of the OLS assumptions bias estimation results and whether they affect the standard errors for coefficient estimates.
Beck, N. and Jackman, S. (1998). Beyond linearity by default: Generalized additive models. American Journal of Political Science, 42 (2): 596–627.↩