This tutorial shows you:

  • how to specify quadratic terms in regression models
  • how to explore nonlinear relationships using lo(w)ess smoothers and generalized additive models
  • how to use residuals to interpret model quality

As always, please note that this tutorial only accompanies the other course materials and that you are expected to have worked through assigned reading before tackling this tutorial.

Nonlinear relationships: quadratic terms

So far, we have not encountered serious violations of the assumption of linearity - a linear relationship between predictors and outcome. But this assumption simply means that we impose a linear structure on the relationship between \(x\) and \(y\). Coefficient estimates alone from a regression model will not reveal whether the relationship between \(x\) and \(y\) in your data actually are linear, but a scatterplot will be useful to investigate whether this might be the case.

Theoretical example

Theories might often make predictions of the form, “as \(x\) increases, \(y\) first increases, and then drops again”. An example for this is the Kuznets curve in economics, suggesting that as countries developed, income inequality first increased, peaked, and then decreased (summarized, for instance, in Acemoglu and Robinson 2002). This implies a so-called curvilinear relationship between economic development and inequality: both poor and rich countries have low inequality, but middle-income countries should exhibit high levels of inequality.

Example with simulated data

Take the following example:

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.650  -5.757   3.239   9.980  28.822 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.1480     0.8235  -9.895  < 2e-16 ***
## x             0.8155     0.2559   3.186  0.00155 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.02 on 425 degrees of freedom
## Multiple R-squared:  0.02333,    Adjusted R-squared:  0.02103 
## F-statistic: 10.15 on 1 and 425 DF,  p-value: 0.001547

Perhaps you might notice the low \(R^2\) value, but that itself is not indicative of problems. Examining the residual plots, however, reveals that the the model produces residuals that are grouped below 0 at low and high values of \(x\):

## Loading required package: carData