**This tutorial shows you**:

- how to diagnose non-normally distributed errors
- how to diagnose heteroskedasticity
- how to diagnose collinearity
- how to transform variables
- how to obtain robust standard errors

This tutorial only accompanies the other materials from our course. Be sure to consult those other course materials for a full coverage of this topic.

**This tutorial focuses on violations of the OLS assumptions.
As a reminder, the assumptions you’ve encountered are:**

- [A1] Linear relationships between each \(x\) and \(y\)
- [A2] Constant error variance
- [A3] Normality of the errors: \(\varepsilon \sim \mathcal{N}(0, \sigma^2)\)
- [A4] Independence of observations
- [A5] No correlation b/w \(x\) and \(\varepsilon\): \(\text{cor}(x, \varepsilon) = 0\)

For violations of most of these assumptions, investigating the residuals of a regression model is a good start. Hence this tutorial will focus on residuals quite a bit. Residuals (or errors) are the differences between the observed values of \(y\) and the predicted, or fitted, values for \(y\), also known as \(\hat{y}\):

\[ \varepsilon_i = y_i - \hat{y}_i \]

You recall that we assume these residuals to be distributed normally
with a mean of 0 and a *constant* variance. That is, the spread
of the residuals around the predicted values is the same across the
range of predicted values.

You read in section 12.1 of *AR* that non-normally distributed
errors do not compromise the validity of OLS estimates in large samples,
but they may give you a reason to improve your model. There are two
simple graphical ways to diagnose whether the residuals are not normally
distributed, and you’re already familiar with these methods from the tutorial
on distributions and probability.

For an example, we’ll use data from the Canadian Survey of Labour and Income Dynamics, for the province of Ontario. Observations are individuals from Ontario who responded to the 1994 wave of the survey. These data are provided as part of the “car” package, but I uploaded them to my website as well. The following variables are included:

```
<- rio::import("http://www.jkarreth.net/files/car_SLID.csv")
slid.dat summary(slid.dat)
```

```
## wages education age sex
## Min. : 2.300 Min. : 0.00 Min. :16.00 Length:7425
## 1st Qu.: 9.235 1st Qu.:10.30 1st Qu.:30.00 Class :character
## Median :14.090 Median :12.10 Median :41.00 Mode :character
## Mean :15.553 Mean :12.50 Mean :43.98
## 3rd Qu.:19.800 3rd Qu.:14.53 3rd Qu.:57.00
## Max. :49.920 Max. :20.00 Max. :95.00
## NA's :3278 NA's :249
## language
## Length:7425
## Class :character
## Mode :character
##
##
##
##
```

This dataset contains 5 variables:

Variable | Description |
---|---|

`wages` |
Composite hourly wage rate from all jobs in Canadian dollars. |

`education` |
Number of years of schooling. |

`age` |
Age in years |

`sex` |
A categorical variable: Female, Male. |

`language` |
A categorical variable: English, French, Other. |

As often before, we’d like to predict hourly wages
(`wages`

) with education and, in this case, other variables.
To do so, we can fit an OLS regression model as usual:

```
<- lm(wages ~ education + age + sex, data = slid.dat)
mod summary(mod)
```

```
##
## Call:
## lm(formula = wages ~ education + age + sex, data = slid.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.111 -4.328 -0.792 3.243 35.892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.905243 0.607771 -13.01 <2e-16 ***
## education 0.918735 0.034514 26.62 <2e-16 ***
## age 0.255101 0.008634 29.55 <2e-16 ***
## sexMale 3.465251 0.208494 16.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.602 on 4010 degrees of freedom
## (3411 observations deleted due to missingness)
## Multiple R-squared: 0.2972, Adjusted R-squared: 0.2967
## F-statistic: 565.3 on 3 and 4010 DF, p-value: < 2.2e-16
```

To investigate the residuals of this regression, we can use the
`resid()`

or (equivalently) `residuals()`

functions.

`summary(resid(mod))`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -26.1108 -4.3279 -0.7924 0.0000 3.2434 35.8921
```

But these quantities don’t tell us much about the
*distribution* of the residuals.

Hence, a density plot is often useful. You’re already familiar with density plots (and histograms) from Days 2 and 3.

`plot(density(resid(mod)))`