This tutorial shows you:
This tutorial only accompanies the other materials from our course. Be sure to consult those other course materials for a full coverage of this topic.
Some of you are working with so-called dummy variables in your work, so we will briefly explore how these variables are used in multiple regression. Dummy variables are also explained well in chapter 7 of AR, but it doesn’t hurt to explore them earlier. Dummy variables are binary indicators that are set to 1 for all observations matching a particular classification and 0 to all other observations. For instance, a dummy variable in a survey for married respondents will be coded the following way:
\[ \text{married} = \begin{cases} 1, & \text{if respondent is married}\\ 0, & \text{otherwise} \end{cases} \]
We’ll start with an example dataset that I’ve taken from the accompanying materials to Kellstedt and Whitten’s Fundamentals of Political Science Research. This dataset is a modified extract from the 1996 edition of the (American) National Election studies. This dataset has 1714 observations and 8 variables:
Variable | Description |
---|---|
demrep | Party identification (1 = strong Democrat, 7 = strong Republican) |
clinton.therm | Feeling thermometer toward Hillary Clinton |
dem.therm | Feeling thermometer toward the Democrats |
female | Female (1 = yes, 0 = no) |
age | Age in years |
educ | Education (1 = 8 grades or less, 7 = advanced degree) |
income | Income (1 = less than $2999, 24 = $105,000 or more) |
region | Northeast, North Central, South, or West |
<- import("https://www.dropbox.com/s/24ktov8o7wcn3l2/nes1996subset.csv?dl=1")
nes.dat summary(nes.dat)
## demrep clinton.therm dem.therm female
## Min. :1.000 Min. : 0.00 Min. : 0.00 Min. :0.0000
## 1st Qu.:3.000 1st Qu.: 30.00 1st Qu.: 40.00 1st Qu.:0.0000
## Median :4.000 Median : 60.00 Median : 60.00 Median :1.0000
## Mean :4.327 Mean : 52.81 Mean : 58.86 Mean :0.5519
## 3rd Qu.:5.000 3rd Qu.: 70.00 3rd Qu.: 70.00 3rd Qu.:1.0000
## Max. :7.000 Max. :100.00 Max. :100.00 Max. :1.0000
## NA's :385 NA's :29 NA's :27
## age educ income region
## Min. :18.00 Min. :1.000 Min. : 1.00 Length:1714
## 1st Qu.:34.00 1st Qu.:3.000 1st Qu.:11.00 Class :character
## Median :44.00 Median :4.000 Median :16.00 Mode :character
## Mean :47.54 Mean :4.105 Mean :15.03
## 3rd Qu.:61.00 3rd Qu.:6.000 3rd Qu.:20.00
## Max. :93.00 Max. :7.000 Max. :24.00
## NA's :2 NA's :3 NA's :150
Say you are interested in explaining why some respondents exhibit a more positive attitude toward Hillary Clinton than others. You could use bivariate regression to test the (somewhat obvious) argument that Republican respondents might be less likely to approve of Clinton than more Democratic respondents. First, you may want to means-center the party ID variable for ease of interpretation:
$demrep.ctr <- nes.dat$demrep - median(nes.dat$demrep, na.rm = TRUE)
nes.dat<- lm(clinton.therm ~ demrep.ctr, data = nes.dat)
m1 plot(x = jitter(nes.dat$demrep.ctr), y = nes.dat$clinton.therm,
xlab = "Party ID (Democratic -> Republican)",
ylab = "Clinton thermometer")
abline(m1, col = "red")
You might have a theory that female respondents are more likely to hold a positive attitude toward Hillary Clinton than male respondents. Based on what you’ve learned so far, you can account for this by “controlling” for gender.
<- lm(clinton.therm ~ demrep.ctr + female, data = nes.dat)
m2 library("modelsummary")
modelsummary(list(m1, m2))
## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
Model 1 | Model 2 | |
---|---|---|
(Intercept) | 54.173 | 50.548 |
(0.751) | (1.087) | |
demrep.ctr | −10.613 | −10.314 |
(0.528) | (0.528) | |
female | 6.722 | |
(1.466) | ||
Num.Obs. | 1316 | 1316 |
R2 | 0.235 | 0.247 |
R2 Adj. | 0.234 | 0.246 |
AIC | 12369.4 | 12350.5 |
BIC | 12384.9 | 12371.2 |
RMSE | 26.53 | 26.32 |
A good way to examine the role of dummy variables in regression is to visualize them in a scatterplot. Because a dummy variable can only take on two values, a dummy variable only shifts the intercept (\(\hat{y}\) when \(x = 0\)).
plot(x = jitter(nes.dat$demrep.ctr),
y = nes.dat$clinton.therm, type = "p",
xlab = "Party ID (Democratic -> Republican)",
ylab = "Clinton thermometer")
abline(v = 0, col = "gray")
abline(a = coef(m1)[1], b = coef(m1)[2])
abline(a = coef(m2)[1], b = coef(m2)[2], col = "red")
abline(a = coef(m2)[1] + coef(m2)[3], b = coef(m2)[2], col = "blue")
legend("bottomleft", legend = c("Model 1", "Model 2 - Males", "Model 2 - Females"),
col = c("black", "red", "blue"), lty = 1)
Dummy variables are always binary, but they can also be created based
on categorical variables with more than two categories. For instance,
you might consider the geographic region of respondents. You can use the
region
variable to this end. But this is a categorical
variable with four values:
str(nes.dat$region)
## chr [1:1714] "Northeast" "Northeast" "Northeast" "Northeast" "Northeast" ...
table(nes.dat$region)
##
## North Central Northeast South West
## 458 260 642 354
You can manually create four dummy variables, one for each region:
$region.northeast <- ifelse(nes.dat$region == "Northeast", 1, 0)
nes.dat$region.northcentral <- ifelse(nes.dat$region == "North Central", 1, 0)
nes.dat$region.south <- ifelse(nes.dat$region == "South", 1, 0)
nes.dat$region.west <- ifelse(nes.dat$region == "West", 1, 0) nes.dat
<- lm(clinton.therm ~
m3 + region.northeast + region.northcentral + region.south + region.west,
demrep.ctr data = nes.dat)
summary(m3)
##
## Call:
## lm(formula = clinton.therm ~ demrep.ctr + region.northeast +
## region.northcentral + region.south + region.west, data = nes.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81.432 -18.375 1.269 18.284 77.165
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.1761 1.5249 36.840 < 2e-16 ***
## demrep.ctr -10.5403 0.5275 -19.980 < 2e-16 ***
## region.northeast -1.7201 2.3833 -0.722 0.47059
## region.northcentral -6.3650 2.0810 -3.059 0.00227 **
## region.south -0.1769 1.9653 -0.090 0.92830
## region.west NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.45 on 1311 degrees of freedom
## (398 observations deleted due to missingness)
## Multiple R-squared: 0.2428, Adjusted R-squared: 0.2405
## F-statistic: 105.1 on 4 and 1311 DF, p-value: < 2.2e-16
You can again plot this output:
plot(x = jitter(nes.dat$demrep.ctr),
y = nes.dat$clinton.therm,
type = "p",
xlab = "Party ID (Democratic -> Republican)",
ylab = "Clinton thermometer")
abline(v = 0, col = "gray")
abline(a = coef(m3)[1] + coef(m3)[3], b = coef(m3)[2], col = "blue", lwd = 2)
abline(a = coef(m3)[1] + coef(m3)[4], b = coef(m3)[2], col = "brown", lwd = 2)
abline(a = coef(m3)[1] + coef(m3)[5], b = coef(m3)[2], col = "red", lwd = 2)
abline(a = coef(m3)[1], b = coef(m3)[2], col = "green", lwd = 2)
legend("bottomleft", legend = c("Northeast", "North Central", "South", "West"),
col = c("blue", "brown", "red", "green"), lty = 1, lwd = 2)
Note the overlap of the lines for South and West.
Alternatively, rather than creating individual dummy variables by
hand, you can use the factor()
function within the
regression equation to let R do the work:
<- lm(clinton.therm ~
m3b + factor(region),
demrep.ctr data = nes.dat)
summary(m3b)
##
## Call:
## lm(formula = clinton.therm ~ demrep.ctr + factor(region), data = nes.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81.432 -18.375 1.269 18.284 77.165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.8111 1.4308 34.814 < 2e-16 ***
## demrep.ctr -10.5403 0.5275 -19.980 < 2e-16 ***
## factor(region)Northeast 4.6449 2.3178 2.004 0.04528 *
## factor(region)South 6.1881 1.8796 3.292 0.00102 **
## factor(region)West 6.3650 2.0810 3.059 0.00227 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.45 on 1311 degrees of freedom
## (398 observations deleted due to missingness)
## Multiple R-squared: 0.2428, Adjusted R-squared: 0.2405
## F-statistic: 105.1 on 4 and 1311 DF, p-value: < 2.2e-16
m3
and
why?If you wanted to specify a different baseline category, you could use
relevel()
:
<- lm(clinton.therm ~
m3b + relevel(factor(region), ref = "South"),
demrep.ctr data = nes.dat)
summary(m3b)
##
## Call:
## lm(formula = clinton.therm ~ demrep.ctr + relevel(factor(region),
## ref = "South"), data = nes.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81.432 -18.375 1.269 18.284 77.165
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 55.9992 1.2554 44.607
## demrep.ctr -10.5403 0.5275 -19.980
## relevel(factor(region), ref = "South")North Central -6.1881 1.8796 -3.292
## relevel(factor(region), ref = "South")Northeast -1.5432 2.2147 -0.697
## relevel(factor(region), ref = "South")West 0.1769 1.9653 0.090
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## demrep.ctr < 2e-16 ***
## relevel(factor(region), ref = "South")North Central 0.00102 **
## relevel(factor(region), ref = "South")Northeast 0.48606
## relevel(factor(region), ref = "South")West 0.92830
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.45 on 1311 degrees of freedom
## (398 observations deleted due to missingness)
## Multiple R-squared: 0.2428, Adjusted R-squared: 0.2405
## F-statistic: 105.1 on 4 and 1311 DF, p-value: < 2.2e-16
Lastly, if you have a factor with many levels and would like to
create dummy variables for your dataset, but avoid doing this as many
times as the variable has categories, you can use the code below to
achieve this task. This code does the same thing you did above using the
ifelse()
function by hand:
for(level in unique(nes.dat$region)){
paste("region", level, sep = "_")] <- ifelse(nes.dat$region == level, 1, 0)
nes.dat[
}summary(nes.dat)
## demrep clinton.therm dem.therm female
## Min. :1.000 Min. : 0.00 Min. : 0.00 Min. :0.0000
## 1st Qu.:3.000 1st Qu.: 30.00 1st Qu.: 40.00 1st Qu.:0.0000
## Median :4.000 Median : 60.00 Median : 60.00 Median :1.0000
## Mean :4.327 Mean : 52.81 Mean : 58.86 Mean :0.5519
## 3rd Qu.:5.000 3rd Qu.: 70.00 3rd Qu.: 70.00 3rd Qu.:1.0000
## Max. :7.000 Max. :100.00 Max. :100.00 Max. :1.0000
## NA's :385 NA's :29 NA's :27
## age educ income region
## Min. :18.00 Min. :1.000 Min. : 1.00 Length:1714
## 1st Qu.:34.00 1st Qu.:3.000 1st Qu.:11.00 Class :character
## Median :44.00 Median :4.000 Median :16.00 Mode :character
## Mean :47.54 Mean :4.105 Mean :15.03
## 3rd Qu.:61.00 3rd Qu.:6.000 3rd Qu.:20.00
## Max. :93.00 Max. :7.000 Max. :24.00
## NA's :2 NA's :3 NA's :150
## demrep.ctr region.northeast region.northcentral region.south
## Min. :-3.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:-1.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean : 0.3273 Mean :0.1517 Mean :0.2672 Mean :0.3746
## 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. : 3.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :385
## region.west region_Northeast region_North Central region_South
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.2065 Mean :0.1517 Mean :0.2672 Mean :0.3746
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## region_West
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2065
## 3rd Qu.:0.0000
## Max. :1.0000
##
The above example uses real survey data, and you can adjudicate yourself how much of a difference the inclusion of the dummy variables as control variables makes to your estimates. However, we can also simulate data where the results from a model without and with the dummy variable differ dramatically. In this example, I simulate a continuous variable \(x_1\), a binary (dummy) variable \(x_2\), and a variable y that is first created from a DGP so that \(y = \alpha + \beta_1 x_1 + \beta_2 x_2 + \varepsilon\). Next, I add 1 to the values of \(x_1\) for all observations where \(x_2 = 1\).
set.seed(123)
<- 100
n.obs <- runif(n = n.obs, min = -1, max = 1)
x1 <- rbinom(n = n.obs, size = 1, prob = 0.5)
x2 <- rnorm(n.obs, mean = 0, sd = 2)
e <- 0.2
a <- 2.5
b1 <- -7.5
b2 <- a + b1 * x1 + b2 * x2 + e
y <- data.frame(y, x1, x2)
sim.dat $x2 == 1, ]$x1 <- sim.dat[sim.dat$x2 == 1, ]$x1 + 1
sim.dat[sim.dat$color <- ifelse(sim.dat$x2 == 1, "blue", "red") sim.dat
Now, I fit two models: Model 1 predicts \(y\) with \(x_1\), and Model 2 adds the (binary) \(x_2\) as a control variable.
<- lm(y ~ x1, data = sim.dat)
m1 <- lm(y ~ x1 + x2, data = sim.dat)
m2 modelsummary(list(m1, m2))
Model 1 | Model 2 | |
---|---|---|
(Intercept) | −2.821 | 0.174 |
(0.564) | (0.276) | |
x1 | −1.914 | 2.558 |
(0.653) | (0.345) | |
x2 | −10.437 | |
(0.489) | ||
Num.Obs. | 100 | 100 |
R2 | 0.081 | 0.839 |
R2 Adj. | 0.071 | 0.835 |
AIC | 593.7 | 421.7 |
BIC | 601.5 | 432.1 |
F | 8.598 | 252.226 |
RMSE | 4.57 | 1.91 |
If you visualize the two models and their predictions, the difference becomes pretty clear:
plot(x = sim.dat$x1, y = sim.dat$y, col = sim.dat$color)
abline(a = coef(m1)[1], b = coef(m1)[2])
abline(a = coef(m2)[1], b = coef(m2)[2], col = "red")
abline(a = coef(m2)[1] + coef(m2)[3], b = coef(m2)[2], col = "blue")
legend("bottomright", legend = c("Model 1", "Model 2: x2 = 0", "Model 2: x2 = 1"),
col = c("black", "red", "blue"), lty = 1)
You already encountered the role of unusual and influential data points in earlier conversations. The remainder of this tutorial follows closely chapter 11 in AR; please refer to that reading for explanations of the concepts used below.
As you read in the introduction to chapter 11 in AR, problems with unusual and influential data can often be dealt with long before you fit a statistical model - through examining data, building careful theoretical models, and transforming data where necessary. If they do occur, though, unusual data points are
“problematic in linear models fit by least squares because they can unduly influence the results of the anlaysis and because their presence may be a signal that the model fails to capture important characteristics of the data.’’ (AR, p. 241)
Unsual data points are also referred to as outliers. Outliers are”conditionally unusual [data points] given the value of the explanatory variable.’’ (AR, p. 241). This means that you should consider residuals when you think about outliers. Residuals are defined as
\[ \varepsilon_i = y_i - \hat{y} \]
Because residuals do not have equal variance, they cannot be directly compared when your goal is to assess how unusual individual observations are. For this reason, we often use studentized residuals to express the “unusualness” of data points on a comparable scale (see more in AR, p. 246-247):
\[ \varepsilon_{Ti} = \frac{\varepsilon_{i}}{\hat{\sigma}_{-i} \sqrt{1 - h_i}} \]
Unusual data points alone, however, are not necessarily a reason for concern when you are interested in using OLS to make inferences. Concern is most warranted from a combination of “unusualness” and leverage on your parameter estimates. In AR’s words:
\[ \text{Influence on coefficients} = \text{Leverage} \times \text{Discrepancy} \]
This symbolic equation guides the remainder of this tutorial.
As example data, we use the crime dataset that appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The dataset has 51 rows (each row is one state) and 9 variables. The variables are:
Variable | Description |
---|---|
state | State name |
violent | Violent crimes per 100,000 people |
murder | Murders per 1,000,000 people |
metro | Percent of the population living in metropolitan areas |
white | Percent of the population that is white |
hsgrad | Percent of population with a high school education or above |
poverty | Percent of population living under poverty line |
snglpar | Percent of population that are single parents |
First, we estimate a regression model, predicting the crime rate with the poverty rate and the percent of population with a high school education (or above). Note, however, that this regression model uses observational data and is not designed to recover causal effects of any of the predictors.
After fitting the model, we can create a scatterplot of fitted values and residuals to check any conditionally unusual observations.
library(rio)
<- import("CRIME.dta")
crime.dat <- lm(violent ~ poverty + hsgrad, data = crime.dat)
m1 plot(x = fitted.values(m1), y = resid(m1), type = "n")
text(x = fitted.values(m1), y = resid(m1), labels = names(fitted.values(m1)))
R also offers a built-in residual plot that labels the most unusual
points (with the row number of that observation). Otherwise, this plot
is identical to the one you just created “by hand”. See
?plot.lm
for more information.
plot(m1, which = c(1))
It looks like observations 20, 49, and 51 are somewhat unusual. Let’s look at these cases in the data by calling these rows individually - remember, you can access individual rows of a dataframe by specifying their numbers before the comma in the hard brackets, and you can access individual commas by doing the same after the comma.
c(20, 49, 51), ] crime.dat[
## state violent murder metro white hsgrad poverty snglpar
## 20 MD 998 12.7 92.8 68.9 78.4 9.7 12.0
## 49 WV 208 6.9 41.8 96.3 66.0 22.2 9.4
## 51 DC 2922 78.5 100.0 31.8 73.1 26.4 22.1
We can calculate studentized residuals (see above) and add them as a
new variable to our dataset. Then, I use the arrange()
function from the “dplyr” package to conveniently sort the dataset by
the absolute value of the studentized residuals. Studentized residuals
can be positive or negative, so sorting them in ascending descending
order won’t do what I want here. The arrange()
function can
be applied to any data frame. Its first argument is the dataframe to
which you want to apply it. The second argument takes the variable by
which you want to sort the data. Here, I use two nested commands. First,
I use abs(m1.studentized.resid)
rather than only the
variable name m1.studentized.resid
because I’d like to sort
by the absolute values of the variable. Next, I nest this term in
desc()
. This command stands for “sort in
desc
ending order.” Lastly, I use again row indexing to show
me only the first 10 rows of the sorted dataframe.
$m1.studentized.resid <- rstudent(m1)
crime.datlibrary("dplyr")
arrange(crime.dat, desc(abs(m1.studentized.resid)))[1:10, ]
## state violent murder metro white hsgrad poverty snglpar m1.studentized.resid
## 1 DC 2922 78.5 100.0 31.8 73.1 26.4 22.1 6.339983
## 2 WV 208 6.9 41.8 96.3 66.0 22.2 9.4 -2.091204
## 3 MS 434 13.5 30.7 63.3 64.3 24.7 14.7 -1.853594
## 4 MD 998 12.7 92.8 68.9 78.4 9.7 12.0 1.801713
## 5 MT 178 3.0 24.0 92.6 81.0 14.9 10.8 -1.615579
## 6 NV 875 10.4 84.8 86.7 78.8 9.8 12.4 1.397613
## 7 SD 208 3.4 32.6 90.2 77.1 14.2 9.4 -1.124149
## 8 WY 286 3.4 29.7 95.9 83.0 13.3 10.8 -1.121716
## 9 IL 960 11.4 84.0 81.0 76.2 13.6 11.5 1.050410
## 10 FL 1206 8.9 93.0 83.5 74.4 17.8 10.6 1.048094
I can also use the outlierTest()
function from the “car”
package to obtain “statistically significant” outliers. This test is
based on the Bonferroni p-value as it is explained in section 11.3.1 of
AR. You can find out more about this function via
?outlierTest
. Don’t forget to load the “car” package before
using this function.
library("car")
outlierTest(m1, cutoff = 0.1)
## rstudent unadjusted p-value Bonferroni p
## 51 6.339983 8.2151e-08 4.1897e-06
With tests like these, please keep in mind the warning in section 11.5 of AR: the numerical cutoffs used for these tests are not a be-all end-all criterion for how you should deal with outliers. Rather, a careful combination of numerical and visual tests as well as inspection of the cases should guide your practice.
To diagnose leverage on OLS coefficients, AR recommends to
investigate hat values (section 11.2). You can easily access them using
the hatvalues()
function (no R package needed) and again
add them to your dataset and sort the dataset to see which observations
have the highest hat values.
$m1.hatvalues <- hatvalues(m1)
crime.datarrange(crime.dat, desc(m1.hatvalues))[1:10, ]
## state violent murder metro white hsgrad poverty snglpar m1.studentized.resid
## 1 DC 2922 78.5 100.0 31.8 73.1 26.4 22.1 6.3399832
## 2 LA 1062 20.3 75.0 66.7 68.3 26.4 14.9 -0.6160271
## 3 MS 434 13.5 30.7 63.3 64.3 24.7 14.7 -1.8535942
## 4 KY 463 6.6 48.5 91.8 64.6 20.4 10.6 -0.8919457
## 5 RI 402 3.9 93.6 92.6 72.0 11.2 10.8 0.2513072
## 6 WV 208 6.9 41.8 96.3 66.0 22.2 9.4 -2.0912037
## 7 AK 761 9.0 41.8 75.2 86.6 9.1 14.3 0.7655789
## 8 AL 780 11.6 67.4 73.5 66.9 17.4 11.5 0.4221204
## 9 AR 593 10.2 44.7 82.9 66.3 20.0 10.7 -0.5448110
## 10 VA 372 8.3 77.5 77.1 75.2 9.7 10.3 0.2582032
## m1.hatvalues
## 1 0.24915431
## 2 0.17360933
## 3 0.13194366
## 4 0.10793022
## 5 0.09867824
## 6 0.09265962
## 7 0.09138438
## 8 0.08900606
## 9 0.08279538
## 10 0.07752945
In section 11.4 of AR, you encounter Cook’s D(istance) statistic as a measure of influence. Again, you can use the appropriately named function in R to obtain these values and sort the dataset by them.
$m1.cooksD <- cooks.distance(m1)
crime.datarrange(crime.dat, desc(m1.cooksD))[1:10, ]
## state violent murder metro white hsgrad poverty snglpar m1.studentized.resid
## 1 DC 2922 78.5 100.0 31.8 73.1 26.4 22.1 6.3399832
## 2 MS 434 13.5 30.7 63.3 64.3 24.7 14.7 -1.8535942
## 3 WV 208 6.9 41.8 96.3 66.0 22.2 9.4 -2.0912037
## 4 MT 178 3.0 24.0 92.6 81.0 14.9 10.8 -1.6155787
## 5 MD 998 12.7 92.8 68.9 78.4 9.7 12.0 1.8017133
## 6 KY 463 6.6 48.5 91.8 64.6 20.4 10.6 -0.8919457
## 7 WY 286 3.4 29.7 95.9 83.0 13.3 10.8 -1.1217157
## 8 NV 875 10.4 84.8 86.7 78.8 9.8 12.4 1.3976125
## 9 LA 1062 20.3 75.0 66.7 68.3 26.4 14.9 -0.6160271
## 10 AK 761 9.0 41.8 75.2 86.6 9.1 14.3 0.7655789
## m1.hatvalues m1.cooksD
## 1 0.24915431 2.44748707
## 2 0.13194366 0.16567269
## 3 0.09265962 0.13909032
## 4 0.06112029 0.05480014
## 5 0.04489217 0.04858550
## 6 0.10793022 0.03222207
## 7 0.07044526 0.03161483
## 8 0.04162281 0.02772722
## 9 0.17360933 0.02692259
## 10 0.09138438 0.01982036
The “car” package also contains the influencePlot()
function, which you can use to plot leverage, discrepancy, and
influence all in one plot. The y-axis (studentized residuals) indicates
how unusual a data point is, the x-axis (hat-values) shows its leverage
on the coefficients, and the size of the bubbles in the plot indicates
the Cook’s D value of each data point. See Figure 11.5 in AR
and ?influencePlot
for reference.
influencePlot(m1)
## StudRes Hat CookD
## 18 -0.6160271 0.17360933 0.02692259
## 25 -1.8535942 0.13194366 0.16567269
## 49 -2.0912037 0.09265962 0.13909032
## 51 6.3399832 0.24915431 2.44748707
An alternative but related measure of influence are dfbetas (see AR section 11.4). You can add dfbetas to your data frame and then create a scatterplot of them, using state IDs as labels. Then, you will easily see which observation(s) exercise high influence on both coefficients.
$m1.dfbeta.poverty <- dfbetas(m1)[, c("poverty")]
crime.dat$m1.dfbeta.hsgrad <- dfbetas(m1)[, c("hsgrad")]
crime.datplot(x = crime.dat$m1.dfbeta.poverty,
y = crime.dat$m1.dfbeta.hsgrad,
type = "n")
text(x = crime.dat$m1.dfbeta.poverty,
y = crime.dat$m1.dfbeta.hsgrad,
labels = crime.dat$state)
In section 11.6 of AR, you learn about the joint influence of data points on regression coefficients. You can see what is meant by this by comparing regression results from two models. Above, it seemed as if the District of Columbia exerts an undue influence on your regression results. So let’s estimate our model from above, but drop observation 51 (DC) from the data.
<- lm(violent ~ poverty + hsgrad, data = crime.dat[-c(51), ])
m2 modelsummary(list(m1, m2))
Model 1 | Model 2 | |
---|---|---|
(Intercept) | −2023.270 | 345.852 |
(1288.790) | (1026.638) | |
poverty | 68.741 | 23.927 |
(17.469) | (14.763) | |
hsgrad | 21.725 | −1.502 |
(14.321) | (11.239) | |
Num.Obs. | 51 | 50 |
R2 | 0.293 | 0.136 |
R2 Adj. | 0.264 | 0.100 |
AIC | 755.1 | 710.6 |
BIC | 762.8 | 718.2 |
F | 9.969 | 3.709 |
RMSE | 367.11 | 272.21 |
You can also use added-variable plots to see whether observations
might exert leverage on the respective coefficients. This might be
useful if you don’t want to drop observations based on previous
diagnostics and explore your data before continuing. The “car” package
provides the avPlot()
function, and these plots are
explained in more detail in section 11.6.1 in AR.
avPlots(m1)
At the end of this tutorial, please consider the advice in section 11.7 of AR:
“Outlying and influential data should not be ignored, but they also should not simply be deleted without investigation.”
Outlying and influential data are often a good starting point for further analysis that may reveal new and important insights.
read.table()
function and the URL of the dataset. Then,
estimate a regression model predicting infant mortality, with a set of
at least 2 predictors you think are theoretically related to infant
mortality. Check whether outliers influence your inferences. If yes, how
do you propose to deal with these outliers?