This tutorial shows you:

• how to use and interpret dummy variables in linear regression
• how to diagnose outliers in a multiple regression model
• how to assess the influence of outliers on regression results

Note on copying & pasting code from the PDF version of this tutorial: Please note that you may run into trouble if you copy & paste code from the PDF version of this tutorial into your R script. When the PDF is created, some characters (for instance, quotation marks or indentations) are converted into non-text characters that R won’t recognize. To use code from this tutorial, please type it yourself into your R script or you may copy & paste code from the source file for this tutorial which is posted on my website.

Note on R functions discussed in this tutorial: I don’t discuss many functions in detail here and therefore I encourage you to look up the help files for these functions or search the web for them before you use them. This will help you understand the functions better. Each of these functions is well-documented either in its help file (which you can access in R by typing ?ifelse, for instance) or on the web. The Companion to Applied Regression (see our syllabus) also provides many detailed explanations.

As always, please note that this tutorial only accompanies the other materials for Day 9 and that you are expected to have worked through the reading for that day before tackling this tutorial.

# Dummy variables in regression

Some of you are working with so-called dummy variables in your replication assignments, so we will briefly explore how these variables are used in multiple regression. Dummy variables are also explained well in chapter 7 of AR (assigned on Day 11), but it doesn’t hurt to explore them earlier. Dummy variables are binary indicators that are set to 1 for all observations matching a particular classification and 0 to all other observations. For instance, a dummy variable in a survey for married respondents will be coded the following way:

$\text{married} = \begin{cases} 1, & \text{if respondent is married}\\ 0, & \text{otherwise} \end{cases}$

## Survey data example: Attitudes toward Hillary Clinton

We’ll start with an example dataset that I’ve taken from the accompanying materials to Kellstedt and Whitten’s Fundamentals of Political Science Research. This dataset is a modified extract from the 1996 edition of the (American) National Election studies. This dataset has 1714 observations and 8 variables:

Variable Description
demrep Party identification (1 = strong Democrat, 7 = strong Republican)
clinton.therm Feeling thermometer toward Hillary Clinton
dem.therm Feeling thermometer toward the Democrats
female Female (1 = yes, 0 = no)
age Age in years
educ Education (1 = 8 grades or less, 7 = advanced degree)
income Income (1 = less than $2999, 24 =$105,000 or more)
region Northeast, North Central, South, or West
nes.dat <- import("https://www.dropbox.com/s/24ktov8o7wcn3l2/nes1996subset.csv?dl=1")
summary(nes.dat)
##      demrep      clinton.therm      dem.therm          female
##  Min.   :1.000   Min.   :  0.00   Min.   :  0.00   Min.   :0.0000
##  1st Qu.:3.000   1st Qu.: 30.00   1st Qu.: 40.00   1st Qu.:0.0000
##  Median :4.000   Median : 60.00   Median : 60.00   Median :1.0000
##  Mean   :4.327   Mean   : 52.81   Mean   : 58.86   Mean   :0.5519
##  3rd Qu.:5.000   3rd Qu.: 70.00   3rd Qu.: 70.00   3rd Qu.:1.0000
##  Max.   :7.000   Max.   :100.00   Max.   :100.00   Max.   :1.0000
##  NA's   :385     NA's   :29       NA's   :27
##       age             educ           income         region
##  Min.   :18.00   Min.   :1.000   Min.   : 1.00   Length:1714
##  1st Qu.:34.00   1st Qu.:3.000   1st Qu.:11.00   Class :character
##  Median :44.00   Median :4.000   Median :16.00   Mode  :character
##  Mean   :47.54   Mean   :4.105   Mean   :15.03
##  3rd Qu.:61.00   3rd Qu.:6.000   3rd Qu.:20.00
##  Max.   :93.00   Max.   :7.000   Max.   :24.00
##  NA's   :2       NA's   :3       NA's   :150

Say you are interested in explaining why some respondents exhibit a more positive attitude toward Hillary Clinton than others. You could use bivariate regression to test the (somewhat obvious) argument that Republican respondents might be less likely to approve of Clinton than more Democratic respondents. First, you may want to means-center the party ID variable for ease of interpretation:

nes.dat$demrep.ctr <- nes.dat$demrep - median(nes.dat$demrep, na.rm = TRUE) m1 <- lm(clinton.therm ~ demrep.ctr, data = nes.dat) plot(x = jitter(nes.dat$demrep.ctr), y = nes.dat\$clinton.therm,
xlab = "Party ID (Democratic -> Republican)",
ylab = "Clinton thermometer")
abline(m1, col = "red")