This tutorial shows you:

  • how to use and interpret binary (“dummy”) variables in linear regression
  • how to diagnose outliers in a multiple regression model
  • how to assess the influence of outliers on regression results

This tutorial only accompanies the other materials from our course. Be sure to consult those other course materials for a full coverage of this topic.

Dummy variables in regression

Some of you are working with so-called dummy variables in your work, so we will briefly explore how these variables are used in multiple regression. Dummy variables are also explained well in chapter 7 of AR, but it doesn’t hurt to explore them earlier. Dummy variables are binary indicators that are set to 1 for all observations matching a particular classification and 0 to all other observations. For instance, a dummy variable in a survey for married respondents will be coded the following way:

\[ \text{married} = \begin{cases} 1, & \text{if respondent is married}\\ 0, & \text{otherwise} \end{cases} \]

Survey data example: Attitudes toward Hillary Clinton

We’ll start with an example dataset that I’ve taken from the accompanying materials to Kellstedt and Whitten’s Fundamentals of Political Science Research. This dataset is a modified extract from the 1996 edition of the (American) National Election studies. This dataset has 1714 observations and 8 variables:

Variable Description
demrep Party identification (1 = strong Democrat, 7 = strong Republican)
clinton.therm Feeling thermometer toward Hillary Clinton
dem.therm Feeling thermometer toward the Democrats
female Female (1 = yes, 0 = no)
age Age in years
educ Education (1 = 8 grades or less, 7 = advanced degree)
income Income (1 = less than $2999, 24 = $105,000 or more)
region Northeast, North Central, South, or West
nes.dat <- import("https://www.dropbox.com/s/24ktov8o7wcn3l2/nes1996subset.csv?dl=1")
summary(nes.dat)
##      demrep      clinton.therm      dem.therm          female      
##  Min.   :1.000   Min.   :  0.00   Min.   :  0.00   Min.   :0.0000  
##  1st Qu.:3.000   1st Qu.: 30.00   1st Qu.: 40.00   1st Qu.:0.0000  
##  Median :4.000   Median : 60.00   Median : 60.00   Median :1.0000  
##  Mean   :4.327   Mean   : 52.81   Mean   : 58.86   Mean   :0.5519  
##  3rd Qu.:5.000   3rd Qu.: 70.00   3rd Qu.: 70.00   3rd Qu.:1.0000  
##  Max.   :7.000   Max.   :100.00   Max.   :100.00   Max.   :1.0000  
##  NA's   :385     NA's   :29       NA's   :27                       
##       age             educ           income         region         
##  Min.   :18.00   Min.   :1.000   Min.   : 1.00   Length:1714       
##  1st Qu.:34.00   1st Qu.:3.000   1st Qu.:11.00   Class :character  
##  Median :44.00   Median :4.000   Median :16.00   Mode  :character  
##  Mean   :47.54   Mean   :4.105   Mean   :15.03                     
##  3rd Qu.:61.00   3rd Qu.:6.000   3rd Qu.:20.00                     
##  Max.   :93.00   Max.   :7.000   Max.   :24.00                     
##  NA's   :2       NA's   :3       NA's   :150

Say you are interested in explaining why some respondents exhibit a more positive attitude toward Hillary Clinton than others. You could use bivariate regression to test the (somewhat obvious) argument that Republican respondents might be less likely to approve of Clinton than more Democratic respondents. First, you may want to means-center the party ID variable for ease of interpretation:

nes.dat$demrep.ctr <- nes.dat$demrep - median(nes.dat$demrep, na.rm = TRUE)
m1 <- lm(clinton.therm ~ demrep.ctr, data = nes.dat)
plot(x = jitter(nes.dat$demrep.ctr), y = nes.dat$clinton.therm, 
     xlab = "Party ID (Democratic -> Republican)",
     ylab = "Clinton thermometer")
abline(m1, col = "red")