This tutorial shows you:

• how to handle grouped data in R
• how to include fixed effects for groups in regression models

Note on copying & pasting code from the PDF version of this tutorial: Please note that you may run into trouble if you copy & paste code from the PDF version of this tutorial into your R script. When the PDF is created, some characters (for instance, quotation marks or indentations) are converted into non-text characters that R won’t recognize. To use code from this tutorial, please type it yourself into your R script or you may copy & paste code from the source file for this tutorial which is posted on my website.

Note on R functions discussed in this tutorial: I don’t discuss many functions in detail here and therefore I encourage you to look up the help files for these functions or search the web for them before you use them. This will help you understand the functions better. Each of these functions is well-documented either in its help file (which you can access in R by typing ?ifelse, for instance) or on the web. The Companion to Applied Regression (see our syllabus) also provides many detailed explanations.

As always, please note that this tutorial only accompanies the other materials for Day 14 (in this case, the course video linked on the course website) and that you need to have worked through the materials for that day before tackling this tutorial. More than on the other days of our seminar so far, these notes only scratch the surface of the issues arising with grouped data. I strongly encourage you to self-study the theory behind fixed effects before using them in your own work. Two recent articles on the topic are worth a look; they include references to other canonical articles on fixed effects, so take a look - even though these articles go beyond the treatment of FEs in this tutorial:

Both of these articles point you to textbook treatments of fixed effects in grouped data; you must consult at least one of these textbooks (e.g., Greene, Wooldridge) before using fixed effects in your work.

Grouped data and the OLS assumptions

In the tutorial for Day 7, you already encountered the two major types of grouped data in social science:

1. individuals nested in higher-level units

• Example: survey respondents in an international survey are “nested” in countries (500 respondents in the U.S., 500 in Canada, etc.)
2. observations from the same unit observed over at least 2 time periods (“time-series cross-sectional data”)

• Example: yearly country-level economic growth measures observed for 20 countries over 30 years

It should be clear from our previous discussions that both of these types of grouped data likely violate at least one OLS assumption: the assumption of independence of units. That is, the residual of observation $$k$$ does not predict the residual of observation $$n$$, or, in more descriptive terms, observations $$k$$ and $$n$$ have nothing in common that the regression model is not already accounting for. Grouped data likely violate this assumption because observations that are part of the same group likely share some characteristics that are difficult to model.

Similarly, grouped data may also be likely to produce heteroskedastic errors if groups exhibit different patterns of residuals. Think of the following two examples:

1. In studies of opinions toward policies, individuals in one country may all be framing their opinion in terms of some shared cultural experience germane to that country. Respondents from the same country are unlikely to be independent from each other.

2. Economic growth over time in the United States may be more volatile than in Sweden due to difficult-to-measure factors. Economic growth measures from the United States are not independent from each other.

A simple solution: group-specific intercepts

From Day 9 of our seminar, you recall that dummy variables can be used to assign separate intercepts to different groups. This means that each group receives a separate intercept for its (virtual) regression line, while the relationships between all other predictors are the same (i.e., parallel regression lines) for all groups. In a simple regression equation, we can include group dummies $$\alpha_j$$ as follows, where groups are indexed by $$j$$ and individual observations are indexed by $$i$$:

$y_{ij} = \alpha_j + \beta x_i + \varepsilon_i$

You can see the group dummies visualized in the figure on the right. You can also see that not accounting for the groups will (a) leave much noise around the estimate for the $$\beta$$ (the relationship between $$x$$ and $$y$$). It will also (although in this example only slightly) bias the estimate of $$\beta$$; you can see this by comparing the slope of the black line and the red lines in the right figure. The black line is the slope from a regression that does not account for the groups.