What you should be able to do on your own after this exercise:

  • Work through the most common issues in data management


To begin this exercise, please quit and re-open RStudio on your computer.

Acknowledgement: Some of this exercise contains materials from the book R for Data Science, the core text for this course. For citations of the R packages used here, please refer to citation("packagename")

1. Look for help online

You’re invited to ask questions while you complete this exercise during our meeting, but I also recommend consulting the main resources I recommend during our course. These are:

2. Create a directory for this exercise

Using your computer’s tools or the “Files” Tab on the bottom right in RStudio, create a folder “Exercise 3” within your course folder. You’ll need to download some data for this exercise. Place it in this folder so you can access it easily.

3. Create an R script

Create an empty R script. Save it as “IntroR_Day3_Exercise.R” in your “Exercise 3” working directory. Within the script, type or copy & paste the following code in the first line to set the working directory to the same folder in which the script is located:


4. Load packages

To start your script, load the following packages. Install them if necessary.

5. Import a dataset

  1. Following the same steps we did with the Afrobarometer data, import round 10 of the European Social Survey into R and take care of factors and other relevant variable types.
  1. Inspect the dataset: how many respondents? How many variables?

  2. Trim the dataset to contain only a few variables: cntry, ppltrst, psppipla, vote, lrscale, eduyrs, agea, gndr

  3. Do all variables seem to have reasonable values? Inspect the dataset with summary().

  4. Similar to our work in class, consider recoding some variables:

  • is ppltrst coded in a way that allows you to use it for analysis? You can try table(as_label(ess10_small$ppltrst)).
  • is psppipla coded in a way that allows you to use it for analysis?
  • what about vote?
  • why does eduyrs have a maximum of 114? What might you need to do with this variable?
  • is agea OK?
  • do you want to recode gndr?
  1. Show a visualization of one variable that seems interesting to you.

6. Reshape a dataset (plus merging!)

Download the dataset kidshtwt.sav from https://stats.oarc.ucla.edu/spss/code/reshaping-data-wide-to-long/ into your working directory and reshape the dataset into long format using pivot_longer() or another function of your choice. Beware: you have two variables (height and weight). I recommend reshaping two datasets, one for height and one for weight, and then merging the two using family ID and birth order.

7. Merge two datasets

Merge the ESS data you processed above with the appropriate QoG data covering democracy, women’s political empowerment, and GDP per capita. Download the basic QoG data: https://www.gu.se/en/quality-government/qog-data. Hint: Round 10 of the ESS was collected in 2020, so it would make sense to use macro-level variables from 2019.

How can you check that the merge/join went alright?

Save the dataset to your working directory as a .csv file and name it ess10_qog.csv.

8. Work with strings

  1. In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to?

  2. Given the corpus of common words in stringr::words, create regular expressions that find all words that: (a) Start with “y”. and (b) end with “x”.

  3. Create a regular expressions to find all words that end with ed, but not with eed.

9. Dates and times

Use the appropriate lubridate function to parse each of the following dates:

  1. "January 1, 2010"
  2. "2015-Mar-07"
  3. "06-Jun-2017"
  4. c("August 19 (2015)", "July 1 (2015)")
  5. "12/30/14" # Dec 30, 2014

10. Collapse a dataset

Returning to the ESS data, can you calculate the average political efficacy (psppipla) by country and gender in the survey? In which country is the gender gap the biggest?