Getting started

The purpose of this tutorial is to demonstrate how to merge and process data into one dataframe in R. This includes data from different sources and at different levels of analysis. To get to this point, the tutorial will show:

  • how to use spreadsheets to prepare data for analysis
  • how to clean raw data from a public source
  • how to process data for analysis
  • how to merge separate data sources into one
  • how to collapse datasets
  • how to create new variables, including time-series operators

Additional resources

I recommend one additional resource to learn more about how to manage and manipulate data in R:

  • The Manipulating Data section of the companion website to Chang (2012), R Cookbook. This site provides annotated examples for almost any data management scenario you’ll likely encounter in your research.
  • The Getting your data into shape section of the companion website to Chang (2021), R Graphics Cookbook. This book is an excellent resource for almost any graphing problem & solution in R.

Working directory

As always, it is useful to set a project-specific working directory—especially if you work with many files. You can set the WD to the location of your R script using our previous approach:

setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

Alternatively, use the here package to set your WD at the same location as your R script. Or, use an RStudio project to organize folders and files automatically.

Entering your own data into a spreadsheet

Since many of you work with data you collect yourself, you will often enter data directly into spreadsheets. I don’t have any specific recommendations for this, but I strongly encourage you to use the following workflow:

  • enter raw data into spreadsheets, then save them as .csv (not .xls or .xlsx) files
  • use these (modified) principles of tidy data when constructing your dataset:
    1. Each variable forms a column.
    2. Each observation forms a row.
    3. Use only variable names (not values) as column headers. Avoid spaces in variable/column names.
    4. Use common identifiers for observations and groups.
  • conduct all and any processing (as you learn below) in R with a documented script
  • This makes your data cleaning and processing reproducible for yourself and for others in the future, and helps you recall and justify choices you made at this step
  • It will also prevent you from being unable to reconstruct what you did down the road
  • Beware of Microsoft Excel changing the content of cells/columns based on formatting. Always verify that your data have correct values once you have re-opened them in Excel or another application. This is especially important for dates and times.

Remember, one of the goals of our course is to help you build a better relationship with your future self. Following this workflow will help with that.

Keep data management and analysis separate

I recommend keeping a separate script file for data management and data analysis. This will also help you maintain a reproducible workflow and keep your code manageable. For instance, in my projects, I typically have at least two R scripts in my project folder:

  • project_datamgmt.R, which starts by importing the original source data and cleans and prepares it for analysis
  • project_analysis.R, which conducts all analysis and creates tables and graphs

A very good model to follow is the Project TIER protocol, as illustrated in Day 5 of our course. You can take a look at the demo project. An R version, created by me, is available on my Github page.

Importing and exporting data

You will data in many different formats when doing research. For this purpose, the R package “rio” (with which you’re already familiar) is particularly useful. Its developer describes it as “a set of tools aims to simplify the process of importing/exporting data.” The package has two main functions, import() and export(). It allows you to import and export data from/to the following popular formats (and others that I don’t list here):

  • Tab-separated data (.tsv)
  • Comma-separated data (.csv)
  • Saved R objects (.RData)
  • Stata (.dta)
  • SPSS and SPSS portable (.sav, .por)
  • Excel (.xls)
  • Excel (.xlsx)
  • SAS and SAS XPORT
  • Minitab (.mtp)
  • OpenDocument Spreadsheet (.ods)
  • Google Sheets
  • Clipboard (default is tsv)
  • … and others!

For more information, see a readme page for the “rio” package on Github: https://github.com/leeper/rio.

Important note on package versions: Because data formats change frequently (e.g., with new versions of commercial software), dealing with data import and export requires special attention. Be sure to always use the most recent version of the “rio” package and all dependencies.

Example: importing an SPSS dataset

In this example, we import a dataset from the Afrobarometer project into R. The Afrobarometer is an African-led series of national public attitude surveys on democracy and governance in Africa, and you can find more information on it at http://www.afrobarometer.org/. The survey data are provided to scholars in SPSS format. SPSS is a statistical software package akin to Stata or R. At http://www.afrobarometer.org/data/merged-data, you can find a download link for the fourth round of the Afrobarometer. The file is called “merged_r4_data.sav”. Let’s use this link to read this dataset into R using the import() function from the “rio” package.

First, install the “rio” package if you didn’t do so before You only have to do this once. If you’ve already installed the “rio” package, you do not need to do so again.

install.packages("rio")

Next, load the package.

library("rio")

Now you can import the Afrobarometer dataset in your R environment. I’ll call it ab. This will take a few seconds since the dataset is over 15MB big. Note: as always, R will search for the file in your working directory.

ab <- import(file = "merged_r4_data.sav")

Alternatively, you can also import the file directly from its source by using the URL from the Afrobarometer website. But if the file is removed from that site, the code below won’t work.

Before actually looking at the dataset itself, you should look at its dimensions.

dim(ab)
## [1] 27713   294

You can also use the glimpse() function from the dplyr package (which is part of the “tidyverse” - which we’ll use more below).

library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
glimpse(ab)
## Rows: 27,713
## Columns: 294
## $ COUNTRY  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ RESPNO   <chr> "BEN0001", "BEN0002", "BEN0003", "BEN0004", "BEN0005", "BEN00…
## $ URBRUR   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2…
## $ BACKCHK  <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2…
## $ REGION   <dbl> 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 1…
## $ DISTRICT <chr> "COTONOU", "COTONOU", "COTONOU", "COTONOU", "COTONOU", "COTON…
## $ EA_SVC_A <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ EA_SVC_B <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ EA_SVC_C <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ EA_SVC_D <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ EA_FAC_A <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ EA_FAC_B <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ EA_FAC_C <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ EA_FAC_D <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ EA_FAC_E <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ EA_SEC_A <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ EA_SEC_B <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ EA_ROAD  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0…
## $ NOCALL_1 <dbl> 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 9…
## $ NOCALL_2 <dbl> 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 9…
## $ NOCALL_3 <dbl> 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 9…
## $ NOCALL_4 <dbl> 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 9…
## $ NOCALL_5 <dbl> 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 9…
## $ NOCALL_6 <dbl> 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 9…
## $ NOCALL_7 <dbl> 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 997, 9…
## $ PREVINT  <dbl> 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1…
## $ THISINT  <dbl> 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2…
## $ ADULT_CT <dbl> 1, 1, 1, 1, 1, 4, 2, 3, 2, 1, 3, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1…
## $ CALLS    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DATEINTR <date> 2008-06-23, 2008-06-23, 2008-06-24, 2008-06-24, 2008-06-23, …
## $ STRTIME  <time> 18:30:00, 19:40:00, 18:30:00, 17:20:00, 17:33:00, 18:33:00, …
## $ Q1       <dbl> 38, 46, 28, 30, 23, 24, 40, 50, 24, 36, 22, 31, 50, 19, 41, 2…
## $ Q2       <dbl> 0, 9, 0, 1, 0, 0, 0, 1, 0, -1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, …
## $ Q3       <dbl> 100, 104, 101, 100, 100, 100, 109, 100, 101, 100, 100, 100, 1…
## $ Q3OTHER  <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "…
## $ Q4A      <dbl> 2, 2, 1, 4, 3, 1, 2, 1, 3, 2, 3, 3, 5, 2, 1, 2, 2, 2, 2, 3, 3…
## $ Q4B      <dbl> 2, 3, 3, 3, 2, 2, 2, 1, 3, 2, 3, 3, 3, 4, 1, 2, 2, 2, 2, 3, 3…
## $ Q5       <dbl> 2, 2, 3, 4, 4, 2, 2, 2, 2, 2, 3, 3, -1, 3, 4, 4, 2, 2, 2, 3, …
## $ Q6A      <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 4, 3, 4, 3, 3, 4, 1, 4, 2, 2, 3, 2, 2…
## $ Q6B      <dbl> 2, 3, 3, 3, 4, 3, 2, 2, 3, 3, 4, 3, 3, 3, 1, 4, 2, 2, 3, 3, 2…
## $ Q7A      <dbl> 9, 9, 4, 4, 4, 3, 2, 2, 4, 1, 3, 2, 4, 4, 4, 4, 4, 4, 9, 4, 4…
## $ Q7B      <dbl> 9, 9, 4, 4, 5, 4, 2, 2, 4, 9, 3, 2, 4, 4, 4, 4, 4, 4, 9, 4, 4…
## $ Q8A      <dbl> 2, 1, 0, 0, 0, 1, 4, 0, 2, 0, 0, 0, 1, 0, 2, 0, 1, 0, 4, 3, 0…
## $ Q8B      <dbl> 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 2, 0…
## $ Q8C      <dbl> 1, 1, 0, 0, 0, 3, 2, 4, 1, 0, 0, 0, 1, 0, 2, 0, 0, 0, 2, 2, 0…
## $ Q8D      <dbl> 1, 1, 0, 0, 0, 1, 2, 4, 1, 0, 0, 0, 1, 0, 2, 0, 1, 0, 0, 1, 0…
## $ Q8E      <dbl> 4, 4, 0, 3, 2, 3, 3, 4, 2, 4, 0, 0, 1, 0, 4, 0, 3, 0, 4, 3, 0…
## $ Q9A      <dbl> 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 1, 0, 0, 2…
## $ Q9B      <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 1, 1, 0, 0, 0, 2…
## $ Q9C      <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ Q10      <dbl> 4, 4, 3, 4, 4, 3, 3, 4, 1, 1, 4, 3, 4, 3, 2, 3, 3, 3, 4, 3, 3…
## $ Q11      <dbl> 1, 1, 3, 3, 3, 4, 2, 1, 1, 1, 3, 2, 2, 2, 4, 4, 3, 2, 1, 4, 2…
## $ Q12A     <dbl> 4, 3, 2, 4, 4, 4, 3, 4, 4, 4, 3, 4, 4, 4, 4, 0, 3, 4, 3, 4, 3…
## $ Q12B     <dbl> 4, 4, 4, 3, 4, 4, 4, 3, 3, 2, 3, 2, 1, 4, 1, 0, 9, 0, 0, 4, 0…
## $ Q12C     <dbl> 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 9, 0, 0, 0, 0…
## $ Q13      <dbl> 3, 3, 2, 0, 1, 3, 2, 3, 2, 2, 2, 2, 1, 0, 1, 0, 2, 2, 3, 3, 2…
## $ Q14      <dbl> 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2…
## $ Q15A     <dbl> 4, 4, 4, 4, 4, 3, 3, 4, 4, 4, 4, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3…
## $ Q15B     <dbl> 4, 4, 4, 4, 4, 3, 1, 4, 4, 4, 2, 4, 3, 4, 1, 3, 3, 3, 4, 4, 3…
## $ Q15C     <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 3, 4, 4, 4, 3, 3, 4, 4, 3…
## $ Q16      <dbl> 4, 4, 3, 4, 2, 4, 4, 4, 4, 4, 2, 2, 4, 4, 4, 3, 3, 3, 4, 4, 2…
## $ Q17      <dbl> 1, 1, 3, 1, 3, 2, 4, 3, 2, 1, 3, 3, 4, 3, 4, 4, 2, 2, 1, 1, 2…
## $ Q18      <dbl> 1, 4, 3, 4, 4, 1, 2, 4, 1, 4, 2, 3, 2, 1, 1, 1, 3, 2, 4, 4, 3…
## $ Q19      <dbl> 4, 4, 3, 3, 4, 4, 4, 3, 4, 4, 3, 3, 3, 4, 4, 3, 3, 2, 4, 4, 3…
## $ Q20      <dbl> 4, 4, 3, 3, 3, 4, 4, 3, 4, 4, 3, 2, 3, 4, 4, 1, 2, 3, 4, 4, 2…
## $ Q21      <dbl> 4, 4, 3, 3, 1, 4, 3, 4, 1, 4, 3, 2, 3, 4, 4, 4, 3, 2, 4, 4, 3…
## $ Q22A     <dbl> 1, 1, 1, 2, 2, 0, 1, 2, 0, 1, 1, 1, 2, 0, 1, 1, 0, 0, 1, 2, 1…
## $ Q22B     <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 2, 0…
## $ Q23A     <dbl> 4, 4, 0, 1, 1, 2, 1, 1, 1, 4, 1, 1, 1, 1, 0, 1, 3, 3, 1, 4, 1…
## $ Q23B     <dbl> 3, 1, 0, 1, 0, 1, 1, 1, 1, 3, 1, 1, 1, 1, 0, 1, 3, 1, 1, 4, 1…
## $ Q23C     <dbl> 1, 1, 0, 1, 0, 1, 0, 1, 1, 4, 9, 1, 0, 1, 0, 0, 0, 1, 1, 4, 1…
## $ Q23D     <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 6, 4, 1, 1, 1, 1…
## $ Q24A     <dbl> 2, 3, 2, 2, 3, 3, 0, 0, 2, 2, 2, 2, 0, 3, 3, 9, 2, 2, 2, 2, 2…
## $ Q24B     <dbl> 2, 0, 2, 2, 3, 3, 2, 0, 0, 0, 2, 2, 0, 2, 0, 9, 2, 2, 0, 1, 1…
## $ Q25A     <dbl> 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2…
## $ Q25B     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2…
## $ Q25C     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2…
## $ Q26A     <dbl> 7, 2, 7, 7, 7, 7, 7, 7, 7, 2, 7, 7, 7, 7, 7, 7, 7, 2, 7, 7, 2…
## $ Q26B     <dbl> 7, 1, 7, 7, 7, 7, 7, 7, 7, 1, 7, 7, 7, 7, 7, 7, 7, 1, 7, 7, 1…
## $ Q27A     <dbl> 0, 0, 0, 0, 3, 0, 0, 0, 1, 3, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0…
## $ Q27B     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ Q27C     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0…
## $ Q28A     <dbl> 7, 7, 7, 7, 2, 7, 7, 7, 1, 2, 1, 7, 7, 7, 1, 9, 7, 1, 7, 7, 7…
## $ Q28B     <dbl> 7, 7, 7, 7, 1, 7, 7, 7, 1, 2, 2, 7, 7, 7, 2, 9, 7, 1, 7, 7, 7…
## $ Q29A     <dbl> 4, 4, 2, 1, 1, 1, 2, 1, 1, 4, 2, 3, 2, 2, 1, 2, 3, 3, 4, 2, 3…
## $ Q29B     <dbl> 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 3, 1, 1, 2, 1, 3, 3, 1, 1, 3…
## $ Q29C     <dbl> 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 3, 2, 2, 1, 1, 3, 3, 1, 1, 3…
## $ Q30      <dbl> 3, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 2, 1, 3, 3, 3, 3, 3, 3, 2…
## $ Q31      <dbl> 1, 1, 3, 3, 2, 3, 4, 2, 1, 1, 2, 3, 1, 1, 1, 1, 3, 2, 1, 1, 3…
## $ Q32      <dbl> 1, 4, 3, 3, 3, 1, 3, 4, 4, 1, 2, 3, 3, 1, 4, 4, 2, 3, 4, 1, 3…
## $ Q33      <dbl> 1, 1, 3, 1, 1, 3, 2, 2, 1, 1, 1, 3, 3, 1, 1, 4, 3, 3, 1, 1, 3…
## $ Q34      <dbl> 4, 4, 3, 4, 4, 3, 3, 1, 1, 4, 3, 2, 3, 2, 1, 4, 2, 3, 1, 4, 3…
## $ Q35      <dbl> 1, 1, 2, 1, 1, 3, 2, 1, 1, 1, 2, 2, 1, 1, 1, 4, 2, 2, 1, 1, 3…
## $ Q36      <dbl> 1, 1, 2, 3, 1, 2, 2, 2, 1, 1, 1, 3, 3, 1, 1, 1, 2, 3, 1, 1, 3…
## $ Q37      <dbl> 4, 4, 2, 2, 4, 2, 3, 2, 4, 4, 3, 3, 3, 2, 4, 4, 3, 3, 4, 4, 3…
## $ Q38      <dbl> 1, 1, 3, 1, 2, 2, 2, 2, 1, 1, 2, 3, 1, 2, 1, 1, 3, 3, 1, 1, 2…
## $ Q39      <dbl> 3, 4, 3, 3, 3, 3, 3, 2, 4, 1, 3, 3, 3, 3, 1, 4, 2, 2, 1, 4, 2…
## $ Q40A     <dbl> 4, 4, 4, 4, 4, 1, 3, 4, 1, 4, 4, 4, 4, 3, 2, 4, 4, 4, 4, 3, 4…
## $ Q40B     <dbl> 1, 3, 2, 1, 3, 3, -1, 3, 4, 2, 2, 1, 1, 4, 4, 2, 2, 2, 1, 4, …
## $ Q41A1    <chr> "-1", "-1", "-1", "-1", "-1", "NAGO MATHURIN", "-1", "-1", "N…
## $ Q41A2    <dbl> 9, 1, 9, 9, 9, 2, 9, 9, 2, 1, 9, 9, 3, 1, 9, 9, 1, 9, 9, 3, 9…
## $ Q41B1    <chr> "-1", "LAWANI SOULE MANA", "-1", "-1", "-1", "-1", "-1", "LAW…
## $ Q41B2    <dbl> 9, 3, 9, 9, 1, 1, 9, 3, 3, 2, 3, 9, 3, 1, -1, 9, 9, 9, 9, 3, …
## $ Q42A     <dbl> 4, 4, 4, 3, 2, 3, 2, 3, 3, 3, 3, 3, 2, 4, 3, 4, 4, 4, 4, 3, 3…
## $ Q42B     <dbl> 4, 4, 4, 4, 4, 3, 4, 3, 4, 4, 4, 4, 3, 3, 3, 2, 3, 3, 4, 4, 3…
## $ Q42C     <dbl> 2, 2, 3, 3, 2, 1, 2, 3, 2, 2, 3, 3, 3, 3, 1, 2, 2, 2, 2, 2, 2…
## $ Q42D     <dbl> 1, 1, 2, 2, 3, 2, 2, 2, 1, 1, 2, 2, 2, 3, 2, 2, 1, 1, 1, 1, 1…
## $ Q43      <dbl> 4, 4, 2, 2, 2, 3, 2, 4, 4, 3, 2, 3, 3, 1, 2, 4, 3, 3, 3, 2, 3…
## $ Q44A     <dbl> 4, 3, 2, 2, 4, 1, 4, 4, 4, 3, 2, 4, 2, 2, 1, 4, 3, 3, 4, 4, 4…
## $ Q44B     <dbl> 5, 4, 2, 4, 2, 4, 4, 4, 5, 5, 2, 4, 2, 2, 4, 5, 3, 3, 5, 4, 4…
## $ Q44C     <dbl> 5, 5, 2, 4, 4, 4, 4, 4, 5, 5, 2, 4, 2, 4, 4, 4, 3, 3, 5, 4, 4…
## $ Q45A     <dbl> 0, 0, 1, 1, 3, 1, 2, 1, 1, 1, 1, 0, 2, 2, 1, 1, 1, 1, 0, 1, 1…
## $ Q45B     <dbl> 0, 1, 1, 0, 2, 1, 1, 1, 0, 2, 1, 0, 1, 0, 1, 1, 0, 0, 9, 1, 0…
## $ Q45C     <dbl> 0, 2, 1, 0, 2, 1, 1, 1, 0, 2, 1, 0, 1, 3, 2, 1, 0, 0, 0, 2, 0…
## $ Q45D     <dbl> 0, 2, 1, 0, 2, 0, 2, 1, 0, 0, 1, 0, 1, 2, 3, 1, 0, 0, 0, 2, 0…
## $ Q45E     <dbl> 0, 3, 1, 0, 2, 3, 2, 1, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 0, 2, 0…
## $ Q46      <dbl> 3, 3, 2, 2, 2, 1, 3, 3, 3, 2, 2, 2, 1, 3, 2, 2, 2, 2, 3, 3, 2…
## $ Q47      <dbl> 3, 1, 1, 2, 3, 1, 0, 2, 3, 1, 1, 1, 2, 3, 3, 3, 2, 2, 3, 3, 1…
## $ Q48A     <dbl> 0, 0, 1, 1, 1, 2, 0, 1, 0, 0, 1, 1, 0, 3, 0, 0, 2, 2, 0, 0, 1…
## $ Q48B     <dbl> 0, 0, 1, 1, 1, 2, 2, 1, 0, 1, 1, 1, 0, 3, 0, 3, 2, 2, 0, 0, 2…
## $ Q49A     <dbl> 3, 1, 3, 1, 3, 1, 2, 3, 2, 0, 1, 0, 3, 3, 2, 1, 2, 3, 2, 2, 3…
## $ Q49B     <dbl> 1, 1, 2, 1, 3, 2, 1, 2, 2, 2, 1, 0, 2, 3, 1, 1, 2, 2, 2, 2, 2…
## $ Q49C     <dbl> 1, 1, 2, 0, 1, 2, 0, 1, 2, 1, 1, 0, 1, 1, 0, 1, 1, 2, 9, 2, 1…
## $ Q49D     <dbl> 0, 1, 2, 0, 1, 0, 1, 1, 2, 1, 1, 0, 1, 1, 9, 1, 1, 3, 1, 1, 2…
## $ Q49E     <dbl> 1, 0, 2, 0, 1, 2, 0, 2, 2, 2, 1, 0, 2, 1, 1, 1, 1, 3, 1, 1, 3…
## $ Q49F     <dbl> 1, 0, 2, 1, 1, 1, 0, 1, 2, 2, 1, 0, 2, 0, 1, 1, 1, 2, 1, 1, 2…
## $ Q49G     <dbl> 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 0, 1, 1, 1, 1, 1, 2, 2, 2, 1…
## $ Q49H     <dbl> 0, 2, 2, 1, 3, 2, 1, 1, 2, 1, 1, 0, 1, 3, 1, 1, 1, 2, 2, 2, 1…
## $ Q49I     <dbl> 1, 1, 2, 1, 1, 1, 0, 1, 2, 2, 1, 0, 1, 1, 1, 9, 1, 2, 1, 1, 2…
## $ Q50A     <dbl> 9, 2, 2, 3, 3, 1, 1, 1, 3, 3, 1, 2, 1, 0, 1, 0, 2, 1, 2, 2, 1…
## $ Q50B     <dbl> 3, 1, 2, 2, 3, 1, 1, 1, 3, 2, 1, 2, 1, 2, 2, 0, 2, 1, 2, 2, 1…
## $ Q50C     <dbl> 2, 1, 2, 3, 3, 1, 2, 1, 2, 1, 1, 2, 1, 0, 2, 0, 2, 2, 1, 1, 1…
## $ Q50D     <dbl> 2, 2, 2, 2, 3, 1, 2, 1, 3, 3, 1, 2, 1, 1, 1, 0, 2, 1, 2, 2, 1…
## $ Q50E     <dbl> 1, 1, 3, 2, 3, 2, 2, 1, 2, 1, 1, 2, 1, 2, 9, 0, 2, 2, 1, 1, 1…
## $ Q50F     <dbl> 2, 2, 3, 2, 3, 2, 2, 1, 2, 2, 1, 2, 1, 0, 1, 0, 2, 2, 3, 2, 1…
## $ Q50G     <dbl> 2, 2, 3, 2, 3, 2, 2, 1, 2, 3, 1, 2, 1, 0, 1, 0, 2, 2, 3, 2, 1…
## $ Q50H     <dbl> 0, 1, 2, 9, 3, 1, 3, 1, 1, 0, 1, 2, 1, 0, 2, 0, 2, 2, 1, 1, 1…
## $ Q51A     <dbl> 7, 0, 7, 7, 7, 7, 7, 0, 7, 2, 7, 2, 0, 7, 0, 0, 7, 7, 7, 7, 7…
## $ Q51B     <dbl> 0, 0, 7, 7, 7, 7, 2, 0, 7, 0, 7, 2, 0, 7, 0, 0, 7, 7, 7, 7, 7…
## $ Q51C     <dbl> 7, 7, 7, 7, 7, 7, 7, 0, 7, 7, 7, 2, 0, 7, 0, 0, 7, 7, 7, 7, 7…
## $ Q52      <dbl> 4, 4, 2, 4, 3, 4, 4, 3, 3, 4, 2, 1, 3, 3, 3, 3, 1, 2, 4, -1, …
## $ Q53A     <dbl> 2, 2, 4, 3, 4, 3, 1, 2, 2, 0, 3, 2, 2, 3, 2, 9, 2, 3, 2, 2, 3…
## $ Q53B     <dbl> 2, 2, 4, 3, 0, 1, 1, 0, 0, 0, 3, 2, 0, 0, 2, 9, 3, 3, 2, 2, 3…
## $ Q54A     <dbl> 0, 1, 1, 1, 1, 1, 2, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 2, 0, 1, 0…
## $ Q54B     <dbl> 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 2, 0…
## $ Q54C     <dbl> 0, 0, 1, 1, 1, 0, 2, 3, 0, 3, 0, 1, 0, 1, 2, 0, 2, 1, 0, 0, 1…
## $ Q55      <dbl> 4, 4, 3, 3, 4, 4, 3, 3, 4, 4, 4, 2, 4, 1, 4, 1, 3, 3, 4, 4, 2…
## $ Q56PT1   <dbl> 4, 7, 1, 1, 8, 1, 1, 3, 17, 1, 1, 1, 1, 1, 1, 3, 3, 2, 7, 1, …
## $ Q56PT2   <dbl> 1, 1, 9, 3, 16, 14, 3, 13, 1, 3, 7, 14, 13, 4, 23, 1, 8, 8, 1…
## $ Q56PT3   <dbl> 32, 13, 16, 10, 14, 13, 7, 7, 13, 15, 14, 20, 23, 24, 20, 1, …
## $ Q57A     <dbl> 3, 3, 3, 3, 3, 3, 1, 3, 3, 1, 3, 2, 3, 3, 2, 1, 3, 3, 2, 2, 3…
## $ Q57B     <dbl> 2, 2, 3, 3, 3, 3, 3, 3, 2, 1, 2, 1, 3, 4, 2, 2, 3, 3, 2, 1, 3…
## $ Q57C     <dbl> 2, 2, 2, 1, 4, 1, 3, 3, 2, 1, 1, 1, 2, 3, 2, 3, 3, 3, 3, 2, 3…
## $ Q57D     <dbl> 2, 2, 3, 1, 3, 1, 3, 3, 2, 1, 1, 1, 3, 4, 1, 4, 3, 1, 2, 2, 3…
## $ Q57E     <dbl> 1, 1, 3, 1, 3, 1, 3, 3, 1, 1, 1, 1, 2, 3, 1, 4, 3, 1, 1, 1, 2…
## $ Q57F     <dbl> 3, 3, 2, 2, 3, 1, 3, 3, 3, 3, 1, 1, 3, 3, 3, 3, 3, 2, 3, 1, 2…
## $ Q57G     <dbl> 3, 3, 2, 3, 3, 1, 3, 4, 3, 2, 1, 3, 3, 3, 3, 4, 3, 2, 4, 2, 2…
## $ Q57H     <dbl> 3, 3, 2, 2, 3, 1, 3, 3, 3, 3, 2, 3, 3, 3, 9, 3, 3, 2, 3, 3, 3…
## $ Q57I     <dbl> 3, 3, 2, 2, 3, 3, 3, 4, 3, 3, 2, 3, 3, 3, 3, 3, 2, 3, 2, 2, 3…
## $ Q57J     <dbl> 3, 3, 2, 2, 3, 1, 3, 4, 3, 2, 2, 3, 3, 4, 3, 3, 2, 3, 3, 2, 3…
## $ Q57K     <dbl> 3, 2, 2, 2, 3, 1, 3, 4, 2, 1, 2, 1, 3, 4, 9, 3, 2, 3, 2, 2, 3…
## $ Q57L     <dbl> 3, 3, 2, 2, 3, 1, 3, 4, 3, 1, 2, 2, 3, 4, 3, 3, 3, 3, 4, 3, 3…
## $ Q57M     <dbl> 2, 3, 2, 2, 3, 1, 3, 3, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 2, 3…
## $ Q57N     <dbl> 2, 3, 2, 2, 4, 1, 2, 3, 2, 2, 2, 2, 2, 3, 2, 3, 2, 3, 1, 2, 3…
## $ Q57O     <dbl> 2, 2, 2, 2, 3, 3, 2, 3, 3, 2, 2, 2, 2, 3, 1, 3, 3, 3, 1, 2, 3…
## $ Q57P     <dbl> 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3…
## $ Q58A     <dbl> 2, 2, 2, 2, 4, 2, 4, 2, 2, 2, 2, 2, 2, 2, 4, 4, 1, 4, 4, 4, 1…
## $ Q58B     <dbl> 1, 1, 1, 9, 1, 2, 4, 2, 1, 1, 4, 4, 1, 2, 1, 2, 1, 4, 1, 1, 1…
## $ Q58C     <dbl> 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1…
## $ Q58D     <dbl> 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1…
## $ Q58E     <dbl> 4, 2, 2, 3, 3, 2, 2, 2, 2, 2, 4, 3, 1, 2, 3, 2, 4, 4, 2, 2, 3…
## $ Q58F     <dbl> 4, 4, 2, 1, 2, 1, 4, 2, 4, 4, 4, 3, 2, 2, 2, 2, 4, 4, 4, 4, 3…
## $ Q58G     <dbl> 1, 1, 2, 1, 1, 1, 4, 1, 1, 1, 4, 2, 1, 2, 1, 2, 4, 4, 1, 1, 1…
## $ Q58H     <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 4, 2, 1, 1, 1, 4, 4, 4, 1, 1, 1…
## $ Q59A     <dbl> 2, 3, 2, 3, 2, 1, 9, 3, 2, 2, 2, 2, 2, 4, 1, 3, 2, 2, 1, 1, 2…
## $ Q59B     <dbl> 3, 3, 2, 3, 2, 1, 9, 3, 2, 2, 2, 3, 2, 2, 3, 4, 2, 2, 1, 1, 2…
## $ Q59C     <dbl> 2, 9, 2, 2, 2, 1, 3, 3, 3, 3, 2, 2, 2, 2, 9, 4, 2, 2, 3, 2, 2…
## $ Q59D     <dbl> 3, 3, 2, 2, 2, 1, 2, 3, 3, 2, 2, 3, 2, 3, 4, 4, 2, 2, 2, 2, 2…
## $ Q59E     <dbl> 2, 2, 2, 3, 2, 2, 3, 3, 2, 4, 2, 3, 2, 4, 3, 3, 2, 3, 1, 3, 2…
## $ Q59F     <dbl> 2, 2, 2, 3, 2, 1, 3, 4, 2, 4, 2, 3, 2, 4, 3, 3, 2, 3, 2, 3, 2…
## $ Q60A     <dbl> 9, 9, 2, 2, 2, 1, 3, 3, 2, 1, 1, 2, 2, 3, 1, 9, 2, 2, 3, 1, 2…
## $ Q60B     <dbl> 9, 9, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 3, 1, 9, 2, 2, 3, 2, 3…
## $ Q60C     <dbl> 9, 9, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 9, 2, 2, 3, 2, 2…
## $ Q60D     <dbl> 9, 9, 2, 2, 2, 1, 3, 2, 2, 1, 1, 2, 2, 1, 4, 9, 2, 2, 3, 2, 2…
## $ Q60E     <dbl> 9, 9, 2, 2, 2, 2, 3, 2, 2, 2, 1, 2, 2, 3, 9, 9, 2, 2, 3, 2, 2…
## $ Q60F     <dbl> 9, 9, 2, 2, 2, 1, 3, 3, 3, 3, 1, 2, 2, 3, 3, 9, 2, 2, 3, 2, 2…
## $ Q61      <dbl> 1, 9, 3, 3, 4, 4, 2, 1, 4, 1, 3, 2, 2, 4, 1, 3, 3, 3, 4, 3, 2…
## $ Q62A1    <dbl> 0, 9, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Q62A     <dbl> 7, 0, 7, 7, 2, 7, 7, 7, 7, 7, 0, 7, 2, 7, 7, 7, 7, 7, 7, 7, 7…
## $ Q62B     <dbl> 7, 0, 7, 7, 2, 7, 7, 7, 7, 7, 0, 7, 1, 7, 7, 7, 7, 7, 7, 7, 7…
## $ Q62C     <dbl> 7, 0, 7, 7, 2, 7, 7, 7, 7, 7, 0, 7, 0, 7, 7, 7, 7, 7, 7, 7, 7…
## $ Q62D     <dbl> 7, 0, 7, 7, 0, 7, 7, 7, 7, 7, 0, 7, 0, 7, 7, 7, 7, 7, 7, 7, 7…
## $ Q62E     <dbl> 7, 0, 7, 7, 0, 7, 7, 7, 7, 7, 0, 7, 1, 7, 7, 7, 7, 7, 7, 7, 7…
## $ Q62F     <dbl> 7, 0, 7, 7, 0, 7, 7, 7, 7, 7, 0, 7, 0, 7, 7, 7, 7, 7, 7, 7, 7…
## $ Q63A     <dbl> 9, 9, 3, 3, 3, 2, 9, 9, 9, 9, 3, 3, 2, 1, 9, 9, 3, 3, 9, 3, 3…
## $ Q63B     <dbl> 9, 9, 3, 3, 3, 4, 9, 9, 9, 9, 3, 3, 2, 4, 9, 9, 3, 3, 9, 9, 3…
## $ Q63C     <dbl> 9, 9, 3, 3, 3, 2, 9, 9, 9, 9, 3, 3, 2, 2, 9, 9, 3, 3, 9, 9, 3…
## $ Q63D     <dbl> 9, 9, 3, 3, 3, 1, 9, 9, 9, 9, 3, 3, 2, 1, 9, 9, 3, 3, 9, 9, 3…
## $ Q64A     <dbl> 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0…
## $ Q64B     <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0…
## $ Q64C     <dbl> 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Q64D     <dbl> 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1…
## $ Q64E     <dbl> 1, 1, 1, 1, 0, 0, 1, 0, 9, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0…
## $ Q65      <dbl> 9, 9, 3, 2, 4, 4, 4, 4, 9, 1, 2, 2, 2, 1, 2, 9, 3, 3, 9, 9, 2…
## $ Q66      <dbl> 9, 9, 4, 4, 2, 2, 1, 2, 9, 1, 3, 2, 2, 2, 4, 9, 3, 2, 9, 9, 3…
## $ Q67      <dbl> 9, 9, 0, 0, 0, 3, 0, 0, 9, 3, 0, 1, 0, 2, 3, 9, 0, 1, 9, 9, 1…
## $ Q68      <dbl> 9, 9, 3, 3, 3, 3, 3, 3, 9, 3, 3, 3, 2, 3, 1, 9, 3, 3, 9, 9, 3…
## $ Q69      <dbl> 9, 9, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 9, 3, 3, 9, 9, 3…
## $ Q70A     <dbl> 3, 9, 3, 3, 4, 2, 4, 4, 3, 3, 3, 2, 4, 3, 3, 3, 3, 4, 3, 3, 3…
## $ Q70B     <dbl> 9, 9, 3, 9, 3, 2, 3, 3, 3, 2, 3, 1, 3, 3, 9, 3, 3, 4, 3, 3, 3…
## $ Q70C     <dbl> 9, 9, 3, 2, 3, 3, 3, 2, 3, 2, 2, 3, 2, 3, 9, 9, 2, 4, 3, 2, 3…
## $ Q71      <dbl> 3, 3, 4, 3, 3, 1, 3, 4, 2, 3, 4, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3…
## $ Q72A     <dbl> 2, 2, 2, 1, 1, 2, 2, 2, 2, 3, 1, 2, 2, 2, 9, 1, 2, 2, 1, 1, 1…
## $ Q72B     <dbl> 2, 2, 2, 1, 1, 1, 2, 2, 3, 3, 1, 2, 2, 2, 9, 3, 2, 2, 1, 1, 1…
## $ Q73A     <dbl> 3, 3, 3, 3, 3, 1, 3, 3, 3, 1, 3, 3, 3, 0, 3, 3, 3, 3, 3, 3, 3…
## $ Q73B     <dbl> 3, 3, 3, 3, 3, 1, 3, 3, 3, 9, 3, 3, 3, 0, 3, 9, 3, 3, 3, 3, 3…
## $ Q73C     <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 1, 9, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3…
## $ Q74      <dbl> 1, 9, 3, 2, 2, 2, 3, 2, 9, 9, 1, 2, 3, 4, 1, 9, 2, 3, 1, 1, 3…
## $ Q79      <dbl> 100, 104, 990, 100, 108, 101, 109, 100, 101, 104, 104, 101, 1…
## $ Q79OTHER <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "…
## $ Q80      <dbl> 3, 3, 7, 3, 3, 4, 3, 3, 3, 3, 4, 3, 3, 2, 9, 2, 3, 2, 3, 3, 3…
## $ Q81      <dbl> 9, 5, 7, 3, 3, 5, 4, 1, 3, 5, 3, 3, 3, 2, 9, 1, 3, 4, 5, 5, 3…
## $ Q82      <dbl> 0, 0, 7, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 9, 0, 0, 0, 0, 0…
## $ Q83      <dbl> 2, 4, 7, 5, 5, 5, 5, 5, 5, 4, 5, 2, 5, 5, 3, 5, 4, 5, 4, 4, 5…
## $ Q84A     <dbl> 0, 0, 2, 2, 1, 2, 0, 0, 0, 0, 1, 0, 2, 1, 1, 1, 2, 1, 0, 2, 2…
## $ Q84B     <dbl> 0, 0, 1, 2, 1, 0, 0, 0, 0, 0, 1, 0, 2, 1, 1, 1, 2, 1, 0, 1, 2…
## $ Q84C     <dbl> 0, 0, 1, 2, 1, 0, 0, 0, 0, 0, 1, 0, 2, 1, 1, 1, 1, 1, 0, 0, 2…
## $ Q85      <dbl> 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1…
## $ Q86      <dbl> 100, 997, 997, 100, 106, 997, 997, 100, 997, 997, 997, 100, 9…
## $ Q87      <dbl> 0, 0, 4, 0, 0, 0, 3, 0, 2, 4, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ Q88A     <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 4, 0, 0, 0, 0, 4, 0…
## $ Q88B     <dbl> 0, 0, 0, 0, 3, 1, 1, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Q88C     <dbl> 0, 0, 0, 0, 3, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Q88D     <dbl> 9, 3, 1, 1, 3, 3, 3, 4, 2, 3, 2, 2, 1, 3, 3, 0, 1, 1, 3, 4, 1…
## $ Q88E     <chr> "FON,FRANCAIS", "FRANCAIS,FON,YORUBA", "YORUBA,ADJA", "FON,FR…
## $ Q88F     <dbl> 2, 3, 2, 2, 3, 4, 3, 2, 4, 3, 3, 3, 1, 1, 2, 3, 2, 2, 2, 3, 1…
## $ Q89      <dbl> 4, 2, 4, 3, 4, 4, 5, 2, 4, 4, 5, 2, 0, 4, 4, 4, 2, 2, 0, 7, 0…
## $ Q90      <dbl> 2, 18, 18, 1, 2, 2, 2, 12, 13, 2, 1, 1, 2, 2, 2, 12, 25, 1, 2…
## $ Q91      <dbl> 4, 1, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 3, 3, 4, 4, 4, 4, 4, 4, 4…
## $ Q92A     <dbl> 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0…
## $ Q92B     <dbl> 1, 1, 1, 9, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ Q92C     <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 9, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0…
## $ Q93A     <dbl> 3, 3, 3, 3, 2, 2, 1, 3, 2, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3…
## $ Q93B     <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q94      <dbl> 0, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 4, 0…
## $ Q95      <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0…
## $ Q96      <dbl> 997, 997, 997, 997, 997, 997, 2, 997, 997, 997, 997, 998, 998…
## $ Q97      <dbl> 100, 100, 100, 100, 100, 102, 998, 100, 999, 100, 101, 100, 1…
## $ Q98A     <dbl> 3, 3, 1, 2, 2, 3, 2, 9, -1, -1, -1, -1, -1, -1, -1, 9, 1, 2, …
## $ Q98B     <dbl> 3, 3, 1, 2, 2, 1, 2, 9, 3, 3, 1, 1, 1, 3, 9, 3, 1, 2, 3, 2, 3…
## $ Q98C     <dbl> 3, 3, 1, 2, 2, 2, 2, 9, 3, 3, 2, 1, 1, 3, 9, 3, 2, 2, 3, 2, 3…
## $ Q98D     <dbl> 3, 3, 1, 2, 2, 2, 2, 9, 3, 3, 2, 1, 1, 3, 9, 3, 2, 1, 3, 2, 3…
## $ Q98E     <dbl> 3, 3, 1, 2, 2, 2, 2, 9, 3, 3, 1, 1, 1, 3, 9, 2, 2, 1, 3, 2, 3…
## $ Q98F     <dbl> 3, 3, 1, 2, 2, 2, 2, 9, 3, 2, 2, 1, 1, 3, 2, 3, 2, 2, 3, 2, 3…
## $ Q98G     <dbl> 3, 2, 1, 2, 2, 0, 2, 9, 3, 2, 2, 1, 1, 3, 1, 9, 2, 2, 3, 2, 3…
## $ Q98H     <dbl> 3, 3, 1, 2, 2, 2, 2, 9, 3, 3, 2, 1, 1, 1, 9, 9, 2, 2, 3, 2, 3…
## $ Q98I     <dbl> 3, 3, 1, 2, 2, 0, 2, 9, 3, 3, 2, 1, 1, 3, 9, 3, 2, 2, 3, 2, 3…
## $ Q98J     <dbl> 3, 3, 1, 2, 2, 0, 2, 9, 3, 3, 2, 1, 1, 3, 9, 1, 2, 2, 3, 2, 3…
## $ Q98J1    <dbl> 3, 3, 1, 2, 2, 0, 2, 9, 3, 3, 1, 1, 1, 3, 2, 3, 2, 2, 3, 2, 3…
## $ Q98K     <dbl> 1, 1, 3, 9, 2, 3, 2, 9, 1, 1, 1, 3, 9, 3, 9, 9, 3, 3, 1, 1, 9…
## $ Q99A     <dbl> 2, 2, 3, 2, 3, 3, 2, 9, 2, 4, 2, 2, 3, 2, 9, 4, 3, 3, 9, 5, 3…
## $ Q99B     <dbl> 2, 1, 3, 2, 3, 3, 2, 9, 2, 4, 1, 2, 4, 2, 9, 1, 2, 3, 9, 5, 3…
## $ Q99C     <dbl> 1, 2, 3, 2, 3, 2, 2, 9, 2, 3, 1, 2, 4, 2, 2, 2, 2, 3, 9, 4, 3…
## $ Q100     <dbl> 19, 17, 1, 1, 20, 17, 1, 1, 19, 19, 17, 1, 17, 2, 19, 19, 1, …
## $ ENDTIME  <time> 19:30:00, 20:40:00, 19:10:00, 18:17:00, 18:25:00, 19:35:00, …
## $ LENGTH   <dbl> 60, 60, 40, 57, 52, 62, 74, 42, 76, 87, 46, 56, 75, 63, 101, …
## $ Q101     <dbl> 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 2, 1, 2…
## $ Q102     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q103     <dbl> 100, 2, 2, 100, 2, 2, 2, 2, 2, 2, 2, 100, 100, 2, 2, 101, 101…
## $ Q104     <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 5, 1, 1, 1…
## $ Q105A    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Q105B    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Q105C    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Q105D    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Q105E    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Q106     <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0, 0, 0, 0, 0…
## $ Q107A    <chr> "000", "000", "000", "000", "000", "000", "000", "000", "0", …
## $ Q107B    <chr> "000", "000", "000", "000", "000", "000", "000", "000", "0", …
## $ Q107C    <chr> "000", "000", "000", "000", "000", "000", "000", "000", "0", …
## $ Q108A    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q108B    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q108C    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1…
## $ Q108D    <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q108E    <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q108F    <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q110     <chr> "BEN10", "BEN10", "BEN12", "BEN12", "BEN11", "BEN11", "BEN09"…
## $ Q111     <dbl> 34, 34, 30, 30, 26, 26, 26, 26, 34, 34, 30, 30, 26, 26, 26, 2…
## $ Q112     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Q113     <dbl> 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2…
## $ Q114     <dbl> 101, 101, 101, 101, 2, 2, 2, 2, 101, 101, 101, 101, 2, 2, 2, …
## $ Q115     <dbl> 7, 7, 8, 8, 8, 8, 7, 7, 7, 7, 8, 8, 8, 8, 7, 7, 8, 8, 7, 7, 8…
## $ Withinwt <dbl> 1.412669, 1.412669, 1.412669, 1.412669, 1.412669, 1.412669, 1…
## $ Acrosswt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Combinwt <dbl> 1.412669, 1.412669, 1.412669, 1.412669, 1.412669, 1.412669, 1…

This is a fairly large dataset. While it doesn’t matter for computing time, sometimes you way want to trim datasets down a bit. From the codebook at http://afrobarometer.org/sites/default/files/data/round-4/merged_r4_codebook3.pdf, I determine that I’ll be interested only in the following variables:

  • COUNTRY
  • URBRUR: Urban or Rural Primary Sampling Unit
  • Q42A: In your opinion how much of a democracy is [Ghana/Kenya/etc.]? today?
  • Q89: What is the highest level of education you have completed?
  • Q101: Respondent’s gender
  • Q1: Respondent’s age

So we use the select() function from the “dplyr” package to create a new object, ab.small, that contains only these six variables. Here, we also introduce the “pipe” symbol |>. The pipe was recently added to base-R, but has been around for a while. Its main purpose is to make long, complex operations easier to read:

x |> mean() |> round() |> print()

can be read as: Start with x, then mean(), then round(), then print(). You will see the pipe used a lot in most tidyverse documentation. You can set a shortcut for the pipe under Tools \(\rightarrow\) Global Options \(\rightarrow\) Code.

ab.small <- ab |> select(COUNTRY, URBRUR, Q42A, Q89, Q101, Q1)
dim(ab.small)
## [1] 27713     6
str(ab.small)
## 'data.frame':    27713 obs. of  6 variables:
##  $ COUNTRY: num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Country"
##   ..- attr(*, "format.spss")= chr "F4.0"
##   ..- attr(*, "labels")= Named num [1:20] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..- attr(*, "names")= chr [1:20] "Benin" "Botswana" "Burkina Faso" "Cape Verde" ...
##  $ URBRUR : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Urban or Rural Primary Sampling Unit"
##   ..- attr(*, "format.spss")= chr "F3.0"
##   ..- attr(*, "labels")= Named num [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "Urban" "Rural"
##  $ Q42A   : num  4 4 4 3 2 3 2 3 3 3 ...
##   ..- attr(*, "label")= chr "Q42a. Extent of democracy"
##   ..- attr(*, "format.spss")= chr "F3.0"
##   ..- attr(*, "labels")= Named num [1:8] -1 1 2 3 4 8 9 998
##   .. ..- attr(*, "names")= chr [1:8] "Missing" "Not a democracy" "A democracy, with major problems" "A democracy, but with minor problems" ...
##  $ Q89    : num  4 2 4 3 4 4 5 2 4 4 ...
##   ..- attr(*, "label")= chr "Q89. Education of respondent"
##   ..- attr(*, "format.spss")= chr "F3.0"
##   ..- attr(*, "labels")= Named num [1:13] -1 0 1 2 3 4 5 6 7 8 ...
##   .. ..- attr(*, "names")= chr [1:13] "Missing" "No formal schooling" "Informal schooling only" "Some primary schooling" ...
##  $ Q101   : num  2 1 2 1 2 1 2 1 1 2 ...
##   ..- attr(*, "label")= chr "Q101. Gender of respondent"
##   ..- attr(*, "format.spss")= chr "F3.0"
##   ..- attr(*, "labels")= Named num [1:3] -1 1 2
##   .. ..- attr(*, "names")= chr [1:3] "Missing" "Male" "Female"
##  $ Q1     : num  38 46 28 30 23 24 40 50 24 36 ...
##   ..- attr(*, "label")= chr "Q1. Age"
##   ..- attr(*, "format.spss")= chr "F3.0"
##   ..- attr(*, "labels")= Named num [1:3] -1 998 999
##   .. ..- attr(*, "names")= chr [1:3] "Missing" "Refused" "Don't know"
summary(ab.small)
##     COUNTRY          URBRUR          Q42A             Q89       
##  Min.   : 1.00   Min.   :1.00   Min.   :-1.000   Min.   :-1.00  
##  1st Qu.: 6.00   1st Qu.:1.00   1st Qu.: 2.000   1st Qu.: 2.00  
##  Median :12.00   Median :2.00   Median : 3.000   Median : 3.00  
##  Mean   :11.21   Mean   :1.62   Mean   : 3.452   Mean   : 3.27  
##  3rd Qu.:16.00   3rd Qu.:2.00   3rd Qu.: 4.000   3rd Qu.: 5.00  
##  Max.   :20.00   Max.   :2.00   Max.   : 9.000   Max.   :99.00  
##       Q101             Q1        
##  Min.   :1.000   Min.   : -1.00  
##  1st Qu.:1.000   1st Qu.: 25.00  
##  Median :2.000   Median : 33.00  
##  Mean   :1.501   Mean   : 47.68  
##  3rd Qu.:2.000   3rd Qu.: 45.00  
##  Max.   :2.000   Max.   :999.00

How would you achieve the same using square brackets?

Now that we’ve read the dataset into R, we can process it for further analysis - which we’ll do further below in this tutorial.

Sidenote: Dealing with value labels in SPSS files

Datasets in SPSS format often contain variables with value labels. You can see this above in the output following the str(ab.small) command. Value labels can be useful to help you quickly identify the meaning of different codes without revisiting the codebook, e.g. that with Q101, 1 stands for Male and 2 for female. In many situations, this makes your life easier.

There are a few ways to display and use labels instead of numbers in R. One of them is through the sjlabelled package. The vignettes (1, 2) offer more detail, but here is the gist:

library("sjlabelled")
## 
## Attaching package: 'sjlabelled'
## The following object is masked from 'package:forcats':
## 
##     as_factor
## The following object is masked from 'package:dplyr':
## 
##     as_label
## The following object is masked from 'package:ggplot2':
## 
##     as_label
table(ab.small$Q42A)
## 
##   -1    1    2    3    4    8    9 
##   15 1875 7338 8249 7310 1202 1724
table(as_label(ab.small$Q42A))
## 
##                              Missing                      Not a democracy 
##                                   15                                 1875 
##     A democracy, with major problems A democracy, but with minor problems 
##                                 7338                                 8249 
##                     A full democracy Do not understand question/democracy 
##                                 7310                                 1202 
##                           Don't know                              Refused 
##                                 1724                                    0

What do the numbers under the country names tell you?

Example: importing a Stata dataset

For this example, we use the European Social Survey, an academically driven cross-national survey that has been conducted every two years across Europe since 2001. You can find more information on the ESS at http://www.europeansocialsurvey.org/ (under Data and Documentation > Round 6) after you register on the site to access data and codebooks. Let’s download the Stata version of the 2012 round of the ESS, called “ESS6e02_1.dta”, and use the import() function again to read the dataset into R.

ess <- import(file = "ESS6e02_1.dta")

Before actually looking at the dataset itself, you should look at its dimensions.

dim(ess)
## [1] 54673   626

This is again a fairly large dataset, so let’s trim it down a bit. From the variable list at http://www.europeansocialsurvey.org/docs/round6/survey/ESS6_appendix_a7_e02_1.pdf, I decide that I’ll be interested only in the following variables:

  • cntry: Country
  • trstlgl: Trust in the legal system, 0 means you do not trust an institution at all, and 10 means you have complete trust.
  • lrscale: Placement on left right scale, where 0 means the left and 10 means the right
  • fairelc: How important R thinks it is for democracy in general that national elections are free and fair
  • yrbrn: Year of birth
  • gndr: Gender
  • hinctnta: Household’s total net income, all sources

Again, we use R’s indexing structure to create a new object, ess.small, that contains only these seven variables.

ess.small <- ess |> select(cntry, trstlgl, lrscale, fairelc, yrbrn, gndr, hinctnta)
dim(ess.small)
## [1] 54673     7
str(ess.small)
## 'data.frame':    54673 obs. of  7 variables:
##  $ cntry   : chr  "AL" "AL" "AL" "AL" ...
##   ..- attr(*, "label")= chr "Country"
##   ..- attr(*, "format.stata")= chr "%2s"
##  $ trstlgl : num  0 0 2 7 6 5 10 9 7 0 ...
##   ..- attr(*, "label")= chr "Trust in the legal system"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##   ..- attr(*, "labels")= Named num [1:14] 0 1 2 3 4 5 6 7 8 9 ...
##   .. ..- attr(*, "names")= chr [1:14] "No trust at all" "1" "2" "3" ...
##  $ lrscale : num  0 88 5 5 10 5 1 5 5 0 ...
##   ..- attr(*, "label")= chr "Placement on left right scale"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##   ..- attr(*, "labels")= Named num [1:14] 0 1 2 3 4 5 6 7 8 9 ...
##   .. ..- attr(*, "names")= chr [1:14] "Left" "1" "2" "3" ...
##  $ fairelc : num  10 10 10 88 10 10 10 10 10 10 ...
##   ..- attr(*, "label")= chr "National elections are free and fair"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##   ..- attr(*, "labels")= Named num [1:14] 0 1 2 3 4 5 6 7 8 9 ...
##   .. ..- attr(*, "names")= chr [1:14] "Not at all important for democracy in general" "1" "2" "3" ...
##  $ yrbrn   : num  1949 1983 1946 9999 1953 ...
##   ..- attr(*, "label")= chr "Year of birth"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##   ..- attr(*, "labels")= Named num [1:3] 7777 8888 9999
##   .. ..- attr(*, "names")= chr [1:3] "Refusal" "Don't know" "No answer"
##  $ gndr    : num  1 2 2 1 1 1 2 2 2 2 ...
##   ..- attr(*, "label")= chr "Gender"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##   ..- attr(*, "labels")= Named num [1:3] 1 2 9
##   .. ..- attr(*, "names")= chr [1:3] "Male" "Female" "No answer"
##  $ hinctnta: num  5 2 2 99 2 2 1 2 4 1 ...
##   ..- attr(*, "label")= chr "Household's total net income, all sources"
##   ..- attr(*, "format.stata")= chr "%10.0g"
##   ..- attr(*, "labels")= Named num [1:13] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..- attr(*, "names")= chr [1:13] "J - 1st decile" "R - 2nd decile" "C - 3rd decile" "M - 4th decile" ...
summary(ess.small)
##     cntry              trstlgl          lrscale         fairelc     
##  Length:54673       Min.   : 0.000   Min.   : 0.00   Min.   : 0.00  
##  Class :character   1st Qu.: 3.000   1st Qu.: 4.00   1st Qu.: 8.00  
##  Mode  :character   Median : 5.000   Median : 5.00   Median :10.00  
##                     Mean   : 7.054   Mean   :17.54   Mean   :10.91  
##                     3rd Qu.: 7.000   3rd Qu.: 8.00   3rd Qu.:10.00  
##                     Max.   :99.000   Max.   :99.00   Max.   :99.00  
##      yrbrn           gndr          hinctnta    
##  Min.   :1909   Min.   :1.000   Min.   : 1.00  
##  1st Qu.:1950   1st Qu.:1.000   1st Qu.: 3.00  
##  Median :1964   Median :2.000   Median : 6.00  
##  Mean   :1981   Mean   :1.546   Mean   :20.06  
##  3rd Qu.:1980   3rd Qu.:2.000   3rd Qu.:10.00  
##  Max.   :9999   Max.   :9.000   Max.   :99.00
table(ess.small$hinctnta)
## 
##    1    2    3    4    5    6    7    8    9   10   77   88   99 
## 5063 5492 5011 4888 4538 4324 4147 3829 3262 3427 6211 4342  139

Now that we’ve read the dataset into R, we can process it for further analysis. We won’t revisit the ESS data in this tutorial.

Sidenote: Dealing with value labels in Stata files

Datasets in Stata format often contain variables with value labels. You can see this above in the output following the str(ess.small) command. Value labels can be useful to help you quickly identify the meaning of different codes without revisiting the codebook, e.g. that with gndr, 1 stands for Male and 2 for female. In many situations, this makes your life easier.

Just like with SPSS above, the as_label() function is handy to print value labels in tables or graphs. Just wrap as_label() around your variable of interest where needed.

table(as_label(ess.small$hinctnta))
## 
##  J - 1st decile  R - 2nd decile  C - 3rd decile  M - 4th decile  F - 5th decile 
##            5063            5492            5011            4888            4538 
##  S - 6th decile  K - 7th decile  P - 8th decile  D - 9th decile H - 10th decile 
##            4324            4147            3829            3262            3427 
##         Refusal      Don't know       No answer 
##            6211            4342             139

Data cleaning

Before any analysis, you will often, if not always, need to process data that you obtained from elsewhere or that you collected yourself. In this section, we’ll go over some typical scenarios for this.

Often, you need to make sure that the variables have the correct numerical or character values. Different data sources often use different codes for missing values, for instance -99, -9999, ., or NA. Let’s work through this with the ab.small dataset. First, I check the structure of the dataset:

str(ab.small)
## 'data.frame':    27713 obs. of  6 variables:
##  $ COUNTRY: num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Country"
##   ..- attr(*, "format.spss")= chr "F4.0"
##   ..- attr(*, "labels")= Named num [1:20] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..- attr(*, "names")= chr [1:20] "Benin" "Botswana" "Burkina Faso" "Cape Verde" ...
##  $ URBRUR : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Urban or Rural Primary Sampling Unit"
##   ..- attr(*, "format.spss")= chr "F3.0"
##   ..- attr(*, "labels")= Named num [1:2] 1 2
##   .. ..- attr(*, "names")= chr [1:2] "Urban" "Rural"
##  $ Q42A   : num  4 4 4 3 2 3 2 3 3 3 ...
##   ..- attr(*, "label")= chr "Q42a. Extent of democracy"
##   ..- attr(*, "format.spss")= chr "F3.0"
##   ..- attr(*, "labels")= Named num [1:8] -1 1 2 3 4 8 9 998
##   .. ..- attr(*, "names")= chr [1:8] "Missing" "Not a democracy" "A democracy, with major problems" "A democracy, but with minor problems" ...
##  $ Q89    : num  4 2 4 3 4 4 5 2 4 4 ...
##   ..- attr(*, "label")= chr "Q89. Education of respondent"
##   ..- attr(*, "format.spss")= chr "F3.0"
##   ..- attr(*, "labels")= Named num [1:13] -1 0 1 2 3 4 5 6 7 8 ...
##   .. ..- attr(*, "names")= chr [1:13] "Missing" "No formal schooling" "Informal schooling only" "Some primary schooling" ...
##  $ Q101   : num  2 1 2 1 2 1 2 1 1 2 ...
##   ..- attr(*, "label")= chr "Q101. Gender of respondent"
##   ..- attr(*, "format.spss")= chr "F3.0"
##   ..- attr(*, "labels")= Named num [1:3] -1 1 2
##   .. ..- attr(*, "names")= chr [1:3] "Missing" "Male" "Female"
##  $ Q1     : num  38 46 28 30 23 24 40 50 24 36 ...
##   ..- attr(*, "label")= chr "Q1. Age"
##   ..- attr(*, "format.spss")= chr "F3.0"
##   ..- attr(*, "labels")= Named num [1:3] -1 998 999
##   .. ..- attr(*, "names")= chr [1:3] "Missing" "Refused" "Don't know"

It looks like we’re dealing with all numeric variables here. This is good, but you want to make sure that codes for missing observations, or otherwise non-numeric values, are discarded appropriately. The most fail-safe way to do this is to convert the variable into a factor manually and to assign value labels as you find them in the codebook.