This tutorial shows you:
Note on copying & pasting code from the PDF version of this tutorial: Please note that you may run into trouble if you copy & paste code from the PDF version of this tutorial into your R script. When the PDF is created, some characters (for instance, quotation marks or indentations) are converted into non-text characters that R won’t recognize. To use code from this tutorial, please type it yourself into your R script or you may copy & paste code from the source file for this tutorial which is posted on my website.
Note on R functions discussed in this tutorial: I don’t discuss many functions in detail here and therefore I encourage you to look up the help files for these functions or search the web for them before you use them. This will help you understand the functions better. Each of these functions is well-documented either in its help file (which you can access in R by typing ?ifelse
, for instance) or on the web.
Additional resources: I recommend three resources to learn more about how to manage and manipulate data in R:
Since many of you work with data you collect yourself, you will often enter data directly into spreadsheets. I don’t have any specific recommendations for this, but I strongly encourage you to use the following workflow:
Remember, one of the goals of our seminar is to help you build a better relationship with your future self. Following this workflow will help with that.
In all you work, you should keep a separate script file for data management and data analysis. This will also help you maintain a reproducible workflow and keep your code manageable. For instance, in my projects, I typically have at least two R scripts in my project folder:
project_datamgmt.R
, which starts by importing the original source data and cleans and prepares it for analysisproject_analysis.R
, which conducts all analysis and creates tables and graphsYou will data in many different formats when doing research. For this purpose, the R package “rio” (with which you’re already familiar) is particularly useful. Its developer describes it as “a set of tools aims to simplify the process of importing/exporting data.” The package has two main functions, import()
and export()
. It allows you to import and export data from/to the following popular formats (and others that I don’t list here):
For more information, see a readme page for the “rio” package on Github: https://github.com/leeper/rio.
Important note on package versions: Because data formats change frequently (e.g., with new versions of commercial software), dealing with data import and export requires special attention. Be sure to always use the most recent version of the “rio” package. For import of some files to work exactly like you see in this tutorial, you also need to have the most recent (development) version 0.2.0.9000 of the package “haven” installed. Use the following commands to install this package:
install.packages("devtools") # only if you haven't installed this package yet
# or if the next line returns an error
devtools::install_github("hadley/haven")
In this example, we import a dataset from the Afrobarometer project into R. The Afrobarometer is an African-led series of national public attitude surveys on democracy and governance in Africa, and you can find more information on it at http://www.afrobarometer.org/. The survey data are provided to scholars in SPSS format. SPSS is a statistical software package akin to Stata or R. At http://www.afrobarometer.org/data/merged-data, you can find a download link for the fourth round of the Afrobarometer. The file is called “merged_r4_data.sav”. Let’s use this link to read this dataset into R using the import()
function from the “rio” package.
First, install the “rio” package. You only have to do this once (and you likely already did so earlier in the semester).
install.packages("rio")
Next, load the package.
library("rio")
Now you can import the Afrobarometer dataset in your R environment. I’ll call it ab
. This will take a few seconds since the dataset is over 15MB big. Note: as always, replace the file path below with the location of the file on your personal computer.
ab <- import(file = "/Users/johanneskarreth/Documents/Dropbox/Uni/Teaching/POS 517/Tutorials/Day 7 - Data management/merged_r4_data.sav")
Before actually looking at the dataset itself, you should look at its dimensions.
dim(ab)
## [1] 27713 294
This is a fairly large dataset, so let’s trim it down a bit. From the codebook at http://afrobarometer.org/sites/default/files/data/round-4/merged_r4_codebook3.pdf, I determine that I’ll be interested only in the following variables:
COUNTRY
URBRUR
: Urban or Rural Primary Sampling UnitQ42A
: In your opinion how much of a democracy is [Ghana/Kenya/etc.]? today?Q89
: What is the highest level of education you have completed?Q101
: Respondent’s genderQ1
: Respondent’s ageSo we use R’s indexing structure (see the tutorial for Day 2) to create a new object, ab.small
, that contains only these six variables.
ab.small <- ab[, c("COUNTRY", "URBRUR", "Q42A", "Q89", "Q101", "Q1")]
dim(ab.small)
## [1] 27713 6
str(ab.small)
## 'data.frame': 27713 obs. of 6 variables:
## $ COUNTRY: Factor w/ 20 levels "Benin","Botswana",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ URBRUR : Factor w/ 2 levels "Urban","Rural": 1 1 1 1 1 1 1 1 1 1 ...
## $ Q42A : Factor w/ 8 levels "Missing","Not a democracy",..: 5 5 5 4 3 4 3 4 4 4 ...
## $ Q89 : Factor w/ 13 levels "Missing","No formal schooling",..: 6 4 6 5 6 6 7 4 6 6 ...
## $ Q101 : Factor w/ 3 levels "Missing","Male",..: 3 2 3 2 3 2 3 2 2 3 ...
## $ Q1 : atomic 38 46 28 30 23 24 40 50 24 36 ...
## ..- attr(*, "is_na")= logi FALSE FALSE FALSE
summary(ab.small)
## COUNTRY URBRUR
## Uganda : 2431 Urban:10521
## South Africa: 2400 Rural:17192
## Nigeria : 2324
## Madagascar : 1350
## Cape Verde : 1264
## Mali : 1232
## (Other) :16712
## Q42A
## A democracy, but with minor problems:8249
## A democracy, with major problems :7338
## A full democracy :7310
## Not a democracy :1875
## Don't know :1724
## Do not understand question/democracy:1202
## (Other) : 15
## Q89 Q101
## Some secondary school/high school :5950 Missing: 0
## Some primary schooling :5111 Male :13837
## No formal schooling :4365 Female :13876
## Secondary school completed/high school :4165
## Primary school completed :3897
## Post-secondary qualifications, not university:1674
## (Other) :2551
## Q1
## Min. : -1.00
## 1st Qu.: 25.00
## Median : 33.00
## Mean : 47.68
## 3rd Qu.: 45.00
## Max. :999.00
##
Now that we’ve read the dataset into R, we can process it for further analysis - which we’ll do further below in this tutorial.
Datasets in SPSS format often contain variables with value labels. You can see this above in the output following the str(ab.small)
command. Value labels can be useful to help you quickly identify the meaning of different codes without revisiting the codebook, e.g. that with Q101
, 1
stands for Male
and 2
for female. In many situations, this makes your life easier.
However, sometimes value labels are prone to creating problems and mix-ups. If you want to prevent value labels from being carried over into R, add the following two options to import()
:
haven = FALSE
: this is an internal command specifying a different routine to read SPSS data into Ruse.value.labels = FALSE
: prevents value labels from being carried over into RAs a result, you will receive a dataset with numerical values only. Then, consult the codebook to identify the meaning of different values and attach labels or recode these variables yourself (see more on recoding below).
ab_nolabels <- import(file = "/Users/johanneskarreth/Documents/Dropbox/Uni/Teaching/POS 517/Tutorials/Day 7 - Data management/merged_r4_data.sav",
haven = FALSE, use.value.labels = FALSE)
## Warning in read.spss(file = file, to.data.frame = TRUE, ...): /Users/
## johanneskarreth/Documents/Dropbox/Uni/Teaching/POS 517/Tutorials/Day 7 -
## Data management/merged_r4_data.sav: Unrecognized record type 7, subtype 8
## encountered in system file
## re-encoding from CP1252
ab_nolabels.small <- ab_nolabels[, c("COUNTRY", "URBRUR", "Q42A", "Q89", "Q101", "Q1")]
dim(ab_nolabels.small)
## [1] 27713 6
str(ab_nolabels.small)
## 'data.frame': 27713 obs. of 6 variables:
## $ COUNTRY: atomic 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "value.labels")= Named chr "20" "19" "18" "17" ...
## .. ..- attr(*, "names")= chr "Zimbabwe" "Zambia" "Uganda" "Tanzania" ...
## $ URBRUR : atomic 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "value.labels")= Named chr "2" "1"
## .. ..- attr(*, "names")= chr "Rural" "Urban"
## $ Q42A : atomic 4 4 4 3 2 3 2 3 3 3 ...
## ..- attr(*, "value.labels")= Named chr "998" "9" "8" "4" ...
## .. ..- attr(*, "names")= chr "Refused" "Don't know" "Do not understand question/democracy" "A full democracy" ...
## $ Q89 : atomic 4 2 4 3 4 4 5 2 4 4 ...
## ..- attr(*, "value.labels")= Named chr "998" "99" "9" "8" ...
## .. ..- attr(*, "names")= chr "Refused" "Don't know" "Post-graduate" "University completed" ...
## $ Q101 : atomic 2 1 2 1 2 1 2 1 1 2 ...
## ..- attr(*, "value.labels")= Named chr "2" "1" "-1"
## .. ..- attr(*, "names")= chr "Female" "Male" "Missing"
## $ Q1 : atomic 38 46 28 30 23 24 40 50 24 36 ...
## ..- attr(*, "value.labels")= Named chr "999" "998" "-1"
## .. ..- attr(*, "names")= chr "Don't know" "Refused" "Missing"
summary(ab_nolabels.small)
## COUNTRY URBRUR Q42A Q89
## Min. : 1.00 Min. :1.00 Min. :-1.000 Min. :-1.00
## 1st Qu.: 6.00 1st Qu.:1.00 1st Qu.: 2.000 1st Qu.: 2.00
## Median :12.00 Median :2.00 Median : 3.000 Median : 3.00
## Mean :11.21 Mean :1.62 Mean : 3.452 Mean : 3.27
## 3rd Qu.:16.00 3rd Qu.:2.00 3rd Qu.: 4.000 3rd Qu.: 5.00
## Max. :20.00 Max. :2.00 Max. : 9.000 Max. :99.00
## Q101 Q1
## Min. :1.000 Min. : -1.00
## 1st Qu.:1.000 1st Qu.: 25.00
## Median :2.000 Median : 33.00
## Mean :1.501 Mean : 47.68
## 3rd Qu.:2.000 3rd Qu.: 45.00
## Max. :2.000 Max. :999.00
table(ab_nolabels.small$Q42A)
##
## -1 1 2 3 4 8 9
## 15 1875 7338 8249 7310 1202 1724
For this example, we use the European Social Survey, an academically driven cross-national survey that has been conducted every two years across Europe since 2001. You can find more information on the ESS at http://www.europeansocialsurvey.org/ (under Data and Documentation > Round 6) after you register on the site to access data and codebooks. Let’s download the Stata version of the 2012 round of the ESS, called “ESS6e02_1.dta”, and use the import()
function again to read the dataset into R.
ess <- import(file = "/Users/johanneskarreth/Documents/Dropbox/Uni/Teaching/POS 517/Tutorials/Day 7 - Data management/ESS6e02_1.dta")
Before actually looking at the dataset itself, you should look at its dimensions.
dim(ess)
## [1] 54673 626
This is again a fairly large dataset, so let’s trim it down a bit. From the variable list at http://www.europeansocialsurvey.org/docs/round6/survey/ESS6_appendix_a7_e02_1.pdf, I decide that I’ll be interested only in the following variables:
cntry
: Countrytrstlgl
: Trust in the legal system, 0 means you do not trust an institution at all, and 10 means you have complete trust.lrscale
: Placement on left right scale, where 0 means the left and 10 means the rightfairelc
: How important R thinks it is for democracy in general that national elections are free and fairyrbrn
: Year of birthgndr
: Genderhinctnta
: Household’s total net income, all sourcesAgain, we use R’s indexing structure to create a new object, ess.small
, that contains only these seven variables.
ess.small <- ess[, c("cntry", "trstlgl", "lrscale", "fairelc", "yrbrn", "gndr", "hinctnta")]
dim(ess.small)
## [1] 54673 7
str(ess.small)
## 'data.frame': 54673 obs. of 7 variables:
## $ cntry : chr "AL" "AL" "AL" "AL" ...
## $ trstlgl : Factor w/ 14 levels "No trust at all",..: 1 1 3 8 7 6 11 10 8 1 ...
## $ lrscale : Factor w/ 14 levels "Left","1","2",..: 1 13 6 6 11 6 2 6 6 1 ...
## $ fairelc : Factor w/ 14 levels "Not at all important for democracy in general",..: 11 11 11 13 11 11 11 11 11 11 ...
## $ yrbrn : atomic 1949 1983 1946 9999 1953 ...
## ..- attr(*, "is_na")= logi FALSE FALSE FALSE
## $ gndr : Factor w/ 3 levels "Male","Female",..: 1 2 2 1 1 1 2 2 2 2 ...
## $ hinctnta: Factor w/ 13 levels "J - 1st decile",..: 5 2 2 13 2 2 1 2 4 1 ...
summary(ess.small)
## cntry trstlgl lrscale
## Length:54673 5 : 7893 5 :15400
## Class :character 8 : 6311 Don't know: 7489
## Mode :character 7 : 6171 7 : 4942
## No trust at all: 5959 6 : 4323
## 3 : 5178 4 : 4269
## 6 : 5130 8 : 3987
## (Other) :18031 (Other) :14263
## fairelc yrbrn
## Extremely important for democracy in general:31800 Min. :1909
## 9 : 6834 1st Qu.:1950
## 8 : 6396 Median :1964
## 7 : 3213 Mean :1981
## 5 : 1945 3rd Qu.:1980
## 6 : 1718 Max. :9999
## (Other) : 2767
## gndr hinctnta
## Male :24929 Refusal : 6211
## Female :29727 R - 2nd decile: 5492
## No answer: 17 J - 1st decile: 5063
## C - 3rd decile: 5011
## M - 4th decile: 4888
## F - 5th decile: 4538
## (Other) :23470
table(ess.small$hinctnta)
##
## J - 1st decile R - 2nd decile C - 3rd decile M - 4th decile
## 5063 5492 5011 4888
## F - 5th decile S - 6th decile K - 7th decile P - 8th decile
## 4538 4324 4147 3829
## D - 9th decile H - 10th decile Refusal Don't know
## 3262 3427 6211 4342
## No answer
## 139
Now that we’ve read the dataset into R, we can process it for further analysis. We won’t revisit the ESS data in this tutorial, but you will encounter it in your in-class assignment on Day 7.
Datasets in Stata format often contain variables with value labels. You can see this above in the output following the str(ess.small)
command. Value labels can be useful to help you quickly identify the meaning of different codes without revisiting the codebook, e.g. that with gndr
, 1
stands for Male
and 2
for female. In many situations, this makes your life easier.
However, sometimes value labels are prone to creating problems and mix-ups. If you want to prevent value labels from being carried over into R, add the following two options to import()
:
haven = FALSE
: this is an internal command specifying a different routine to read Stata data into Rconvert.factors = FALSE
: prevents value labels from being carried over into RAs a result, you will receive a dataset with numerical values only. Then, consult the codebook to identify the meaning of different values and attach labels or recode these variables yourself (see more on recoding below).
ess_nolabels <- import(file = "/Users/johanneskarreth/Documents/Dropbox/Uni/Teaching/POS 517/Tutorials/Day 7 - Data management/ESS6e02_1.dta",
haven = FALSE, convert.factors = FALSE)
ess_nolabels.small <- ess_nolabels[, c("cntry", "trstlgl", "lrscale", "fairelc", "yrbrn", "gndr", "hinctnta")]
dim(ess_nolabels.small)
## [1] 54673 7
str(ess_nolabels.small)
## 'data.frame': 54673 obs. of 7 variables:
## $ cntry : chr "AL" "AL" "AL" "AL" ...
## $ trstlgl : num 0 0 2 7 6 5 10 9 7 0 ...
## $ lrscale : num 0 88 5 5 10 5 1 5 5 0 ...
## $ fairelc : num 10 10 10 88 10 10 10 10 10 10 ...
## $ yrbrn : num 1949 1983 1946 9999 1953 ...
## $ gndr : num 1 2 2 1 1 1 2 2 2 2 ...
## $ hinctnta: num 5 2 2 99 2 2 1 2 4 1 ...
summary(ess_nolabels.small)
## cntry trstlgl lrscale fairelc
## Length:54673 Min. : 0.000 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.: 3.000 1st Qu.: 4.00 1st Qu.: 8.00
## Mode :character Median : 5.000 Median : 5.00 Median :10.00
## Mean : 7.054 Mean :17.54 Mean :10.91
## 3rd Qu.: 7.000 3rd Qu.: 8.00 3rd Qu.:10.00
## Max. :99.000 Max. :99.00 Max. :99.00
## yrbrn gndr hinctnta
## Min. :1909 Min. :1.000 Min. : 1.00
## 1st Qu.:1950 1st Qu.:1.000 1st Qu.: 3.00
## Median :1964 Median :2.000 Median : 6.00
## Mean :1981 Mean :1.546 Mean :20.06
## 3rd Qu.:1980 3rd Qu.:2.000 3rd Qu.:10.00
## Max. :9999 Max. :9.000 Max. :99.00
table(ess_nolabels.small$hinctnta)
##
## 1 2 3 4 5 6 7 8 9 10 77 88 99
## 5063 5492 5011 4888 4538 4324 4147 3829 3262 3427 6211 4342 139
Before any analysis, you will often, if not always, need to process data that you obtained from elsewhere or that you collected yourself. In this section, we’ll go over some typical scenarios for this.
Often, you need to make sure that the variables have the correct numerical or character values. Different data sources often use different codes for missing values, for instance -99
, -9999
, .
, or NA
. Let’s work through this with the ab.small
dataset. First, I check the structure of the dataset:
str(ab.small)
## 'data.frame': 27713 obs. of 6 variables:
## $ COUNTRY: Factor w/ 20 levels "Benin","Botswana",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ URBRUR : Factor w/ 2 levels "Urban","Rural": 1 1 1 1 1 1 1 1 1 1 ...
## $ Q42A : Factor w/ 8 levels "Missing","Not a democracy",..: 5 5 5 4 3 4 3 4 4 4 ...
## $ Q89 : Factor w/ 13 levels "Missing","No formal schooling",..: 6 4 6 5 6 6 7 4 6 6 ...
## $ Q101 : Factor w/ 3 levels "Missing","Male",..: 3 2 3 2 3 2 3 2 2 3 ...
## $ Q1 : atomic 38 46 28 30 23 24 40 50 24 36 ...
## ..- attr(*, "is_na")= logi FALSE FALSE FALSE
It looks like we’re dealing with a number of factor variables here. You’ll recall from Day 1 that factor variables are somewhat special in that they combine numerical data (integer values such as 1, 2, 3, …) with character labels (such as “Not at all”, “Somewhat”, …). Read through this page at Quick-R for more information on factors: http://www.statmethods.net/input/datatypes.html.
Knowing that a variable is stored as a factor has implications for how you will work with that variable.
Let’s check each of the factor variables individually. Because they are factors, the best way to look at all their values is to use the table()
function.
table(ab.small$COUNTRY)
##
## Benin Botswana Burkina Faso Cape Verde Ghana
## 1200 1200 1200 1264 1200
## Kenya Lesotho Liberia Madagascar Malawi
## 1104 1200 1200 1350 1200
## Mali Mozambique Namibia Nigeria Senegal
## 1232 1200 1200 2324 1200
## South Africa Tanzania Uganda Zambia Zimbabwe
## 2400 1208 2431 1200 1200
You can also list the levels of the factor (i.e. the labels connected to the integers) using the levels()
function:
levels(ab.small$COUNTRY)
## [1] "Benin" "Botswana" "Burkina Faso" "Cape Verde"
## [5] "Ghana" "Kenya" "Lesotho" "Liberia"
## [9] "Madagascar" "Malawi" "Mali" "Mozambique"
## [13] "Namibia" "Nigeria" "Senegal" "South Africa"
## [17] "Tanzania" "Uganda" "Zambia" "Zimbabwe"
These are all proper country names, so we can move on to the next variable, after you answer a question for yourself:
table(ab.small$URBRUR)
##
## Urban Rural
## 10521 17192
This is a factor with 2 levels and no missing values, so we can move on to the next variable.
table(ab.small$Q42A)
##
## Missing Not a democracy
## 15 1875
## A democracy, with major problems A democracy, but with minor problems
## 7338 8249
## A full democracy Do not understand question/democracy
## 7310 1202
## Don't know Refused
## 1724 0
This variable needs some work. First, we need to recode all values that are currently listed as “Missing” to R’s default code for missing, NA
. For this, we can use the ifelse()
function (see the tutorial for Day 4). Note that we create a new object for our recoded variable. This is good practice: always leave the source variable untouched and document your recoding command. This way, you can always retrace your steps later and share your work with other researchers. I name the new variable perceivedDem
. Below, I use several steps to code all answers that do not rank the country’s democracy as missing. Depending on your analysis, of course, you may want to use the “Don’t know” answers in a different way. In the code below, you see the |
symbol: this stands for “or”, just as &
stands for “and”.
ab.small$perceivedDem <- ifelse(ab.small$Q42A == "Missing" |
ab.small$Q42A == "Don't know" |
ab.small$Q42A == "Refused" |
ab.small$Q42A == "Do not understand question/democracy",
NA,
ab.small$Q42A)
table(ab.small$perceivedDem)
##
## 2 3 4 5
## 1875 7338 8249 7310
Note that NA
is not set in quotation marks because it is a specific value and not a string. This is important. Also note that we now have numeric values (and labels did not carry over), but we may want to recode them so that 0 is the lowest value and 3 the highest.
ab.small$perceivedDem <- ab.small$perceivedDem - 2
table(ab.small$perceivedDem)
##
## 0 1 2 3
## 1875 7338 8249 7310
Now we have numeric values; this may be preferable for using these data in calculations later on and we could stop here.
However, in some situations you might want labels back so you can remember what the numeric values stand for. To do this, you could access the labels (or levels in R factor language) of the original variable (Q42A
) and assign them to the new variable perceivedDem
you just created.
levels(ab.small$Q42A)
## [1] "Missing"
## [2] "Not a democracy"
## [3] "A democracy, with major problems"
## [4] "A democracy, but with minor problems"
## [5] "A full democracy"
## [6] "Do not understand question/democracy"
## [7] "Don't know"
## [8] "Refused"
But in that case, be sure to use only those levels that match to the values you are using: you don’t need “Missing”, “Don’t know”, “Refused”, “Do not understand question/democracy” anymore.
levels(ab.small$Q42A)[2:5]
## [1] "Not a democracy"
## [2] "A democracy, with major problems"
## [3] "A democracy, but with minor problems"
## [4] "A full democracy"
These four levels correspond to the values of the newly created variables perceivedDem
, so we can create a second version of perceivedDem
as a factor and assign these levels:
ab.small$perceivedDem_factor <- factor(ab.small$perceivedDem,
levels = c(0:3),
labels = levels(ab.small$Q42A)[2:5])
table(ab.small$perceivedDem_factor)
##
## Not a democracy A democracy, with major problems
## 1875 7338
## A democracy, but with minor problems A full democracy
## 8249 7310
Lastly, let’s compare our new variable with the original one to make sure we didn’t mix up labels:
table(ab.small$Q42A, ab.small$perceivedDem_factor)
##
## Not a democracy
## Missing 0
## Not a democracy 1875
## A democracy, with major problems 0
## A democracy, but with minor problems 0
## A full democracy 0
## Do not understand question/democracy 0
## Don't know 0
## Refused 0
##
## A democracy, with major problems
## Missing 0
## Not a democracy 0
## A democracy, with major problems 7338
## A democracy, but with minor problems 0
## A full democracy 0
## Do not understand question/democracy 0
## Don't know 0
## Refused 0
##
## A democracy, but with minor problems
## Missing 0
## Not a democracy 0
## A democracy, with major problems 0
## A democracy, but with minor problems 8249
## A full democracy 0
## Do not understand question/democracy 0
## Don't know 0
## Refused 0
##
## A full democracy
## Missing 0
## Not a democracy 0
## A democracy, with major problems 0
## A democracy, but with minor problems 0
## A full democracy 7310
## Do not understand question/democracy 0
## Don't know 0
## Refused 0
Now let’s make the education variable an ordinal numerical variable. First, let’s have a look at the variable in its current form:
table(ab.small$Q89)
##
## Missing
## 10
## No formal schooling
## 4365
## Informal schooling only
## 1260
## Some primary schooling
## 5111
## Primary school completed
## 3897
## Some secondary school/high school
## 5950
## Secondary school completed/high school
## 4165
## Post-secondary qualifications, not university
## 1674
## Some university
## 649
## University completed
## 506
## Post-graduate
## 92
## Don't know
## 34
## Refused
## 0
We should first recode “Missing”, “Don’t know”, and “Refused” to NA
. Again, we create a new variable, education.
ab.small$education <- ifelse(ab.small$Q89 == "Missing" |
ab.small$Q89 == "Don't know" |
ab.small$Q89 == "Refused",
NA, ab.small$Q89)
table(ab.small$education)
##
## 2 3 4 5 6 7 8 9 10 11
## 4365 1260 5111 3897 5950 4165 1674 649 506 92
We now have numeric values, but we may want to recode them so that 0 is the lowest value. Currently, the lowest value is 2, so I’ll subtract 2 from each observation:
ab.small$education <- ab.small$education - 2
table(ab.small$education)
##
## 0 1 2 3 4 5 6 7 8 9
## 4365 1260 5111 3897 5950 4165 1674 649 506 92
If we wanted to put labels on the variable, we can use the factor
function:
levels(ab.small$Q89)
## [1] "Missing"
## [2] "No formal schooling"
## [3] "Informal schooling only"
## [4] "Some primary schooling"
## [5] "Primary school completed"
## [6] "Some secondary school/high school"
## [7] "Secondary school completed/high school"
## [8] "Post-secondary qualifications, not university"
## [9] "Some university"
## [10] "University completed"
## [11] "Post-graduate"
## [12] "Don't know"
## [13] "Refused"
ab.small$education_factor <- factor(ab.small$education,
levels = c(0:9),
labels = levels(ab.small$Q89)[2:11])
table(ab.small$education_factor)
##
## No formal schooling
## 4365
## Informal schooling only
## 1260
## Some primary schooling
## 5111
## Primary school completed
## 3897
## Some secondary school/high school
## 5950
## Secondary school completed/high school
## 4165
## Post-secondary qualifications, not university
## 1674
## Some university
## 649
## University completed
## 506
## Post-graduate
## 92
Again, a quick comparison with the original variable:
table(ab.small$Q89, ab.small$education_factor)
##
## No formal schooling
## Missing 0
## No formal schooling 4365
## Informal schooling only 0
## Some primary schooling 0
## Primary school completed 0
## Some secondary school/high school 0
## Secondary school completed/high school 0
## Post-secondary qualifications, not university 0
## Some university 0
## University completed 0
## Post-graduate 0
## Don't know 0
## Refused 0
##
## Informal schooling only
## Missing 0
## No formal schooling 0
## Informal schooling only 1260
## Some primary schooling 0
## Primary school completed 0
## Some secondary school/high school 0
## Secondary school completed/high school 0
## Post-secondary qualifications, not university 0
## Some university 0
## University completed 0
## Post-graduate 0
## Don't know 0
## Refused 0
##
## Some primary schooling
## Missing 0
## No formal schooling 0
## Informal schooling only 0
## Some primary schooling 5111
## Primary school completed 0
## Some secondary school/high school 0
## Secondary school completed/high school 0
## Post-secondary qualifications, not university 0
## Some university 0
## University completed 0
## Post-graduate 0
## Don't know 0
## Refused 0
##
## Primary school completed
## Missing 0
## No formal schooling 0
## Informal schooling only 0
## Some primary schooling 0
## Primary school completed 3897
## Some secondary school/high school 0
## Secondary school completed/high school 0
## Post-secondary qualifications, not university 0
## Some university 0
## University completed 0
## Post-graduate 0
## Don't know 0
## Refused 0
##
## Some secondary school/high school
## Missing 0
## No formal schooling 0
## Informal schooling only 0
## Some primary schooling 0
## Primary school completed 0
## Some secondary school/high school 5950
## Secondary school completed/high school 0
## Post-secondary qualifications, not university 0
## Some university 0
## University completed 0
## Post-graduate 0
## Don't know 0
## Refused 0
##
## Secondary school completed/high school
## Missing 0
## No formal schooling 0
## Informal schooling only 0
## Some primary schooling 0
## Primary school completed 0
## Some secondary school/high school 0
## Secondary school completed/high school 4165
## Post-secondary qualifications, not university 0
## Some university 0
## University completed 0
## Post-graduate 0
## Don't know 0
## Refused 0
##
## Post-secondary qualifications, not university
## Missing 0
## No formal schooling 0
## Informal schooling only 0
## Some primary schooling 0
## Primary school completed 0
## Some secondary school/high school 0
## Secondary school completed/high school 0
## Post-secondary qualifications, not university 1674
## Some university 0
## University completed 0
## Post-graduate 0
## Don't know 0
## Refused 0
##
## Some university
## Missing 0
## No formal schooling 0
## Informal schooling only 0
## Some primary schooling 0
## Primary school completed 0
## Some secondary school/high school 0
## Secondary school completed/high school 0
## Post-secondary qualifications, not university 0
## Some university 649
## University completed 0
## Post-graduate 0
## Don't know 0
## Refused 0
##
## University completed
## Missing 0
## No formal schooling 0
## Informal schooling only 0
## Some primary schooling 0
## Primary school completed 0
## Some secondary school/high school 0
## Secondary school completed/high school 0
## Post-secondary qualifications, not university 0
## Some university 0
## University completed 506
## Post-graduate 0
## Don't know 0
## Refused 0
##
## Post-graduate
## Missing 0
## No formal schooling 0
## Informal schooling only 0
## Some primary schooling 0
## Primary school completed 0
## Some secondary school/high school 0
## Secondary school completed/high school 0
## Post-secondary qualifications, not university 0
## Some university 0
## University completed 0
## Post-graduate 92
## Don't know 0
## Refused 0
Next, let’s check the Q101
variable for gender:
table(ab.small$Q101)
##
## Missing Male Female
## 0 13837 13876
We can’t use the Missing value and want this variable to be numerical so that males are 0 and females 1. Here, we use again the ifelse()
function to recode, in this case, in several steps. First, I generate a variable female
that is missing (NA
) for all observations where Q101
has the value "Missing"
:
ab.small$female <- ifelse(ab.small$Q101 == "Missing", NA, NA)
Next, I change the value of female
to 1 for all observations where Q101
is "Female"
, and leave it at its previous value for all others:
ab.small$female <- ifelse(ab.small$Q101 == "Female", 1, ab.small$female)
Last, I change the value of female
to 0 for all observations where Q101
is "Male"
, and leave it at its previous value for all others:
ab.small$female <- ifelse(ab.small$Q101 == "Male", 0, ab.small$female)
table(ab.small$female)
##
## 0 1
## 13837 13876
Lastly, let’s check the age variable.
str(ab.small$Q1)
## atomic [1:27713] 38 46 28 30 23 24 40 50 24 36 ...
## - attr(*, "is_na")= logi [1:3] FALSE FALSE FALSE
We can see that the values 999
, 998
, and -1
don’t seem to correspond to realistic ages. A quick look at the codebook reveals the following: