Day 2: Data transformation - Exercises

Intro to R (ESS 2023)
Author
Affiliation

Ursinus College

Published

June 27, 2023

Goals

What you should be able to do on your own after this exercise:

  • Create a working directory on your computer
  • Create an R script
  • Load a package
  • Use basic R functions
  • Create figures and save them in your working directory
  • Look for help online

Setup

To begin this exercise, please quit and re-open RStudio on your computer.

Acknowledgement: Some of this exercise contains materials from the book R for Data Science, the core text for this course. For citations of the R packages used here, please refer to citation("packagename")

1. Look for help online

You’re invited to ask questions while you complete this exercise during our meeting, but I also recommend consulting the main resources I recommend during our course. These are:

2. Create a directory for this exercise

Using your computer’s tools or the “Files” Tab on the bottom right in RStudio, create a folder “Exercise 2” within your course folder.

3. Create an R script

Create an empty R script. Save it as “IntroR_Day2_Exercise.R” in your “Exercise 2” working directory. Within the script, type or copy & paste the following code in the first line to set the working directory to the same folder in which the script is located:

Code
setwd(dirname(rstudioapi::getSourceEditorContext()$path))

4. Load packages

To start your script, load the following packages: “tidyverse”, “tidylog”, “gapminder”, and “nycflights13”. If necessary, install the package(s) via the package manager in RStudio.

5. Use basic R functions

Create a named object that contains all integers (whole numbers) from 1960 to 2022. How many elements does this object have? Hint: use the seq() and length() functions and the approach discussed in lecture.

6. Filtering and sorting a dataset

Using the gapminder data, find each of the following. Remember the tricks to print/view all observations!

  1. country-years that had a life expectancy over 80

  2. the country-year with the highest GDP per capita in the dataset

  3. the 10 country-years with the highest GDP per capita in the dataset

  4. the 10 countries with the highest GDP per capita in 2007 in the dataset

  5. the average life expectancy for each continent in 1952 and 2007

7. Computing with variables

  1. Knowing that GDP per capita is a country’s GDP divided by its population, can you use the two variables pop and gdpPercap to “recover” countries’ GDP?

  2. Can you create a histogram of this variable for the year 2007?

  3. You just created this new variable, but the histogram suggests that it might be useful to transform it. Section 5.5.1 in R4DS suggests the log transformation. Create a new variable, log2Gdp, and generate a histogram for this variable in 2007.

  4. You decide that you’d rather use the natural log and delete the log2Gdp variable from the data. How do you perform both operations?

  5. For each country-year, calculate the difference between that country’s average (mean) life expectancy in that year and the average (mean) life expectancy of that country’s continent in that given year. What are some of the countries with the biggest deviation in life expectancy from their continent’s typical values?

  6. Can you create a new object, named gapminder2007, that contains only observations from 2007?

  7. Now create a new object, gapminder2007_countries, that only contains the countries that are part of gapminder2007. What object class is this? And how long is it?

  8. Can you create this object in a way so that it turns out to be a vector?

  9. At this point, which objects are in your workspace?

  10. What happens if you try to access all gapminder observations from 2002 by typing gapminder2002 into R? Why?

8. Data exploration

  1. Explore the variation of life expectancy for each continent-year graphically.

  2. This graph is a bit cluttered. Create a new dataset with a new variable, lifeExpDecade, that takes the average (mean) of a country’s life expectancy by decade, and recreate the prior graph using that variable. Hint: use this trick to generate a decade variable.

  3. Which continent shows the largest deviations in life expectancy from the continent mean? Use the variable diff_lifeExp_contMean from earlier and a box plot to explore this question.

  4. Over time, which country shows the largest deviations in life expectancy from the continent mean? Use the variable diff_lifeExp_contMean from earlier and a box plot to explore this question. Hint: see the end of section 7.5.1 in R4DS for an example.

  5. Explore the covariation of GDP per capita and life expectancy: does the relationship between both variables change over time, and does it differ across continents? This builds on a plot you saw in yesterday’s exercise. I recommend using the gapminder_decade dataset from a few exercises above.

9. Save figures in your working directory

Pick one figure that you liked the most, and save it to your working directory. Hint: use the ggsave() function.