Overview of the tidyverse

Week 1 In-class Demo

1 How to follow along with this document

Using this document
This is a convenience for learning. It is not what you will be doing when you are running R commands yourself outside of class! Thus, while we will use this approach for learning, it is not for doing work.
Using RStudio
This is how you will do work (including your homework and personal project), so you should get used to using this environment as quickly as possible. Click on this link to download the files (R project, R script, data, and folder structure) that you need to do all of this in RStudio.

Let’s take a look at how to use each of these two approaches.

1.1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

  • Edit the code that is shown in the box. Click on the Run Code button.
  • Make further edits and re-run that code. You can do this as often as you’d like.
  • Click the Start Over button to bring back the original code if you’d like.
  • If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

1.2 Using RStudio

If you were to do the following in RStudio, we encourage you to do the following. (You don’t need to do any of this if you are executing the commands within this document — it is all handled for you.)

Download the ZIP file
Use the link to download the ZIP file. Then unzip it.
Open the Rproj (R project) file
Double-click on the file week1.Rproj to open RStudio. Or, if RStudio is already open, use File/Open Project and open that file.
Clean up the workspace
When beginning a new project in RStudio, it’s always a good idea to remove active, existing data by doing the following:
rm(list = ls())

In the above, ls() lists all objects in the workspace, and rm() deletes them.

An alternative, and perhaps better, practice is to just restart R using the Session menu, but this is a quick way to clean up most things.

Load R libraries
In order to do any of the following, we need to load the library for data manipulations:
library(tidyverse) 

Next, we need to load the tidylog library. This is optional but it tells R to give more detailed messages.

library(tidylog) 

2 Introduction to the tidyverse

The philosophy of the tidyverse in R is to make data manipulation easier and more intuitive by

  1. Restricting ourselves to simple spreadsheet-like data structures called data frames,
  2. Using a consistent set of functions for common operations we use for spreadsheets, like filtering rows, renaming columns, pivoting, and summarizing data.
  3. Using the pipe operator %>% (or |>)to chain operations together, making it easier to read and write scripts that do multiple things to the data.

A typical IR problem is to read in CSV files, manipulate them in some way, and then save the results to a new file. Here’s an example. We’ll learn the details of each step later. The point here is to illustrate the utility of the tidyverse approach.

Learning goal
Introduction to the tidyverse way of working with data, using a short script to accomplish a common IR task.
Project description
We’ve been asked to create a teaching load report to identify professors by academic subject who are teaching more than 10 courses per academic year.
Load raw data
We execute the following commands in order to read the data from three separate CSV files into R.
Data Dictionary
The following definitions provide a minimal data dictionary; that is, they define the meaning of each of the columns of data.
  • CourseSectionID: unique identifier for each course section
  • CourseID: unique identifier for each course, specifying subject and course number.
  • SectionID: the section number of a course type, e.g., 001, 002, etc.
  • TermID: unique identifier for each term
  • Credits: the number of credits for the course
  • ProfID: unique identifier for each professor
Examine the data
Let’s take a look at each of the data frames.
Analysis
Now we need to try to figure out what we have. First, let’s count the number of courses taught by a professor per academic year.

Again, let’s take a look at these results:

Next, let’s make a wide, spreadsheet-like display of that information.

And see what we have:

Third, let’s create an overload report for more than 10 courses per year:

What does it tell us?

Validation
Determine if the data looks correct.

Let’s plot it over time.

Output
Finally, we will write out both the graph and the overloads data to a new csv file in the /output folder to (hypothetically) send to the dean. We are not executing this code in the browser because you don’t have access to the folders; however, these would work within RStudio.
ggsave("output/overloads_plot.png", 
       plot = overload_plot, 
       width = 8, 
       height = 6)
write_csv(overloads, "output/overloads.csv")