rm(list = ls())
Overview of the tidyverse
Week 1 In-class Demo
1 How to follow along with this document
- Using this document
- This is a convenience for learning. It is not what you will be doing when you are running R commands yourself outside of class! Thus, while we will use this approach for learning, it is not for doing work.
- Using
RStudio
-
This is how you will do work (including your homework and personal project), so you should get used to using this environment as quickly as possible. Click on this link to download the files (R project, R script, data, and folder structure) that you need to do all of this in
RStudio
.
Let’s take a look at how to use each of these two approaches.
1.1 Using this document
Within this document are blocks of R
code. You can edit and execute this code as a way of practicing your R
skills:
- Edit the code that is shown in the box. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
1.2 Using RStudio
If you were to do the following in RStudio
, we encourage you to do the following. (You don’t need to do any of this if you are executing the commands within this document — it is all handled for you.)
- Download the
ZIP
file -
Use the link to download the
ZIP
file. Then unzip it. - Open the
Rproj
(R project) file -
Double-click on the file
week1.Rproj
to openRStudio
. Or, ifRStudio
is already open, useFile
/Open Project
and open that file. - Clean up the workspace
-
When beginning a new project in
RStudio
, it’s always a good idea to remove active, existing data by doing the following:
In the above, ls()
lists all objects in the workspace, and rm()
deletes them.
An alternative, and perhaps better, practice is to just restart R
using the Session
menu, but this is a quick way to clean up most things.
- Load
R
libraries - In order to do any of the following, we need to load the library for data manipulations:
library(tidyverse)
Next, we need to load the tidylog
library. This is optional but it tells R
to give more detailed messages.
library(tidylog)
2 Introduction to the tidyverse
The philosophy of the tidyverse in R is to make data manipulation easier and more intuitive by
- Restricting ourselves to simple spreadsheet-like data structures called data frames,
- Using a consistent set of functions for common operations we use for spreadsheets, like filtering rows, renaming columns, pivoting, and summarizing data.
- Using the pipe operator
%>%
(or|>
)to chain operations together, making it easier to read and write scripts that do multiple things to the data.
A typical IR problem is to read in CSV
files, manipulate them in some way, and then save the results to a new file. Here’s an example. We’ll learn the details of each step later. The point here is to illustrate the utility of the tidyverse
approach.
- Learning goal
- Introduction to the tidyverse way of working with data, using a short script to accomplish a common IR task.
- Project description
- We’ve been asked to create a teaching load report to identify professors by academic subject who are teaching more than 10 courses per academic year.
- Load raw data
-
We execute the following commands in order to read the data from three separate
CSV
files intoR
.
- Data Dictionary
- The following definitions provide a minimal data dictionary; that is, they define the meaning of each of the columns of data.
CourseSectionID
: unique identifier for each course sectionCourseID
: unique identifier for each course, specifying subject and course number.SectionID
: the section number of a course type, e.g., 001, 002, etc.TermID
: unique identifier for each termCredits
: the number of credits for the courseProfID
: unique identifier for each professor
- Examine the data
- Let’s take a look at each of the data frames.
- Analysis
- Now we need to try to figure out what we have. First, let’s count the number of courses taught by a professor per academic year.
Again, let’s take a look at these results:
Next, let’s make a wide, spreadsheet-like display of that information.
And see what we have:
Third, let’s create an overload report for more than 10 courses per year:
What does it tell us?
- Validation
- Determine if the data looks correct.
Let’s plot it over time.
- Output
-
Finally, we will write out both the graph and the overloads data to a new csv file in the /output folder to (hypothetically) send to the dean. We are not executing this code in the browser because you don’t have access to the folders; however, these would work within
RStudio
.
ggsave("output/overloads_plot.png",
plot = overload_plot,
width = 8,
height = 6)
write_csv(overloads, "output/overloads.csv")