Application of pivoting to retention data

1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

  • Edit the code that is shown in the box. Click on the Run Code button.
  • Make further edits and re-run that code. You can do this as often as you’d like.
  • Click the Start Over button to bring back the original code if you’d like.
  • If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

Before you get started with this script, ensure that you have loaded the following packages:

library(tidyverse)
library(tidylog)
library(knitr)
library(kableExtra)

2 Set up: Create some test data

A data frame is like a table in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.

The enroll data frame has three rows and two columns. Here is how we defined this table:

enroll <- tribble(~student_id, ~term,
                  234567, "Fall 2023",
                  236047, "Fall 2023",
                  234567, "Spring 2024")

Let’s validate that we did this correctly:

Verify for yourself that this is in a long format.

3 The process

The problem we have is that enrollment records keep track of students enrolling, but for retention purposes, we need to also know when they are not enrolling.

3.1 Create the retain data frame

What we’d like to have for retention analysis is data like the following:

What we’ve done here is take advantage of R/tidyverse’s ability to pivot to wide data and, in the process, fill in for missing values — in this case, those values represent the terms in which the student was not enrolled. How do we know they weren’t enrolled? Because it wasn’t represented in a row in the enroll data frame.

Let’s go through the above script line-by-line:

retain <-
Whatever the result is, we’re going to save it to a new data frame named retain.
enroll |>
The data for the query comes from this data frame.
mutate(enrolled = 1) |>
Add a column named enroll, and make every row have a 1 in it. Why are we doing this, and why does it make sense? Well, every row in enroll represents a student who enrolled the university during a specific term. So using a 1 in the enrolled column to represent a student who is enrolled, you might (correctly) guess that we’re going to use a 0 to represent a student who is not enrolled. (Also, we could have just as easily used TRUE and FALSE in a logical column for enrolled.)
pivot_wider(names_from..., values_from..., values_fill = 0)
Changing to a wide data format, each row is now going to represent a student with all all of the terms having a separate column. The values will be from the newly-created enrolled column (with all 1s in it for those existing student/term combinations). The values_fill argument here is where the magic is done. If a student/term combination is not in the source data frame, then the otherwise empty cell in the new one will be given a default value of 0 (representing not enrolled).

3.2 Complete the long form

To save ourselves some typing in the following, we’re going to define the following vector:

If we want to include terms not found in the original data, we first complete all the cases using complete(). This command, as shown in the following code block, do the following:

  • For every student_id in the table,
  • ensure that there is a row for every term in all_valid_terms
  • and set enrolled to 0 for every row that did not already exist.

Take a look at the result:

3.3 Pivot to a wide form

This result is in long form and, as explained just above, has one row for each combination of student_id and term. Now we can finish this operation with pivot_wider():

As you might guess, we now have separate columns for the student_id and every term listed in all_valid_terms. We have a 1 in every cell corresponding to the rows in the original enroll data frame above and a 0 in all other cells.

Note that by default R sorts the columns by alphabetical order. We can change this by changing term to a factor and specifying the levels in the order we want them to appear:

3.4 Alternative approach with terms represented by numbers

Another way to handle term names is to convert them all to numbers, where Fall 2022 = 2022, Spring 2023 = 2022.5, or something similar:

One advantage of this method is that if we are tracking students in cohorts, we can easily calculate what year of school they are in:

This is a good way to track students over time and see how many are retained from a cohort at a given point in their enrollment history.