library(tidyverse)
library(tidylog)
library(knitr)
library(kableExtra)
Application of pivoting to retention data
1 Using this document
Within this document are blocks of R
code. You can edit and execute this code as a way of practicing your R
skills:
- Edit the code that is shown in the box. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
Before you get started with this script, ensure that you have loaded the following packages:
2 Set up: Create some test data
A data frame
is like a table
in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.
The enroll
data frame has three rows and two columns. Here is how we defined this table:
<- tribble(~student_id, ~term,
enroll 234567, "Fall 2023",
236047, "Fall 2023",
234567, "Spring 2024")
Let’s validate that we did this correctly:
Verify for yourself that this is in a long format.
3 The process
The problem we have is that enrollment records keep track of students enrolling, but for retention purposes, we need to also know when they are not enrolling.
3.1 Create the retain
data frame
What we’d like to have for retention analysis is data like the following:
What we’ve done here is take advantage of R/tidyverse
’s ability to pivot to wide data and, in the process, fill in for missing values — in this case, those values represent the terms in which the student was not enrolled. How do we know they weren’t enrolled? Because it wasn’t represented in a row in the enroll
data frame.
Let’s go through the above script line-by-line:
retain <-
-
Whatever the result is, we’re going to save it to a new data frame named
retain
. enroll |>
- The data for the query comes from this data frame.
mutate(enrolled = 1) |>
-
Add a column named enroll, and make every row have a
1
in it. Why are we doing this, and why does it make sense? Well, every row inenroll
represents a student who enrolled the university during a specific term. So using a1
in theenrolled
column to represent a student who is enrolled, you might (correctly) guess that we’re going to use a0
to represent a student who is not enrolled. (Also, we could have just as easily usedTRUE
andFALSE
in a logical column forenrolled
.) pivot_wider(names_from..., values_from..., values_fill = 0)
-
Changing to a wide data format, each row is now going to represent a student with all all of the terms having a separate column. The values will be from the newly-created
enrolled
column (with all1
s in it for those existing student/term combinations). Thevalues_fill
argument here is where the magic is done. If a student/term combination is not in the source data frame, then the otherwise empty cell in the new one will be given a default value of0
(representing not enrolled).
3.2 Complete the long form
To save ourselves some typing in the following, we’re going to define the following vector:
If we want to include terms not found in the original data, we first complete all the cases using complete()
. This command, as shown in the following code block, do the following:
- For every
student_id
in the table, - ensure that there is a row for every
term
inall_valid_terms
- and set
enrolled
to0
for every row that did not already exist.
Take a look at the result:
3.3 Pivot to a wide form
This result is in long form and, as explained just above, has one row for each combination of student_id
and term
. Now we can finish this operation with pivot_wider()
:
As you might guess, we now have separate columns for the student_id
and every term listed in all_valid_terms
. We have a 1
in every cell corresponding to the rows in the original enroll
data frame above and a 0
in all other cells.
Note that by default R
sorts the columns by alphabetical order. We can change this by changing term
to a factor and specifying the levels in the order we want them to appear:
3.4 Alternative approach with terms represented by numbers
Another way to handle term names is to convert them all to numbers, where Fall 2022 = 2022, Spring 2023 = 2022.5, or something similar:
One advantage of this method is that if we are tracking students in cohorts, we can easily calculate what year of school they are in:
This is a good way to track students over time and see how many are retained from a cohort at a given point in their enrollment history.