Defining a factor

In IR work, we often report on class standing, first year (or freshman), sophomore, junior, and senior for a four-year undergraduate program. But these don’t sort alphabetically in the right order as strings — freshman comes first, then junior, then senior, then sophomore. There is a way to tell R how to sort them correctly by converting them to a factor type. For more info on factors, see this page.

Let’s create this data frame to play with for a bit:

df <- tibble(
  id         = c(1, 2, 3, 4, 5),
  first_name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  last_name  = c("Smith", "Jones", "Kline", "White", "Zettle"),
  class      = c("Freshman", "Sophomore", "Junior", "Senior", "Senior"),
  age        = c(18,18,19,20,21),
  gpa        = c(3.5, 3.2, 3.8, 3.0, 3.9)
)

We can sort by a column with the arrange() function, but it will sort alphabetically.

df |> arrange(class)

That’s not usually what we want, so we can convert class to a factor with the mutate() function. We’ll overwrite the existing data frame with the new one, using df <- ...

df <- df |> 
    mutate(class = factor(class, 
                          levels = c("Freshman", "Sophomore", 
                                     "Junior", "Senior"),
                          ordered = TRUE))

The important parts of the above are to put the levels into the right order, and then to set ordered to TRUE. Now when sorting df by class, the rows will appear in the right order.

df |> arrange(class)  # sort by class properly

This trick is particularly useful when we want to make a chart or table where the reader will expect sorting in a particular way. You will see much more of factors as you progress in your work with R.