Advanced lesson – Importance of ungrouping

1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

Edit the code that is shown in the box. Click on the Run Code button.
Make further edits and re-run that code. You can do this as often as you’d like.
Click the Start Over button to bring back the original code if you’d like.
If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

1.1 Using RStudio

If you’re following along with this exercise in RStudio, then you need to execute the following code in the Console. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.

library(tidyverse)
library(tidylog)

The first line loads the tidyverse package. You could actually load just the packages dplyr and purrr to save on memory, but with today’s computers and the types of things that you’re doing at this point, you don’t need to worry about load speed and/or memory usage just yet.
The second package tells R to give more detailed messages.

1.2 Set up: Create some test data

1.2.1 Description

A data frame is like a table in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.

If you’re running through this in RStudio, you should also run this code before you attempt to complete the following lesson.

In this lesson, we create a different data frame than we have previously used. This grades data frame represents the grades that students earned in specific sections of courses in specific subjects. It contains 1000 rows, thus representing 1000 grades earned by students. It has the following columns:

student_id: the student ID. The grades of 100 different students are within.
section_id: the section ID. The grades for 50 different sections are within.
subject: the name (acting as an identifier) of the subject of the section. The grades for dfour different subjects are within.
grade_points: the grades earned by the student in the section of the subject. Grade points range from 4.0 down to 0.0.
grade_letter: the letter grade that corresponds to the grade points earned. Letter grades can be anything from A to F (with no + or -).

1.2.2 The code to generate data

Here is the code:

grades <- data.frame(record_id = 1:1000) |> 
  mutate(student_id = sample(1:100, 1000, replace = TRUE),
         section_id = sample(1:50,  1000, replace = TRUE),
         subject    = sample(c("Biology","Chemistry",
                               "Economics", "Psychology"), 
                             1000, 
                             replace = TRUE), 
         grade_points = round(4*rbeta(1000, 5, 1), 2),
         grade_letter = case_when(grade_points >= 3.9 ~ "A",
                                  grade_points >= 3.0 ~ "B",
                                  grade_points >= 2.0 ~ "C",
                                  grade_points >= 1.0 ~ "D",
                                  TRUE                ~ "F"))

Let’s review this code. If nothing else, it provides some good examples of functions (that we introduced in the functions lesson).

First, it is assigning the results of the whole operation to the grades data frame. The data.frame() call creates 1000 rows with a column named record_id being assigned successive values from 1 to 1000.

The results of the data.frame() operation is piped (see the pipe lesson) into the mutate() operation that creates an additional five columns. The first three columns have their values set by the results of different calls to the sample() function. The fourth is a result of calling two functions (round() and rbeta()) while the fifth is a result of the case_when() function.

Let’s take a moment and explain each one of them.

sample(1:100, 1000, replace = TRUE): Read this as “Generate a vector of 1000 values taken from the range 1 to 100. Each value can be chosen more than one time.” In probabilistic terms, the value chosen is put back, or replaced, into the sample from which values are drawn.
sample(1:50, 1000, replace = TRUE): Read this as “Generate a vector of 1000 values taken from the range 1 to 50. Each value can be chosen more than one time.”

sample(c("Biology","Chemistry", "Economics",
"Psychology"),
1000, replace = TRUE)

This time, instead of taking from a range of integers, sample() takes values from a vector of four strings. Thus, you should read this as “Generate a vector of 1000 strings taken from this vector of four subject names. Each value can be chosen more than one time.”

round(4*rbeta(1000, 5, 1), 2): The rbeta() function samples a random (the r in the name) value from the beta distribution. The beta distribution requires two settings, in this case 5 and 1; we are not going to go into the details of the distribution here. (Take a stats course!) This particular call to rbeta() generates a vector of 1000 values between 0 and 1 (because that’s how the beta distribution is defined). We then multiply that value by 4 since we want the grades to fall in the range from 0 to 4. Finally, we use the round() function to round the resulting value to 2 decimal places.

case_when(grade_points >= 3.9 ~ "A",
           grade_points >= 3.0 ~ "B",
           grade_points >= 2.0 ~ "C",
           grade_points >= 1.0 ~ "D",
           TRUE ~ "F")

The case_when() function assigns a value (the one to the right of the tilde ~) from the first case (the phrase on the left of the tilde ~) which is true. Suppose that the value of grade_points is 2.4. Then the first case is not true; the second case is not true; the third case is true. Thus "C" is assigned to the value of grade_letter for that row.

1.2.3 Address the categorical grade data

The grade_letter and subject columns contains categorical data, so let’s define those columns as a factor (see the factors lesson):

1.2.4 Inspect the data

Let’s take a look at some of this data. You can see that the data frame contains 1000 rows of data.

Now use the following query to gain some more insight into the contents of each of the columns.

Since grade_letter is categorical, you can see the distribution of letter grades from low to high. You can also see the distribution of subject values, though there is no concept of low to high for this column since it is not ordered.

2 Using `ungroup()`

2.1 Identifying how confusion might arise

Let’s first look at how we use group_by() and what it does.

The tidylog note highlights that one grouping variable (subject) is being used. This isn’t really much of a surprise, is it? You just declared it that way!

But the tidylog note also says ungrouped after summarize(). Why is that? This is kind of confusing, but you can think of it in this way — at the end of the whole command, the grouping isn’t saved, so it remains ungrouped from this point forward. Hmmm. We’re going to have to explore this more.

If it is truly ungrouped, then the following count() should simply return a count of the rows in the whole data frame. Let’s check:

Sure enough — it’s ungrouped and tells us that the data frame contains 1000 rows.

Let’s start looking at situations in which group_by() can cause real confusion.

The first command below groups the rows in grades by subject **and saves the results to grades. The first tidylog note confirms that our grouping was successful.
The second command counts the rows (as executed in the previous code block) but gives a different result! Take a look:

The result of the count() now returns counts of rows grouped by subject. This isn’t that confusing given that these commands are one line after the other — but what if 2 weeks and hundreds of lines of code have come between writing the first command and writing the second one? You might think that you’re going crazy: “Why is this command, which I have executed hundreds of times, returning this wacky result?”

The reason is because you saved the grouping to the grades data frame and it will not disappear until you apply (and save) the ungroup() command.

2.2 Resolving the confusion

If we have a grouped data frame and we want to revert back, we can use ungroup():

The tidylog command confirms that it is now ungrouped.

So let’s run the count() again and it should return the single row showing the total count of rows in the data frame:

WHAT?!?! It is still grouped! What is going on?!

What is going on is that we did not save the result of the ungroup() command above. Since we didn’t save it, the results did not persist with further commands.

Let’s try both of those commands again, but we’ll save the results of the ungroup() this time:

Sure enough, the ungroup() persisted and the count() returns the overall total number of rows in the data frame.

We can group by multiple columns, and when we run the count() command we get the now expected result that the counts are grouped by those two columns.

If we want to get rid of a grouping of two variables, we only have to apply the ungroup() command one time. If it works as expected, then the following count() command should return the overall number of rows:

Sure ehough, it did!

Let’s verify that it also works for three columns:

I think it’s safe to say that we only need to apply and save ungroup() once and it will get rid of all existing groupings.

3 Lesson to leave with

I think it’s pretty safe to say that you should be very careful if you ever save the result of a tidyverse command that applies a group_by(). If you have to do so, then you should ungroup() as soon as possible afterwards.

In general, if you can avoid saving the result of the application of the group_by(), then you should do so.