Groups and Summaries

group_by() and summary functions

1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

  • Edit the code that is shown in the box. Click on the Run Code button.
  • Make further edits and re-run that code. You can do this as often as you’d like.
  • Click the Start Over button to bring back the original code if you’d like.
  • If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

1.1 Using RStudio

If you’re following along with this exercise in RStudio, then you need to execute the following code in the Console. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.

library(tidyverse)
library(tidylog)
  • The first line loads the tidyverse package. You could actually load just the packages dplyr and purrr to save on memory, but with today’s computers and the types of things that you’re doing at this point, you don’t need to worry about load speed and/or memory usage just yet.
  • The second package tells R to give more detailed messages.

2 Set up: Create some test data

A data frame is like a table in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.

grades <- data.frame(record_id = 1:1000) |> 
  mutate(student_id = sample(1:100, 1000, replace = TRUE),
         section_id = sample(1:50,  1000, replace = TRUE),
         subject    = sample(c("Biology","Chemistry",
                               "Economics", "Psychology"), 
                             1000, 
                             replace = TRUE), 
         grade_points = round(4*rbeta(1000, 5, 1), 1),
         grade_letter = case_when(grade_points >= 4.0 ~ "A",
                                  grade_points >= 3.0 ~ "B",
                                  grade_points >= 2.0 ~ "C",
                                  grade_points >= 1.0 ~ "D",
                                  TRUE                ~ "F"))
grades <-
  grades |> 
    mutate(subject = factor(subject, 
                            levels = c("Biology","Chemistry",
                                       "Economics", "Psychology"))
grades <-
  grades |>
    mutate(grade_letter = factor(grade_letter,
                                 levels = c("F", "D", "C", "B", "A"),
                                 ordered = TRUE))
Tip

Explaining some of the R used to create the data

The code above uses some probably unfamiliar R function calls that you do not need to know but that you might find interesting to know exists.

sample(data, num, R)
This function generates num random samples from data. If R is TRUE (as it is here), then once each value is sampled, it is put back in the data set and can be chosen again. Official documentation
rbeta(num, A, B)
This function generates num values from a beta distribution with parameters A and B. Official documentation
case_when(cond ~ val, ..., TRUE ~ def)
We originally introduced this function in the mutate() lesson. The structure of this particular function call is different in that the last condition (TRUE) serves as the default condition because it will catch all cases that the first four conditions did not catch. That is, if it is the case that grade_points < 1.0, then grade_letter will be assigned the value F.
factor()
We introduced the factor() operator in the factors lesson. It is used to create categorical data such as contained here in subject and grade_level.

Again, you do not understand how all of this works now, but you should understand that R can be used to generate test data that you can use to validate any work that you do, and that R/tidyverse can help you with this.

In this lesson, we create a different data frame than we have previously used. This grades data frame represents the grades that students earned in specific sections of courses in specific subjects. It contains 1000 rows, thus representing 1000 grades earned by students. It has the following columns:

student_id
the student ID. The grades of 100 different students are within.
section_id
the section ID. The grades for 50 different sections are within.
subject
the name (acting as an identifier) of the subject of the section. The grades for four different subjects are within.
grade_points
the grades earned by the student in the section of the subject. Grade points range from 4.0 down to 0.0.
grade_letter
the letter grade that corresponds to the grade points earned. Letter grades can be anything from A to F (with no + or -).

The two factors (that is, two columns containing categorical data) differ in significant ways and highlight the major decision that you have to make when defining a factor — is the factor ordered, and, if so, what is the order?

  • subject: These four subjects are not ordered; that is, saying that "Biology" > "Economics" does not make sense. Thus, ordered is not set to TRUE (and is, therefore, set to its default value of FALSE).
  • grade_letter: These five letter grades are ordered. It is the case that it makes sense to say "B" > "D". Thus, you set ordered to TRUE and define levels from the minimum value "F" to the maximum value "A".

Let’s take a look at some of this data. From this display you can see that the data frame contains 1000 rows of data.

Note that you cannot tell from the listing of data that subject and grade_letter are factors.

Now use the following query to gain some more insight over the contents of each of the columns.

One of the benefits of defining subject and grade_letter as categorical data (with factor()) is that functions such as summary() (as shown here) present more information than it could otherwise. Specifically, it shows the count for each value in the column (in its value order, if that makes sense).

Test your knowledge

Before running the str() function on grades in the code block below, predict what information it will show for each of the columns.

What do you now know that you didn’t learn from the summary() output above?

Solution

A few insights can be gained from the str() output that could not be gained from the summary() output:

  1. You can easily see that the data frame has 1000 observations (or rows).
  2. You can easily see that both subject and grade_letter are factors. You can actually determine this from the summary() output since the counts of each of those factors are meaningful, saying something other than this data is character data (paraphrasing). But this is explicitly represented in the str() output.
  3. You can also easily see both that grade_letter is an ordered factor and what the order of values are within that factor. This cannot be determined in any way from the summary() output.

3 Summary statistics

In this lesson, we are going to investigate different ways that we can calculate summary statistics:

3.1 Vector-based calculations

For simple, vector-based calculations, we have two different use cases:

Let’s look at details for each case.

3.1.1 Basic math

In R, calculations on vectors are performed element-wise, meaning that basic arithmetic operations like addition, subtraction, multiplication, and division are applied to corresponding elements between two vectors of the same length; if the vectors have different lengths, R will “recycle” the shorter vector to match the length of the longer one. There are not many situations in your work in which you will want to perform math on vectors of different lengths so you should guard against this.

Let’s go through some simple examples. We first define the vectors that we use in the examples:

Notice a little trick that we’ve added here in order to get R to print out the value of the assignment: we’ve added parentheses around the assignment statement. This tells R to print the value. Go ahead and do the same for the y assignment in the code block above.

Tip

Using simple vectors in place of data frame columns

Realize that while we are using simple vectors here, we could just as easily be working with data frame columns (which are always, by definition, vectors).

3.1.1.1 Length

This first example shows how to determine the length of the vector and test for the equivalence of the length of two vectors.

3.1.1.2 Addition

The following shows addition with vectors and scalars (the name for a regular number). The first and third examples show what happens when you add a vector and a scalar; the second example shows the result of adding two vectors.

Here, as in the following examples, the following rules are in force:

  • When a scalar is added to a vector, the scalar is added to each element of the vector.
  • When two vectors are added, the pair-wise elements of the vectors are added in order to form a new vector.
3.1.1.3 Subtraction

This time we show the results of subtracting vectors and scalars. The same rules apply for how to combine vectors and scalars.

3.1.1.4 Multiplication

This time we show the results of multiplying vectors and scalars. The same rules apply for how to combine vectors and scalars.

3.1.1.5 Division

This time we show the results of dividing vectors and scalars. The same rules apply for how to combine vectors and scalars.

Test your knowledge

Write an expression that adds 5 to each element of x.

Solution

We came up with the following two ways to complete this assignment.

This first way simply adds the scalar value 5 to the vector, thus adding 5 to each element of the vector.

The second way uses a function rep() to which you have not been introduced. This function has many options and complexities (you can see lots of examples here), but the simplest approach uses the argument each: it returns a vector, having repeated the vector in the first argument each times. Thus, we could do the following:

To make this approach a bit more flexible, we can replace the 3 in the second argument with length(x), as shown here:

Test your knowledge

Write an expression that adds 1 to the first element of x, 2 to the second element, and 3 to the third.

Solution

Again, we came up with two basic approaches to complete this exercise.

The first approach simply adds the appropriate vector to x, as follows:

The second approach uses the : operator (which returns a vector) in the following way:

Again, to make this approach more flexible, we replace the 3 with the length() function:

3.1.2 Pre-defined functions

R has many pre-defined functions for performing operations on vectors. This is only a small sampling of them, though they are undoubtably ones that you will use often.

3.1.2.1 Mean (average)

The mean() function calculates the average of all the elements of a vector.

3.1.2.2 Standard deviation (a measure of spread)

The sd() function calculates the standard deviation of the elements of a vector.

3.1.2.3 Sum

The sum() function calculates the arithmetic sum of all the elements of a vector.

3.1.2.4 Maximum and minimum

The max() function returns the largest vector element. The min() function returns the smallest vector element.

3.1.2.5 Rounding a vector

The round() function rounds the individual elements of a vector with the same rules it uses to round a scalar value.

3.2 Compound, tidyverse calculations

We are now going to explore something completely different — how to calculate summary statistics using the tidyverse.

3.2.1 The general form of the approach

The previous approach used vectors as its input. This approach starts with data frames and uses summarize() to calculate the statistics. The general form is as follows:

DATA_FRAME |>
  summarize(NAME1 = FUNCTION1(),
            NAME2 = FUNCTION2(),
            ...)

It is common practice to rename the calculated columns (as with NAME1 and NAME2 above); however, you don’t have to do this.

Here is an example that calculates the count, minimum, mean, median, maximum, and standard error but does not rename the columns:

The column headers are not the easiest to interpret; thus, we generally rename the columns, as in the following (which calculates the same statistics as the previous example):

You will have to decide whether or not this renaming is worth your effort.

3.2.2 Common functions called within summarize()

Common functions that can be called from within summarize() are as follows:

  • mean(COL, na.rm = FALSE): the arithmetic average of COL.
  • median(COL, na.rm = FALSE): the median value of COL.
  • sd(COL, na.rm = FALSE): the standard deviation of the values in COL.
  • min(COL, na.rm = FALSE): the minimum value in COL.
  • max(COL, na.rm = FALSE): the maximum value in COL.
  • first(COL, default = NULL, na_rm = FALSE): the first value in COL. If the first position has the value NA, then NA is returned unless the default value has been set (and then that value is returned).
  • last(COL, default = NULL, na_rm = FALSE): the last value in COL. Same use of default as first().
  • nth(COL, default = NULL, na_rm = FALSE): the nth value in COL. Same use of default as first().
  • n(): the current group size. If no group has been defined, then it returns the number of rows.
  • n_distinct(COL1, COL2, ... na.rm = FALSE): the number of unique values (that is, combination of values) in the set of COLs.

For all of the indicated functions above, the default value of na.rm (or na_rm, read as remove NA values) of FALSE indicates that NA values are included in the calculation. If you want them to be excluded, then set na.rm to TRUE.

3.2.3 Demonstration of na.rm

Consider the following data frame with one column (no_nulls) that does not have NA values and one column (some_nulls) that has one NA value:

The following calculation on the no_nulls column that removes the (non-existent) NA values does not return any surprises (since the na.rm setting has no effect in this case):

In the above with the len column, we use the following calculation which calculates the number of non-NA values in the column no_nulls:

sum(!is.na(no_nulls))

Let’s review how this operates:

  1. It is operating on the no_nulls column.
  2. For each value in no_nulls, it applies the is.na() function which returns TRUE (1) if the value is NA and FALSE (0) if the value is not NA.
  3. The ! (not) operator simply flips the value that it is given. Thus, !is.na() returns TRUE (1) if the value is not NA and FALSE (0) if the value is NA.
  4. The sum() operator returns the total of all of the values that it is given — in this case, it returns the count of the number of non-NA values in no_nulls.

That is a pretty nice trick.

Let’s go back and look at the values of num_rows, len, and distinct:

  • num_rows: this returns the number of rows in the group and, since there are no groups defined, it returns the total number of rows.
  • len: this returns the number of non-NA values in the column and, since there are no NA values in no_nulls, it returns the number of rows in the column.
  • distinct:

The following performs the same calculations on the some_nulls column while uising na.rm to tell the functions to ignore NA values. The results look as we would expect:

Again, let’s look at the values of these following:

  • num_rows: the value returned by this does not change — it is the number of rows.
  • len: this returns the number of non-NA values in the column, thus 3.
  • distinct: this returns the number of distinct non-NA values in the column (since na.rm is TRUE), thus 3.

In the following, we now do not tell the functions to ignore NA values. This leads to some differences. See if you can spot them.

Look at the output of this code block and compare it to the one before it:

  • The functions mean, median, and sd do not return any value if any one value in the vector is an NA value.
  • When NA values are removed, n_distinct returns the number of non-NA values. When they are not removed, then NA is counted as one distinct value.

One more time, let’s look at the values of these following:

  • num_rows: again, this simply returns the number of rows, thus 5.
  • len: this returns the number of non-NA values, thus 3.
  • distinct: this returns the number of distinct values in the column (including NA values since na.rm is FALSE), thus 4.

Test your knowledge

Write a query that calculates the average and median values of grade_points using summarize().

Solution

This is a straight-forward usage of the summarize() function. We explicitly set na.rm to TRUE just in case the column contains NA values.

Test your knowledge

Write a query that calculates the average value of grade_points using summarize() but without using the mean() function.

Solution

Let’s highlight a few points:

  • We know that the grade_points column does not contain any NA values.
  • An average value is calculated as the sum of the column divided by the number of values in the column.
  • The sum is calculated with sum(grade_points).
  • The *number of values is calculated with sum(!is.na(grade_points)).

Thus we have the following query:

Another way of determining the number of values in a column (when it contains zero NA values) is n(). Thus, we can use the following query:

Test your knowledge

Write a query that calculates the average grade points for Economics classes.

Solution

We first need to apply filter() so that we only include those rows for Economics classes. After that, we can calculate the average grade as usual:

3.3 Compound, grouped tidyverse calculations

Throughout this lesson we have been working up to this section. We know from the previous section how to use summarize() to calculate statistics over the whole data frame (or whatever filtered subset that has been defined). Now we are going to explore how we can calculate statistics grouped by the values in another (single or multiple) column.

To make it clearer what we’re talking about, here is the example that we used above, but this time we are calculating the statistics while defining subgroups based on the value in the subject column:

Instead of getting one row of calculated results, we now get a separate row for each value in the subject column.

The form of this calculation is the same as the above, only with an additional group_by() function preceding the summarize():

DATA_FRAME |>
  group_by(COL_A, ...) |>
  summarize(NAME1 = FUNCTION1(),
            NAME2 = FUNCTION2(),
            ...)

Test your knowledge

Write a query that calculates the average of grade_points by letter grade. Before running the query, estimate how many rows you think the result will contain. Also, what will be in the first row of the results?

Solution

This is a simple application of the combination of group_by() and summarize(). We want the results to be grouped by grade_letter, so we use that column in the group_by() function. We want to calculate the average of grade_points, so we use the mean() function (with round() to simplify the display) on it.

We see five rows in the output since there are five separate letter grades, and F is first since it has the lowest value of all the grades.

Test your knowledge

Write a query that counts the number of students given specific letter grades by subject. Again, before running the query, estimate how many rows the result will contain.

Solution

This is a bit more complicated that what we have previously done because we are grouping by two columns — subject and grade_letter. We have chosen to sort by grade_letter within each subject, so we list subject first within group_by().

Notice that the results contain not (5x4=20) twenty rows but eighteen because two of the subjects do not contain any F grades. You could not have know this. It was useful to estimate the number of rows in any case just to know what to expect when interpreting the result.