library(tidyverse)
library(tidylog)
Groups and Summaries
group_by() and summary functions
1 Using this document
Within this document are blocks of R
code. You can edit and execute this code as a way of practicing your R
skills:
- Edit the code that is shown in the box. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
1.1 Using RStudio
If you’re following along with this exercise in RStudio
, then you need to execute the following code in the Console
. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.
- The first line loads the
tidyverse
package. You could actually load just the packagesdplyr
andpurrr
to save on memory, but with today’s computers and the types of things that you’re doing at this point, you don’t need to worry about load speed and/or memory usage just yet. - The second package tells
R
to give more detailed messages.
2 Set up: Create some test data
A data frame
is like a table
in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.
<- data.frame(record_id = 1:1000) |>
grades mutate(student_id = sample(1:100, 1000, replace = TRUE),
section_id = sample(1:50, 1000, replace = TRUE),
subject = sample(c("Biology","Chemistry",
"Economics", "Psychology"),
1000,
replace = TRUE),
grade_points = round(4*rbeta(1000, 5, 1), 1),
grade_letter = case_when(grade_points >= 4.0 ~ "A",
>= 3.0 ~ "B",
grade_points >= 2.0 ~ "C",
grade_points >= 1.0 ~ "D",
grade_points TRUE ~ "F"))
<-
grades |>
grades mutate(subject = factor(subject,
levels = c("Biology","Chemistry",
"Economics", "Psychology"))
<-
grades |>
grades mutate(grade_letter = factor(grade_letter,
levels = c("F", "D", "C", "B", "A"),
ordered = TRUE))
Explaining some of the R
used to create the data
The code above uses some probably unfamiliar R
function calls that you do not need to know but that you might find interesting to know exists.
sample(data, num, R)
-
This function generates
num
random samples fromdata
. IfR
isTRUE
(as it is here), then once each value is sampled, it is put back in the data set and can be chosen again. Official documentation rbeta(num, A, B)
-
This function generates
num
values from a beta distribution with parametersA
andB
. Official documentation case_when(cond ~ val, ..., TRUE ~ def)
-
We originally introduced this function in the
mutate()
lesson. The structure of this particular function call is different in that the last condition (TRUE
) serves as the default condition because it will catch all cases that the first four conditions did not catch. That is, if it is the case thatgrade_points < 1.0
, thengrade_letter
will be assigned the valueF
. factor()
-
We introduced the
factor()
operator in thefactors
lesson. It is used to create categorical data such as contained here insubject
andgrade_level
.
Again, you do not understand how all of this works now, but you should understand that R
can be used to generate test data that you can use to validate any work that you do, and that R/tidyverse
can help you with this.
In this lesson, we create a different data frame than we have previously used. This grades
data frame represents the grades that students earned in specific sections of courses in specific subjects. It contains 1000 rows, thus representing 1000 grades earned by students. It has the following columns:
student_id
- the student ID. The grades of 100 different students are within.
section_id
- the section ID. The grades for 50 different sections are within.
subject
- the name (acting as an identifier) of the subject of the section. The grades for four different subjects are within.
grade_points
- the grades earned by the student in the section of the subject. Grade points range from 4.0 down to 0.0.
grade_letter
-
the letter grade that corresponds to the grade points earned. Letter grades can be anything from
A
toF
(with no+
or-
).
The two factor
s (that is, two columns containing categorical data) differ in significant ways and highlight the major decision that you have to make when defining a factor — is the factor
ordered, and, if so, what is the order?
subject
: These four subjects are not ordered; that is, saying that"Biology" > "Economics"
does not make sense. Thus,ordered
is not set toTRUE
(and is, therefore, set to its default value ofFALSE
).grade_letter
: These five letter grades are ordered. It is the case that it makes sense to say"B" > "D"
. Thus, you setordered
toTRUE
and definelevels
from the minimum value"F"
to the maximum value"A"
.
Let’s take a look at some of this data. From this display you can see that the data frame contains 1000 rows of data.
Note that you cannot tell from the listing of data that subject
and grade_letter
are factor
s.
Now use the following query to gain some more insight over the contents of each of the columns.
One of the benefits of defining subject
and grade_letter
as categorical data (with factor()
) is that functions such as summary()
(as shown here) present more information than it could otherwise. Specifically, it shows the count for each value in the column (in its value order, if that makes sense).
Test your knowledge
Before running the str()
function on grades
in the code block below, predict what information it will show for each of the columns.
What do you now know that you didn’t learn from the summary()
output above?
Solution
A few insights can be gained from the str()
output that could not be gained from the summary()
output:
- You can easily see that the data frame has 1000 observations (or rows).
- You can easily see that both
subject
andgrade_letter
are factors. You can actually determine this from thesummary()
output since the counts of each of those factors are meaningful, saying something other than this data is character data (paraphrasing). But this is explicitly represented in thestr()
output. - You can also easily see both that
grade_letter
is anordered
factor
and what the order of values are within thatfactor
. This cannot be determined in any way from thesummary()
output.
3 Summary statistics
In this lesson, we are going to investigate different ways that we can calculate summary statistics:
- Simple, vector-based calculations (Section 3.1)
- Compound,
tidyverse
calculations (Section 3.2)
3.1 Vector-based calculations
For simple, vector-based calculations, we have two different use cases:
- Basic math {Section 3.1.1}, and
- Pre-defined functions {Section 3.1.2}
Let’s look at details for each case.
3.1.1 Basic math
In R
, calculations on vectors are performed element-wise, meaning that basic arithmetic operations like addition, subtraction, multiplication, and division are applied to corresponding elements between two vectors of the same length; if the vectors have different lengths, R will “recycle” the shorter vector to match the length of the longer one. There are not many situations in your work in which you will want to perform math on vectors of different lengths so you should guard against this.
Let’s go through some simple examples. We first define the vectors that we use in the examples:
Notice a little trick that we’ve added here in order to get R
to print out the value of the assignment: we’ve added parentheses around the assignment statement. This tells R
to print the value. Go ahead and do the same for the y
assignment in the code block above.
Using simple vectors in place of data frame columns
Realize that while we are using simple vectors here, we could just as easily be working with data frame columns (which are always, by definition, vectors).
3.1.1.1 Length
This first example shows how to determine the length of the vector and test for the equivalence of the length of two vectors.
3.1.1.2 Addition
The following shows addition with vectors and scalars (the name for a regular number). The first and third examples show what happens when you add a vector and a scalar; the second example shows the result of adding two vectors.
Here, as in the following examples, the following rules are in force:
- When a scalar is added to a vector, the scalar is added to each element of the vector.
- When two vectors are added, the pair-wise elements of the vectors are added in order to form a new vector.
3.1.1.3 Subtraction
This time we show the results of subtracting vectors and scalars. The same rules apply for how to combine vectors and scalars.
3.1.1.4 Multiplication
This time we show the results of multiplying vectors and scalars. The same rules apply for how to combine vectors and scalars.
3.1.1.5 Division
This time we show the results of dividing vectors and scalars. The same rules apply for how to combine vectors and scalars.
Test your knowledge
Write an expression that adds 5
to each element of x
.
Solution
We came up with the following two ways to complete this assignment.
This first way simply adds the scalar value 5
to the vector, thus adding 5
to each element of the vector.
The second way uses a function rep()
to which you have not been introduced. This function has many options and complexities (you can see lots of examples here), but the simplest approach uses the argument each
: it returns a vector, having repeated the vector in the first argument each
times. Thus, we could do the following:
To make this approach a bit more flexible, we can replace the 3
in the second argument with length(x)
, as shown here:
Test your knowledge
Write an expression that adds 1
to the first element of x
, 2
to the second element, and 3
to the third.
Solution
Again, we came up with two basic approaches to complete this exercise.
The first approach simply adds the appropriate vector to x
, as follows:
The second approach uses the :
operator (which returns a vector) in the following way:
Again, to make this approach more flexible, we replace the 3
with the length()
function:
3.1.2 Pre-defined functions
R
has many pre-defined functions for performing operations on vectors. This is only a small sampling of them, though they are undoubtably ones that you will use often.
3.1.2.1 Mean (average)
The mean()
function calculates the average of all the elements of a vector.
3.1.2.2 Standard deviation (a measure of spread)
The sd()
function calculates the standard deviation of the elements of a vector.
3.1.2.3 Sum
The sum()
function calculates the arithmetic sum of all the elements of a vector.
3.1.2.4 Maximum and minimum
The max()
function returns the largest vector element. The min()
function returns the smallest vector element.
3.1.2.5 Rounding a vector
The round()
function rounds the individual elements of a vector with the same rules it uses to round a scalar value.
3.2 Compound, tidyverse
calculations
We are now going to explore something completely different — how to calculate summary statistics using the tidyverse
.
3.2.1 The general form of the approach
The previous approach used vectors as its input. This approach starts with data frames and uses summarize()
to calculate the statistics. The general form is as follows:
|>
DATA_FRAME summarize(NAME1 = FUNCTION1(),
NAME2 = FUNCTION2(),
...)
It is common practice to rename the calculated columns (as with NAME1
and NAME2
above); however, you don’t have to do this.
Here is an example that calculates the count, minimum, mean, median, maximum, and standard error but does not rename the columns:
The column headers are not the easiest to interpret; thus, we generally rename the columns, as in the following (which calculates the same statistics as the previous example):
You will have to decide whether or not this renaming is worth your effort.
3.2.2 Common functions called within summarize()
Common functions that can be called from within summarize()
are as follows:
mean(COL, na.rm = FALSE)
: the arithmetic average ofCOL
.median(COL, na.rm = FALSE)
: the median value ofCOL
.sd(COL, na.rm = FALSE)
: the standard deviation of the values inCOL
.min(COL, na.rm = FALSE)
: the minimum value inCOL
.max(COL, na.rm = FALSE)
: the maximum value inCOL
.first(COL, default = NULL, na_rm = FALSE)
: the first value inCOL
. If the first position has the valueNA
, thenNA
is returned unless thedefault
value has been set (and then that value is returned).last(COL, default = NULL, na_rm = FALSE)
: the last value inCOL
. Same use ofdefault
asfirst()
.nth(COL, default = NULL, na_rm = FALSE)
: then
th value inCOL
. Same use ofdefault
asfirst()
.n()
: the current group size. If no group has been defined, then it returns the number of rows.n_distinct(COL1, COL2, ... na.rm = FALSE)
: the number of unique values (that is, combination of values) in the set ofCOL
s.
For all of the indicated functions above, the default value of na.rm
(or na_rm
, read as remove NA
values) of FALSE
indicates that NA
values are included in the calculation. If you want them to be excluded, then set na.rm
to TRUE
.
3.2.3 Demonstration of na.rm
Consider the following data frame with one column (no_nulls
) that does not have NA
values and one column (some_nulls
) that has one NA
value:
The following calculation on the no_nulls
column that removes the (non-existent) NA
values does not return any surprises (since the na.rm
setting has no effect in this case):
In the above with the len
column, we use the following calculation which calculates the number of non-NA
values in the column no_nulls
:
sum(!is.na(no_nulls))
Let’s review how this operates:
- It is operating on the
no_nulls
column. - For each value in
no_nulls
, it applies theis.na()
function which returnsTRUE
(1
) if the value isNA
andFALSE
(0
) if the value is notNA
. - The
!
(not) operator simply flips the value that it is given. Thus,!is.na()
returnsTRUE
(1
) if the value is notNA
andFALSE
(0
) if the value isNA
. - The
sum()
operator returns the total of all of the values that it is given — in this case, it returns the count of the number of non-NA
values inno_nulls
.
That is a pretty nice trick.
Let’s go back and look at the values of num_rows
, len
, and distinct
:
num_rows
: this returns the number of rows in the group and, since there are no groups defined, it returns the total number of rows.len
: this returns the number of non-NA
values in the column and, since there are noNA
values inno_nulls
, it returns the number of rows in the column.distinct
:
The following performs the same calculations on the some_nulls
column while uising na.rm
to tell the functions to ignore NA
values. The results look as we would expect:
Again, let’s look at the values of these following:
num_rows
: the value returned by this does not change — it is the number of rows.len
: this returns the number of non-NA
values in the column, thus3
.distinct
: this returns the number of distinct non-NA
values in the column (sincena.rm
isTRUE
), thus3
.
In the following, we now do not tell the functions to ignore NA
values. This leads to some differences. See if you can spot them.
Look at the output of this code block and compare it to the one before it:
- The functions
mean
,median
, andsd
do not return any value if any one value in the vector is anNA
value. - When
NA
values are removed,n_distinct
returns the number of non-NA
values. When they are not removed, thenNA
is counted as one distinct value.
One more time, let’s look at the values of these following:
num_rows
: again, this simply returns the number of rows, thus5
.len
: this returns the number of non-NA
values, thus3
.distinct
: this returns the number of distinct values in the column (includingNA
values sincena.rm
isFALSE
), thus4
.
Test your knowledge
Write a query that calculates the average and median values of grade_points
using summarize()
.
Solution
This is a straight-forward usage of the summarize()
function. We explicitly set na.rm
to TRUE
just in case the column contains NA
values.
Test your knowledge
Write a query that calculates the average value of grade_points
using summarize()
but without using the mean()
function.
Solution
Let’s highlight a few points:
- We know that the
grade_points
column does not contain anyNA
values. - An average value is calculated as the sum of the column divided by the number of values in the column.
- The sum is calculated with
sum(grade_points)
. - The *number of values is calculated with
sum(!is.na(grade_points))
.
Thus we have the following query:
Another way of determining the number of values in a column (when it contains zero NA
values) is n()
. Thus, we can use the following query:
Test your knowledge
Write a query that calculates the average grade points for Economics classes.
Solution
We first need to apply filter()
so that we only include those rows for Economics classes. After that, we can calculate the average grade as usual:
3.3 Compound, grouped tidyverse
calculations
Throughout this lesson we have been working up to this section. We know from the previous section how to use summarize()
to calculate statistics over the whole data frame (or whatever filtered subset that has been defined). Now we are going to explore how we can calculate statistics grouped by the values in another (single or multiple) column.
To make it clearer what we’re talking about, here is the example that we used above, but this time we are calculating the statistics while defining subgroups based on the value in the subject
column:
Instead of getting one row of calculated results, we now get a separate row for each value in the subject
column.
The form of this calculation is the same as the above, only with an additional group_by()
function preceding the summarize()
:
|>
DATA_FRAME group_by(COL_A, ...) |>
summarize(NAME1 = FUNCTION1(),
NAME2 = FUNCTION2(),
...)
Test your knowledge
Write a query that calculates the average of grade_points
by letter grade. Before running the query, estimate how many rows you think the result will contain. Also, what will be in the first row of the results?
Solution
This is a simple application of the combination of group_by()
and summarize()
. We want the results to be grouped by grade_letter
, so we use that column in the group_by()
function. We want to calculate the average of grade_points
, so we use the mean()
function (with round()
to simplify the display) on it.
We see five rows in the output since there are five separate letter grades, and F
is first since it has the lowest value of all the grades.
Test your knowledge
Write a query that counts the number of students given specific letter grades by subject. Again, before running the query, estimate how many rows the result will contain.
Solution
This is a bit more complicated that what we have previously done because we are grouping by two columns — subject
and grade_letter
. We have chosen to sort by grade_letter
within each subject
, so we list subject
first within group_by()
.
Notice that the results contain not (5x4=20
) twenty rows but eighteen because two of the subjects do not contain any F
grades. You could not have know this. It was useful to estimate the number of rows in any case just to know what to expect when interpreting the result.