library(tidyverse)
library(tidylog)
Advanced lesson – Logical expressions to compute proportions
group_by() and summary functions
1 Using this document
Within this document are blocks of R
code. You can edit and execute this code as a way of practicing your R
skills:
- Edit the code that is shown in the box. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
1.1 Using RStudio
If you’re following along with this exercise in RStudio
, then you need to execute the following code in the Console
. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.
- The first line loads the
tidyverse
package. You could actually load just the packagesdplyr
andpurrr
to save on memory, but with today’s computers and the types of things that you’re doing at this point, you don’t need to worry about load speed and/or memory usage just yet. - The second package tells
R
to give more detailed messages.
1.2 Set up: Create some test data
A data frame
is like a table
in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.
The following data frame is explained in great detail in the ungroup()
lesson.
<- data.frame(record_id = 1:1000) |>
grades mutate(student_id = sample(1:100, 1000, replace = TRUE),
section_id = sample(1:50, 1000, replace = TRUE),
subject = sample(c("Biology","Chemistry",
"Economics", "Psychology"),
1000,
replace = TRUE),
grade_points = round(4*rbeta(1000, 5, 1), 1),
grade_letter = case_when(grade_points >= 4.0 ~ "A",
>= 3.0 ~ "B",
grade_points >= 2.0 ~ "C",
grade_points >= 1.0 ~ "D",
grade_points TRUE ~ "F"))
<-
grades |>
grades mutate(subject = factor(subject,
levels = c("Biology","Chemistry",
"Economics", "Psychology"))
<-
grades |>
grades mutate(grade_letter = factor(grade_letter,
levels = c("F", "D", "C", "B", "A"),
ordered = TRUE))
2 Finding proportions using logical expressions
2.1 The tidyverse
command
To find the proportion of grades of at least 3.0 (a B or better), we could do this:
Nicely-formatting table output
The kable()
operator provides an easy way to apply some nice formatting to your table outputs. (You can read about a lot of its options on this page.) Before using it, you need to use library(knitr)
within RStudio
(if you haven’t already done so for tidyverse
).
During this lesson, we use the digits
option to keep us from having to call round(x, 3)
for every decimal number.
Since we didn’t group the data, the calculation operates over all of the rows. This trick works because the logical expression grade_points >= 3.0
is converted to 1 for TRUE
and 0 for FALSE
, and when we average those we get the proportion of 1s.
Converting values from logical to integer
Let’s remind ourselves about how this conversion from logical (TRUE
/FALSE
) to integer works.
- We define a vector
v
consisting of five logical values. Since we surround it with parentheses,R
prints the value ofv
. - We define a vector
i
in which we convert each element ofv
to its integer equivalent. Again, since we surround it with parentheses,R
prints the value ofi
. - We calculate the
sum(v)
which will, of course, simply be the number of1
values inv
. - We calculate
length(v)
, which is the number of elements inv
. - Finally, dividing
sum()/length()
, we get the percentage of values ofv
that are1
(orTRUE
).
This is the underlying process by which the command above works.
2.2 How the command works
Let’s see if we can explore in more detail using a variety of R/tidyverse
commands in order to understand how this command works on our data.
First, let’s filter to include those rows for which grade_points
is greater than or equal to 3.0
. (Given what we saw above, this should be 781
rows.) Then use the select()
operator to display the values in the grade_points
column:
Next, let’s use the mutate()
command to define a new column a_or_b
that is TRUE
if grade_points
is high or FALSE
if it is not. This should then display all 1000
rows but include this new logical column:
Here, we are filtering to include those rows for which the grade is high, and then using summarize()
to calculate the number of rows that remain:
We’re narrowing in on it! In the following, we use mutate()
to define a new logical column high_grade
as we did above. This time we are going to use summarize()
to calculate the average of this new column. This applies the technique that was discussed in the callout box above.
In the following, we simply substitute the value of high_grade
that is set in the above mutate()
operator (i.e., grade_points >= 3.0
) into the right side of the summarize()
operator. This allows us to get rid of the mutate()
operator all together.
And this is the command that we started with back in Section 2.1.
2.3 Using the command with group_by()
If we want to find the proportion of grades > 3.0 by subject, we can group the data using group_by(subject)
, as follows:
Finally, we can also use this technique to calculate multiple proportions at the same time: