Advanced lesson – Logical expressions to compute proportions

group_by() and summary functions

1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

  • Edit the code that is shown in the box. Click on the Run Code button.
  • Make further edits and re-run that code. You can do this as often as you’d like.
  • Click the Start Over button to bring back the original code if you’d like.
  • If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

1.1 Using RStudio

If you’re following along with this exercise in RStudio, then you need to execute the following code in the Console. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.

library(tidyverse)
library(tidylog)
  • The first line loads the tidyverse package. You could actually load just the packages dplyr and purrr to save on memory, but with today’s computers and the types of things that you’re doing at this point, you don’t need to worry about load speed and/or memory usage just yet.
  • The second package tells R to give more detailed messages.

1.2 Set up: Create some test data

A data frame is like a table in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.

The following data frame is explained in great detail in the ungroup() lesson.

grades <- data.frame(record_id = 1:1000) |> 
  mutate(student_id = sample(1:100, 1000, replace = TRUE),
         section_id = sample(1:50,  1000, replace = TRUE),
         subject    = sample(c("Biology","Chemistry",
                               "Economics", "Psychology"), 
                             1000, 
                             replace = TRUE), 
         grade_points = round(4*rbeta(1000, 5, 1), 1),
         grade_letter = case_when(grade_points >= 4.0 ~ "A",
                                  grade_points >= 3.0 ~ "B",
                                  grade_points >= 2.0 ~ "C",
                                  grade_points >= 1.0 ~ "D",
                                  TRUE                ~ "F"))
grades <-
  grades |> 
    mutate(subject = factor(subject, 
                            levels = c("Biology","Chemistry",
                                       "Economics", "Psychology"))
grades <-
  grades |>
    mutate(grade_letter = factor(grade_letter,
                                 levels = c("F", "D", "C", "B", "A"),
                                 ordered = TRUE))

2 Finding proportions using logical expressions

2.1 The tidyverse command

To find the proportion of grades of at least 3.0 (a B or better), we could do this:

Tip

Nicely-formatting table output

The kable() operator provides an easy way to apply some nice formatting to your table outputs. (You can read about a lot of its options on this page.) Before using it, you need to use library(knitr) within RStudio (if you haven’t already done so for tidyverse).

During this lesson, we use the digits option to keep us from having to call round(x, 3) for every decimal number.

Since we didn’t group the data, the calculation operates over all of the rows. This trick works because the logical expression grade_points >= 3.0 is converted to 1 for TRUE and 0 for FALSE, and when we average those we get the proportion of 1s.

Tip

Converting values from logical to integer

Let’s remind ourselves about how this conversion from logical (TRUE/FALSE) to integer works.

  1. We define a vector v consisting of five logical values. Since we surround it with parentheses, R prints the value of v.
  2. We define a vector i in which we convert each element of v to its integer equivalent. Again, since we surround it with parentheses, R prints the value of i.
  3. We calculate the sum(v) which will, of course, simply be the number of 1 values in v.
  4. We calculate length(v), which is the number of elements in v.
  5. Finally, dividing sum()/length(), we get the percentage of values of v that are 1 (or TRUE).

This is the underlying process by which the command above works.

2.2 How the command works

Let’s see if we can explore in more detail using a variety of R/tidyverse commands in order to understand how this command works on our data.

First, let’s filter to include those rows for which grade_points is greater than or equal to 3.0. (Given what we saw above, this should be 781 rows.) Then use the select() operator to display the values in the grade_points column:

Next, let’s use the mutate() command to define a new column a_or_b that is TRUE if grade_points is high or FALSE if it is not. This should then display all 1000 rows but include this new logical column:

Here, we are filtering to include those rows for which the grade is high, and then using summarize() to calculate the number of rows that remain:

We’re narrowing in on it! In the following, we use mutate() to define a new logical column high_grade as we did above. This time we are going to use summarize() to calculate the average of this new column. This applies the technique that was discussed in the callout box above.

In the following, we simply substitute the value of high_grade that is set in the above mutate() operator (i.e., grade_points >= 3.0) into the right side of the summarize() operator. This allows us to get rid of the mutate() operator all together.

And this is the command that we started with back in Section 2.1.

2.3 Using the command with group_by()

If we want to find the proportion of grades > 3.0 by subject, we can group the data using group_by(subject), as follows:

Finally, we can also use this technique to calculate multiple proportions at the same time: