Finding proportions using logical expressions

1 Set up: Create some test data

A data frame is like a table in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.

grades <- data.frame(record_id = 1:1000) |> 
  mutate(student_id = sample(1:100, 1000, replace = TRUE),
         section_id = sample(1:50,  1000, replace = TRUE),
         subject    = sample(c("Biology","Chemistry",
                               "Economics", "Psychology"), 
                             1000, 
                             replace = TRUE), 
         grade_points = round(4*rbeta(1000, 5, 1), 1),
         grade_letter = case_when(grade_points >= 4.0 ~ "A",
                                  grade_points >= 3.0 ~ "B",
                                  grade_points >= 2.0 ~ "C",
                                  grade_points >= 1.0 ~ "D",
                                  TRUE                ~ "F"))
grades <-
  grades |> 
    mutate(subject = factor(subject, 
                            levels = c("Biology","Chemistry",
                                       "Economics", "Psychology")))
grades <-
  grades |>
    mutate(grade_letter = factor(grade_letter,
                                 levels = c("F", "D", "C", "B", "A"),
                                 ordered = TRUE))

2 Finding proportions using logical expressions

2.1 The tidyverse command

To find the proportion of grades of at least 3.0 (a B or better), we could do this:

grades |> 
  summarize(a_or_b = mean(grade_points >= 3.0)) |>
  kable(digits = 3)
a_or_b
0.785
Tip

Nicely-formatting table output

The kable() operator provides an easy way to apply some nice formatting to your table outputs. (You can read about a lot of its options on this page.) Before using it, you need to use library(knitr) within RStudio (if you haven’t already done so for tidyverse).

During this lesson, we use the digits option to keep us from having to call round(x, 3) for every decimal number.

Since we didn’t group the data, the calculation operates over all of the rows. This trick works because the logical expression grade_points >= 3.0 is converted to 1 for TRUE and 0 for FALSE, and when we average those we get the proportion of 1s.

Tip

Converting values from logical to integer

Let’s remind ourselves about how this conversion from logical (TRUE/FALSE) to integer works.

  1. We define a vector v consisting of five logical values. Since we surround it with parentheses, R prints the value of v.
  2. We define a vector i in which we convert each element of v to its integer equivalent. Again, since we surround it with parentheses, R prints the value of i.
  3. We calculate the sum(v) which will, of course, simply be the number of 1 values in v.
  4. We calculate length(v), which is the number of elements in v.
  5. Finally, dividing sum()/length(), we get the percentage of values of v that are 1 (or TRUE).
(v <- c(TRUE, TRUE, FALSE, FALSE, TRUE))
[1]  TRUE  TRUE FALSE FALSE  TRUE
(i <- as.integer(v))
[1] 1 1 0 0 1
sum(v)
[1] 3
length(v)
[1] 5
sum(v)/length(v)
[1] 0.6

This is the underlying process by which the command above works.

2.2 How the command works

Let’s see if we can explore in more detail using a variety of R/tidyverse commands in order to understand how this command works on our data.

First, let’s filter to include those rows for which grade_points is greater than or equal to 3.0. (Given what we saw above, this should be 785 rows.) Then use the select() operator to display the values in the grade_points column:

grades |> 
  filter(grade_points >= 3.0) |>
  select(grade_points)

Next, let’s use the mutate() command to define a new column a_or_b that is TRUE if grade_points is high or FALSE if it is not. This should then display all 1000 rows but include this new logical column:

grades |> 
  mutate(a_or_b = grade_points >= 3.0)

Here, we are filtering to include those rows for which the grade is high, and then using summarize() to calculate the number of rows that remain:

grades |> 
  filter(grade_points >= 3.0) |>
  summarize(num_high = n())

We’re narrowing in on it! In the following, we use mutate() to define a new logical column high_grade as we did above. This time we are going to use summarize() to calculate the average of this new column. This applies the technique that was discussed in the callout box above.

grades |> 
  mutate(high_grade = grade_points >= 3.0) |>
  summarize(a_or_b = mean(high_grade)) |>
  kable(digits = 3)
a_or_b
0.785

In the following, we simply substitute the value of high_grade that is set in the above mutate() operator (i.e., grade_points >= 3.0) into the right side of the summarize() operator. This allows us to get rid of the mutate() operator all together.

grades |> 
  summarize(a_or_b = mean(grade_points >= 3.0)) |>
  kable(digits = 3)
a_or_b
0.785

And this is the command that we started with back in Section 2.1.

2.3 Using the command with group_by()

If we want to find the proportion of grades > 3.0 by subject, we can group the data using group_by(subject), as follows:

grades |> 
  group_by(subject) |> 
  summarize(a_or_b = mean(grade_points >= 3.0)) |>
  kable(digits = 3)
subject a_or_b
Biology 0.760
Chemistry 0.831
Economics 0.752
Psychology 0.799

Finally, we can also use this technique to calculate multiple proportions at the same time:

grades |> 
  group_by(subject) |> 
  summarize(
    A = mean(grade_letter == "A"), 
    B = mean(grade_letter == "B"), 
    C = mean(grade_letter == "C"), 
    D = mean(grade_letter == "D"), 
    F = mean(grade_letter == "F")
  ) |>
  kable(digits = 3)
subject A B C D F
Biology 0.066 0.694 0.225 0.016 0.000
Chemistry 0.058 0.774 0.148 0.016 0.004
Economics 0.052 0.700 0.220 0.028 0.000
Psychology 0.072 0.727 0.185 0.016 0.000