grades <- data.frame(record_id = 1:1000) |>
mutate(student_id = sample(1:100, 1000, replace = TRUE),
section_id = sample(1:50, 1000, replace = TRUE),
subject = sample(c("Biology","Chemistry",
"Economics", "Psychology"),
1000,
replace = TRUE),
grade_points = round(4*rbeta(1000, 5, 1), 1),
grade_letter = case_when(grade_points >= 4.0 ~ "A",
grade_points >= 3.0 ~ "B",
grade_points >= 2.0 ~ "C",
grade_points >= 1.0 ~ "D",
TRUE ~ "F"))
grades <-
grades |>
mutate(subject = factor(subject,
levels = c("Biology","Chemistry",
"Economics", "Psychology")))
grades <-
grades |>
mutate(grade_letter = factor(grade_letter,
levels = c("F", "D", "C", "B", "A"),
ordered = TRUE))Finding proportions using logical expressions
1 Set up: Create some test data
A data frame is like a table in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.
2 Finding proportions using logical expressions
2.1 The tidyverse command
To find the proportion of grades of at least 3.0 (a B or better), we could do this:
Nicely-formatting table output
The kable() operator provides an easy way to apply some nice formatting to your table outputs. (You can read about a lot of its options on this page.) Before using it, you need to use library(knitr) within RStudio (if you haven’t already done so for tidyverse).
During this lesson, we use the digits option to keep us from having to call round(x, 3) for every decimal number.
Since we didn’t group the data, the calculation operates over all of the rows. This trick works because the logical expression grade_points >= 3.0 is converted to 1 for TRUE and 0 for FALSE, and when we average those we get the proportion of 1s.
Converting values from logical to integer
Let’s remind ourselves about how this conversion from logical (TRUE/FALSE) to integer works.
- We define a vector
vconsisting of five logical values. Since we surround it with parentheses,Rprints the value ofv. - We define a vector
iin which we convert each element ofvto its integer equivalent. Again, since we surround it with parentheses,Rprints the value ofi. - We calculate the
sum(v)which will, of course, simply be the number of1values inv. - We calculate
length(v), which is the number of elements inv. - Finally, dividing
sum()/length(), we get the percentage of values ofvthat are1(orTRUE).
[1] TRUE TRUE FALSE FALSE TRUE
[1] 1 1 0 0 1
[1] 3
[1] 5
[1] 0.6
This is the underlying process by which the command above works.
2.2 How the command works
Let’s see if we can explore in more detail using a variety of R/tidyverse commands in order to understand how this command works on our data.
First, let’s filter to include those rows for which grade_points is greater than or equal to 3.0. (Given what we saw above, this should be 785 rows.) Then use the select() operator to display the values in the grade_points column:
Next, let’s use the mutate() command to define a new column a_or_b that is TRUE if grade_points is high or FALSE if it is not. This should then display all 1000 rows but include this new logical column:
Here, we are filtering to include those rows for which the grade is high, and then using summarize() to calculate the number of rows that remain:
We’re narrowing in on it! In the following, we use mutate() to define a new logical column high_grade as we did above. This time we are going to use summarize() to calculate the average of this new column. This applies the technique that was discussed in the callout box above.
grades |>
mutate(high_grade = grade_points >= 3.0) |>
summarize(a_or_b = mean(high_grade)) |>
kable(digits = 3)| a_or_b |
|---|
| 0.785 |
In the following, we simply substitute the value of high_grade that is set in the above mutate() operator (i.e., grade_points >= 3.0) into the right side of the summarize() operator. This allows us to get rid of the mutate() operator all together.
And this is the command that we started with back in Section 2.1.
2.3 Using the command with group_by()
If we want to find the proportion of grades > 3.0 by subject, we can group the data using group_by(subject), as follows:
| subject | a_or_b |
|---|---|
| Biology | 0.760 |
| Chemistry | 0.831 |
| Economics | 0.752 |
| Psychology | 0.799 |
Finally, we can also use this technique to calculate multiple proportions at the same time:
grades |>
group_by(subject) |>
summarize(
A = mean(grade_letter == "A"),
B = mean(grade_letter == "B"),
C = mean(grade_letter == "C"),
D = mean(grade_letter == "D"),
F = mean(grade_letter == "F")
) |>
kable(digits = 3)| subject | A | B | C | D | F |
|---|---|---|---|---|---|
| Biology | 0.066 | 0.694 | 0.225 | 0.016 | 0.000 |
| Chemistry | 0.058 | 0.774 | 0.148 | 0.016 | 0.004 |
| Economics | 0.052 | 0.700 | 0.220 | 0.028 | 0.000 |
| Psychology | 0.072 | 0.727 | 0.185 | 0.016 | 0.000 |