Filters
1 Using this document
Within this document are blocks of R
code. You can edit and execute this code as a way of practicing your R
skills:
- Edit the code that is shown in the box. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
2 Set up: Create a small data frame
A data frame
is like a table
in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.
See the lesson on data types for more on these data structures.
3 Filtering rows
The select()
function is used to choose columns from a data frame. (You can work through the select()
lesson on this page.) Think of that as taking vertical slices (i.e., columns) of the data frame. The filter()
function is complementary — it takes horizontal slices (i.e., rows), just like filtering a spreadsheet. For more information, see this page on rforir
.
3.1 Basic filter example
First, let’s remind ourselves of the df
data table that we’re going to be working with.
The following is the simplest example that we can think of. Here, we want to filter for those rows from df
for which the gpa
column has a value greater than 3.5
. Accordingly, the R/tidyverse
query is as follows:
As you might expect, only one row is remaining after the filter()
. Also, note the comment (generated by the tidylog
package); it helps you understand just how many records are filtered out for any particular query.
3.2 Simple conditions
As you can see, the filter()
function takes a logical expression as its argument. Here are some variations.
Run each one of these code blocks and be sure the results make sense.
Using the code blocks to experiment with R
We encourage you to change the code in these code blocks and experiment with what’s possible and how this works.
3.2.1 Working with numbers
This code block filters to include rows for which age
is equal to 19
. Note that the test for equality consists of 2
equal signs!
The following query is the complement of the above — it filters to include those rows for which age
is not equal to 19
. Again, note the symbol for not equal is !=
.
Blank values and equality
Something fishy seems to be going on with the above two examples. Note the following:
- The
df
data frame has 5 rows. - For
age == 19
, 1 row matches. - This would lead you to believe that 4 rows have a value that does not equal to 19.
- Looking at the results of the most recent code block, for
age != 19
, 4 rows match…NO THEY DON’T! Only 3 rows appear below the query!
Why is this? Because the age
column for Eve
is blank — that is, according to R
, it neither equals nor does not equal 19
.
In order for this to make sense from the beginning, you should pay close attention to the “Filtering out blanks” section below.
3.2.2 Working with text
The process for working with text is the same as for working with numbers — you can use the same ==
and !=
symbols as in the following two examples.
Again, change around this code and verify that the printed results are what you expect.
This query filters to include those rows with class
equal to "Freshman"
:
This query filters to include those rows with class
not equal to "Freshman"
:
Finally, this query filters to include those rows with class
greater than "Junior"
:
Looking at these results, they don’t seem intuitively right. Of course, it’s definitely true that, alphabetically, Sophomore
and Senior
are greater than (i.e., come later in the alphabet than) Junior
. However, That’s likely not what we mean when we think class > "Junior"
— we likely mean "Senior"
because it’s the fourth year of an undergraduate education, coming after the third/"Junior"
year.
We address this issue in Section 4 of this lesson.
Test your knowledge
Write a query that filters to include those rows with gpa
less than or equal to 3.5
.
A bit of help
The symbol for greater than or equal to is >=
. The symbol for less than or equal to is <=
.
Solution
This is our answer for this exercise:
Test your knowledge
Write a query to filter to include those rows with class
equal to "Junior"
.
Solution
This is how we wrote the query:
Test your knowledge
Write a query that filters to select those rows for which the last name is greater than “Kline”.
Solution
This is how we wrote this query:
Note that this text column does not have the order problem that the class
column has — alphabetical is the expected ordering rule for text.
3.3 Compound conditions
We can make filters as complex as we need by combining logical expressions with &
(and) and |
(or):
&
(and): both conditions have to be true; otherwise, it evaluates to false.|
(or): only one of the conditions have to be true; only if all conditions are false will it evaluate to false.
The ordering of the conditions will have no effect on the value of the expression.
The following filter
query should be interpreted as follows:
When operating on the
df
data frame, filter to select those rows for which it is true that bothage
is greater than or equal to18
andgpa
is greater than3.2
.
Look at the results and validate that the results are what you expect.
In this example, notice that a comma is equivalent to an and (&
). This query is evaluated in exactly the same way as the previous one.
The following query also is equivalent to the previous two queries! Let’s interpret its meaning:
When operating on the
df
data frame, filter to include those rows for whichage
is greater than or equal to18
. Then, from that set of data, filter to include those rows for whichgpa
is greater than3.2
.
It is the case that both the age
filter and the gpa
filter have to be true for the row to be displayed; it is generally true that two filters separated by the pipe
symbol has the same effect as a logical &
(and) operator.
Finally, as described above, the |
(or) operator ensures that a row is included if it meets any of the specified criteria. The following query is interpreted as follows:
Working on the
df
data frame, filter to include rows for which it is either true thatage
is greater than or equal to18
or it is true thatgpa
is greater than3.2
.
Again, validate that the results are what you expected.
Test your knowledge
For which students is it true that the name comes before “David” in the alphabet and his/her age is 18?
Solution
This is fairly straight-forward, as compound queries go:
Of course, you could have also written the query with two separate filter
functions or separated the two clauses by commas instead of the &
.
Test your knowledge
This question is a bit more complicated: For which freshmen or sophomores is it true that they are both 18 years old and have a GPA greater than 3.2?
Solution
Below, we present three different ways of answering this question. All give the same answer (as one would hope!). The challenge implicit in this question is that there is both an or and an and included in the question. How are we to handle this?
The first one is the way that we expecct that you answered. The following two present techniques that we have not covered but that we think are sufficiently important that you should be aware of them.
The first approach handles the or filter and then handles the and filter.
The second approach puts all of the conditions in one filter
. Note the structure of this function:
filter((A | B) & C & D)
The inner parentheses around the A | B
indicates that this condition is to be evaluated separately such that either A
or B
can be true. After this is evaluated, then both C
and D
also have to be true.
The following shows how to do that. Note that the line return after the class
condition does not effect the interpretation of the query.
Finally, the following introduces a new operator that we have not encountered before: the %in%
operator. When looking at the following, do not fail to notice the c()
function included! It is vital!
Here is the basic structure of the following:
filter(COLUMN %in% SOME_VECTOR & C & D)
This %in%
operator evaluates to true if the value in COLUMN
can be found within SOME_VECTOR
.
In this case, we are asking if the value in the class
column is equal to either Sophomore
or Freshman
.
The rest of the query is the same as the above.
None of the above options are better than any of the others for our purposes. Be sure that you understand why they are equivalent.
3.4 Combining select()
with filter()
These two functions together narrow a data frame by eliminating columns and rows, and are often used in combination. It’s best to do the filter
first, because you may want to deselect the column being filtered on, which can produce a confusing error.
Let’s be clear about what instructions we are giving to R
in the following code block:
For the
df
data table, usefilter()
to include those rows withgpa
greater than3.3
. For those rows, display both thefirst_name
andlast_name
columns.
From the tidylog
output, you can see that filter
removed 3 rows and select
dropped 4 columns, displaying just the two columns in the select()
function.
Now, let’s look at what happens when we reverse the order of select
and filter
in this example.
Again, let’s interpret the instructions that are in the following code block:
For the
df
data table, useselect()
to retrieve thefirst_name
andlast_name
columns. For those two columns (and all of the rows, since we have not filtered as of yet), usefilter()
to include those rows withgpa
greater than3.3
.
Run the following code block now:
When you run this, you get an R error
telling you that “object gpa
is not found”. Why is that? The data frame df
certainly has a gpa
column!
But wait!
In the select()
operator, we selected two columns…and gpa
is not one of them! We cannot filter on a column that we do not have access to.
Test your knowledge
The following code block contains the original R
code that gave the error message. Change it now — without changing the order of the select
and filter
operators — so that it does not give an error message.
Solution
The simplest way that we could figure out to address this question was to add gpa
to the select()
operator. This provides the information that the filter()
operator needs to do what it’s been asked to do.
3.5 Filtering out blanks
A common step in data cleanup is to filter out or replace blanks. The is.na()
function can be used to detect missing values, and if we put a “not” operator — the !
in front to make !is.na()
— it flips the truth values to detect non-blanks. For more information on missing values, see this page on rforir
.
The following code block shows how to use filter()
to include those rows for which gpa
is not blank:
Test your knowledge
Change the following code block so that it does the opposite of what the previous query does — that is, it wants to filter to include those rows for which gpa
is blank.
Solution
It’s a simple change: remove the not operator (the !
) in front of the is.na()
, as follows:
If you want to filter a data frame on all columns to display those rows that do not have any blanks, there’s a short cut: na.omit()
.
The above code block used filter()
to include only those rows for which every column has a value.
It’s more complicated than the previous case to create a filter that’s the opposite of na.omit()
— that is, to filter to include those rows with at least one blank.
We now have to use the if_any()
and everything()
functions. Let’s clarify how the following should be interpreted:
Working on the
df
data frame, usefilter
to include rows for which any column has the valueNA
.
Test your knowledge
Previously on this page we examined the following queries that display those rows for which age
does not equal to 19
and those rows for which age
does equal to 19
:
If you remember, the issue that we brought up is that the number of rows in df
(the first query) does not equal the total of rows returned by the second and third queries.
In this exercise you are going to integrate what we have learned about blank values.
In the code box below, define the following queries:
- One query that displays all rows for which
age
is not blank. - One query that displays all rows for which
age
has a value and that value is not19
. - One query that displays all rows for which
age
has a value and that value is equal to19
.
There are, as usual, multiple ways to do this (and we discuss some of those ways in the solution that follows).
A bit of help
The main challenge here is to figure out how to filter to include those rows for which age
is not blank. One way of doing this query includes a filter for doing this in each one of the queries.
A bit of help
In one of the approaches, for the second and third queries, you can either 1) use two separate filter
statements, or 2) use one filter statement that includes two clauses separated by an and operator.
Another approach would create a new data frame that has all of the rows with a blank age
value removed.
Solution
We are going to go through three separate approaches to answering this question. Each one gives the same answer, one for which the total number of rows returned by queries 2 and 3 equals the number of rows returned by the first query.
For approach one, we use separate filter
statements — one to get rid of blank age
values and the other to filter for the age
values that we’re looking for.
For the second approach, we include all of the filter
clauses in one filter()
function per query:
For the third approach, we first create a new data frame that has removed all of the rows with a blank age
value. The second and third queries then operate on this restricted df_age
data frame so there’s no need to worry about blank age
values any more.
4 Technicalities
In IR work, we often report on class standing, first year (or freshman), sophomore, junior, and senior for a four-year undergraduate program. But these don’t sort alphabetically in the right order. There is a way to tell R how to sort them correctly by converting them to a factor type. For more on factors, see this page on rforir
.
Here we will replace the class
column using the mutate()
function to create an ordered factor.
Compare the results to the previous filter()
(from the end of the above section.