Filters

1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

Edit the code that is shown in the box. Click on the Run Code button.
Make further edits and re-run that code. You can do this as often as you’d like.
Click the Start Over button to bring back the original code if you’d like.
If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

2 Set up: Create a small data frame

A data frame is like a table in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.

See the lesson on data types for more on these data structures.

3 Filtering rows

The select() function is used to choose columns from a data frame. (You can work through the select() lesson on this page.) Think of that as taking vertical slices (i.e., columns) of the data frame. The filter() function is complementary — it takes horizontal slices (i.e., rows), just like filtering a spreadsheet. For more information, see this page on rforir.

3.1 Basic filter example

First, let’s remind ourselves of the df data table that we’re going to be working with.

The following is the simplest example that we can think of. Here, we want to filter for those rows from df for which the gpa column has a value greater than 3.5. Accordingly, the R/tidyverse query is as follows:

As you might expect, only one row is remaining after the filter(). Also, note the comment (generated by the tidylog package); it helps you understand just how many records are filtered out for any particular query.

3.2 Simple conditions

As you can see, the filter() function takes a logical expression as its argument. Here are some variations.

Run each one of these code blocks and be sure the results make sense.

Tip

Using the code blocks to experiment with R

We encourage you to change the code in these code blocks and experiment with what’s possible and how this works.

3.2.1 Working with numbers

This code block filters to include rows for which age is equal to 19. Note that the test for equality consists of 2 equal signs!

The following query is the complement of the above — it filters to include those rows for which age is not equal to 19. Again, note the symbol for not equal is !=.

Tip

Blank values and equality

Something fishy seems to be going on with the above two examples. Note the following:

The df data frame has 5 rows.
For age == 19, 1 row matches.
This would lead you to believe that 4 rows have a value that does not equal to 19.
Looking at the results of the most recent code block, for age != 19, 4 rows match…NO THEY DON’T! Only 3 rows appear below the query!

Why is this? Because the age column for Eve is blank — that is, according to R, it neither equals nor does not equal 19.

In order for this to make sense from the beginning, you should pay close attention to the “Filtering out blanks” section below.

3.2.2 Working with text

The process for working with text is the same as for working with numbers — you can use the same == and != symbols as in the following two examples.

Again, change around this code and verify that the printed results are what you expect.

This query filters to include those rows with class equal to "Freshman":

This query filters to include those rows with class not equal to "Freshman":

Finally, this query filters to include those rows with class greater than "Junior":

Looking at these results, they don’t seem intuitively right. Of course, it’s definitely true that, alphabetically, Sophomore and Senior are greater than (i.e., come later in the alphabet than) Junior. However, That’s likely not what we mean when we think class > "Junior" — we likely mean "Senior" because it’s the fourth year of an undergraduate education, coming after the third/"Junior" year.

We address this issue in Section 4 of this lesson.

Test your knowledge

Write a query that filters to include those rows with gpa less than or equal to 3.5.

Test your knowledge

Write a query to filter to include those rows with class equal to "Junior".

Test your knowledge

Write a query that filters to select those rows for which the last name is greater than “Kline”.

3.3 Compound conditions

We can make filters as complex as we need by combining logical expressions with & (and) and | (or):

& (and): both conditions have to be true; otherwise, it evaluates to false.
| (or): only one of the conditions have to be true; only if all conditions are false will it evaluate to false.

The ordering of the conditions will have no effect on the value of the expression.

The following filter query should be interpreted as follows:

When operating on the df data frame, filter to select those rows for which it is true that both age is greater than or equal to 18 and gpa is greater than 3.2.

Look at the results and validate that the results are what you expect.

In this example, notice that a comma is equivalent to an and (&). This query is evaluated in exactly the same way as the previous one.

The following query also is equivalent to the previous two queries! Let’s interpret its meaning:

When operating on the df data frame, filter to include those rows for which age is greater than or equal to 18. Then, from that set of data, filter to include those rows for which gpa is greater than 3.2.

It is the case that both the age filter and the gpa filter have to be true for the row to be displayed; it is generally true that two filters separated by the pipe symbol has the same effect as a logical & (and) operator.

Finally, as described above, the | (or) operator ensures that a row is included if it meets any of the specified criteria. The following query is interpreted as follows:

Working on the df data frame, filter to include rows for which it is either true that age is greater than or equal to 18 or it is true that gpa is greater than 3.2.

Again, validate that the results are what you expected.

Test your knowledge

For which students is it true that the name comes before “David” in the alphabet and his/her age is 18?

Test your knowledge

This question is a bit more complicated: For which freshmen or sophomores is it true that they are both 18 years old and have a GPA greater than 3.2?

3.4 Combining `select()` with `filter()`

These two functions together narrow a data frame by eliminating columns and rows, and are often used in combination. It’s best to do the filter first, because you may want to deselect the column being filtered on, which can produce a confusing error.

Let’s be clear about what instructions we are giving to R in the following code block:

For the df data table, use filter() to include those rows with gpa greater than 3.3. For those rows, display both the first_name and last_name columns.

From the tidylog output, you can see that filter removed 3 rows and select dropped 4 columns, displaying just the two columns in the select() function.

Now, let’s look at what happens when we reverse the order of select and filter in this example.

Again, let’s interpret the instructions that are in the following code block:

For the df data table, use select() to retrieve the first_name and last_name columns. For those two columns (and all of the rows, since we have not filtered as of yet), use filter() to include those rows with gpa greater than 3.3.

Run the following code block now:

When you run this, you get an R error telling you that “object gpa is not found”. Why is that? The data frame df certainly has a gpa column!

But wait!

In the select() operator, we selected two columns…and gpa is not one of them! We cannot filter on a column that we do not have access to.

Test your knowledge

The following code block contains the original R code that gave the error message. Change it now — without changing the order of the select and filter operators — so that it does not give an error message.

3.5 Filtering out blanks

A common step in data cleanup is to filter out or replace blanks. The is.na() function can be used to detect missing values, and if we put a “not” operator — the ! in front to make !is.na() — it flips the truth values to detect non-blanks. For more information on missing values, see this page on rforir.

The following code block shows how to use filter() to include those rows for which gpa is not blank:

Test your knowledge

Change the following code block so that it does the opposite of what the previous query does — that is, it wants to filter to include those rows for which gpa is blank.

If you want to filter a data frame on all columns to display those rows that do not have any blanks, there’s a short cut: na.omit().

The above code block used filter() to include only those rows for which every column has a value.

It’s more complicated than the previous case to create a filter that’s the opposite of na.omit() — that is, to filter to include those rows with at least one blank.

We now have to use the if_any() and everything() functions. Let’s clarify how the following should be interpreted:

Working on the df data frame, use filter to include rows for which any column has the value NA.

Test your knowledge

Previously on this page we examined the following queries that display those rows for which age does not equal to 19 and those rows for which age does equal to 19:

If you remember, the issue that we brought up is that the number of rows in df (the first query) does not equal the total of rows returned by the second and third queries.

In this exercise you are going to integrate what we have learned about blank values.

In the code box below, define the following queries:

One query that displays all rows for which age is not blank.
One query that displays all rows for which age has a value and that value is not 19.
One query that displays all rows for which age has a value and that value is equal to 19.

There are, as usual, multiple ways to do this (and we discuss some of those ways in the solution that follows).

4 Technicalities

In IR work, we often report on class standing, first year (or freshman), sophomore, junior, and senior for a four-year undergraduate program. But these don’t sort alphabetically in the right order. There is a way to tell R how to sort them correctly by converting them to a factor type. For more on factors, see this page on rforir.

Here we will replace the class column using the mutate() function to create an ordered factor.

Compare the results to the previous filter() (from the end of the above section.

1 Using this document

2 Set up: Create a small data frame

3 Filtering rows

3.1 Basic filter example

3.2 Simple conditions

3.2.1 Working with numbers

3.2.2 Working with text

3.3 Compound conditions

3.4 Combining select() with filter()

3.5 Filtering out blanks

4 Technicalities

3.4 Combining `select()` with `filter()`