Filters

1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

  • Edit the code that is shown in the box. Click on the Run Code button.
  • Make further edits and re-run that code. You can do this as often as you’d like.
  • Click the Start Over button to bring back the original code if you’d like.
  • If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

2 Set up: Create a small data frame

A data frame is like a table in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.

See the lesson on data types for more on these data structures.

3 Filtering rows

The select() function is used to choose columns from a data frame. (You can work through the select() lesson on this page.) Think of that as taking vertical slices (i.e., columns) of the data frame. The filter() function is complementary — it takes horizontal slices (i.e., rows), just like filtering a spreadsheet. For more information, see this page on rforir.

3.1 Basic filter example

First, let’s remind ourselves of the df data table that we’re going to be working with.

The following is the simplest example that we can think of. Here, we want to filter for those rows from df for which the gpa column has a value greater than 3.5. Accordingly, the R/tidyverse query is as follows:

As you might expect, only one row is remaining after the filter(). Also, note the comment (generated by the tidylog package); it helps you understand just how many records are filtered out for any particular query.

3.2 Simple conditions

As you can see, the filter() function takes a logical expression as its argument. Here are some variations.

Run each one of these code blocks and be sure the results make sense.

Tip

Using the code blocks to experiment with R

We encourage you to change the code in these code blocks and experiment with what’s possible and how this works.

3.2.1 Working with numbers

This code block filters to include rows for which age is equal to 19. Note that the test for equality consists of 2 equal signs!

The following query is the complement of the above — it filters to include those rows for which age is not equal to 19. Again, note the symbol for not equal is !=.

Tip

Blank values and equality

Something fishy seems to be going on with the above two examples. Note the following:

  • The df data frame has 5 rows.
  • For age == 19, 1 row matches.
  • This would lead you to believe that 4 rows have a value that does not equal to 19.
  • Looking at the results of the most recent code block, for age != 19, 4 rows match…NO THEY DON’T! Only 3 rows appear below the query!

Why is this? Because the age column for Eve is blank — that is, according to R, it neither equals nor does not equal 19.

In order for this to make sense from the beginning, you should pay close attention to the “Filtering out blanks” section below.

3.2.2 Working with text

The process for working with text is the same as for working with numbers — you can use the same == and != symbols as in the following two examples.

Again, change around this code and verify that the printed results are what you expect.

This query filters to include those rows with class equal to "Freshman":

This query filters to include those rows with class not equal to "Freshman":

Finally, this query filters to include those rows with class greater than "Junior":

Looking at these results, they don’t seem intuitively right. Of course, it’s definitely true that, alphabetically, Sophomore and Senior are greater than (i.e., come later in the alphabet than) Junior. However, That’s likely not what we mean when we think class > "Junior" — we likely mean "Senior" because it’s the fourth year of an undergraduate education, coming after the third/"Junior" year.

We address this issue in Section 4 of this lesson.

Test your knowledge

Write a query that filters to include those rows with gpa less than or equal to 3.5.

A bit of help

The symbol for greater than or equal to is >=. The symbol for less than or equal to is <=.

Solution

This is our answer for this exercise:

Test your knowledge

Write a query to filter to include those rows with class equal to "Junior".

Solution

This is how we wrote the query:

Test your knowledge

Write a query that filters to select those rows for which the last name is greater than “Kline”.

Solution

This is how we wrote this query:

Note that this text column does not have the order problem that the class column has — alphabetical is the expected ordering rule for text.

3.3 Compound conditions

We can make filters as complex as we need by combining logical expressions with & (and) and | (or):

  • & (and): both conditions have to be true; otherwise, it evaluates to false.
  • | (or): only one of the conditions have to be true; only if all conditions are false will it evaluate to false.

The ordering of the conditions will have no effect on the value of the expression.

The following filter query should be interpreted as follows:

When operating on the df data frame, filter to select those rows for which it is true that both age is greater than or equal to 18 and gpa is greater than 3.2.

Look at the results and validate that the results are what you expect.

In this example, notice that a comma is equivalent to an and (&). This query is evaluated in exactly the same way as the previous one.

The following query also is equivalent to the previous two queries! Let’s interpret its meaning:

When operating on the df data frame, filter to include those rows for which age is greater than or equal to 18. Then, from that set of data, filter to include those rows for which gpa is greater than 3.2.

It is the case that both the age filter and the gpa filter have to be true for the row to be displayed; it is generally true that two filters separated by the pipe symbol has the same effect as a logical & (and) operator.

Finally, as described above, the | (or) operator ensures that a row is included if it meets any of the specified criteria. The following query is interpreted as follows:

Working on the df data frame, filter to include rows for which it is either true that age is greater than or equal to 18 or it is true that gpa is greater than 3.2.

Again, validate that the results are what you expected.

Test your knowledge

For which students is it true that the name comes before “David” in the alphabet and his/her age is 18?

Solution

This is fairly straight-forward, as compound queries go:

Of course, you could have also written the query with two separate filter functions or separated the two clauses by commas instead of the &.

Test your knowledge

This question is a bit more complicated: For which freshmen or sophomores is it true that they are both 18 years old and have a GPA greater than 3.2?

Solution

Below, we present three different ways of answering this question. All give the same answer (as one would hope!). The challenge implicit in this question is that there is both an or and an and included in the question. How are we to handle this?

The first one is the way that we expecct that you answered. The following two present techniques that we have not covered but that we think are sufficiently important that you should be aware of them.

The first approach handles the or filter and then handles the and filter.

The second approach puts all of the conditions in one filter. Note the structure of this function:

filter((A | B) & C & D)

The inner parentheses around the A | B indicates that this condition is to be evaluated separately such that either A or B can be true. After this is evaluated, then both C and D also have to be true.

The following shows how to do that. Note that the line return after the class condition does not effect the interpretation of the query.

Finally, the following introduces a new operator that we have not encountered before: the %in% operator. When looking at the following, do not fail to notice the c() function included! It is vital!

Here is the basic structure of the following:

filter(COLUMN %in% SOME_VECTOR & C & D)

This %in% operator evaluates to true if the value in COLUMN can be found within SOME_VECTOR.

In this case, we are asking if the value in the class column is equal to either Sophomore or Freshman.

The rest of the query is the same as the above.

None of the above options are better than any of the others for our purposes. Be sure that you understand why they are equivalent.

3.4 Combining select() with filter()

These two functions together narrow a data frame by eliminating columns and rows, and are often used in combination. It’s best to do the filter first, because you may want to deselect the column being filtered on, which can produce a confusing error.

Let’s be clear about what instructions we are giving to R in the following code block:

For the df data table, use filter() to include those rows with gpa greater than 3.3. For those rows, display both the first_name and last_name columns.

From the tidylog output, you can see that filter removed 3 rows and select dropped 4 columns, displaying just the two columns in the select() function.

Now, let’s look at what happens when we reverse the order of select and filter in this example.

Again, let’s interpret the instructions that are in the following code block:

For the df data table, use select() to retrieve the first_name and last_name columns. For those two columns (and all of the rows, since we have not filtered as of yet), use filter() to include those rows with gpa greater than 3.3.

Run the following code block now:

When you run this, you get an R error telling you that “object gpa is not found”. Why is that? The data frame df certainly has a gpa column!

But wait!

In the select() operator, we selected two columns…and gpa is not one of them! We cannot filter on a column that we do not have access to.

Test your knowledge

The following code block contains the original R code that gave the error message. Change it now — without changing the order of the select and filter operators — so that it does not give an error message.

Solution

The simplest way that we could figure out to address this question was to add gpa to the select() operator. This provides the information that the filter() operator needs to do what it’s been asked to do.

3.5 Filtering out blanks

A common step in data cleanup is to filter out or replace blanks. The is.na() function can be used to detect missing values, and if we put a “not” operator — the ! in front to make !is.na() — it flips the truth values to detect non-blanks. For more information on missing values, see this page on rforir.

The following code block shows how to use filter() to include those rows for which gpa is not blank:

Test your knowledge

Change the following code block so that it does the opposite of what the previous query does — that is, it wants to filter to include those rows for which gpa is blank.

Solution

It’s a simple change: remove the not operator (the !) in front of the is.na(), as follows:

If you want to filter a data frame on all columns to display those rows that do not have any blanks, there’s a short cut: na.omit().

The above code block used filter() to include only those rows for which every column has a value.

It’s more complicated than the previous case to create a filter that’s the opposite of na.omit() — that is, to filter to include those rows with at least one blank.

We now have to use the if_any() and everything() functions. Let’s clarify how the following should be interpreted:

Working on the df data frame, use filter to include rows for which any column has the value NA.

Test your knowledge

Previously on this page we examined the following queries that display those rows for which age does not equal to 19 and those rows for which age does equal to 19:

If you remember, the issue that we brought up is that the number of rows in df (the first query) does not equal the total of rows returned by the second and third queries.

In this exercise you are going to integrate what we have learned about blank values.

In the code box below, define the following queries:

  1. One query that displays all rows for which age is not blank.
  2. One query that displays all rows for which age has a value and that value is not 19.
  3. One query that displays all rows for which age has a value and that value is equal to 19.

There are, as usual, multiple ways to do this (and we discuss some of those ways in the solution that follows).

A bit of help

The main challenge here is to figure out how to filter to include those rows for which age is not blank. One way of doing this query includes a filter for doing this in each one of the queries.

A bit of help

In one of the approaches, for the second and third queries, you can either 1) use two separate filter statements, or 2) use one filter statement that includes two clauses separated by an and operator.

Another approach would create a new data frame that has all of the rows with a blank age value removed.

Solution

We are going to go through three separate approaches to answering this question. Each one gives the same answer, one for which the total number of rows returned by queries 2 and 3 equals the number of rows returned by the first query.

For approach one, we use separate filter statements — one to get rid of blank age values and the other to filter for the age values that we’re looking for.

For the second approach, we include all of the filter clauses in one filter() function per query:

For the third approach, we first create a new data frame that has removed all of the rows with a blank age value. The second and third queries then operate on this restricted df_age data frame so there’s no need to worry about blank age values any more.

4 Technicalities

In IR work, we often report on class standing, first year (or freshman), sophomore, junior, and senior for a four-year undergraduate program. But these don’t sort alphabetically in the right order. There is a way to tell R how to sort them correctly by converting them to a factor type. For more on factors, see this page on rforir.

Here we will replace the class column using the mutate() function to create an ordered factor.

Compare the results to the previous filter() (from the end of the above section.