Lesson on Pipes

1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

  • Edit the code that is shown in the box. Click on the Run Code button.
  • Make further edits and re-run that code. You can do this as often as you’d like.
  • Click the Start Over button to bring back the original code if you’d like.
  • If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

1.1 Using RStudio

If you’re following along with this exercise in RStudio, then you need to execute the following code in the Console. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.

library(tidyverse)
library(tidylog)
  • The first line loads the tidyverse package. You could actually load just the packages dplyr, purrr, and stringr to save on memory, but with today’s computers and the types of things that you’re doing at this point, you don’t need to worry about load speed and/or memory usage just yet.
  • The second package tells R to give more detailed messages.

2 Set up: Create a small data frame

A data frame is like a table in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.

df = data.frame(
  id         = c(1, 2, 3, 4, 5),
  first_name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  last_name  = c("Smith", "Jones", "Kline", "White", "Zettle"),
  class      = c("Freshman", "Sophomore", "Junior", "Senior", "Senior"),
  age        = c(18,18,19,20,21),
  gpa        = c(3.5, 3.2, 3.8, 3.0, 3.9)
)

See data_types.qmd for more on these data structures.

3 The pipe

The pipe operator is a symbol denoted |> (you’ll also see %>%), which is used to chain together multiple operations. For more information, see this page on rforir. It’s never necessary to use the pipe, but it is the standard way of structuring queries and calculations in R/tidyverse.

It also makes a script easier to read and write. As an example, suppose we want to find the sum of the squares of GPAs for some statistical analysis. We could do it in stages like this:

  1. extract the gpa column using select()
  2. square the values using mutate()
  3. sum the squares using summarize()

We could do all that with an incomprehensible nesting of parentheses:

Or we could use the |> pipe to make it WAY more readable:

See how much more readable that is? In writing the script, it’s also easier to think of the chain of operations in the order we do them, whereas with parentheses we have to think from the inside out.

4 How it works

The pipe operator only does one simple thing. It slightly rearranges the order of arguments to functions. Recall that functions are named operations that take some information and produce a single result. Like taking the square root of a vector of numbers.

Run the following code block to see the first few square roots:

The pipe operator lets us write the same thing as:

In this case, it’s not that helpful because the original statement is simple, but it illustrates what the pipe does — it lets us move the first argument of a function out of the parenthesis and put it in front. This lets us think of a chain of operations like an assembly line.

Suppose we now want to add the square roots together. Without the pipe, this can be accomplished with the following code. Note this is written inside out: take square roots and then sum up.

But with the pipe, it looks like a chain of operations:

  1. Specify the numbers, then
  2. Take their square roots, then
  3. Sum them.

Often we write piped operations vertically for readability, as in the following code block. Note that these line breaks have no effect on the calculations themselves.

This vertical arrangement has another advantage: we can add comments to help the reader (which may be ourselves next time we run the script):

Use, and over-use, of comments

The previous example shows the technique of integrating comments into your code. However, don’t take it as prescriptive of how frequently you need to write comments. For example, at some point, if you don’t know that sum() is the function to take the summation of a set of numbers, then you will have bigger problems than can be solved with commenting. At the beginning of your R/tidyverse scripting journey, you will comment on some things that you will not need to comment on later in your journey. And that’s okay!

Comment, and even over-comment, at the beginning of your journey. When you revisit your own code in weeks or even months, pay attention to which comments you find helpful or not. And then adjust your commenting strategy as appropriate.

5 The pipe and tidyverse

The tidyverse functions like select(), mutate(), and summarize() are designed to work with the pipe. They take a data frame as their first argument, and return a data frame, so we can think of each function as a step in a factory-like production line.

For example, when we use select() to choose columns, the function needs to know:

  1. what data frame to use, and
  2. what columns to choose.

Running the following code block will display the gpa column of the df data frame:

With the pipe, we can write it in a production-line fashion:

From there we can add more operations like mutate(), and it works just like we would expect: it takes the result of the select and sends it to the mutate:

6 Variations

The original pipe in R was the %>% operator, which is still used in many scripts. In 2023, the |> operator was added to R as a built-in operator. They do the same thing, with some subtle differences. To tell RStudio you want to use the built-in |> version, find the Tools menu at the top, and navigate to Global Options > Code/Editing and check the box that says to use the native pipe operator.

7 Shortcut key

Since we use the pipe so often, it’s helpful to have a keyboard shortcut for it.

Shortcuts are for your work in RStudio

All of this discussion about shortcuts are appropriate for your work in RStudio but do not apply to your work on these Web pages.

Over time, the vast majority of your work will occur in RStudio (or similar development environments) so, don’t worry, this discussion will have relevance to you!

By default it is Shift+Ctrl+M, but you can change that to something simpler like Alt+M. To do that, find the Tools menu at the top, Global Options, Code. Look for the button that says “Modify keyboard shortcuts” (about 2/3 of the way down the list of options). Use the search box to find “pipe” and change the shortcut to whatever you want.

The “M” is associated with the pipe for historical reasons. The original pipe operator was found in package(magrittr). (Okay, okay, here are the details: Rene Magritte was a Belgian surrealist painter, and one of his most famous paintings is of a pipe. Thus, the shortcut for inserting the pipe symbol uses an M.)

Test your knowledge

Try using the pipe operator with the data frame df to do the following:

  1. select the columns first_name and last_name
  2. mutate a new column called full_name that concatenates first_name and last_name. The function you need is str_c(first_name, " ", last_name).

You can find more information about str_c() on this page.

A bit of help

The structure of the mutate() function is as follows:

mutate(NEW_COLUMN_NAME = SOME_CALCULATION)

Solution

The structure of the following solution can be described as follows:

  1. Use the df data frame,
  2. select the two columns that we need, and
  3. Define a new column called full_name using the str_c() function.