library(tidyverse)
library(tidylog)
Lesson on Pipes
1 Using this document
Within this document are blocks of R
code. You can edit and execute this code as a way of practicing your R
skills:
- Edit the code that is shown in the box. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
1.1 Using RStudio
If you’re following along with this exercise in RStudio
, then you need to execute the following code in the Console
. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.
- The first line loads the
tidyverse
package. You could actually load just the packagesdplyr
,purrr
, andstringr
to save on memory, but with today’s computers and the types of things that you’re doing at this point, you don’t need to worry about load speed and/or memory usage just yet. - The second package tells
R
to give more detailed messages.
2 Set up: Create a small data frame
A data frame
is like a table
in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.
= data.frame(
df id = c(1, 2, 3, 4, 5),
first_name = c("Alice", "Bob", "Charlie", "David", "Eve"),
last_name = c("Smith", "Jones", "Kline", "White", "Zettle"),
class = c("Freshman", "Sophomore", "Junior", "Senior", "Senior"),
age = c(18,18,19,20,21),
gpa = c(3.5, 3.2, 3.8, 3.0, 3.9)
)
See data_types.qmd
for more on these data structures.
3 The pipe
The pipe operator is a symbol denoted |>
(you’ll also see %>%
), which is used to chain together multiple operations. For more information, see this page on rforir
. It’s never necessary to use the pipe, but it is the standard way of structuring queries and calculations in R/tidyverse
.
It also makes a script easier to read and write. As an example, suppose we want to find the sum of the squares of GPAs for some statistical analysis. We could do it in stages like this:
- extract the gpa column using
select()
- square the values using
mutate()
- sum the squares using
summarize()
We could do all that with an incomprehensible nesting of parentheses:
Or we could use the |>
pipe to make it WAY more readable:
See how much more readable that is? In writing the script, it’s also easier to think of the chain of operations in the order we do them, whereas with parentheses we have to think from the inside out.
4 How it works
The pipe operator only does one simple thing. It slightly rearranges the order of arguments to functions. Recall that functions are named operations that take some information and produce a single result. Like taking the square root of a vector of numbers.
Run the following code block to see the first few square roots:
The pipe operator lets us write the same thing as:
In this case, it’s not that helpful because the original statement is simple, but it illustrates what the pipe does — it lets us move the first argument of a function out of the parenthesis and put it in front. This lets us think of a chain of operations like an assembly line.
Suppose we now want to add the square roots together. Without the pipe, this can be accomplished with the following code. Note this is written inside out: take square roots and then sum up.
But with the pipe, it looks like a chain of operations:
- Specify the numbers, then
- Take their square roots, then
- Sum them.
Often we write piped operations vertically for readability, as in the following code block. Note that these line breaks have no effect on the calculations themselves.
This vertical arrangement has another advantage: we can add comments to help the reader (which may be ourselves next time we run the script):
The previous example shows the technique of integrating comments into your code. However, don’t take it as prescriptive of how frequently you need to write comments. For example, at some point, if you don’t know that sum()
is the function to take the summation of a set of numbers, then you will have bigger problems than can be solved with commenting. At the beginning of your R/tidyverse
scripting journey, you will comment on some things that you will not need to comment on later in your journey. And that’s okay!
Comment, and even over-comment, at the beginning of your journey. When you revisit your own code in weeks or even months, pay attention to which comments you find helpful or not. And then adjust your commenting strategy as appropriate.
5 The pipe and tidyverse
The tidyverse
functions like select()
, mutate()
, and summarize()
are designed to work with the pipe. They take a data frame as their first argument, and return a data frame, so we can think of each function as a step in a factory-like production line.
For example, when we use select()
to choose columns, the function needs to know:
- what data frame to use, and
- what columns to choose.
Running the following code block will display the gpa
column of the df
data frame:
With the pipe, we can write it in a production-line fashion:
From there we can add more operations like mutate()
, and it works just like we would expect: it takes the result of the select
and sends it to the mutate
:
6 Variations
The original pipe in R
was the %>%
operator, which is still used in many scripts. In 2023, the |>
operator was added to R
as a built-in operator. They do the same thing, with some subtle differences. To tell RStudio
you want to use the built-in |>
version, find the Tools
menu at the top, and navigate to Global Options > Code/Editing
and check the box that says to use the native pipe operator.
7 Shortcut key
Since we use the pipe so often, it’s helpful to have a keyboard shortcut for it.
RStudio
All of this discussion about shortcuts are appropriate for your work in RStudio
but do not apply to your work on these Web pages.
Over time, the vast majority of your work will occur in RStudio
(or similar development environments) so, don’t worry, this discussion will have relevance to you!
By default it is Shift+Ctrl+M
, but you can change that to something simpler like Alt+M
. To do that, find the Tools menu at the top, Global Options, Code. Look for the button that says “Modify keyboard shortcuts” (about 2/3 of the way down the list of options). Use the search box to find “pipe” and change the shortcut to whatever you want.
The “M” is associated with the pipe for historical reasons. The original pipe operator was found in package(magrittr)
. (Okay, okay, here are the details: Rene Magritte was a Belgian surrealist painter, and one of his most famous paintings is of a pipe. Thus, the shortcut for inserting the pipe
symbol uses an M
.)
Test your knowledge
Try using the pipe operator with the data frame df
to do the following:
- select the columns
first_name
andlast_name
- mutate a new column called
full_name
that concatenatesfirst_name
andlast_name
. The function you need isstr_c(first_name, " ", last_name)
.
You can find more information about str_c()
on this page.
A bit of help
The structure of the mutate()
function is as follows:
mutate(NEW_COLUMN_NAME = SOME_CALCULATION)
Solution
The structure of the following solution can be described as follows:
- Use the
df
data frame, select
the two columns that we need, and- Define a new column called
full_name
using thestr_c()
function.