Selecting columns

1 Introduction

1.1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

  • Edit the code that is shown in the box. Click on the Run Code button.
  • Make further edits and re-run that code. You can do this as often as you’d like.
  • Click the Start Over button to bring back the original code if you’d like.
  • If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

1.2 Using RStudio

If you’re following along with this exercise in RStudio, then you need to execute the following code in the Console. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.

library(tidyverse)
library(tidylog)

The first line loads the tidyverse package. The second package tells R to give more detailed messages.

FYI, to speed up the process just a tiny bit, you can load dplyr and purrr instead of tidyverse…but we generally just go with the one-step solution that we show here.

1.3 Other information on selecting columns

For more information on selecting columns, see this page on rforir.

2 Create a small data frame

A data frame is like a table in a database or a spreadsheet — and you will hear us refer to it as a table. (At times the distinction among table, data frame, and tibble (more on this later) can be important … but now is not that time.) It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.

df = data.frame(
  id         = c(1, 2, 3, 4, 5),
  first_name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  last_name  = c("Smith", "Jones", "Kline", "White", "Zettle"),
  class      = c("Freshman", "Sophomore", "Junior", "Senior", "Senior"),
  age        = c(18,18,19,20,21),
  gpa        = c(3.5, 3.2, 3.8, 3.0, 3.9)
)

See the lesson on data types for more on these data structures (e.g., data.frame and c()).

This data frame has six columns: id, first_name, last_name, class, age, and gpa. It also has five rows; we will see this in just one moment. Finally (for now), all of this information is stored in the variable df so that we can access it later.

Different ways to assign values to a variable

Note that we could have just as easily used the “arrow” (<-) assignment operator instead of the equals sign (=) between the df and data.frame above. To be clear, we could not have made that substitution for the six other equals signs in that statement.

3 Display data from a data frame

Before exploring the select() function itself, we will briefly explore how to display data from a data frame with two quite basic approaches.

3.1 The simplest statement of all

The first technique is to simply type the name of the data table, as follows:

Interacting with these code blocks

The above is the first of these code blocks that are different from what you normally see on the Web. Be sure to press the Run code button for each box and then inspect the output — if you don’t, then it’s possible that subsequent code blocks will not work correctly.

Also, note that you can run any R code that you want within these code blocks.

You can get quite a bit of insight from just displaying the data frame in this manner:

  • You verify that you know the name of the data frame and that it is loaded into R.
  • You can see the names of the columns.
  • You can see what type of data is in each of those columns.

Sometimes, this is all that you need. Keep this tool handy.

3.2 Using the head() function

The next simplest query that we can issue to R is “show me the data in this table.” Within the tidyverse, you will see two different ways to structure statements. (This applies to most queries in the tidyverse, not just this one.)

3.2.1 The functional approach

Here is the first method, which we call the “functional” approach (because it relies solely on R functions to complete its task). We start you off by telling it to return the first 3 rows of the df data frame. After seeing what the initial code displays, play around with the code block and see what other values might return:

The head() function returns the top rows of a table. In this case, we are specifying that

  • It should work with the df table, and
  • It should return the top three rows.

3.2.2 The pipe-based approach

The following is the second way to structure statements, one that uses the “pipe” (|>) operator. Look at the code block below. You should interpret the code in this way:

You are working on the df data frame. Pass this data along to the next command, the one to the right of the pipe (in this case, the head() function). Apply the head(4) function to this data. Display the result at the terminal.

Go ahead and run this code.

Note that, with the initial value, the first four rows of the df data frame are displayed.

Play around with this to investigate what other values might return. See what happens with what you might consider reasonable values but also with unreasonable values. It’s always good to know how a system responds to unexpected inputs.

Test your knowledge

Using the pipe operator approach, show the first 2 rows of the df table. Replace the underscores in the statement below with the appropriate code.

A bit of help

You need to use the head() operator.

Our solution

The most direct way to print the first two rows of the df data frame is the following:

1df |>
2  head(2)
1
We work with the df data frame.
2
Use the head() function with an argument equal to 2 to print the first two rows.

An equivalent answer would have the above code on one line — df |> head(2). We like to put separate operators on separate lines as a general rule (which we, of course, sometimes violate).

Tip

Working in RStudio

In case you’re working through all of these in RStudio, we wanted to give you a hint on working more efficiently within it.

If you place the cursor on any line of code within RStudio (even if that line of code extends over multiple lines!) and press Ctrl+Enter, you will see a new entry in the Environment (in the top right tab). There you can browse it by clicking on the blue arrow to the left of the name (in this case, df) to get a list of columns and some values, or click on the name (df) to see the whole table appear.

4 Selecting columns

We are going to use a variety of approaches to tell R/tidyverse to display data from specified columns.

4.1 Provide a list of column names

To select columns from a data frame, use the select() function from the dplyr package (which is part of the tidyverse). For example, run this line:

There are several things to observe here:

  • We are using the pipe operator again in this statement.
  • We are again operating on the df table.
  • This initial statement tells R to return the values in the id and class columns.
  • It only prints the first few rows by default.
  • A message is displayed that tells you that this select statement has not returned all of the columns.
Try out the select() function above
  • Change the above code so that it returns id, class, and age.
  • Now change it so that it returns just the gpa.

4.2 Select all but one column

Here’s a new one for you. The following code selects all the columns except gpa.

Test your knowledge

Now try these variations in the above code block by running each one and watching the changing output. Describe what you think is going on with each variation.

  • df |> select(c(1,2,3,4,5))
  • df |> select(1:5)
  • df |> select(-6)

After you have run each of the above, think about this question:

What are the benefits and drawbacks of each approach?

Do not look at our thoughts below until you have spent a moment or two pondering the question.

Our thoughts about the question

The following are our thoughts about the question. This isn’t close to an exhaustive list, but it should be helpful as you work to gain insight into the language.

df |> select(c(1,2,3,4,5))
  • Benefits: It’s simple.
  • Drawbacks: If the order of the columns in the data frame change, this will be problematic. It would also be hard for the reader/analyst to understand if they’re not intimately familiar with the order of the columns.
df |> select(1:5)
  • Benefits: It’s simple. Its meaning is obvious. It’s not much to type.
  • Drawbacks: Same as the above — the reader/analyst has to be familiar with the order of the columns to know what is going to be displayed.
df |> select(-6)
  • Benefits: It’s not much to type. It’s clear that only one column is not going to be displayed.
  • Drawbacks: Again, the reader/analyst has to be familiar with the order of the columns to know what is going to be displayed. Also, it’s not clear how much information is going to be displayed since only what is not going to be displayed is being specified.

4.3 Column names are case-sensitive

Column names are case sensitive. That is, NAME is not the same as name or Name — it must be NAME with that exact capitalization (if the column is called NAME). Similarly, if the column is called Address, then you can only refer to that column if the letters are exactly Address and not address or ADDRESS.

Test your knowledge

Execute the code below. You will get an error message!

In order to fix the statement, look at the work above you’ve already completed with this table and find the actual name of the columns. Fix the select() statement until you get rid of the error message.

Our thoughts about the question

Multiple approaches can be used to print out the id and gpa columns from the df data frame. This is the one that we used.

1df |>
2  select(id, gpa)
1
We work with the same data frame.
2
Both of the column names are all lowercase letters.

4.4 Using functions to choose column names

While the most common approach to choosing specific column names is to simply list them, the tidyverse also makes it possible to use functions to specify the columns. Two of them are shown below.

This first example uses the contains() function. It selects all column names that contain the specified string (in this case, "_").

Try some different strings to see how this function works.

This second example uses the ends_with() function. It selects all column names that end with the specified string (in this case, "name").

You can read about other such functions in the tidyverse documentation. You should give that page a look so that you can answer the following questions.

Test your knowledge

Use the appropriate function to match all of the column names in df.

A bit of help

Look at the tidyverse documentation for information on the everything() function.

Solution

1df |>
2  select(everything())
1
You are, again, working with the df data frame.
2
The everything() function matches all variables.

Test your knowledge

Use the appropriate function to match the column containing age information, whether it be called "AGE", "age", or "Age".

A bit of help

Look at the tidyverse documentation for information on the any_of() function.

Solution

1df |>
2  select(any_of(c("age", "Age", "AGE")))
1
You are, again, working with the df data frame.
2
The any_of() function matches columns with any of the strings that are listed in the function. The column names do not have to exist.

4.5 Select many columns other than a few specified columns

Here we want to run a select() statement that chooses many columns other than a few select columns. This is combining two approaches that we saw above in Section 4.2 and Section 4.4.

Test your knowledge

For example, what query would select all the columns except for the first and last names? You should be able to specify several different ways to do this. Can you think how you might use the contains() function?

A bit of help

If you want to use the contains() function, you will almost certainly also have to use the minus (-) operator.

Solution

Here is one solution of many:

Now incorporate the ends_with() function in the code chunk just above in order to get the same results.

4.6 Rearranging columns

We have already seen that we can select columns in whatever order we want by structuring the select() statement in the order we desire. However, we can also use R functions to do some of this reordering for us.

Test your knowledge

Use the select() statement to rearrange the columns in the data frame so that the column names are alphabetical.

We definitely do not want to simply type the column names in alphabetical order. No self-respecting R programmer would do that. Here are a couple of pieces that should be incorporated into your answer:

  • names(DATA_FRAME_NAME): this returns a list of the column names in the data frame.
  • sort(THE_LIST): this returns a sorted list of the items in THE_LIST.

Give it a try!

A bit of help

You are sorting a list of the df column names.

A bit of help

You want to select() a sort()ed list of names() from df.

Solution

4.7 Renaming columns

4.7.1 Renaming with rename()

There is a rename() function to change the name of a column. (The R documentation is found on this page.) For example, the following changes the name of the id column to upper case:

Note that it is not changing the contents of the id column — just the name.

Also, this is not saving the data frame with the new column names. If you want to do that, then you need to assign the results to a table, as in this example (note that this code is not executed; this is just for show):

df_new <- df |> rename(ID = id)

For simple renaming like this, using the rename() function feels like overkill. However, sometimes within a long series of steps the purpose of R statements can be not so obvious. Having a specialized function such as rename() can make the analyst’s job that much easier since it’s obvious what the code does.

4.7.2 Rename with select()

It’s generally more convenient — or, at least, more common — to rename columns when we select them. For example, consider the following:

The above renamed the columns temporarily, just for the duration of the query. Let’s validate that these columns have not changed:

Sure enough, the id and gpa columns still have their original names.

Test your knowledge

Use the select statement to rename all the columns in the data frame. We’ll create a new copy of the data frame called df2 to avoid changing the original.

Solution

Working in RStudio

After executing the code, you should see a new data frame in the Environment window. It should have the same data as df, but with different column names.

For more information on renaming and other useful commands, see this page on RforIR.

4.8 Columns with weird names

By default, column names should start with a letter and contain only letters, numbers, and underscores or periods.

We’ll create a new data frame to illustrate:

Notice the space in the middle of the second column name. (Note that we do not advocate doing this, but sometimes you will have to deal with data owned by other people who haven’t been as well-trained as you!)

Let’s take a look at its contents in a familiar table view:

Again, notice that the column name “first name” has a space in it. Non-standard names require markers to let R know what’s going on. The backtick ` is used to enclose non-standard names. Usually the back tick (not the apostrophe) is under the escape key top left of the keyboard, residing with the tilde (~).

Test your knowledge

Select the first name column from df3

Try this with and without the back ticks to see the error that happens when you forget.

Solution

5 Advanced topics

5.1 R’s native column access

The select() function is preferred because it is flexible and easier to read. But sometimes it’s useful to use the original R method for accessing columns. The point is not to learn these methods, but to recognize them when you see them in other people’s code.

Try these in the space below (or in RStudio, if you’re following along there):

  • df$first_name – the dollar sign separates the name of the data frame from the column name
  • df3$first name – the non-standard name in df3
  • df[, 1]R shorthand for all rows, first column
  • df[,1:3]R shorthand for all rows, columns 1–3
  • df[, "first_name"] – the first_name column in df
  • df[, c("first_name", "last_name")] – the first_name and last_name columns in df.
  • df[["id"]] – this is the same as df$id

You can learn more about accessing data items on this rforir page.

Of these native R methods for column selection (in contrast with the tidyverse’s select() operator), the dollar sign approach is the most common and useful.

5.2 Automatic selection with group_by

The group_by() function is used to group data by one or more columns, just like using a pivot table in Excel. Grouping is usually followed by summarizing, and the resulting columns will be

  1. All the grouping columns, and
  2. The columns created by the summarizing functions.

For more information, see this rforir page.

Example: find gpa by class:

Notice the names of the columns in the above output.

5.3 Combining select() with filter()

These two functions together narrow a data frame by eliminating columns and rows, and are often used in combination. It’s best to do the filter first (that is, stop specific rows from being included), because you may want to deselect the column being filtered on, which can produce a confusing error.

For more information, see this rforir page and also this lesson on filter().

This works fine: filter() before select():

The following produces an error that gpa isn’t a column. That’s because we already removed it in the select() statement.

Bonus question: How could you fix the second example while keeping the select() and filter() in the same order? (Hint: modify the select())

5.4 Technicalities

Sometimes little things can trip us up, and there’s a bit of R metaphysics that can be confusing. You’ll see one here so it may ring a bell if you have a problem later. Run these and look a the output:

Notice that these two print out the same information, but it’s arranged differently. Only the first one has the column name identified. That’s because the first one is still a data frame, and the second one is a vector. A vector is kind of a bare-bones column of data, without the extra information that a data frame has. This distinction becomes important occasionally. The general topic here is that of data types, which is covered in another lesson.

R’s native way to create a data frame is with the data.frame() function.

The df4 object is almost the same as tibble() — in fact, go ahead and replace data.frame in the code block above with tibble and rerun it. You should get the same result.

There are some subtle differences between using data.frame() and tibble(). For our work it’s recommended to use tibble(). One difference is that data.frame() will rename non-standard column names, while tibble() will not.

Run the following line and notice that it changes the name:

Run this line and notice that it does not change the name: