library(tidyverse)
library(tidylog)
Selecting columns
1 Introduction
1.1 Using this document
Within this document are blocks of R
code. You can edit and execute this code as a way of practicing your R
skills:
- Edit the code that is shown in the box. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
1.2 Using RStudio
If you’re following along with this exercise in RStudio
, then you need to execute the following code in the Console
. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.
The first line loads the tidyverse
package. The second package tells R
to give more detailed messages.
FYI, to speed up the process just a tiny bit, you can load dplyr
and purrr
instead of tidyverse
…but we generally just go with the one-step solution that we show here.
1.3 Other information on selecting columns
For more information on selecting columns, see this page on rforir
.
2 Create a small data frame
A data frame
is like a table
in a database or a spreadsheet — and you will hear us refer to it as a table
. (At times the distinction among table
, data frame
, and tibble
(more on this later) can be important … but now is not that time.) It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.
= data.frame(
df id = c(1, 2, 3, 4, 5),
first_name = c("Alice", "Bob", "Charlie", "David", "Eve"),
last_name = c("Smith", "Jones", "Kline", "White", "Zettle"),
class = c("Freshman", "Sophomore", "Junior", "Senior", "Senior"),
age = c(18,18,19,20,21),
gpa = c(3.5, 3.2, 3.8, 3.0, 3.9)
)
See the lesson on data types for more on these data structures (e.g., data.frame
and c()
).
This data frame
has six columns: id
, first_name
, last_name
, class
, age
, and gpa
. It also has five rows; we will see this in just one moment. Finally (for now), all of this information is stored in the variable df
so that we can access it later.
Note that we could have just as easily used the “arrow” (<-
) assignment operator instead of the equals sign (=
) between the df
and data.frame
above. To be clear, we could not have made that substitution for the six other equals signs in that statement.
3 Display data from a data frame
Before exploring the select()
function itself, we will briefly explore how to display data from a data frame with two quite basic approaches.
3.1 The simplest statement of all
The first technique is to simply type the name of the data table, as follows:
The above is the first of these code blocks that are different from what you normally see on the Web. Be sure to press the Run code
button for each box and then inspect the output — if you don’t, then it’s possible that subsequent code blocks will not work correctly.
Also, note that you can run any R code that you want within these code blocks.
You can get quite a bit of insight from just displaying the data frame in this manner:
- You verify that you know the name of the data frame and that it is loaded into
R
. - You can see the names of the columns.
- You can see what type of data is in each of those columns.
Sometimes, this is all that you need. Keep this tool handy.
3.2 Using the head()
function
The next simplest query that we can issue to R
is “show me the data in this table.” Within the tidyverse
, you will see two different ways to structure statements. (This applies to most queries in the tidyverse
, not just this one.)
3.2.1 The functional approach
Here is the first method, which we call the “functional” approach (because it relies solely on R
functions to complete its task). We start you off by telling it to return the first 3
rows of the df
data frame. After seeing what the initial code displays, play around with the code block and see what other values might return:
The head()
function returns the top rows of a table. In this case, we are specifying that
- It should work with the
df
table, and - It should return the top three rows.
3.2.2 The pipe-based approach
The following is the second way to structure statements, one that uses the “pipe” (|>
) operator. Look at the code block below. You should interpret the code in this way:
You are working on the
df
data frame. Pass this data along to the next command, the one to the right of the pipe (in this case, thehead()
function). Apply thehead(4)
function to this data. Display the result at the terminal.
Go ahead and run this code.
Note that, with the initial value, the first four rows of the df
data frame are displayed.
Play around with this to investigate what other values might return. See what happens with what you might consider reasonable values but also with unreasonable values. It’s always good to know how a system responds to unexpected inputs.
Test your knowledge
Using the pipe operator approach, show the first 2 rows of the df
table. Replace the underscores in the statement below with the appropriate code.
A bit of help
You need to use the head()
operator.
Our solution
The most direct way to print the first two rows of the df
data frame is the following:
- 1
-
We work with the
df
data frame. - 2
-
Use the
head()
function with an argument equal to2
to print the first two rows.
An equivalent answer would have the above code on one line — df |> head(2)
. We like to put separate operators on separate lines as a general rule (which we, of course, sometimes violate).
Working in RStudio
In case you’re working through all of these in RStudio
, we wanted to give you a hint on working more efficiently within it.
If you place the cursor on any line of code within RStudio
(even if that line of code extends over multiple lines!) and press Ctrl+Enter
, you will see a new entry in the Environment
(in the top right tab). There you can browse it by clicking on the blue arrow to the left of the name (in this case, df
) to get a list of columns and some values, or click on the name (df
) to see the whole table appear.
4 Selecting columns
We are going to use a variety of approaches to tell R/tidyverse
to display data from specified columns.
4.1 Provide a list of column names
To select columns from a data frame, use the select()
function from the dplyr
package (which is part of the tidyverse
). For example, run this line:
There are several things to observe here:
- We are using the pipe operator again in this statement.
- We are again operating on the
df
table. - This initial statement tells
R
to return the values in theid
andclass
columns. - It only prints the first few rows by default.
- A message is displayed that tells you that this
select
statement has not returned all of the columns.
select()
function above
- Change the above code so that it returns
id
,class
, andage
. - Now change it so that it returns just the
gpa
.
4.2 Select all but one column
Here’s a new one for you. The following code selects all the columns except gpa
.
Test your knowledge
Now try these variations in the above code block by running each one and watching the changing output. Describe what you think is going on with each variation.
df |> select(c(1,2,3,4,5))
df |> select(1:5)
df |> select(-6)
After you have run each of the above, think about this question:
What are the benefits and drawbacks of each approach?
Do not look at our thoughts below until you have spent a moment or two pondering the question.
Our thoughts about the question
The following are our thoughts about the question. This isn’t close to an exhaustive list, but it should be helpful as you work to gain insight into the language.
df |> select(c(1,2,3,4,5))
-
- Benefits: It’s simple.
- Drawbacks: If the order of the columns in the data frame change, this will be problematic. It would also be hard for the reader/analyst to understand if they’re not intimately familiar with the order of the columns.
df |> select(1:5)
-
- Benefits: It’s simple. Its meaning is obvious. It’s not much to type.
- Drawbacks: Same as the above — the reader/analyst has to be familiar with the order of the columns to know what is going to be displayed.
df |> select(-6)
-
- Benefits: It’s not much to type. It’s clear that only one column is not going to be displayed.
- Drawbacks: Again, the reader/analyst has to be familiar with the order of the columns to know what is going to be displayed. Also, it’s not clear how much information is going to be displayed since only what is not going to be displayed is being specified.
4.3 Column names are case-sensitive
Column names are case sensitive. That is, NAME
is not the same as name
or Name
— it must be NAME
with that exact capitalization (if the column is called NAME
). Similarly, if the column is called Address
, then you can only refer to that column if the letters are exactly Address
and not address
or ADDRESS
.
Test your knowledge
Execute the code below. You will get an error message!
In order to fix the statement, look at the work above you’ve already completed with this table and find the actual name of the columns. Fix the select()
statement until you get rid of the error message.
4.4 Using functions to choose column names
While the most common approach to choosing specific column names is to simply list them, the tidyverse
also makes it possible to use functions to specify the columns. Two of them are shown below.
This first example uses the contains()
function. It selects all column names that contain the specified string (in this case, "_"
).
Try some different strings to see how this function works.
This second example uses the ends_with()
function. It selects all column names that end with the specified string (in this case, "name"
).
You can read about other such functions in the tidyverse
documentation. You should give that page a look so that you can answer the following questions.
Test your knowledge
Use the appropriate function to match all of the column names in df
.
A bit of help
Look at the tidyverse
documentation for information on the everything()
function.
Test your knowledge
Use the appropriate function to match the column containing age information, whether it be called "AGE"
, "age"
, or "Age"
.
A bit of help
Look at the tidyverse
documentation for information on the any_of()
function.
4.5 Select many columns other than a few specified columns
Here we want to run a select()
statement that chooses many columns other than a few select columns. This is combining two approaches that we saw above in Section 4.2 and Section 4.4.
Test your knowledge
For example, what query would select all the columns except for the first and last names? You should be able to specify several different ways to do this. Can you think how you might use the contains()
function?
A bit of help
If you want to use the contains()
function, you will almost certainly also have to use the minus (-
) operator.
Solution
Here is one solution of many:
Now incorporate the ends_with()
function in the code chunk just above in order to get the same results.
4.6 Rearranging columns
We have already seen that we can select columns in whatever order we want by structuring the select()
statement in the order we desire. However, we can also use R
functions to do some of this reordering for us.
Test your knowledge
Use the select()
statement to rearrange the columns in the data frame so that the column names are alphabetical.
We definitely do not want to simply type the column names in alphabetical order. No self-respecting R
programmer would do that. Here are a couple of pieces that should be incorporated into your answer:
names(DATA_FRAME_NAME)
: this returns a list of the column names in the data frame.sort(THE_LIST)
: this returns a sorted list of the items inTHE_LIST
.
Give it a try!
A bit of help
You are sorting a list of the df
column names.
A bit of help
You want to select()
a sort()
ed list of names()
from df
.
Solution
4.7 Renaming columns
4.7.1 Renaming with rename()
There is a rename()
function to change the name of a column. (The R documentation is found on this page.) For example, the following changes the name of the id
column to upper case:
Note that it is not changing the contents of the id
column — just the name.
Also, this is not saving the data frame with the new column names. If you want to do that, then you need to assign the results to a table, as in this example (note that this code is not executed; this is just for show):
<- df |> rename(ID = id) df_new
For simple renaming like this, using the rename()
function feels like overkill. However, sometimes within a long series of steps the purpose of R statements can be not so obvious. Having a specialized function such as rename()
can make the analyst’s job that much easier since it’s obvious what the code does.
4.7.2 Rename with select()
It’s generally more convenient — or, at least, more common — to rename columns when we select
them. For example, consider the following:
The above renamed the columns temporarily, just for the duration of the query. Let’s validate that these columns have not changed:
Sure enough, the id
and gpa
columns still have their original names.
Test your knowledge
Use the select
statement to rename all the columns in the data frame. We’ll create a new copy of the data frame called df2
to avoid changing the original.
Solution
RStudio
After executing the code, you should see a new data frame in the Environment window. It should have the same data as df
, but with different column names.
For more information on renaming and other useful commands, see this page on RforIR.
4.8 Columns with weird names
By default, column names should start with a letter and contain only letters, numbers, and underscores or periods.
We’ll create a new data frame to illustrate:
Notice the space
in the middle of the second column name. (Note that we do not advocate doing this, but sometimes you will have to deal with data owned by other people who haven’t been as well-trained as you!)
Let’s take a look at its contents in a familiar table view:
Again, notice that the column name “first name” has a space in it. Non-standard names require markers to let R know what’s going on. The backtick ` is used to enclose non-standard names. Usually the back tick (not the apostrophe) is under the escape key top left of the keyboard, residing with the tilde (~).
Test your knowledge
Select the first name column from df3
Try this with and without the back ticks to see the error that happens when you forget.
Solution
5 Advanced topics
5.1 R’s native column access
The select()
function is preferred because it is flexible and easier to read. But sometimes it’s useful to use the original R method for accessing columns. The point is not to learn these methods, but to recognize them when you see them in other people’s code.
Try these in the space below (or in RStudio, if you’re following along there):
df$first_name
– the dollar sign separates the name of the data frame from the column namedf3$first name
– the non-standard name indf3
df[, 1]
–R
shorthand for all rows, first columndf[,1:3]
–R
shorthand for all rows, columns 1–3df[, "first_name"]
– thefirst_name
column indf
df[, c("first_name", "last_name")]
– thefirst_name
andlast_name
columns indf
.df[["id"]]
– this is the same asdf$id
You can learn more about accessing data items on this rforir
page.
Of these native R
methods for column selection (in contrast with the tidyverse
’s select()
operator), the dollar sign approach is the most common and useful.
5.2 Automatic selection with group_by
The group_by()
function is used to group data by one or more columns, just like using a pivot table in Excel. Grouping is usually followed by summarizing, and the resulting columns will be
- All the grouping columns, and
- The columns created by the summarizing functions.
For more information, see this rforir
page.
Example: find gpa
by class
:
Notice the names of the columns in the above output.
5.3 Combining select()
with filter()
These two functions together narrow a data frame by eliminating columns and rows, and are often used in combination. It’s best to do the filter
first (that is, stop specific rows from being included), because you may want to deselect the column being filtered on, which can produce a confusing error.
For more information, see this rforir
page and also this lesson on filter()
.
This works fine: filter()
before select()
:
The following produces an error that gpa
isn’t a column. That’s because we already removed it in the select()
statement.
Bonus question: How could you fix the second example while keeping the select()
and filter()
in the same order? (Hint: modify the select()
)
5.4 Technicalities
Sometimes little things can trip us up, and there’s a bit of R
metaphysics that can be confusing. You’ll see one here so it may ring a bell if you have a problem later. Run these and look a the output:
Notice that these two print out the same information, but it’s arranged differently. Only the first one has the column name identified. That’s because the first one is still a data frame, and the second one is a vector. A vector is kind of a bare-bones column of data, without the extra information that a data frame has. This distinction becomes important occasionally. The general topic here is that of data types, which is covered in another lesson.
R
’s native way to create a data frame is with the data.frame()
function.
The df4
object is almost the same as tibble()
— in fact, go ahead and replace data.frame
in the code block above with tibble
and rerun it. You should get the same result.
There are some subtle differences between using data.frame()
and tibble()
. For our work it’s recommended to use tibble()
. One difference is that data.frame()
will rename non-standard column names, while tibble()
will not.
Run the following line and notice that it changes the name:
Run this line and notice that it does not change the name: