library(tidyverse)
library(tidylog)
Lesson on Mutate
Creating and Modifying Columns
1 Using this document
Within this document are blocks of R
code. You can edit and execute this code as a way of practicing your R
skills:
- Edit the code that is shown in the box. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
1.1 Using RStudio
If you’re following along with this exercise in RStudio
, then you need to execute the following code in the Console
. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.
- The first line loads the
tidyverse
package. You could actually load just the packagesdplyr
,purrr
,stringr
, andggplot2
to save on memory, but with today’s computers and the types of things that you’re doing at this point, you don’t need to worry about load speed and/or memory usage just yet. - The second package tells
R
to give more detailed messages.
2 Set up: Create a small data frame
A data frame
is like a table
in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.
See data_types.qmd
for more on these data structures.
<- data.frame(
df id = c(1, 2, 3, 4, 5, 6),
first_name = c("Alice", "Bob", "Charlie", "David",
"Eve", "Stanislav"),
last_name = c("Smith", "Jones", "Kline", "White",
"Zettle", "Bernard-Zza"),
class = c("Freshman", "Sophomore", "Junior",
"Senior", "Senior", "Sophomore"),
age = c(17,18,19,20,21,19),
gpa = c(3.51, 3.20, 3.83, 3.03, 3.98, 2.91),
home = c("Dayton, OH", "Columbia, SC",
"Cleveland, OH", "New York, NY",
"Las Vegas, NV", "Cedar Rapids, IA")
)
3 Creating new columns with mutate()
In working with spreadsheets, we commonly want to add a column that is created from one or more existing columns. The character variables lesson has an example of splitting the home column into city and state, thereby creating two new columns from one existing one. Another example on that page creates a full_name
column that combines first and last names.
Here we’ll focus on numerical and logical columns and some other more powerful tools for calculating values. For more information on mutate()
, see this page on rforir
. You can see lots of examples of how to use mutate()
on this page and this page.
4 Creating a new logical column
4.1 The situation
Suppose that we are faced with the following situation. A faculty researcher sends us a list of student IDs that will be surveyed. For IRB purposes, the faculty member wants to know if any of them are under 18, and so require extra precautions. Suppose we already generated df
from the list of IDs, and for convenience want to add a column called minor
that is true
if the student is under 18, and false
otherwise.
4.2 The mutate()
function
This calls for the mutate()
function in the R/tidyverse
— which is convenient given that this is what this lesson is focusing on. Here is the structure of the mutate()
function when you want to create one new column:
mutate(NEW_COLUMN_NAME = SOME_CALCULATION)
It is also possible to create two new columns with the same mutate()
operation:
mutate(NEW_COLUMN_NAME1 = SOME_CALCULATION1,
NEW_COLUMN_NAME2 = SOME_CALCULATION2)
We’re guessing that you can figure out how to create more than two columns given the above.
4.3 The solution
Okay, now that we have the mutate()
function at our disposal, let’s review what we’ve been asked to do:
- Work with the
df
data frame. - Create a new column called
minor
. - The new column should have the value
true
ifage < 18
is true andfalse
ifage < 18
is false.
The next code block shows how we write this R/tidyverse
code. Remember the structure of the mutate()
function that we described above:
NEW_COLUMN_NAME
isminor
SOME_CALCULATION
isage < 18
.
You might have expected a calculation to be something like x + 3
or 5 * 2
, but here we have a logical expression. Consider the following:
- If
age = 12
, thenage < 18
has a value oftrue
. - If
age = 23
, thenage < 18
has a value offalse
.
This is a legitimate calculation (or expression) in R
.
Run the following code block and verify that it does what you expect it to do.
Note for creating all these new columns
Throughout this lesson, you are going to be creating new columns. Each of these exercises is isolated from the others, so when you create a new column, it will only exist within that particular exercise. When you go to complete a new exercise, you will be starting from the original df
data frame.
Test your knowledge
Create a TRUE/FALSE column (that is, a logical column) called deans_list
that is TRUE if the student’s GPA is 3.5 or higher, and FALSE otherwise.
Solution
This is how we solved this problem:
You can verify for yourself that the values for deans_list
are correct for each value of gpa
.
5 Creating a new numeric column
We can also create new columns by performing computations on existing columns. For example, suppose we want to add a column called gpa_squared
that is the square of the student’s GPA.
As you can see from the displayed output, a new numeric column gpa_squared
exists, and its contents are the square of the value in the corresponding gpa
column.
6 Creating categorical values
Sometimes we need to create a column containing a categorical value based on a complex condition.
6.1 Example problem
For example, suppose we want to create a column called gpa_category
that is “High” if the student’s GPA is 3.5 or higher, “Medium” if the GPA is between 3.0 and 3.5, and “Low” otherwise. The case_when()
function is perfect for this, and is used inside the mutate()
.
6.2 Basic structure of the case_when()
function
The case_when()
function takes a series of conditions, and returns the value of the first condition that is TRUE. If no condition is TRUE, it returns NA
. Generally, it looks like this:
case_when(
~ VALUE_1,
CONDITION_1 ~ VALUE_2,
CONDITION_2
... )
6.3 A solution to our problem
Back in Section 6.1 we specific the problem. The value of the new gpa_category
column depends on the value of the gpa
column. This calls for the case_when()
function.
The following implements the conditions specified in this problem:
Note that if gpa
is blank, then the gpa_category
column will be set to NA
.
6.4 More complex structure of the case_when()
function
It is sometimes the case that we want to specify a default value other than NA
. Fortunately, we can easily do so within the case_when()
function, as follows:
case_when(
~ VALUE_1,
CONDITION_1 ~ VALUE_2,
CONDITION_2
....default = DEFAULT_VALUE
)
To test your understanding of this function, complete the following exercises.
Test your knowledge
Create a column called age_category
that is “Pre-teen” if the student is younger than 13, “Teen” if the student is 13-19, “Young Adult” if the student is 20-29, and “Adult” otherwise.
A bit of help
When you are faced with problems that specify ranges with less than, greater than, less than or equal, or greater than or equal, you must carefully specify your conditions. When validating that you have specified correctly, you will want to focus on boundary conditions.
For example, you would want to test your work with students who have ages, 12, 13, 19, 20, 29, and 30 so that you can verify that you created the right divisions between the values.
A bit of help
Be very careful about the ordering of your conditions within the case_when()
function. R
will stop evaluating once (from the top down) once it finds a condition that it meets.
Solution
The following is how we specified our query. After running this code block, check that the values of age_category
are correct.
Now consider the following code block. (It is incorrect!)
Can you see what is wrong with it?
Run the code block and look at the output.
As you can see, each student is specified as a Young adult
given how the conditions are ordered. Every student has an age less than 30, so R
stops evaluation after the first condition for each one.
Test your knowledge
Create a TRUE/FALSE column called instate
that is TRUE
if the student’s home address is in Ohio, and FALSE
otherwise.
To create this query, you will need to use the str_detect()
function. Here is its structure:
str_detect(STR_A, STR_B)
Given the above, this function returns TRUE
if STR_B
is contained within STR_B
.
Also, note that when specifying logical values in R/tidyverse
, you must pay attention to capitalization! It is TRUE
and not true
or True
. It is FALSE
and not false
or False
.
Solution
A lot is going on with this query. First, consider the str_detect()
function call: it is detecting if ", OH"
is contained within the home
column. You might have specified "OH"
and this would have worked for this data set. But lots can go on with text, and our answer isn’t necessarily bullet-proof. What if the column doesn’t have a comma? What if the state name is spelled out? It can get really ugly and compliated to deal with messy textual data.
In any case, that is the string that we are looking for in the home
column.
Thus, if it finds the ", OH"
string in home
, then it gives instate
the value of TRUE
. If it does not match that, then it gives the default value. Since we do not want that value to be NA
, we have to specify the value of .default
(don’t forget the period at the beginning!).
There’s another, more direct way, to answer this query. Let’s explore.
Test your knowledge
Address the exact same question as above: Create a TRUE/FALSE column called instate
that is TRUE
if the student’s home address is in Ohio, and FALSE
otherwise.
To create this query, you should use the str_detect()
function, but use it to directly calculate the value of the instate
column.
Solution
This answer is almost embarrassingly simple compared to the previous solution, isn’t it? But the results are the same.
7 Modifying existing columns
For a table in a report, we want GPAs rounded to one decimal. We won’t need the original gpa
column after that. We could create a new column with a new name, but in this case it’s safe just to overwrite the existing column.
The round()
function
The function round(X, D)
rounds the number X
to D
digits.
Note that it does not work how you think it works!!!
R
uses “banker’s rounding”, explained in ridiculous detail on this page.
Consider the following:
<- c(-11.77, -11.55, -11.5, -11.22,
vals 0.35, 11.22, 11.5, 11.55, 11.77)
round(vals, 1)
[1] -11.8 -11.6 -11.5 -11.2 0.3 11.2 11.5 11.6 11.8
round(vals, 0)
[1] -12 -12 -12 -11 0 11 12 12 12
In any case, we recommend that you simply use round()
and don’t worry too much about the details.
Here is how you use this function to complete the above request:
To replace (or overwrite) an existing column (as in this previous code block), just use its name in the mutate()
function on the left of the equals sign instead of creating a new column name.
Test your knowledge
A faculty researcher has requested student GPA data for the work. University policy says that this data must be slightly modified before sending, to protect student privacy. We do this by adding or subtracting a random number between about -0.1 and 0.1 to each GPA.
This random number can be generated with rnorm(1, M, SD)
, which creates 1
normally distributed random number with mean M
and standard deviation SD
.
In this case, we want to use M = 0.0
and SD = 0.05
.
Be sure to round the GPA to 2 digits.
Solution
There are two different ways to solve this problem (at least).
In this first case, the random number that we’re generating comes from this code:
rnorm(1, 0, 0.05)
Note that the random number is different for each of the numbers in the gpa
column.
Another way to do this (that we have not addressed as of yet) is to generate a whole vector of random numbers. How many do we want to generate? As many as there are rows in the table. And how many is that? You can use n()
to calculate this number. Thus, rnorm(n(), 0, 0.05)
calculates a vector of random numbers from this normal distribution.
Using this form of rnorm()
, the solution looks like the following:
8 Creating data bins by count
Sometimes we want to take a numerical vector’s values (the “range”) and divide into discrete “bins” like four of them representing the top 25% to bottom 25% (i.e. quartiles). There is a convenient tidyverse
function for this called ntile()
that’s natural to use within a mutate()
.
8.1 Task
Create a column called gpa_half
that is 2
if the student’s GPA is in the top half of the list of students, and 1
if in the bottom half.
Here we created two bins (a “duo-tile”?) by specifying 2
as the second argument to ntile()
. Note the gpa
values for each gpa_half
value — all of the gpa
values for gpa_half = 1
will be less than all of the gpa
values for gpa_half = 2
. This is what the ntile()
function does for you.
8.2 Exploring ntile()
8.2.1 Calculate some sample data
Here’s an example with a data frame of normal random numbers. The following code block creates a data frame called random_numbers
with one column named x_val
that is filled with numbers from a normal distribution:
8.2.2 Use summary()
to understand the data better
The summary()
function calculates summary statistics for the columns in a data frame. Let’s look at what it says about x_val
:
You can see the minimum, maximum, median, mean, and other values calculated from the values in x_val
.
8.2.3 Create quintiles for this data
In the following, we create quintiles (that is, five equal bins) with mutate()
and ntile()
:
For more information on quantiles, quintiles, and more, see this page on rforir
.
8.2.4 Validate the quintiles
Tabulate the quintiles:
The bins divide the data into equal-sized groups, so the number of cases in each bin will be the same. This is useful for creating categories like “top 25%” or “bottom 25%” and so on.
Test your knowledge
Try changing the above so that it creates seven equal-sized bins for the x_val
column:
Solution
This is a straight-forward application of the ntile()
function:
Note that the values of n
are not all the same since 7
does not divide equally into 100
.
9 Creating data bins by width
A different kind of non-ntile
binning is done with the cut()
function, which is useful for creating bins of equal width on the x
-axis rather than equal counts. For example, cut(x_val, 5)
creates five bins of equal width and reports the range as a category (factor variable).
(Note that, in the results, the group_num
column will still exist from the previous exercise if you completed it).
The values of the domain
column are not as easy to interpret as the ntile()
approach, but they are quite informative.
Let’s take a look at the count()
information for this type of bin:
Notice that the counts represent a bell shape.
Histograms are made with this second way of binning. Below we plot the histogram with the original x_val
column. Binning is automatically done by geom_histogram()
, but we can specify the number of bins in the following manner: