Lesson on Mutate

Creating and Modifying Columns

1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

  • Edit the code that is shown in the box. Click on the Run Code button.
  • Make further edits and re-run that code. You can do this as often as you’d like.
  • Click the Start Over button to bring back the original code if you’d like.
  • If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

1.1 Using RStudio

If you’re following along with this exercise in RStudio, then you need to execute the following code in the Console. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.

library(tidyverse)
library(tidylog)
  • The first line loads the tidyverse package. You could actually load just the packages dplyr, purrr, stringr, and ggplot2 to save on memory, but with today’s computers and the types of things that you’re doing at this point, you don’t need to worry about load speed and/or memory usage just yet.
  • The second package tells R to give more detailed messages.

2 Set up: Create a small data frame

A data frame is like a table in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.

See data_types.qmd for more on these data structures.

df <- data.frame(
  id         = c(1, 2, 3, 4, 5, 6),
  first_name = c("Alice", "Bob", "Charlie", "David", 
                 "Eve", "Stanislav"),
  last_name  = c("Smith", "Jones", "Kline", "White", 
                 "Zettle", "Bernard-Zza"),
  class      = c("Freshman", "Sophomore", "Junior", 
                 "Senior", "Senior", "Sophomore"),
  age        = c(17,18,19,20,21,19),
  gpa        = c(3.51, 3.20, 3.83, 3.03, 3.98, 2.91), 
  home       = c("Dayton, OH", "Columbia, SC", 
                 "Cleveland, OH", "New York, NY", 
                 "Las Vegas, NV", "Cedar Rapids, IA")
)

3 Creating new columns with mutate()

In working with spreadsheets, we commonly want to add a column that is created from one or more existing columns. The character variables lesson has an example of splitting the home column into city and state, thereby creating two new columns from one existing one. Another example on that page creates a full_name column that combines first and last names.

Here we’ll focus on numerical and logical columns and some other more powerful tools for calculating values. For more information on mutate(), see this page on rforir. You can see lots of examples of how to use mutate() on this page and this page.

4 Creating a new logical column

4.1 The situation

Suppose that we are faced with the following situation. A faculty researcher sends us a list of student IDs that will be surveyed. For IRB purposes, the faculty member wants to know if any of them are under 18, and so require extra precautions. Suppose we already generated df from the list of IDs, and for convenience want to add a column called minor that is true if the student is under 18, and false otherwise.

4.2 The mutate() function

This calls for the mutate() function in the R/tidyverse — which is convenient given that this is what this lesson is focusing on. Here is the structure of the mutate() function when you want to create one new column:

mutate(NEW_COLUMN_NAME = SOME_CALCULATION)

It is also possible to create two new columns with the same mutate() operation:

mutate(NEW_COLUMN_NAME1 = SOME_CALCULATION1,
       NEW_COLUMN_NAME2 = SOME_CALCULATION2)

We’re guessing that you can figure out how to create more than two columns given the above.

4.3 The solution

Okay, now that we have the mutate() function at our disposal, let’s review what we’ve been asked to do:

  1. Work with the df data frame.
  2. Create a new column called minor.
  3. The new column should have the value true if age < 18 is true and false if age < 18 is false.

The next code block shows how we write this R/tidyverse code. Remember the structure of the mutate() function that we described above:

  • NEW_COLUMN_NAME is minor
  • SOME_CALCULATION is age < 18.

You might have expected a calculation to be something like x + 3 or 5 * 2, but here we have a logical expression. Consider the following:

  • If age = 12, then age < 18 has a value of true.
  • If age = 23, then age < 18 has a value of false.

This is a legitimate calculation (or expression) in R.

Run the following code block and verify that it does what you expect it to do.

Tip

Note for creating all these new columns

Throughout this lesson, you are going to be creating new columns. Each of these exercises is isolated from the others, so when you create a new column, it will only exist within that particular exercise. When you go to complete a new exercise, you will be starting from the original df data frame.

Test your knowledge

Create a TRUE/FALSE column (that is, a logical column) called deans_list that is TRUE if the student’s GPA is 3.5 or higher, and FALSE otherwise.

Solution

This is how we solved this problem:

You can verify for yourself that the values for deans_list are correct for each value of gpa.

5 Creating a new numeric column

We can also create new columns by performing computations on existing columns. For example, suppose we want to add a column called gpa_squared that is the square of the student’s GPA.

As you can see from the displayed output, a new numeric column gpa_squared exists, and its contents are the square of the value in the corresponding gpa column.

6 Creating categorical values

Sometimes we need to create a column containing a categorical value based on a complex condition.

6.1 Example problem

For example, suppose we want to create a column called gpa_category that is “High” if the student’s GPA is 3.5 or higher, “Medium” if the GPA is between 3.0 and 3.5, and “Low” otherwise. The case_when() function is perfect for this, and is used inside the mutate().

6.2 Basic structure of the case_when() function

The case_when() function takes a series of conditions, and returns the value of the first condition that is TRUE. If no condition is TRUE, it returns NA. Generally, it looks like this:

case_when(
  CONDITION_1 ~ VALUE_1,
  CONDITION_2 ~ VALUE_2,
  ...
)

6.3 A solution to our problem

Back in Section 6.1 we specific the problem. The value of the new gpa_category column depends on the value of the gpa column. This calls for the case_when() function.

The following implements the conditions specified in this problem:

Note that if gpa is blank, then the gpa_category column will be set to NA.

6.4 More complex structure of the case_when() function

It is sometimes the case that we want to specify a default value other than NA. Fortunately, we can easily do so within the case_when() function, as follows:

case_when(
  CONDITION_1 ~ VALUE_1,
  CONDITION_2 ~ VALUE_2,
  ...
  .default = DEFAULT_VALUE
)

To test your understanding of this function, complete the following exercises.

Test your knowledge

Create a column called age_category that is “Pre-teen” if the student is younger than 13, “Teen” if the student is 13-19, “Young Adult” if the student is 20-29, and “Adult” otherwise.

A bit of help

When you are faced with problems that specify ranges with less than, greater than, less than or equal, or greater than or equal, you must carefully specify your conditions. When validating that you have specified correctly, you will want to focus on boundary conditions.

For example, you would want to test your work with students who have ages, 12, 13, 19, 20, 29, and 30 so that you can verify that you created the right divisions between the values.

A bit of help

Be very careful about the ordering of your conditions within the case_when() function. R will stop evaluating once (from the top down) once it finds a condition that it meets.

Solution

The following is how we specified our query. After running this code block, check that the values of age_category are correct.

Now consider the following code block. (It is incorrect!)

Can you see what is wrong with it?

Run the code block and look at the output.

As you can see, each student is specified as a Young adult given how the conditions are ordered. Every student has an age less than 30, so R stops evaluation after the first condition for each one.

Test your knowledge

Create a TRUE/FALSE column called instate that is TRUE if the student’s home address is in Ohio, and FALSE otherwise.

To create this query, you will need to use the str_detect() function. Here is its structure:

str_detect(STR_A, STR_B)

Given the above, this function returns TRUE if STR_B is contained within STR_B.

Also, note that when specifying logical values in R/tidyverse, you must pay attention to capitalization! It is TRUE and not true or True. It is FALSE and not false or False.

Solution

A lot is going on with this query. First, consider the str_detect() function call: it is detecting if ", OH" is contained within the home column. You might have specified "OH" and this would have worked for this data set. But lots can go on with text, and our answer isn’t necessarily bullet-proof. What if the column doesn’t have a comma? What if the state name is spelled out? It can get really ugly and compliated to deal with messy textual data.

In any case, that is the string that we are looking for in the home column.

Thus, if it finds the ", OH" string in home, then it gives instate the value of TRUE. If it does not match that, then it gives the default value. Since we do not want that value to be NA, we have to specify the value of .default (don’t forget the period at the beginning!).

There’s another, more direct way, to answer this query. Let’s explore.

Test your knowledge

Address the exact same question as above: Create a TRUE/FALSE column called instate that is TRUE if the student’s home address is in Ohio, and FALSE otherwise.

To create this query, you should use the str_detect() function, but use it to directly calculate the value of the instate column.

Solution

This answer is almost embarrassingly simple compared to the previous solution, isn’t it? But the results are the same.

7 Modifying existing columns

For a table in a report, we want GPAs rounded to one decimal. We won’t need the original gpa column after that. We could create a new column with a new name, but in this case it’s safe just to overwrite the existing column.

Tip

The round() function

The function round(X, D) rounds the number X to D digits.

Note that it does not work how you think it works!!!

R uses “banker’s rounding”, explained in ridiculous detail on this page.

Consider the following:

vals <- c(-11.77, -11.55, -11.5, -11.22, 
          0.35, 11.22, 11.5, 11.55, 11.77)
round(vals, 1)
[1] -11.8 -11.6 -11.5 -11.2   0.3  11.2  11.5  11.6  11.8
round(vals, 0)
[1] -12 -12 -12 -11   0  11  12  12  12

In any case, we recommend that you simply use round() and don’t worry too much about the details.

Here is how you use this function to complete the above request:

To replace (or overwrite) an existing column (as in this previous code block), just use its name in the mutate() function on the left of the equals sign instead of creating a new column name.

Test your knowledge

A faculty researcher has requested student GPA data for the work. University policy says that this data must be slightly modified before sending, to protect student privacy. We do this by adding or subtracting a random number between about -0.1 and 0.1 to each GPA.

This random number can be generated with rnorm(1, M, SD), which creates 1 normally distributed random number with mean M and standard deviation SD.

In this case, we want to use M = 0.0 and SD = 0.05.

Be sure to round the GPA to 2 digits.

Solution

There are two different ways to solve this problem (at least).

In this first case, the random number that we’re generating comes from this code:

rnorm(1, 0, 0.05)

Note that the random number is different for each of the numbers in the gpa column.

Another way to do this (that we have not addressed as of yet) is to generate a whole vector of random numbers. How many do we want to generate? As many as there are rows in the table. And how many is that? You can use n() to calculate this number. Thus, rnorm(n(), 0, 0.05) calculates a vector of random numbers from this normal distribution.

Using this form of rnorm(), the solution looks like the following:

8 Creating data bins by count

Sometimes we want to take a numerical vector’s values (the “range”) and divide into discrete “bins” like four of them representing the top 25% to bottom 25% (i.e. quartiles). There is a convenient tidyverse function for this called ntile() that’s natural to use within a mutate().

8.1 Task

Create a column called gpa_half that is 2 if the student’s GPA is in the top half of the list of students, and 1 if in the bottom half.

Here we created two bins (a “duo-tile”?) by specifying 2 as the second argument to ntile(). Note the gpa values for each gpa_half value — all of the gpa values for gpa_half = 1 will be less than all of the gpa values for gpa_half = 2. This is what the ntile() function does for you.

8.2 Exploring ntile()

8.2.1 Calculate some sample data

Here’s an example with a data frame of normal random numbers. The following code block creates a data frame called random_numbers with one column named x_val that is filled with numbers from a normal distribution:

8.2.2 Use summary() to understand the data better

The summary() function calculates summary statistics for the columns in a data frame. Let’s look at what it says about x_val:

You can see the minimum, maximum, median, mean, and other values calculated from the values in x_val.

8.2.3 Create quintiles for this data

In the following, we create quintiles (that is, five equal bins) with mutate() and ntile():

For more information on quantiles, quintiles, and more, see this page on rforir.

8.2.4 Validate the quintiles

Tabulate the quintiles:

The bins divide the data into equal-sized groups, so the number of cases in each bin will be the same. This is useful for creating categories like “top 25%” or “bottom 25%” and so on.

Test your knowledge

Try changing the above so that it creates seven equal-sized bins for the x_val column:

Solution

This is a straight-forward application of the ntile() function:

Note that the values of n are not all the same since 7 does not divide equally into 100.

9 Creating data bins by width

A different kind of non-ntile binning is done with the cut() function, which is useful for creating bins of equal width on the x-axis rather than equal counts. For example, cut(x_val, 5) creates five bins of equal width and reports the range as a category (factor variable).

(Note that, in the results, the group_num column will still exist from the previous exercise if you completed it).

The values of the domain column are not as easy to interpret as the ntile() approach, but they are quite informative.

Let’s take a look at the count() information for this type of bin:

Notice that the counts represent a bell shape.

Histograms are made with this second way of binning. Below we plot the histogram with the original x_val column. Binning is automatically done by geom_histogram(), but we can specify the number of bins in the following manner: