<- data.frame(
grades student_id = 1:9,
instructor_id = c(1, 1, 2, 3, 4, 4, 5, 5, 5),
student_first = c("Alice", "Bob", "Charlie",
"David", "Eve", "Stanislav",
"Yolanda", "Zoe", "Xavier"),
student_last = c("Smith", "Jones", "Kline",
"White", "Zettle", "Bernard-Zza",
"Zhang", "Xu", "Zimmerman"),
subject = c("Math", "Math", "Math",
"English", "English", "English",
"History", "History", "History"),
grade = c("A", "B", "A", "C", "B",
"A", "A", "B", "A"),
grade_points = c(4.0, 3.0, 4.0, 2.0, 3.0,
4.0, 4.0, 3.0, 4.0)
)<- c("English", "Math", "History")
dept_names <- c("F", "D", "C", "B", "A")
possible_grades <- grades |>
grades mutate(subject = factor(
subject, levels = dept_names,
ordered = FALSE),
grade = factor(
grade,levels = possible_grades,
ordered = TRUE)
)
Iterating with Loops
1 Using this document
Within this document are blocks of R
code. You can edit and execute this code as a way of practicing your R
skills:
- Edit the code that is shown in the box. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
2 Set up: Create a small data frame
A data frame
is like a table
in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.
The grades
data frame contains grade information for nine students who received grades from five instructors in three subjects.
3 The problem we are going to address
Sometimes we want to divide a data set into groups and perform a complex operation on the subset — for example, creating separate data sets or reports by academic departments, or by student class, or by year.
Basically, we need to be able to produce the exact same report but for different constituencies (“Produce a report for each department head that…”) or different underlying data (“Create a separate analysis on each admitted class for the VP of Enrollment”).
We are going to create a recipe for creating this type of report but, first, we are going to explore the simplest operator for iterating: the for
loop. (For more information on for
loops, see this page on rforir
. For more information on accessing data in a row or vector, see this page on rforir
.)
Along the way, we are also going to introduce an approach to collecting your output in Excel. Within this aside, we are going to introduce a standard approach to using helper functions to simplify our scripting across multiple projects.
4 The for
loop as a tool for iterating
Let’s look at a specific example and describe what happens in a for
loop:
Here is what happens:
- Line 1: Enter the
for
loop and setx
to1
. - Line 2: Print the value of
x
(1
). - Line 1: Return to the top of the
for
loop and setx
to2
. - Line 2: Print the value of
x
(2
). - Line 1: Return to the top of the
for
loop and setx
to3
. - Line 2: Print the value of
x
(3
). - Line 3: Exit the
for
loop since no more values are in the vector.
The general syntax for the for
loop is as follows:
What a for
loop does can be described as follows:
Execute all of the statements within a
for
loop while successively changing the variable to the values in the vector.
Question 1: Test your knowledge
Create a for
loop that works through the values in this vector:
For each value, print “This fruit is” and then the name of the fruit.
A bit of help
A reasonable first step for working with a for
loop is to define the for
loop itself with minimal processing inside. We’ll show you what this looks like in the next hint…but give it a try before looking.
A bit of help
Here is our first attempt. We ignored the fancy output instructions for now; we just want to ensure that we’ve got the foundation right.
A bit of help
The next step involves building the output string. The function str_c()
is useful for this.
Solution
The str_c()
can be used to concatenate any number of strings together; here we are using it with two — one a fixed string and the other a variable.
Question 2: Test your knowledge
We defined the list of dept_names
at the beginning of this lesson. Use a for
loop so that for each successive value in dept_names
, you print the last name of the students that took classes from that department. Within the loop, print the name of the department as a title.
A bit of help
An easy first step for this is to define the for
loop structure and print the department name.
A bit of help
The next step is to integrate your query into this loop. Note that in order to see the output of the query within a for
loop, you need to surround the query with a print()
statement.
Solution
This query filters to include grades for the current dept_name
and then select just the last name of the student.
Question 3: Test your knowledge
Now, without using a for
loop and using just R/tidyverse
, print the same information (though it will be formatted differently).
A bit of help
Do you use filter()
here? Do you use group_by()
? Do you use arrange()
? We answered no
, no
, yes
. But you might have also reasonably chosen no
, yes
, no
.
A bit of help
What information do you want to display? We only chose two columns.
Solution
We chose to arrange()
(sort) by subject
. But you might have chosen to use group_by()
instead. However, unless we do a calculation, we generally don’t use group_by()
…but it seems to work.
See how much easier it is to write and read the R/tidyverse
version? It’s not always possible but, when it is, we believe it’s so much better.
5 An aside: using Excel as a container for output
When using R
scripts (as opposed to quarto
, which is another approach to executing and managing your R
-based work), we need to develop an approach that we can integrate into our workflow. Here we are going to assume that you are comfortable copying — and otherwise integrating — Excel spreadsheets into documents and presentations.
We want to show you how to use an Excel spreadsheet file as a canvas to hold the outputs of data tables. (Note that we are not using Excel for computation and analysis; we are simply using it as a communications tool.) You can then use your own know-how to copy-and-paste the information from that file to its appropriate places (word processing documents, emails, or whatever).
5.1 The openxlsx2
package
As with many needs that we have for doing our work, someone has created a package that helps us interact — read, write, and format — with spreadsheet files. Several are available, but one that is currently under development and provides the capabilities that we need is the openxlsx2
package. (Yes, there is an openxlsx
package, but it is no longer under development. It’s generally better to adopt packages that are still evolving if you have a choice.)
The documentation for this package can be found here; the manual is available here.
This package is quite extensive and somewhat complex. As such, we have written some functions to make it easier to adopt for our limited, but common, use case.
5.1.1 Installation
If you haven’t installed the package openxlsx2
, then go ahead and do so by typing this at the Console
prompt in RStudio
.
5.1.2 Reading it in
Whenever you are going to use this package, you should add the following line to the top of your script.
5.1.3 Helper functions
The following are the helper functions that we have written to make it easier for you to use this functionality in your scripts. The drawback of using these functions instead of the underlying functions from openxslx2
is that they do not expose the full power of the package; their very purpose is to hide the functionality. Why would we choose to do this?
- At this early stage of our development as a user of
R/tidyverse
, we should refrain from over-engineering and over-complicating our scripts. Make them simple to write and simple to read. - We do not need all of the package’s capabilities. We just need a small portion of it while we are developing our set of projects. Leave it to later, when a specific need or request arises, to integrate more options into our scripts.
- Using a small fraction of a more complicated and complete package instead of a simpler and less complete package allows us to grow into the package over time and to use the helper functions to hide the complexity in the meantime. If we chose the other approach, at some point we would have to go through the painful process of adopting the approach of a new package; putting off this pain would hinder our future growth as an
R
user.
The purpose of each function is documented within the code below. In Section 5.2 we explain and demonstrate how to use these functions.
Here’s what you need to do to use these functions:
- Create a folder
source
in your project. - Within this folder, create a new R script named
worksheet_functions.R
. - Copy the code below and paste it into that new file.
- Save the file.
- At some point, you will want to change the settings for the fonts within the
set_ws_formatting
function to your own preferences.
#' Creates a spreadsheet object. Must be called
#' before any other function.
create_wb <- function() {
wb_workbook()
}
#' Adds a named worksheet to the workbook. It assumes
#' that the title of the worksheet will be placed
#' in cell (1, 1).
add_ws <- function(wb, ws_name, title_name) {
return(wb |>
wb_add_worksheet(ws_name) |>
wb_add_data(x = title_name))
}
#' Adds data to the current sheet of this workbook.
#' df_name is the name of the data frame that you
#' want to put in the workbook.
#' It puts the data frame starting on row 3 (in
#' order to leave space for the title above it).
#' Actually, the column headers are on row 3 and
#' the data starts on row 4.
add_data <- function(wb, df_name) {
return(wb |>
wb_add_data(sheet = current_sheet(),
x = df_name,
start_row = 3))
}
#' Sets formatting for the report. It has separate
#' font names and colors for the report title, the
#' column headers, and the data itself.
#' It sets the base font for the whole sheet so that
#' any changes you make (manually) to the worksheet
#' after it has been saved will, by default, have
#' this font setting.
#' You will almost certainly want to change the font
#' name, font color, and font size for the title,
#' header, and data to your preference.
#' Color names at https://htmlcolorcodes.com/color-names/
#' Hex colors at https://htmlcolorcodes.com/color-chart/
set_ws_formatting <- function(wb, df_name) {
dims_title <- wb_dims(1, 1)
dims_header <- wb_dims(rows = 3,
cols = 1:ncol(df_name))
dims_data <- wb_dims(rows = 4:(3+nrow(df_name)),
cols = 1:ncol(df_name))
wb$set_base_font(font_size = 10,
font_color = wb_color("red"),
font_name = "Arial")
return(wb |>
wb_add_font(sheet = current_sheet(),
dims = dims_title,
name = "Arial",
bold = TRUE,
wb_color(name = "black"),
size = 20) |>
wb_add_font(sheet = current_sheet(),
dims = dims_header,
name = "Arial",
bold = TRUE,
wb_color(hex = "#1f618d"),
size = 15) |>
wb_add_font(sheet = current_sheet(),
dims = dims_data,
name = "Times New Roman",
bold = FALSE,
wb_color(hex = "#2F4F4F"),
size = "14"))
}
#' Saves the workbook to a specific file name (which
#' should include the path). If the file already
#' exists, it will be over-written.
save_wb <- function(wb, file_name) {
wb |>
wb_save(file = file_name,
overwrite = TRUE)
}
#' Creates a blank data frame of a random size. The
#' only purpose of this is to use as a place holder
#' during program development.
create_blank_df <- function() {
num_rows <- runif(n=1, min=2, max=10)
num_cols <- runif(n=1, min=3, max=8)
return(as.data.frame(matrix(0, num_rows, num_cols)))
}
5.2 How to output data to a workbook
In this section, we take you through how you should use these helper functions to output your data frames to a workbook. We assume that you have gone through the steps described in the previous section.
5.2.1 Make the output
folder
As best practice, all of your projects should have an output
folder (to, shockingly, hold the output of your scripts). If you don’t already have one, go ahead and create it.
5.2.2 Load the library
At the top of your script, insert the following line where you install the other libraries that you need.
5.2.3 Making the helper functions available
At the top of your script, but after you have loaded all of the appropriate libraries, add the following line. This reads in the functions so that they are available to the script that you are writing.
The above line assumes that you have a source
folder and that you have put that named R
script in that folder.
5.2.4 Create the workbook
Now it is time for you to create the Excel workbook data object that will hold the output from the script. Put the following code near the top of your script, after the above lines but certainly before you want to output any information.
Note that there is nothing sacred about the wb
variable name. You can change it to whatever you want. However, as we are wont to do, we like to be consistent from script to script in our naming conventions. Thus, we use wb
when we are creating output in this way. You are encouraged to use a naming convention — use ours or make up your own. But, please, be consistent.
5.2.5 Add data to a worksheet
We are done with the setup. We are now faced with the challenge of putting data into a worksheet within the workbook.
The helper functions assume that you want the following:
- Every data frame added to the workbook will be added to a separate worksheet within that workbook.
- Each worksheet has a name that you can give it.
- Each worksheet has a report name in cell
A1
(the top left corner of the worksheet) that has its own special font formatting. - The data on each worksheet has a row of column headers (with its own special font formatting) right above all of the data (again, with its own special font formatting).
- Each report on each worksheet has the same formatting, providing a professional, unified look to your work.
Each time you want to add another data frame to the worksheet, use the following code. Ensure that the name of the data frame is the same in the 2nd and 3rd commands.
5.2.6 Add more data to another worksheet
Just to emphasize the point, if you want to add another data frame to another worksheet within that same workbook, you should execute these same three commands (but with different arguments), as such:
You would repeat each of these three commands every time you have another data frame to add to the workbook.
5.2.7 Save
The previous code created the output in your computer’s memory. Now you need to ensure that it is written to disk. At the end of your script (or at least after the portion that is concerned with gathering data to be output), you should add the following code:
Note that this code puts the workbook into the output
folder in a file called FILENAME.xlsx
. You can, of course, call your output file something different, but we tend to use output/output.xlsx
unless we have a compelling reason for changing it. Why? Because with all of the projects that we have, we like to make it easy to move from one project to another, and re-using common names that have a common purpose is just one way to do so.
5.3 Summary of using spreadsheets for gathering output
The above tools and process can be integrated, with minimal effort, into any R
script in which you want to output one or more data tables into a workbook.
6 Recipe for a repeating report using for
To make this concrete, suppose we want to create a grade report for each academic subject. The report shows each professor’s grade distribution for the last academic year. We’ll output that in a workbook with each academic subject on a different worksheet.
6.1 1. Bring together all of the data into a single data frame
Create the combined data set. The grades
data frame is a tiny example.
6.2 2. Identify the groups in this data frame
Identify the groups in a new data frame. In this case, the groups are the unique values in the subject
column. (Note that we could also use the dept_names
vector defined at the beginning, but we wanted to remind you of this option.)
So we now have a vector subjects
that contains all of the different values that subject
takes on in the grades
data frame.
6.3 3. Define a query for one element of the group
To ensure that you know what reports you are going to collect, you should define the queries needed to collect the data for one element of the group (defined in step #2 above).
In this case, as stated above, for each department, we want to display each professor’s grade distribution for the last academic year. The following queries display what we want for the "Math"
department:
We will come back to this below.
6.4 4. Create the structure of the for
loop
Here we present two common ways to structure the for
loop — that is, to ensure that you go through the for
loop exactly the number of times that you need to.
The first option is probably the most obvious way but we think the second option is the one we prefer.
6.4.1 Option #1: Counting
This approach is basically the following: 1) determine how many times to go through the loop; 2) count from 1 to that number.
Let’s set up the bare-bones version of this.
Since subjects
is a data frame with a single column, we can determine how many values are in it by using the nrow()
function. Thus, for (i in 1:nrow(subjects))
instructs R
to count from 1
to the number of subjects
as it repeats the for
loop.
On line 4, we assign the ith value in the subject
column of the subjects
data frame to my_subject
. On line 5, we construct a report title using the my_subject
value.
6.4.2 Option #2: Stepping through a vector
This approach has a more streamlined trick — simply move through the vector itself as a means of ensuring that you go through the for
loop the correct number of times. When you run out of items in the vector, you are done.
In this case we already have the vector, subjects
, so we simply assign my_subject
to each individual subject, one by one (as in line 1 below):
Everything else remains the same.
We are going to use this approach for the rest of our example.
6.5 5. Add functions for capturing output in a workbook
Now that we have the basics of the for
loop in place, we should add the steps that we described above in Section 5.2 to put your output in a workbook:
Let’s go through the above query:
- Line 1
-
We create the
wb
object that will hold the workbook and all of its worksheets. - Line 2
-
Use a
for
loop ondept_names
. - Lines 3–4
- Define the report title.
- Line 9
-
As a temporary placeholder, we define
empty_df
so that we have something to add to the page. Our actual query will go here later. - Line 10
- Again, as a temporary hack, we print the data frame that we’re going to add to the workbook.
- Lines 13–15
-
Add a worksheet with name
my_subject
and a titlereport_title
in cellA1
. - Line 16
-
Add the report to the worksheet that you just created. We are currently adding
empty_df
; this will change in future steps. - Line 17
- Format the report on the worksheet that you just added.
- Line 22
-
Given that we are operating on a Web page, we have to do this in order to verify that there is something in
wb
. If you are working inRStudio
, then you should delete this line (and, of course, its associated comments). - Line 25
-
If you are working within
RStudio
, then you should uncomment this line. You should also changeFILENAME
to whatever name that you want. We generally useoutput
for consistency across projects.
6.6 6. Add the interim calculations for your report
Now that you have the for
loop basic structure and have integrated placeholders for sending the output to a workbook, it’s now time to start adding the actual calculations to your code.
You defined the queries that you are going to use back in Section 6.3. Note that you defined two data frames: subject_grades
, which is an interim calculation, and prof_grades
, which is the final calculation.
What we’re going to do here is calculate and send to the output the interim calculation as a final check for our work. Instead of using empty_df
as we did above, we are going to calculate subject_grades
using the current value of my_subject
(as shown on lines 6–9 below). We also change the name of the data frame in lines 16 and 17. These are the only changes that we made.
After this has run (if you’re working in RStudio
), check the workbook in the output
folder to validate that it looks right.
6.7 7. Add the final calculations for your report
If you have multiple stages of interim calculations, you might repeat step 6 for each one of them by continually adding the appropriate queries to the for
loop and changing the data frame that you send to the output file.
Eventually, you will get to the final calculations. Add all of the code from Section 6.3 to the for
loop (in lines 6–15), and change the data frame name on lines 21 and 22 to the appropriate data frame.
After running the above code in RStudio
, you should definitely check the worksheet in the output
folder.
6.8 8. Create a CSV
file if desired
You may also want to create CSV
files to hold the data frame for each report as you work your way through the for
loop. If that’s so, then you should add something like lines 26–29 to the inside of your for
loop.
7 Try it yourself
The following are a series of exercises that help you try out what you learned above on a new problem. We are going to create reports for each department that calculates the number of students and the average grade points for each instructor in that department. The output is going to go to a workbook.
You can go through the whole process in this Web page, but we encourage you to also run this code within RStudio
so that you can see the workbook that it creates at each stage.
Before starting all of this in RStudio
, you need to complete the first three sections of Section 5.2.
Question 4: Test your knowledge
Define a vector that you are going to step through in your for
loop. (Note that it can be something you have already seen in this page.)
A bit of help
Before you try to define something yourself, isn’t there a vector that is already defined that you can use?
Solution
Yep. There is.
Question 5: Test your knowledge
Define an appropriate query for one element of the group (e.g., “English”).
A bit of help
You need to use 3 operators within the R/tidyverse
query. What are they?
Solution
As usual, we filter to include just the subject we are interested in first. Since our calculations are by instructor, we use group_by()
on that column. Finally, we calculate the two values that were asked for.
Question 6: Test your knowledge
Create the structure of the for
loop using the “stepping through a vector” method.
A bit of help
We have already established that dept_names
is going to be the vector and that my_subject
is going to be the variable name. So that means we have this so far:
Print both the subject name and, after defining the report title, the report title. We defined the report title as `“Grades report for XXX”.
Solution
This is a fairly straight-forward translation of the example that we went through above.
Question 7: Test your knowledge
Add functions for capturing output in a workbook.
A bit of help
We are certainly starting with our answer to the previous question. Now it’s just a matter of filling in the blanks in the following in an appropriate manner. Use the recipe and example to go in the right direction.
Solution
If you follow the recipe and example, you should come up with the following.
If you are running this in RStudio
, then you should definitely replace the final print(wb)
statement with the one for saving it to a file…and then look in the file to verify that it’s working.
Question 8: Test your knowledge
Add the final calculations for your report.
A bit of help
Again, we’re starting with our answer to the previous question. Now it’s just a matter of putting in the calculations and determining what to put in the blanks for the final two statements in the for
loop.
Solution
It’s really just a simple copy-and-paste from the query that we defined above. And we want to output the dept_grades
data frame, so we put that into the blanks at the end of the for
loop.
8 Summary
In this lesson, we brought together a lot of the tools that you have at your disposal:
R/tidyverse
queries with all the usual operators- usage of packages
- functions
We have also added a few new arrows to your quiver:
for
loops- an operator for reading in an
R
script - a standard way to organize functions in a
source
folder - a standard way to organize output in an
output
folder - a standard way to capture output in a workbook
The result is a recipe for printing a series of data tables to separate worksheets in a single workbook. You will probably find yourself applying this recipe to almost every project you work on while you are growing your reliance on R
.