Week 1 Homework

1 Using quarto-live documents

Within this document are blocks of R code.

  • Edit the code that is shown in the box. Click on the Run Code button.
  • Make further edits and re-run that code. You can do this as often as you’d like.
  • Click the Start Over button to bring back the original code if you’d like.
  • If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

2 Instructions for this homework

  1. Work on the questions from top to bottom. All of the questions can be found in the blue boxes.
  2. Before you can work on any question, you need to run the code box in @setup.
  3. These questions differ from the ones in your “Lessons” in that, after you press the “Run Code” button, it will tell you if you got it right and also try to tell you where you might have done something different than what we expected. (Any and all feedback would be appreciated here. This code is all new and, we’re sure, will need revisions. Also let us know if the feedback provided is helpful — it is quite complex to write.)
  4. We also provide Hints (for most) and Solutions (for all) throughout the homework.

If you are having trouble with these, the Problem Session on Wednesday would be a great time to get feedback — that’s why these sessions exist.

If you haven’t already completed them, then go back to the lessons for the pipe (|>) operator and the select() and filter() functions.

3 Preliminaries

3.1 Setup

We have loaded both the tidyverse and tidylog packages in the background. If you are working in RStudio, then run the following code at the Console:

library(tidyverse)
library(tidylog)

We are now going to load the raw data from the CSV files. If you are working in RStudio, then execute the following code at the Console:

courses <- read_csv("data/course_section.csv")

If you are working in this web page, then read in the data and display it by running the following code:

Tip

What to do if all goes wrong?

If you do something on this page and somehow mess up the courses data frame, then just run the above code block again. It will reload the data. You can then go back to where you were working.

3.2 Data Dictionary

The following describes the contents of each of the columns in courses:

  • CourseSectionID: unique identifier for each course section
  • CourseID: unique identifier for each course, specifying subject and course number.
  • SectionID: the section number of a course type, e.g., 001, 002, etc.
  • TermID: unique identifier for each term
  • Credits: the number of credits for the course
  • ProfID: unique identifier for each professor

4 Questions

Do your best to answer each of the following questions. You can complete them within this web page or within RStudio. For each question, we provide a solution (with accompanying explanation); for most, we also provide a series of hints that you can use to help yourself along the way.

When you see a series of underscore characters within a code block, that is our hint that somethinganything — needs to go there. Replace those underscores with something and then press the Run Code button. Keep trying until you think you’ve got it right. Read the provided feedback or click on Show Hint if you need some help, and then click on Show Solution when you’re ready.

You will get the most benefit from going through these questions if you try — multiple times, if necessary — to answer them. Getting your brain engaged and struggling is the key to building a memory about these skills.

4.1 Practice with select()

The following five questions primarily test your knowledge of the select() operator.

Question 1: Test your knowledge

Create a new data frame from courses by using the select() function to keep only the columns CourseSectionID and Credits. Call the new data frame course_credits.

A bit of help

You need to assign the results of the query to course_credits.

course_credits <- ______ |>
  select(______)

A bit of help

You are selecting data from the courses data table (tibble).

course_credits <- courses |>
  select(______)

A bit of help

You need to select two separate columns and separate them by a comma.

course_credits <- courses |>
  select(______, ______)

Solution

This is how we answered this question:

1course_credits <- courses |>
2    select(CourseSectionID, Credits)
1
Assign the results of the query to course_credits. The source data is in the courses tibble.
2
Select (choose) two columns from the courses tibble.

Question 2: Test your knowledge

If you are using RStudio, then using the data browser at top right, inspect the data frame course_credits to make sure it looks right.

Solution

You should see a spreadsheet-like display with the two columns you selected. The data should be the same as the data in courses, the number of rows should be the same as in courses, and the data types for each column should be the same.

Question 3: Test your knowledge

What if we wanted all the columns except for Credits? Write the select() to do that below, starting with the data frame and piping it to a select(). This will print the first few rows of the result.

A bit of help

The only thing you have to change is within the select() operator. Don’t change anything else.

courses |>
  select(______)

A bit of help

The easiest way to get rid of one column for a select operator is to use the minus operator.

courses |>              
  select(-NAMEOFONECOLUMN)

Solution

This is how we answered this question:

1courses |>
2  select(-Credits)
1
You do not need to assign the results to another tibble. Simply start with the courses tibble as the data source for the query.
2
The easiest way to select all of the columns except for the Credits column is to use the minus operator.

The minus deselects only that column. What other way(s) might you do this?

In naming columns, the convention of using uppercase letters to separate words (as in ThisIsAColumn) is called camel case, because of the “humps”. The courses data frame uses this convention.

A different convention is snake case, where lowercase words are separated by underscores, as in this_is_a_column. Generally, snake case is easier to read. It is also the standard throughout the tidyverse.

Question 4: Test your knowledge

Use the select() function to select the first three columns of the courses data frame and rename the columns to snake case. For example, CourseSectionID would become course_section_id. Print the result to the console — i.e., do not save the information to another data frame.

A bit of help

Again, you only need to apply the select() function to the courses data table.

courses |>                                     
  select(______)

A bit of help

You need to use snake_case for the new names. The code below shows how to rename the first column. Continue with the second and third columns.

courses |>                                     
  select(course_section_id = CourseSectionID,
         ______)

A bit of help

To rename three columns, you need to have three phrases separated by commas within the select() statement.

courses |>                                     
  select(course_section_id = CourseSectionID,
         course_id = ______,  
         subject_id = ______)

Solution

1courses |>
2  select(course_section_id = CourseSectionID,
3         course_id = CourseID,
4         subject_id = SubjectID)
1
The courses tibble is the data source.
2
The select() statement can be used to rename columns. This first line says “rename CourseSectionID to course_section_id.”
3
This second line says “rename CourseID to course_id.” Notice that this is separated from the previous column by a comma.
4
Finally, the third line says “rename SubjectID to subject_id.”

Since we did not assign the results of this whole command to a new data table, these results are temporary.

How would you make them permanent?

Question 5: Test your knowledge

There is also a rename() function that can be used to rename columns if you want to keep all the columns (i.e. you don’t need to select() them).

Look up “tidyverse rename” in a search engine (or simply look at this page in rforir) and see if you can rename all the columns in the data frame to snake case.

A bit of help

We are using the courses data table as the source. Instead of select() as we did in the previous question, we are going to use rename() in this answer.

courses |> 
  rename(______)

A bit of help

The format of this rename() command is similar to the one for the select() command in the previous question. This is a more focused command with a simple purpose. Neither one is right or wrong.

This is how you rename the CourseSectionID column. Now you need to rename the rest of the columns.

courses |> 
  rename(course_section_id = CourseSectionID)

A bit of help

Just as with the select() command, you separate the renaming of the fields by a comma. Continue until you have completed all of them.

courses |> 
  rename(course_section_id = CourseSectionID,
         ______)

Solution

This is how we attacked this question:

1courses |>
2  rename(course_section_id = CourseSectionID,
         course_id = CourseID,
         subject_id = SubjectID,
         section_id = SectionID,
         term_id = TermID,
         credits = Credits,
         prof_id = ProfID)
1
The courses data table is the source.
2
The formatting of this approach to renaming is the same as for the select() statement above.

Note that since we did not assign the query to a data frame (at the beginning, with something like courses <-), these columns will not be renamed after this query is completed.

4.2 Practice with filter()

Before attacking these questions, review the lesson on the filter() function.

Our task is to filter the courses data frame to keep only the most recent term, which is associated with the largest TermID. We can do a quick pivot table to count cases by TermID like this:

The table() function counts the number of occurrences of each unique value. In the results just above, you can see the counts for each value of TermID from 101 to 128.

You may recall from the lesson on the select() operator that the $ operator is used to select a column from a data frame (and return a vector).

The result shows that each individual course is offered somewhere around 800 times.

What’s the most recent — i.e., maximum — term?

Question 6: Test your knowledge

We could do the following if we wanted to save the courses for the most recent term to a new data frame:

Why is this not the best way to do this? Think about what happens next term when you want the same report.

Solution

As time goes on, the maximum term will increase, and this filter will be out of date. This kind of thing is a danger in automated reports, because it means manually inspecting parameters to make sure they are up to date. It’s better to find a method to automatically get the most recent term.

An alternative is to use the filter() function with the max() function to get the most recent term automatically.

Look at the following and compare its result to the query above:

Test that these are the same with the handy identical() function:

Question 7: Test your knowledge

Copy the filter code from your answer to Question 6 and modify it to keep only the courses with the smallest (minimum) TermID. Do not use an integer in your query! Assign the results of the query to the table early_courses.

A bit of help

We are selecting from the data source courses and assigning the results to the early_courses data table. We are going to print out the top of the new table, so that means that we should use head().

All that is left is for you to figure out how to create the appropriate filter.

early_courses <- courses |>
  filter(______)

A bit of help

Our filter involves testing (ensure that you are using two equal signs!) whether or not TermID is equal to some value. What value? Remember that you can’t type a number here — it has to be calculated.

early_courses <- courses |>
  filter(TermID == ______)

Solution

Here is how we answered this question:

1early_courses <- courses |>
2  filter(TermID == min(TermID))
1
We are saving the query to early_courses. The data comes from the courses data frame.
2
We want to filter to include those rows for which TermID equals the minimum value in the TermID column (which happens to be 101, but we do not want this value hard-coded into the query).
Tip

A shortcut

Above, we assigned a value to early_courses and then, in a second command, we simply typed in the data frame name in order to display its values.

An alternative to this is to surround the assignment statement with parentheses. This makes the assignment and then prints the value of the new variable. In this case, we could have written this:

(early_courses <- courses |>
  filter(TermID == min(TermID)))

We have ensured that you can include the outer parentheses or not and the code check will proceed in the same manner.

This one’s a little harder, but see if you can figure it out.

Question 8: Test your knowledge

Use the filter() function to keep only the course section that were taught in the most recent HALF of the data. In other words, the term number should be greater than the median term number. Save the results in a data frame named recent_half_courses.

A bit of help

Consider using the filter() function from dplyr.

recent_half_courses <- courses |> 
  filter(______)

A bit of help

You should filter the dataset using the TermID column.

recent_half_courses <- courses |> 
  filter(TermID > ______)

Solution

Here is our answer. We are not wedded to the use of the > instead of >= in this case; either would be fine.

1recent_half_courses <- courses |>
2  filter(TermID > median(TermID))
1
We are assigning the results to the new recent_half_courses data frame. The data comes from the courses data frame.
2
We want to filter to include those rows for which TermID is greater than the median value of TermID.

4.3 Practice with count()

The count() function does something we have to do all the time in IR — count cases within groups. It’s a shortcut for a combination of group_by() and summarize(), which together can do more complicated kinds of group statistics. But counting is pretty useful all by itself.

Here is a simple use of count(). It counts the number of rows in the courses data frame.

Another way to format the previous command is by using the pipe:

Both do the same thing–they count the number of rows in the data frame. Compare to the number of observations (again, rows) in the data browser.

The count() function is most useful when we want to count cases within groups. For example, to count course sections by term, we can do this:

How do you interpret the output of this command?

Question 9: Test your knowledge

How many course sections were taught in the most recent term?

A bit of help

Consider using the filter() function from dplyr.

courses |> 
  filter(______) |> 
  count(______)

A bit of help

You should filter the dataset using the species column.

courses |> 
  filter(______ == ______) |> 
  count(______)

A bit of help

This is the third hint.

courses |> 
  filter(TermID == max(TermID)) |> 
  count(______)

Solution

1courses |>
2  filter(TermID == max(TermID)) |>
3  count()
1
The data originates in the courses data frame.
2
We want to filter to include those rows for which the TermID equals the maximum value of TermID.
3
Count the number of rows left after filtering.

Note, the last line could also have been count(TermID).

Question 10: Test your knowledge

The chair of the history (SubjectID == "HISTO") department wants to know how many course sections were taught in the most recent term. Can you help her out?

A bit of help

You need to first filter to include the rows that you want, and then you should count the rows that remain.

courses |> 
  filter(______) |>
  count(______)

A bit of help

In this case, we’re filtering on two separate columns — one to ensure the courses are the right term, and the other to ensure that the courses are in the right department.

courses |> 
  filter(TermID == ______, 
         SubjectID == ______) |>
  count(______)

A bit of help

Now that we see how we’re filtering, what do you want to count? What do you want to display when the count is displayed?

courses |> 
  filter(TermID == max(TermID), 
         SubjectID == "HISTO") |>
  count(______)

Make the output as useful and informative as possible.

Solution

courses |> 
1  filter(TermID == max(TermID),
2         SubjectID == "HISTO") |>
3  count(TermID, SubjectID)
1
The comma means AND, or you can use & for same thing in filter()
2
Always use a double equal sign when testing for equality.
3
You can leave out these column names, but this way you get more information than just the count–useful for making sure it’s correct.

Question 11: Test your knowledge

A professor with ID == "8749910288-DOUG" wants to know how many course sections he has taught in each of the most recent three terms. Can you help him out?

A bit of help

The basic structure of our query is to first filter to include the rows that we need and then count how many rows remain.

courses |> 
  filter(______) |>
  count(______) 

A bit of help

The most straight-foward filter is on the ProfID. Now you need to figure out what you’re going to compare TermID to.

courses |> 
  filter(ProfID == "8749910288-DOUG",
         TermID > ______) |>
  count(______) 

A bit of help

We want to include the last three semesters, so we start with TermID (128), and we want to include 126, 127, and 128. This means that we could use either of the following:

  • TermID >= max(TermID) - 2
  • TermID > max(TermID) - 3

We chose the second one, but either one would work.

courses |> 
  filter(ProfID == "8749910288-DOUG",
         TermID > max(TermID) - 3) |>
  count(______) 

Solution

courses |> 
1  filter(TermID > max(TermID) - 3,
2           ProfID == "8749910288-DOUG") |>
3  count(TermID, ProfID)
1
The & means AND. In filter() you can also use a comma instead.
2
We want to include the right professor.
3
We want to display the term number, the professor id (for visual validation), and the number of rows per unique combination of the two (n).

4.4 Practice with the pipe

The power of the tidyverse functions comes from flexible application of the function to compose useful data processes by chaining the operations in an assembly line.

If we chain filter() and select() it’s usually best to filter BEFORE selecting, because we might accidentally deselect the column we want to filter on, and that makes for confusing error messages.

Here’s an example.

Get the course Ids from the most recent term. Note that this will produce an error! Run it, and then see if you can fix it before continuing below.

Compare to this version, which works