Week 1 Homework
1 Using quarto-live
documents
Within this document are blocks of R
code.
- Edit the code that is shown in the box. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
2 Instructions for this homework
- Work on the questions from top to bottom. All of the questions can be found in the blue boxes.
- Before you can work on any question, you need to run the code box in @setup.
- These questions differ from the ones in your “Lessons” in that, after you press the “Run Code” button, it will tell you if you got it right and also try to tell you where you might have done something different than what we expected. (Any and all feedback would be appreciated here. This code is all new and, we’re sure, will need revisions. Also let us know if the feedback provided is helpful — it is quite complex to write.)
- We also provide Hints (for most) and Solutions (for all) throughout the homework.
If you are having trouble with these, the Problem Session on Wednesday would be a great time to get feedback — that’s why these sessions exist.
If you haven’t already completed them, then go back to the lessons for the pipe
(|>
) operator and the select()
and filter()
functions.
3 Preliminaries
3.1 Setup
We have loaded both the tidyverse
and tidylog
packages in the background. If you are working in RStudio
, then run the following code at the Console
:
library(tidyverse)
library(tidylog)
We are now going to load the raw data from the CSV
files. If you are working in RStudio
, then execute the following code at the Console:
<- read_csv("data/course_section.csv") courses
If you are working in this web page, then read in the data and display it by running the following code:
What to do if all goes wrong?
If you do something on this page and somehow mess up the courses
data frame, then just run the above code block again. It will reload the data. You can then go back to where you were working.
3.2 Data Dictionary
The following describes the contents of each of the columns in courses
:
CourseSectionID
: unique identifier for each course sectionCourseID
: unique identifier for each course, specifying subject and course number.SectionID
: the section number of a course type, e.g., 001, 002, etc.TermID
: unique identifier for each termCredits
: the number of credits for the courseProfID
: unique identifier for each professor
4 Questions
Do your best to answer each of the following questions. You can complete them within this web page or within RStudio
. For each question, we provide a solution (with accompanying explanation); for most, we also provide a series of hints that you can use to help yourself along the way.
When you see a series of underscore characters within a code block, that is our hint that something — anything — needs to go there. Replace those underscores with something and then press the Run Code
button. Keep trying until you think you’ve got it right. Read the provided feedback or click on Show Hint
if you need some help, and then click on Show Solution
when you’re ready.
You will get the most benefit from going through these questions if you try — multiple times, if necessary — to answer them. Getting your brain engaged and struggling is the key to building a memory about these skills.
4.1 Practice with select()
The following five questions primarily test your knowledge of the select()
operator.
Question 1: Test your knowledge
Create a new data frame from courses
by using the select()
function to keep only the columns CourseSectionID
and Credits.
Call the new data frame course_credits
.
A bit of help
You need to assign the results of the query to course_credits
.
<- ______ |>
course_credits select(______)
A bit of help
You are selecting data from the courses
data table (tibble).
<- courses |>
course_credits select(______)
A bit of help
You need to select two separate columns and separate them by a comma.
<- courses |>
course_credits select(______, ______)
Question 2: Test your knowledge
If you are using RStudio
, then using the data browser at top right, inspect the data frame course_credits
to make sure it looks right.
Solution
You should see a spreadsheet-like display with the two columns you selected. The data should be the same as the data in courses
, the number of rows should be the same as in courses
, and the data types for each column should be the same.
Question 3: Test your knowledge
What if we wanted all the columns except for Credits
? Write the select()
to do that below, starting with the data frame and piping it to a select()
. This will print the first few rows of the result.
A bit of help
The only thing you have to change is within the select()
operator. Don’t change anything else.
|>
courses select(______)
A bit of help
The easiest way to get rid of one column for a select
operator is to use the minus
operator.
|>
courses select(-NAMEOFONECOLUMN)
Solution
This is how we answered this question:
- 1
-
You do not need to assign the results to another tibble. Simply start with the
courses
tibble as the data source for the query. - 2
-
The easiest way to select all of the columns except for the
Credits
column is to use theminus
operator.
The minus deselects only that column. What other way(s) might you do this?
In naming columns, the convention of using uppercase letters to separate words (as in ThisIsAColumn
) is called camel case, because of the “humps”. The courses
data frame uses this convention.
A different convention is snake case, where lowercase words are separated by underscores, as in this_is_a_column.
Generally, snake case is easier to read. It is also the standard throughout the tidyverse
.
Question 4: Test your knowledge
Use the select()
function to select the first three columns of the courses
data frame and rename the columns to snake case. For example, CourseSectionID
would become course_section_id
. Print the result to the console — i.e., do not save the information to another data frame.
A bit of help
Again, you only need to apply the select()
function to the courses
data table.
|>
courses select(______)
A bit of help
You need to use snake_case
for the new names. The code below shows how to rename the first column. Continue with the second and third columns.
|>
courses select(course_section_id = CourseSectionID,
______)
A bit of help
To rename three columns, you need to have three phrases separated by commas within the select()
statement.
|>
courses select(course_section_id = CourseSectionID,
course_id = ______,
subject_id = ______)
Solution
1|>
courses 2select(course_section_id = CourseSectionID,
3course_id = CourseID,
4subject_id = SubjectID)
- 1
-
The
courses
tibble is the data source. - 2
-
The
select()
statement can be used to rename columns. This first line says “renameCourseSectionID
tocourse_section_id
.” - 3
-
This second line says “rename
CourseID
tocourse_id
.” Notice that this is separated from the previous column by a comma. - 4
-
Finally, the third line says “rename
SubjectID
tosubject_id
.”
Since we did not assign the results of this whole command to a new data table, these results are temporary.
How would you make them permanent?
Question 5: Test your knowledge
There is also a rename()
function that can be used to rename columns if you want to keep all the columns (i.e. you don’t need to select()
them).
Look up “tidyverse rename” in a search engine (or simply look at this page in rforir
) and see if you can rename all the columns in the data frame to snake case.
A bit of help
We are using the courses
data table as the source. Instead of select()
as we did in the previous question, we are going to use rename()
in this answer.
|>
courses rename(______)
A bit of help
The format of this rename()
command is similar to the one for the select()
command in the previous question. This is a more focused command with a simple purpose. Neither one is right or wrong.
This is how you rename the CourseSectionID
column. Now you need to rename the rest of the columns.
|>
courses rename(course_section_id = CourseSectionID)
A bit of help
Just as with the select()
command, you separate the renaming of the fields by a comma. Continue until you have completed all of them.
|>
courses rename(course_section_id = CourseSectionID,
______)
Solution
This is how we attacked this question:
1|>
courses 2rename(course_section_id = CourseSectionID,
course_id = CourseID,
subject_id = SubjectID,
section_id = SectionID,
term_id = TermID,
credits = Credits,
prof_id = ProfID)
- 1
-
The
courses
data table is the source. - 2
-
The formatting of this approach to renaming is the same as for the
select()
statement above.
Note that since we did not assign the query to a data frame (at the beginning, with something like courses <-
), these columns will not be renamed after this query is completed.
4.2 Practice with filter()
Before attacking these questions, review the lesson on the filter()
function.
Our task is to filter the courses
data frame to keep only the most recent term, which is associated with the largest TermID
. We can do a quick pivot table to count cases by TermID
like this:
The table()
function counts the number of occurrences of each unique value. In the results just above, you can see the counts for each value of TermID
from 101
to 128
.
You may recall from the lesson on the select()
operator that the $
operator is used to select a column from a data frame (and return a vector).
The result shows that each individual course is offered somewhere around 800 times.
What’s the most recent — i.e., maximum — term?
Question 6: Test your knowledge
We could do the following if we wanted to save the courses for the most recent term to a new data frame:
Why is this not the best way to do this? Think about what happens next term when you want the same report.
Solution
As time goes on, the maximum term will increase, and this filter will be out of date. This kind of thing is a danger in automated reports, because it means manually inspecting parameters to make sure they are up to date. It’s better to find a method to automatically get the most recent term.
An alternative is to use the filter()
function with the max()
function to get the most recent term automatically.
Look at the following and compare its result to the query above:
Test that these are the same with the handy identical()
function:
Question 7: Test your knowledge
Copy the filter code from your answer to Question 6 and modify it to keep only the courses with the smallest (minimum) TermID
. Do not use an integer in your query! Assign the results of the query to the table early_courses
.
A bit of help
We are selecting from the data source courses
and assigning the results to the early_courses
data table. We are going to print out the top of the new table, so that means that we should use head()
.
All that is left is for you to figure out how to create the appropriate filter.
<- courses |>
early_courses filter(______)
A bit of help
Our filter involves testing (ensure that you are using two equal signs!) whether or not TermID
is equal to some value. What value? Remember that you can’t type a number here — it has to be calculated.
<- courses |>
early_courses filter(TermID == ______)
Solution
Here is how we answered this question:
- 1
-
We are saving the query to
early_courses
. The data comes from thecourses
data frame. - 2
-
We want to filter to include those rows for which
TermID
equals the minimum value in theTermID
column (which happens to be101
, but we do not want this value hard-coded into the query).
A shortcut
Above, we assigned a value to early_courses
and then, in a second command, we simply typed in the data frame name in order to display its values.
An alternative to this is to surround the assignment statement with parentheses. This makes the assignment and then prints the value of the new variable. In this case, we could have written this:
<- courses |>
(early_courses filter(TermID == min(TermID)))
We have ensured that you can include the outer parentheses or not and the code check will proceed in the same manner.
This one’s a little harder, but see if you can figure it out.
Question 8: Test your knowledge
Use the filter()
function to keep only the course section that were taught in the most recent HALF of the data. In other words, the term number should be greater than the median term number. Save the results in a data frame named recent_half_courses
.
A bit of help
Consider using the filter()
function from dplyr
.
<- courses |>
recent_half_courses filter(______)
A bit of help
You should filter the dataset using the TermID
column.
<- courses |>
recent_half_courses filter(TermID > ______)
Solution
Here is our answer. We are not wedded to the use of the >
instead of >=
in this case; either would be fine.
4.3 Practice with count()
The count()
function does something we have to do all the time in IR — count cases within groups. It’s a shortcut for a combination of group_by()
and summarize()
, which together can do more complicated kinds of group statistics. But counting is pretty useful all by itself.
Here is a simple use of count()
. It counts the number of rows in the courses
data frame.
Another way to format the previous command is by using the pipe:
Both do the same thing–they count the number of rows in the data frame. Compare to the number of observations (again, rows) in the data browser.
The count()
function is most useful when we want to count cases within groups. For example, to count course sections by term, we can do this:
How do you interpret the output of this command?
Question 9: Test your knowledge
How many course sections were taught in the most recent term?
A bit of help
Consider using the filter()
function from dplyr
.
|>
courses filter(______) |>
count(______)
A bit of help
You should filter the dataset using the species
column.
|>
courses filter(______ == ______) |>
count(______)
A bit of help
This is the third hint.
|>
courses filter(TermID == max(TermID)) |>
count(______)
Solution
- 1
-
The data originates in the
courses
data frame. - 2
-
We want to filter to include those rows for which the
TermID
equals the maximum value ofTermID
. - 3
- Count the number of rows left after filtering.
Note, the last line could also have been count(TermID)
.
Question 10: Test your knowledge
The chair of the history (SubjectID == "HISTO"
) department wants to know how many course sections were taught in the most recent term. Can you help her out?
A bit of help
You need to first filter to include the rows that you want, and then you should count the rows that remain.
|>
courses filter(______) |>
count(______)
A bit of help
In this case, we’re filtering on two separate columns — one to ensure the courses are the right term, and the other to ensure that the courses are in the right department.
|>
courses filter(TermID == ______,
== ______) |>
SubjectID count(______)
A bit of help
Now that we see how we’re filtering, what do you want to count? What do you want to display when the count is displayed?
|>
courses filter(TermID == max(TermID),
== "HISTO") |>
SubjectID count(______)
Make the output as useful and informative as possible.
Solution
- 1
-
The comma means AND, or you can use
&
for same thing infilter()
- 2
- Always use a double equal sign when testing for equality.
- 3
- You can leave out these column names, but this way you get more information than just the count–useful for making sure it’s correct.
Question 11: Test your knowledge
A professor with ID == "8749910288-DOUG"
wants to know how many course sections he has taught in each of the most recent three terms. Can you help him out?
A bit of help
The basic structure of our query is to first filter to include the rows that we need and then count how many rows remain.
|>
courses filter(______) |>
count(______)
A bit of help
The most straight-foward filter is on the ProfID
. Now you need to figure out what you’re going to compare TermID
to.
|>
courses filter(ProfID == "8749910288-DOUG",
> ______) |>
TermID count(______)
A bit of help
We want to include the last three semesters, so we start with TermID
(128), and we want to include 126, 127, and 128. This means that we could use either of the following:
TermID >= max(TermID) - 2
TermID > max(TermID) - 3
We chose the second one, but either one would work.
|>
courses filter(ProfID == "8749910288-DOUG",
> max(TermID) - 3) |>
TermID count(______)
Solution
|>
courses 1filter(TermID > max(TermID) - 3,
2== "8749910288-DOUG") |>
ProfID 3count(TermID, ProfID)
- 1
-
The & means AND. In
filter()
you can also use a comma instead. - 2
- We want to include the right professor.
- 3
- We want to display the term number, the professor id (for visual validation), and the number of rows per unique combination of the two (n).
4.4 Practice with the pipe
The power of the tidyverse functions comes from flexible application of the function to compose useful data processes by chaining the operations in an assembly line.
If we chain filter()
and select()
it’s usually best to filter BEFORE selecting, because we might accidentally deselect the column we want to filter on, and that makes for confusing error messages.
Here’s an example.
Get the course Ids from the most recent term. Note that this will produce an error! Run it, and then see if you can fix it before continuing below.
Compare to this version, which works