Week 1 Homework

Building a graph

1 Lesson overview

In this set of problems, we are going to go through the process for building a graph in R using the ggplot2 package (as shown in Figure 1).

Process for creating a graph in R/ggplot2 — Figure 1: Process for creating a graph in R/ggplot

Working through this document to learn ggplot2 is analogous to learning to fly while a pilot is sitting in the seat next to you with controls so that he/she can take over. First, it will be a lot! Second, you will get lots of support and hints throughout. By the end, you should have a much better sense of how the process works, how the pieces fit together, and how to use the ggplot2 package to create your own graphs.

Note

Note that you will only rarely go through this whole process when defining a ggplot2 graph. Most of the time, you will go through the first three steps — gather the data, build the easel, and paint (apply geom_* layers). And that’s it. So don’t be put off by the size of this lesson. We have included as many of the basics as we can in this one lesson. The rest of the course will basically be about different geom_*s.

1.1 Using this document

Within this document are blocks of R code that you can modify — it has a Run Code button in the upper right. (Other R code that is just for display, is shown in an unadorned gray box.) You can edit and execute this code as a way of practicing your R skills:

Edit the code that is shown in the box as desired. Click on the Run Code button.
Make further edits and re-run that code. You can do this as often as you’d like.
Click the Start Over button to bring back the original code if you’d like.
If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

1.2 Using RStudio

The easiest way to work through the exercises in this document are within this document. However, you can also complete these exercises within RStudio. In order to set up the environment so that you can successfully complete them, you should copy the code in the following section and execute them at the prompt.

1.2.1 Open the project

Before doing anything else within RStudio, open the project for this lesson so that you have access to the data and source code that enables all the rest.

1.2.2 Set up

If you’re following along with this exercise in RStudio, then you need to execute the following code in the Console. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.

library(tidyverse)
library(tidylog)
library(skimr)
library(hexbin)
library(RColorBrewer)
library(ggthemes)
library(scales)

The first line loads the tidyverse package. You could actually load just the packages dplyr, purrr, stringr, and ggplot2 to save on memory, but with today’s computers and the types of things that you’re doing at this point, you don’t need to worry about load speed and/or memory usage just yet.
The second package tells R to give more detailed messages.
The skimr package enables the skim function that provides a convenient way to explore data frames.
The hexbin package enables ggplot2 to create hex graphs.
The RColorBrewer and ggthemes packages have a lot of useful palettes.

1.3 Reading in the data

A data frame is like a table in a database or a spreadsheet. It has rows and named columns. Most of the time—as we do here—we load data from a file or database. The following reads in the university data that we will be using.

If you are working within RStudio, then execute the following code in its Console:

source("source/read_univ_data_simple.R")

2 Gather data

We are going to use the university grades data set for this homework. We have defined a few data frames that summarize the data in different ways. The base data set contains information about students who were admitted to a university, including their high school GPA, university GPA, and other demographic information.

This is quite a large set of grades — it contains over 400,000 rows of 27 columns. The following commands list the names of the columns in all the data frames that we will be using.

3 The questions

Except where noted, only define ggplot(aes()), the facets (if necessary), the geom_*() layers, and apply the same theme_*() to each. You also only need to apply a color palette where asked.

Question 1: Test your knowledge

Create a histogram of HSGPA from admit_data with a number of bins that seems appropriate for the data.

Also, set the color, linewidth, and fill to constants to make the histogram look nice.
Ensure that the x-axis scale has breaks at 2.0, 2.5, 3.0, 3.5, and 4.0.
Apply your preferred theme (as you should on every graph in this homework).

Leave the filter() statement in place. It ensures that we only include student records that have a high school GPA.

Your first step should be to define the aes() and to determine which geom you’re going to use.

Hint 3: A bit of help

If you set the limits option in scale_x_continuous(), you might get the following error:

<warning/rlang_warning>
Warning:
Removed 2 rows containing missing values or values outside the 
scale range (`geom_bar()`).

This is some strange, anomolous behavior. The HSGPA column has had all NAs removed, so it shouldn’t be a problem. The HSGPA column doesn’t have any values outside the range of 2.0 to 4.0, so it shouldn’t be a problem. But it is. It probably has to do with the way that computers handle floating point numbers.

In a reporting context, we could suppress this warning message. But in this case, we wanted you to see it so that we could talk about it.

Question 2: Test your knowledge

Create a histogram of UnivGPA from admit_data with a binwidth that seems appropriate for the data.

Note that we do not have the filter() statement in this question. We took care of the problem when we set up the admit_data data frame.

Your first step should be to define the aes() and to determine which geom you’re going to use.

Question 3: Test your knowledge

Create a line chart across AdmitCalendarYear of AvgGPA from gpa_gender_summary with a separate line for each gender; you do this by defining some aesthetic characteristic, such as color or linetype, for the plot.

We want to highlight the points of plotted data themselves along the line.
Use both the color and the shape to distinguish the gender.
Set the linewidth for the line to 1.2 and the size of the point to 4.
Ensure that the y-axis shows values from 2.0 to 4.0. (Why?)
Pick an appropriate color palette. Which type of color palette should you use?

As always, your first step should be to define the aes() and to determine which geom you’re going to use.

Question 4: Test your knowledge

Starting with the graph you defined in the previous question, add a text label to each point (instead of graphing the point) that prints the average GPA. The color of the label should be based on the Gender while the fill should be "white" and the font face should be "bold".

Your first step should be to copy your answer from the previous question to this code block. Now figure out what you’re missing. We’ll address that in the hint…but spend a moment trying to figure it out for yourself — even better, try to answer it without looking at the hint.

Solution

This is how we answered this question:

You’ll notice that we added a useful bit of R code:

label = sprintf("%.2f", AvgGPA)

This code formats the average GPA (which is a “floating point” (f) number) to two decimal places. This cleans up the label on the graph so that it doesn’t take up so much room.

Question 5: Test your knowledge

Create a stacked bar chart (by value) of the number of each gender admitted each year from gpa_gender_summary. Pick an appropriate color palette.

Which type of color palette should you use?

Solution

This is how we answered this question:

You could also change the geom_col() layer to the following:

geom_col(color="black", linewidth = 0.2)

This would add a thin black outline to the bars. This is a common practice in data visualization, but it is not required.

Question 6: Test your knowledge

Create graphs (separate graphs for each combination of home state and gender) from grades_by_homestate_by_gender that shows the distribution of letter grades that they earned in courses. You need to assign a color palette (for the letter grades), and we have defined one for you called grade_palette.

Defining your own color palette

We’re going to go through the steps we followed when defining a new color palette for the bars.

We have to first decide what type of color palette we want to use:

Qualitative: This is used when the values are not ordered in any way. That’s not appropriate in this case because letter grades are clearly ordered.
Sequential: This is used when the values are ordered and you want to emphasize the order. It emphasizes low values and the gradual increase levels to the high values. This is appropriate in this case, but we did not choose it.
Divergent: This is used when the values are ordered and you want to emphasize the order, but you also want to emphasize the low and high values as well as the center point. This is appropriate in this case, and we did choose it.

First, we already have a vector of the letter grades that we want to use. (You’ll see why we need this in a moment.)

grade_levels <- c("A", "A-", "B+", "B", "B-", "C+",
                  "C", "C-", "D+", "D", "D-", "F")

Since we have loaded the scales library, we have the div_gradient_pal() function available to us for defining a divergent color palette (documentation). This function takes three arguments: low, mid, and high. The low and high arguments are the colors at the ends of the gradient, while the mid argument is the color in the middle of the gradient.

We are going to set the highest grades to be blue, the lowest grades to be red, and the middle grades to be very light grey. There’s no magic here — just pick colors that you like from your standard color scheme.

The process of using this function is a little tricky, because it is a function that returns a function. So we need to call it with the three colors, and then call the result with a sequence of numbers from 0 to 1 that is the same length as the number of grades (length(grade_levels)). (And now you know why we needed the vector!)

grade_palette <- div_gradient_pal(
    low = "#2166AC", 
    mid = "#f7f7f7", 
    high = "#B2182B")(seq(0, 
                          1,
                          length.out = length(grade_levels)
                          )
                     )

Note that I did not know whether to put blue as the low or high argument. I just guessed. If I was wrong, I would just switch them. The above code chunk results in a grade_palette containing a vector of 12 colors that range from blue to light grey to red.

Finally, we need to map the color palette to the letter grades. This is done by setting the names of our color palette (grade_palette) to be the letter grades.

names(grade_palette) <- grade_levels  # map to grades

To apply this color palette, you append this layer to your graph:

ggplot(aes(...)) +
  ... +
  scale_fill_manual(values = grade_palette)

This is how to define and use a custom color palette. You can use this code to define your own color palette if you don’t like the ones provided by RBrewer.

Also, as a finishing touch, get rid of the minor grid lines (i.e., those at the values of 500, 1500, etc.) while setting the major grid lines to darkgrey with a linewidth of 0.1. We also want to remove the letter grade legend since the grades are already shown in the graph.

As always, apply the same theme that you have been using.

We need to make some general decisions:

What geom are you going to use?
What are the x and y values of the graph?
You’re making multiple graphs, so what does that tell you to do?

See if you can use your answers to these questions to build the first iteration of the graph. We address the questions in the next hint (though we aren’t writing the code yet).

Hint 2: A bit of help

You should have something like this at this point:

grades_by_homestate_by_gender |>
  ggplot(aes(LtrGr, NumGrades)) +
    facet_grid(rows = vars(HomeState),
               cols = vars(Gender)) +
    geom_col() +
    theme_minimal()

Our next step is to assign different colors to the bars — that would be the fill argument — based on the letter grades. In our case, this also involves assigning the custom color palette that we defined above.

After we did this, we thought that the bars near the center of the distribution were quite hard to see. To remedy this, we added a black outline to the bars. This is done by setting the color argument in the geom_col() function. Since it is a constant, we set it directly in the geom_col() function, not within the aes() function.

We look at the code to accomplish all of this in the next hint. As always, try to do this yourself — struggle a bit, even! It will help you learn and internalize the material.

Hint 3: A bit of help

At this stage, we have this code:

grades_by_homestate_by_gender |>
  ggplot(aes(LtrGr, NumGrades, 
             fill = LtrGr)) +
    facet_grid(rows = vars(HomeState),
               cols = vars(Gender)) +
    geom_col(color = "black", linewidth = 0.1) +
    theme_minimal() +
    scale_fill_manual(values = grade_palette)

Note that we just made three changes to the code for this step.

What remains now? Basically, we want to de-clutter the graph:

We want to get rid of the minor grid lines (i.e., those at the values of 500, 1500, etc.). This is done by setting the panel.grid.minor argument in the theme() function to element_blank(). You do not have to specify the values that you have to remove; referring to them as the minor grid lines is enough.
We need to set the major grid lines to darkgrey with a linewidth of 0.1. This is done by setting the panel.grid.major argument in the theme() function to element_line(color="darkgrey", linewidth=0.1).
We want to remove the letter grade legend since the grades are already shown in the graph. This is done by setting the legend.position argument in the theme() function to none.

The solution is next. See if you can figure this out before looking at the solution. If you can’t, that’s okay — just try to understand the code and how it works.

Question 7: Test your knowledge

Create violin plots on one graph showing the distribution of HSGPA by gender (from admit_data). Pick an Viridis color palette — whatever option you would like (options) — for the fill (which should be based on Gender); set the alpha to 0.7 to lessen the saturation of the colors. Define a useful title, subtitle, caption, and axis names. Apply the same theme that you have been using.

As always, answer the basic questions first so that you can construct the foundation of the graph:

What geom are you going to use?
What are the x and y values of the graph?

And, again, go ahead and add the theme in this first step.

Build this basic graph in this first step, and we’ll guide you through the next steps starting in the next hint.

Hint 1: A bit of help

This is how we started our graph:

admit_data |>
  ggplot(aes(Gender, HSGPA)) +
    geom_violin(linewidth = 0.3) +
    theme_minimal()

Just the basics, to make sure that the graph is going to show what we want. If it doesn’t show promise, then there’s no reason to continue refining it.

We’re happy with it so far.

Let’s address the color-related steps here.

We want to fill the violin plots with different colors based on the Gender, so we need to set the fill argument (for the violin plot) to Gender. Remember that, since the fill is based on a column, you should set this up within the aes() within ggplot().
We want to use a Viridis color palette, and Gender is a discrete set of values, so we need to use the scale_fill_viridis_d() function. This function takes an option argument that specifies which color palette to use. You can find the available options here.
Don’t forget to set the alpha (opacity) argument of the color palette to 0.7 to lessen the saturation of the colors.

Put all of the appropriate code for these steps within the graph definition.

Question 8: Test your knowledge

Create box-and-whiskers plots showing the distribution of UnivGPA by DeclaredMajor (from admit_data). We want to show the plots in order by the major’s median GPA.

Create a useful y-axis scale with values ranging from 1.0 to 4.0 and breaks every 0.5.
Make the x-axis labels readable by rotating them 45 degrees.
Define a useful title and axis names.
As always, apply the same theme that you have been using.
To emphasize the increasing median GPA on the plot, use a color palette that goes from very light grey (for the lowest median GPA) to blue (for the highest median GPA).
Hide the fill legend.

Note that you should not change the beginning mutate() statement that uses the fct_reorder function (documentation). It will ensure that your box-and-whiskers plots are ordered by the median UnivGPA for each DeclaredMajor. This statement should be interpreted as follows:

The DeclaredMajor variable is being reordered based on the median UnivGPA for each major, with the lowest median GPA first and the highest median GPA last. The .na_rm = TRUE argument ensures that any missing values are ignored when calculating the median. The results are assigned back to the DeclaredMajor column (but only for this statement since the results are not being assigned back to admit_data).

Start this process by doing the following:

Defining the x and y for the graph,
Setting the values on the y axis,
Picking the geom_*() that you’re going to use,
Giving the graph a title and more descriptive axis labels, and
Applying the theme.

Then, in the next hint, we’ll review these decisions and continue with next steps.

Hint 1: A bit of help

This is where we are in the process. None of this should be surprising to you at this point.

admit_data |>
  mutate(
    DeclaredMajor = fct_reorder(
                       DeclaredMajor, 
                       UnivGPA, 
                       .fun = median, 
                       .na_rm = TRUE
                       )
    ) |>
  ggplot(aes(DeclaredMajor, UnivGPA)) +
    geom_boxplot() +
    theme_minimal() +
    scale_y_continuous(
      limits = c(1.0, 4.0),
      breaks = seq(1, 4, by = 0.5)
    ) +
    labs(
      title = "Distribution of GPA by Major",
      x = "Major", 
      y = "GPA"
    )

We have the basic graph defined, and we are happy with it…so far as it goes.

We would love to address the lack of color in the graph. We want to apply a color palette to the fill of the box-and-whiskers plots.

Setting up the fill color of a box-and-whiskers in which the values are calculated is a little tricky. We know that we are going to use the fill argument within the aes() of the ggplot(). But, beyond that, things veer off the rails a little.

What we’ve done so far is set the color, or fill, or shape, or size, or whatever, to a variable in the data frame. But here, we want to set the color based on the calculated value of the x variable. We’re not coloring (or whatever) a specific row in a data frame; we’re coloring based on a calculated value over a whole group of rows.

Luckily, for just this kind of situation, ggplot2 has a special function called after_stat(). This function allows us to set the color based on the calculated value of the x variable.

The form of the statement is now:

  ggplot(aes(DeclaredMajor, UnivGPA, 
             fill = after_stat(x))) +
    ...

This says to set the color based on whatever calculations are done — that is, the creation of the box-and-whiskers itself.

Next, we need to define a color scale that is a two-color gradient from one (low) color to one (high) color. (The ggplot2 package also provides the scale_*_gradient2 function for a diverging color palette. (See the documentation for more information.) The structure of this function is the following:

  scale_fill_gradient(
    low = "<low color>", 
    high = "<high color>" 
  ) +

Specify both the fill and this color palette in the ggplot() statement. We’ll take a look at the code in the next hint, and then continue with this journey.

Hint 2: A bit of help

Here’s where we are:

admit_data |>
  mutate(
    DeclaredMajor = fct_reorder(
                       DeclaredMajor, 
                       UnivGPA, 
                       .fun = median, 
                       .na_rm = TRUE
                       )
    ) |>
  ggplot(aes(DeclaredMajor, UnivGPA, 
             fill = after_stat(x))) +
    geom_boxplot() +
    scale_y_continuous(
      limits = c(1.0, 4.0),
      breaks = c(1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0)
    ) +
    theme_minimal() +
    labs(
      title = "Distribution of GPA by Major",
      x = "Major", y = "GPA"
    ) +
    scale_fill_gradient(
      low = "#f7f7f7",   # very light grey
      high = "#4A85A3"   # blue
    )

We now have two final polishing steps:

Hide the fill legend. (Why do this? Have you looked at the legend? It doesn’t provide any useful information at all. Better to get rid of it.)
Make the x-axis labels readable by rotating them45` degrees.

Both of these can be handled within the theme() function.

The first is done by setting the legend.position argument to none.
The second is handled with the axis.text.x argument. You set it to the element_text() function. Within this function, you can set many values, but the one that we’re interested in is angle. Set it to 45.

Give it a try! Then look at our solution.

Question 9: Test your knowledge

For each combination of Gender and StudentType, show a histogram of TotalGradePointsEarned (from admit_data).

Give it a shot! It’s a pretty short answer this time.

Question 10: Test your knowledge

Referring back to Exercise 2: Make the colors of the bars get darker as the bars get farther to the right. Also, define an appropriate color palette using RBrewer. Ensure that the color legend is as useful as you can make it. Make the x-axis and y-axis as useful as you can. Also define a useful title, subtitle, caption, and axis names. As always, apply the same theme that you have been using.

Certainly, you should start by copying your answer to #2 into the code chunk. (This is a fairly typical process that you’ll go through with graphs. You’ll start with last year’s or last month’s version and see if you can improve it a bit.)

What specific requests are you being asked to address?

Hint 2: A bit of help

This is what we have from starting with our answer to #2 and then addressing the first three requests:

admit_data |> 
  ggplot(aes(x = UnivGPA)) +
    geom_histogram(aes(fill = after_stat(x)), 
                   color = "black", 
                   alpha = 0.8, 
                   linewidth = 0.2) +
    scale_x_continuous(
      breaks = seq(1, 4, by = 0.5)
    ) +
    scale_y_continuous(
      limits = c(0, 2000),
      breaks = seq(0, 2000, by = 250)
    ) +
    theme_minimal() +
    labs(
      title = "Distribution of University GPA values",
      subtitle = "For all students",
      x = "University GPA",
      y = "Number of students",
      caption = "For homework assignment"
    )

Now we need to address the last three requests:

Make the colors of the bar get darker going to the right.
Define an appropriate color palette using RBrewer.
Ensure that the color legend is as useful as you can make it.

To make the colors of the bars get darker as they go to the right, we need to use the scale_fill_distiller() function. This function allows us to create a color gradient based on the values of the x variable. All you have to do is set the palette argument to the name of an appropriate palette. (You can find a list of the available palettes on this page.)

As for making the color legend useful…well, we don’t find it useful at all! We want to get rid of it, so we set the legend.position argument (within theme()) to "none".

And that’s it!

We suggest that, after you have completed this homework within this page, you come back to it and attempt to answer the questions within RStudio (while relying on the hints as little as possible). This will help you to get used to the process of building graphs in R and ggplot2.