Week 1 Homework
Building a graph
1 Lesson overview
In this set of problems, we are going to go through the process for building a graph in R
using the ggplot2
package (as shown in Figure 1).

Working through this document to learn ggplot2
is analogous to learning to fly while a pilot is sitting in the seat next to you with controls so that he/she can take over. First, it will be a lot! Second, you will get lots of support and hints throughout. By the end, you should have a much better sense of how the process works, how the pieces fit together, and how to use the ggplot2
package to create your own graphs.
Note that you will only rarely go through this whole process when defining a ggplot2
graph. Most of the time, you will go through the first three steps — gather the data, build the easel, and paint (apply geom_*
layers). And that’s it. So don’t be put off by the size of this lesson. We have included as many of the basics as we can in this one lesson. The rest of the course will basically be about different geom_*
s.
1.1 Using this document
Within this document are blocks of R
code that you can modify — it has a Run Code
button in the upper right. (Other R
code that is just for display, is shown in an unadorned gray box.) You can edit and execute this code as a way of practicing your R
skills:
- Edit the code that is shown in the box as desired. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
1.2 Using RStudio
The easiest way to work through the exercises in this document are within this document. However, you can also complete these exercises within RStudio
. In order to set up the environment so that you can successfully complete them, you should copy the code in the following section and execute them at the prompt.
1.2.1 Open the project
Before doing anything else within RStudio
, open the project for this lesson so that you have access to the data and source code that enables all the rest.
1.2.2 Set up
If you’re following along with this exercise in RStudio
, then you need to execute the following code in the Console
. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.
library(tidyverse)
library(tidylog)
library(skimr)
library(hexbin)
library(RColorBrewer)
library(ggthemes)
library(scales)
- The first line loads the
tidyverse
package. You could actually load just the packagesdplyr
,purrr
,stringr
, andggplot2
to save on memory, but with today’s computers and the types of things that you’re doing at this point, you don’t need to worry about load speed and/or memory usage just yet. - The second package tells
R
to give more detailed messages. - The
skimr
package enables theskim
function that provides a convenient way to explore data frames. - The
hexbin
package enablesggplot2
to createhex
graphs. - The
RColorBrewer
andggthemes
packages have a lot of useful palettes.
1.3 Reading in the data
A data frame
is like a table
in a database or a spreadsheet. It has rows and named columns. Most of the time—as we do here—we load data from a file or database. The following reads in the university data that we will be using.
If you are working within RStudio
, then execute the following code in its Console
:
source("source/read_univ_data_simple.R")
2 Gather data
We are going to use the university grades data set for this homework. We have defined a few data frames that summarize the data in different ways. The base data set contains information about students who were admitted to a university, including their high school GPA, university GPA, and other demographic information.
This is quite a large set of grades — it contains over 400,000 rows of 27 columns. The following commands list the names of the columns in all the data frames that we will be using.
3 The questions
Except where noted, only define ggplot(aes())
, the facets (if necessary), the geom_*()
layers, and apply the same theme_*()
to each. You also only need to apply a color palette where asked.
Question 1: Test your knowledge
Create a histogram of HSGPA
from admit_data
with a number of bins that seems appropriate for the data.
- Also, set the
color
,linewidth
, andfill
to constants to make the histogram look nice. - Ensure that the
x-axis
scale has breaks at2.0
,2.5
,3.0
,3.5
, and4.0
. - Apply your preferred theme (as you should on every graph in this homework).
Leave the filter()
statement in place. It ensures that we only include student records that have a high school GPA.
Your first step should be to define the aes()
and to determine which geom
you’re going to use.
Hint 1: A bit of help
Defining the aes()
means that you want to specify the x
and y
values for your graph. Since we’re creating a histogram, you only need to define the x
value. In that case, this means that you’re going to use the geom_histogram()
function.
Add this information to the code block and get it to run.
Which theme are you going to use? You should always set your theme as early as you can, certainly before you start thinking about colors, fills, and text. Add it to the code block and get it to run.
The next step will be to set the options in the geom_histogram()
function. The number of bins
is the most important and, to us, depends on the data. What range does HSGPA
have?
Hint 2: A bit of help
The HSGPA
column basically ranges from 2.0
to 4.0
. A reasonable number of bins would be 10
per GPA point, or 20
overall. Set the bins
value to your preferred value.
Do the same for color
, linewidth
, and fill
.
Now set the breaks
for the x-axis
to 2.0
, 2.5
, 3.0
, 3.5
, and 4.0
.
Hint 3: A bit of help
If you set the limits
option in scale_x_continuous()
, you might get the following error:
<warning/rlang_warning>
:
Warning2 rows containing missing values or values outside the
Removed range (`geom_bar()`). scale
This is some strange, anomolous behavior. The HSGPA
column has had all NA
s removed, so it shouldn’t be a problem. The HSGPA
column doesn’t have any values outside the range of 2.0
to 4.0
, so it shouldn’t be a problem. But it is. It probably has to do with the way that computers handle floating point numbers.
In a reporting context, we could suppress this warning message. But in this case, we wanted you to see it so that we could talk about it.
Solution
This is how we answered this question:
Notice that we simply removed the limits
option for the scale_x_continuous()
function.
Question 2: Test your knowledge
Create a histogram of UnivGPA
from admit_data
with a binwidth that seems appropriate for the data.
Note that we do not have the filter()
statement in this question. We took care of the problem when we set up the admit_data
data frame.
Your first step should be to define the aes()
and to determine which geom
you’re going to use.
Hint 1: A bit of help
Wait wait wait! This question is almost exactly like the previous question! You only have to make one change from your answer to that question in order to answer this one.
This is one of the great benefits of using ggplot2
. You can use the same code for different data sets.
Do it! Copy the code, make your change, make a bit of change to the x-axis
, and you’re done.
Solution
This is how we answered this question. You should recognize most of the code. The only differences are 1) that this question is about UnivGPA
, 2) these grades range over a larger range than those from HSGPA
(so the breaks differ), and 3) because the range is larger, we increased the bin
size to 30
.
Also, look at both of your solutions to the first two questions. Did you set the ranges of the x-axis
to the same values? If not (as we did here), then it’s not immediately obvious that the university grades are significantly lower than the high school grades. If this is important to you, then you should set the x-axis
limits to the same values. If not, then no worries.
This is the kind of thing that you’ll want to think about when you’re creating your graphs for your own institution.
Question 3: Test your knowledge
Create a line chart across AdmitCalendarYear
of AvgGPA
from gpa_gender_summary
with a separate line for each gender; you do this by defining some aesthetic characteristic, such as color
or linetype
, for the plot.
- We want to highlight the points of plotted data themselves along the line.
- Use both the
color
and theshape
to distinguish the gender. - Set the
linewidth
for the line to1.2
and thesize
of the point to4
. - Ensure that the
y-axis
shows values from2.0
to4.0
. (Why?) - Pick an appropriate color palette. Which type of color palette should you use?
As always, your first step should be to define the aes()
and to determine which geom
you’re going to use.
Hint 1: A bit of help
We’re going to use geom_line()
. A line is a fairly standard choice for a time series. It’s almost expected by the reader that a value that changes over time is shown as a line plot.
We want to highlight the plotted points, so that means we should also use geom_point()
.
We know that the x-axis
is going to be AdmitCalendarYear
. The y-axis
is going to be AvgGPA
.
Finally, we are going to distinguish the Gender
by using both the color
and the shape
. What exactly does this mean, and how does it affect our code?
Both a line and a point have a color, but only a point has a shape. So if we want to distinguish the
Gender
by color, then we need to define set it in theaes()
function withinggplot()
since both a line and a point have a color. Forshape
, you need to set it withinaes()
in thegeom_point()
function since only a point has a shape.
Go ahead and put all of this in the code block and run it to ensure that you don’t have any syntax errors.
Go ahead and add your theme as well.
Next step is that you should define the options for all of these functions. After you have done all of the above, we’ll talk through these options in the next hint if you’re having any trouble.
Hint 2: A bit of help
In order to set the linewidth
for the line, you need to set this argument in the geom_line()
call itself — not within the aes()
function within the geom_line()
, because the line width does not vary by any column. It is a constant, so it is set directly within the geom_line()
function.
The same holds for the size
of the point. You set this in the geom_point()
function itself, not within the aes()
function because it is also a constant.
All that remains is the color palette. We’ll address this in the next hint. What type of color palette do you think you should use? Qualitative, sequential, or divergent? And is it for the fill
or the color
?
Hint 3: A bit of help
We are setting the palette for color
, not fill
because we are using color
to distinguish the Gender
. Are the values of Gender
ordered in any way? No, they are not. So we should use a qualitative color palette.
Now you know that you’re using a qualitative color palette, but which one? RBrewer
provides ready-made qualitative color palettes. To see your options, you could go to this page at the R
graph gallery. Or, you can see them by executing the following code (with a type
value of "div"
, "qual"
, "seq"
, or "all"
):
The one that we settled on is "Set1"
. Add your choice to the graph definition.
Solution
This is how we answered this question:
Question 4: Test your knowledge
Starting with the graph you defined in the previous question, add a text label to each point (instead of graphing the point) that prints the average GPA. The color of the label should be based on the Gender
while the fill
should be "white"
and the font face should be "bold"
.
Your first step should be to copy your answer from the previous question to this code block. Now figure out what you’re missing. We’ll address that in the hint…but spend a moment trying to figure it out for yourself — even better, try to answer it without looking at the hint.
Hint 1: A bit of help
So here’s what we see that we’re missing:
- In order to add a label to the point, you need to use
geom_label()
instead ofgeom_point()
. - You need to set the
fill
andfontface
options in thegeom_label()
function.
Try to make these changes to your code and run it. How do you ensure that the label shows the average GPA and that the color of the label is based on Gender
?
Hint 2: A bit of help
In order to ensure that the label shows the average GPA, you need to set the label
argument (within the aes()
function) in the geom_label()
function. To set the color of the label based on Gender
, you need to set the color
argument (also within the aes()
function) in the geom_label()
function.
Once you set those, then you have done what you needed to do.
Solution
This is how we answered this question:
You’ll notice that we added a useful bit of R
code:
= sprintf("%.2f", AvgGPA) label
This code formats the average GPA (which is a “floating point” (f
) number) to two decimal places. This cleans up the label on the graph so that it doesn’t take up so much room.
Question 5: Test your knowledge
Create a stacked bar chart (by value) of the number of each gender admitted each year from gpa_gender_summary
. Pick an appropriate color palette.
- Which type of color palette should you use?
Hint 1: A bit of help
If you look back up at the top of this page, you can see that gpa_gender_summary
has four columns: AdmitCalendarYear
, NumAdmits
, Gender
, and AvgGPA
.
When starting a stacked bar chart, you need to define the x
and y
values for your graph, choose the appropriate geom_*()
, and define the appropriate fill
value.
Also, in the first step, apply your standard theme.
We’ll discuss these in the next hint. You should write the appropriate code to define all the above, run it, and only then go on to the next hint where we’ll discuss our answer to those questions.
Hint 2: A bit of help
How did we think about these questions?
geom_*()
: We are creating a stacked bar chart, so we need to usegeom_col()
. Thegeom_col()
function is used to create a bar chart where the height of the bar represents the value of the variable. Thegeom_bar()
function is used to create a bar chart where the height of the bar represents the count of cases in each category.x
: We want to create a separate bar for each year, so thex
axis should be defined over theAdmitCalendarYear
column.- ‘y’: We want to create a stacked bar chart where the height is determined by the number of admits, so the
y
axis should be defined over theNumAdmits
column. fill
: We want to fill the bars with different colors based on theGender
, so we need to set thefill
argument (for the bar) toGender
.- theme: We are boring, so we use the
theme_minimal()
. You can use any theme you want, but please use the same one that you have been using for the rest of the homework. Never change themes in the middle of a report (or homework assignment)!
All that’s left now is to choose a color palette. What type of color palette do you think you should use? Qualitative, sequential, or divergent? And is it for the fill
or the color
?
Hint 3: A bit of help
We are setting the palette for fill
, not color
, because we are using fill
to distinguish the Gender
. Are the values of Gender
ordered in any way? No, they are not. So we should use a qualitative color palette.
Do you remember how to list the available color palettes? You can use the display.brewer.all(type="qual")
function to see all of the available RBrewer
qualitative color palettes.
You should now be able to finish up this question.
Solution
This is how we answered this question:
You could also change the geom_col()
layer to the following:
geom_col(color="black", linewidth = 0.2)
This would add a thin black outline to the bars. This is a common practice in data visualization, but it is not required.
Question 6: Test your knowledge
Create graphs (separate graphs for each combination of home state and gender) from grades_by_homestate_by_gender
that shows the distribution of letter grades that they earned in courses. You need to assign a color palette (for the letter grades), and we have defined one for you called grade_palette
.
We’re going to go through the steps we followed when defining a new color palette for the bars.
We have to first decide what type of color palette we want to use:
- Qualitative: This is used when the values are not ordered in any way. That’s not appropriate in this case because letter grades are clearly ordered.
- Sequential: This is used when the values are ordered and you want to emphasize the order. It emphasizes low values and the gradual increase levels to the high values. This is appropriate in this case, but we did not choose it.
- Divergent: This is used when the values are ordered and you want to emphasize the order, but you also want to emphasize the low and high values as well as the center point. This is appropriate in this case, and we did choose it.
First, we already have a vector of the letter grades that we want to use. (You’ll see why we need this in a moment.)
<- c("A", "A-", "B+", "B", "B-", "C+",
grade_levels "C", "C-", "D+", "D", "D-", "F")
Since we have loaded the scales
library, we have the div_gradient_pal()
function available to us for defining a divergent color palette (documentation). This function takes three arguments: low
, mid
, and high
. The low
and high
arguments are the colors at the ends of the gradient, while the mid
argument is the color in the middle of the gradient.
We are going to set the highest grades to be blue, the lowest grades to be red, and the middle grades to be very light grey. There’s no magic here — just pick colors that you like from your standard color scheme.
The process of using this function is a little tricky, because it is a function that returns a function. So we need to call it with the three colors, and then call the result with a sequence of numbers from 0
to 1
that is the same length as the number of grades (length(grade_levels)
). (And now you know why we needed the vector!)
<- div_gradient_pal(
grade_palette low = "#2166AC",
mid = "#f7f7f7",
high = "#B2182B")(seq(0,
1,
length.out = length(grade_levels)
) )
Note that I did not know whether to put blue as the low
or high
argument. I just guessed. If I was wrong, I would just switch them. The above code chunk results in a grade_palette
containing a vector of 12 colors that range from blue to light grey to red.
Finally, we need to map the color palette to the letter grades. This is done by setting the names of our color palette (grade_palette
) to be the letter grades.
names(grade_palette) <- grade_levels # map to grades
To apply this color palette, you append this layer to your graph:
ggplot(aes(...)) +
+
... scale_fill_manual(values = grade_palette)
This is how to define and use a custom color palette. You can use this code to define your own color palette if you don’t like the ones provided by RBrewer
.
Also, as a finishing touch, get rid of the minor grid lines (i.e., those at the values of 500
, 1500
, etc.) while setting the major grid lines to darkgrey
with a linewidth of 0.1
. We also want to remove the letter grade legend since the grades are already shown in the graph.
As always, apply the same theme that you have been using.
We need to make some general decisions:
- What
geom
are you going to use? - What are the
x
andy
values of the graph? - You’re making multiple graphs, so what does that tell you to do?
See if you can use your answers to these questions to build the first iteration of the graph. We address the questions in the next hint (though we aren’t writing the code yet).
Hint 1: A bit of help
Let’s address those questions:
- What
geom
are you going to use? — We want to show the distribution of grades, and a bar chart is a good way to do this. So we are going to usegeom_col()
since we are going to specify the height of each bar explicitly. - What are the
x
andy
values of the graph? — Thex
value is the letter grade (since we want to show a distribution of the grades), and they
value is the number of students who received that letter grade. - You’re making multiple graphs, so what does that tell you to do? — This tells us to use a
facet_*()
function. We want to create a separate graph for each combination ofHomeState
andGender
, so we need to usefacet_grid()
. Since we’ll be usingfacet_grid()
, we need to define therows
andcols
arguments. If possible, we usually use therows
argument for the variable that has more unique values. In this case,HomeState
has 10 unique values andGender
has 2 unique values (since we only want to analyze the larger groups at this point), so we should useHomeState
for therows
argument andGender
for thecols
argument.
This should tell you enough to get started. You should be able to write the code to define the aes()
and the geom_*()
function. You also might as well add your standard theme to the graph now, too. We’ll look at the code we’ve written to this point in the next hint.
Hint 2: A bit of help
You should have something like this at this point:
|>
grades_by_homestate_by_gender ggplot(aes(LtrGr, NumGrades)) +
facet_grid(rows = vars(HomeState),
cols = vars(Gender)) +
geom_col() +
theme_minimal()
Our next step is to assign different colors to the bars — that would be the fill
argument — based on the letter grades. In our case, this also involves assigning the custom color palette that we defined above.
After we did this, we thought that the bars near the center of the distribution were quite hard to see. To remedy this, we added a black outline to the bars. This is done by setting the color
argument in the geom_col()
function. Since it is a constant, we set it directly in the geom_col()
function, not within the aes()
function.
We look at the code to accomplish all of this in the next hint. As always, try to do this yourself — struggle a bit, even! It will help you learn and internalize the material.
Hint 3: A bit of help
At this stage, we have this code:
|>
grades_by_homestate_by_gender ggplot(aes(LtrGr, NumGrades,
fill = LtrGr)) +
facet_grid(rows = vars(HomeState),
cols = vars(Gender)) +
geom_col(color = "black", linewidth = 0.1) +
theme_minimal() +
scale_fill_manual(values = grade_palette)
Note that we just made three changes to the code for this step.
What remains now? Basically, we want to de-clutter the graph:
- We want to get rid of the minor grid lines (i.e., those at the values of
500
,1500
, etc.). This is done by setting thepanel.grid.minor
argument in thetheme()
function toelement_blank()
. You do not have to specify the values that you have to remove; referring to them as the minor grid lines is enough. - We need to set the major grid lines to
darkgrey
with a linewidth of0.1
. This is done by setting thepanel.grid.major
argument in thetheme()
function toelement_line(color="darkgrey", linewidth=0.1)
. - We want to remove the letter grade legend since the grades are already shown in the graph. This is done by setting the
legend.position
argument in thetheme()
function tonone
.
The solution is next. See if you can figure this out before looking at the solution. If you can’t, that’s okay — just try to understand the code and how it works.
Solution
This is how we answered this question:
We don’t even have any labs()
defined in this graph — it would certainly benefit from a title, a subtitle, new x
and y
axis labels, and a caption. Try it out if you’re feeling adventurous.
Question 7: Test your knowledge
Create violin plots on one graph showing the distribution of HSGPA
by gender (from admit_data
). Pick an Viridis color palette — whatever option
you would like (options) — for the fill
(which should be based on Gender
); set the alpha
to 0.7
to lessen the saturation of the colors. Define a useful title, subtitle, caption, and axis names. Apply the same theme that you have been using.
As always, answer the basic questions first so that you can construct the foundation of the graph:
- What
geom
are you going to use? - What are the
x
andy
values of the graph?
And, again, go ahead and add the theme in this first step.
Build this basic graph in this first step, and we’ll guide you through the next steps starting in the next hint.
Hint 1: A bit of help
This is how we started our graph:
|>
admit_data ggplot(aes(Gender, HSGPA)) +
geom_violin(linewidth = 0.3) +
theme_minimal()
Just the basics, to make sure that the graph is going to show what we want. If it doesn’t show promise, then there’s no reason to continue refining it.
We’re happy with it so far.
Let’s address the color-related steps here.
- We want to fill the violin plots with different colors based on the
Gender
, so we need to set thefill
argument (for the violin plot) toGender
. Remember that, since thefill
is based on a column, you should set this up within theaes()
withinggplot()
. - We want to use a Viridis color palette, and
Gender
is a discrete set of values, so we need to use thescale_fill_viridis_d()
function. This function takes anoption
argument that specifies which color palette to use. You can find the available options here. - Don’t forget to set the
alpha
(opacity) argument of the color palette to0.7
to lessen the saturation of the colors.
Put all of the appropriate code for these steps within the graph definition.
Hint 2: A bit of help
The last step is to add the labs()
function to define a useful title, subtitle, caption, and axis names.
Solution
This is how we answered this question:
You might wonder why we are encoding the Gender
variable in the aes()
function twice. The first time is to define the fill
color of the violin plot, and the second time is to define the x
axis.
But why do this?
Redundancy isn’t always bad. In this case, it is a good thing because it makes the graph easier to read and understand — and to talk about: “See the orange plot?”. Also, if you’re displaying the graph on a slide, some people in the audience may not be able to read the x-axis
labels. The orange is much more obvious.
If you haven’t already done so, look at the graph without the alpha
argument. It is an intense visual. We like to soften the look with the alpha
argument; it’s just our preference.
Question 8: Test your knowledge
Create box-and-whiskers plots showing the distribution of UnivGPA
by DeclaredMajor
(from admit_data
). We want to show the plots in order by the major’s median GPA.
- Create a useful
y-axis
scale with values ranging from1.0
to4.0
and breaks every0.5
. - Make the
x-axis
labels readable by rotating them45
degrees. - Define a useful title and axis names.
- As always, apply the same theme that you have been using.
- To emphasize the increasing median GPA on the plot, use a color palette that goes from very light grey (for the lowest median GPA) to blue (for the highest median GPA).
- Hide the
fill
legend.
Note that you should not change the beginning mutate()
statement that uses the fct_reorder
function (documentation). It will ensure that your box-and-whiskers plots are ordered by the median UnivGPA
for each DeclaredMajor
. This statement should be interpreted as follows:
The
DeclaredMajor
variable is being reordered based on the medianUnivGPA
for each major, with the lowest median GPA first and the highest median GPA last. The.na_rm = TRUE
argument ensures that any missing values are ignored when calculating the median. The results are assigned back to theDeclaredMajor
column (but only for this statement since the results are not being assigned back toadmit_data
).
Start this process by doing the following:
- Defining the
x
andy
for the graph, - Setting the values on the
y
axis, - Picking the
geom_*()
that you’re going to use, - Giving the graph a title and more descriptive axis labels, and
- Applying the theme.
Then, in the next hint, we’ll review these decisions and continue with next steps.
Hint 1: A bit of help
This is where we are in the process. None of this should be surprising to you at this point.
|>
admit_data mutate(
DeclaredMajor = fct_reorder(
DeclaredMajor,
UnivGPA, .fun = median,
.na_rm = TRUE
)|>
) ggplot(aes(DeclaredMajor, UnivGPA)) +
geom_boxplot() +
theme_minimal() +
scale_y_continuous(
limits = c(1.0, 4.0),
breaks = seq(1, 4, by = 0.5)
+
) labs(
title = "Distribution of GPA by Major",
x = "Major",
y = "GPA"
)
We have the basic graph defined, and we are happy with it…so far as it goes.
We would love to address the lack of color in the graph. We want to apply a color palette to the fill
of the box-and-whiskers plots.
Setting up the fill
color of a box-and-whiskers in which the values are calculated is a little tricky. We know that we are going to use the fill
argument within the aes()
of the ggplot()
. But, beyond that, things veer off the rails a little.
What we’ve done so far is set the color, or fill, or shape, or size, or whatever, to a variable in the data frame. But here, we want to set the color based on the calculated value of the x
variable. We’re not coloring (or whatever) a specific row in a data frame; we’re coloring based on a calculated value over a whole group of rows.
Luckily, for just this kind of situation, ggplot2
has a special function called after_stat()
. This function allows us to set the color based on the calculated value of the x
variable.
The form of the statement is now:
ggplot(aes(DeclaredMajor, UnivGPA,
fill = after_stat(x))) +
...
This says to set the color based on whatever calculations are done — that is, the creation of the box-and-whiskers itself.
Next, we need to define a color scale that is a two-color gradient from one (low) color to one (high) color. (The ggplot2
package also provides the scale_*_gradient2
function for a diverging color palette. (See the documentation for more information.) The structure of this function is the following:
scale_fill_gradient(
low = "<low color>",
high = "<high color>"
+ )
Specify both the fill
and this color palette in the ggplot()
statement. We’ll take a look at the code in the next hint, and then continue with this journey.
Hint 2: A bit of help
Here’s where we are:
|>
admit_data mutate(
DeclaredMajor = fct_reorder(
DeclaredMajor,
UnivGPA, .fun = median,
.na_rm = TRUE
)|>
) ggplot(aes(DeclaredMajor, UnivGPA,
fill = after_stat(x))) +
geom_boxplot() +
scale_y_continuous(
limits = c(1.0, 4.0),
breaks = c(1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0)
+
) theme_minimal() +
labs(
title = "Distribution of GPA by Major",
x = "Major", y = "GPA"
+
) scale_fill_gradient(
low = "#f7f7f7", # very light grey
high = "#4A85A3" # blue
)
We now have two final polishing steps:
- Hide the
fill
legend. (Why do this? Have you looked at the legend? It doesn’t provide any useful information at all. Better to get rid of it.) - Make the
x-axis labels readable by rotating them
45` degrees.
Both of these can be handled within the theme()
function.
- The first is done by setting the
legend.position
argument tonone
. - The second is handled with the
axis.text.x
argument. You set it to theelement_text()
function. Within this function, you can set many values, but the one that we’re interested in isangle
. Set it to45
.
Give it a try! Then look at our solution.
Solution
This is how we answered this question:
We have been dabbling with the theme()
function, but take a moment to look at the documentation). It lists dozens of components that you can modify. Similarly with element_text()
and its documentation).
Question 9: Test your knowledge
For each combination of Gender
and StudentType
, show a histogram of TotalGradePointsEarned
(from admit_data
).
Give it a shot! It’s a pretty short answer this time.
Hint 1: A bit of help
Here’s how we parsed the request:
- We’ve been asked to show separate graphs for different combination of two variables. This is a classic use of the
facet_grid()
function. Since there are two values for student type and four values for gender (since we haven’t filtered out any values for this data set), we should use student type for the columns and gender for the rows. (Also, don’t forget to usevars()
to specify the variables.) - The graph we’ve been asked to show is a histogram. So, we need to use the
geom_histogram()
function.
Wrap this up, and then we’ll take a look at our solution.
Solution
This is how we answered this question:
Here are our observations about the graph:
- The grid looks good to us.
- The
x
andy
axes scales are pretty good. - You could definitely improve the names of both axes.
- The color of the histogram bars is a little bland.
- You could also give the graph a title, a subtitle, and a caption.
Just for kicks and giggles, give it a shot!
Question 10: Test your knowledge
Referring back to Exercise 2: Make the colors of the bars get darker as the bars get farther to the right. Also, define an appropriate color palette using RBrewer
. Ensure that the color legend is as useful as you can make it. Make the x-axis
and y-axis
as useful as you can. Also define a useful title, subtitle, caption, and axis names. As always, apply the same theme that you have been using.
Certainly, you should start by copying your answer to #2 into the code chunk. (This is a fairly typical process that you’ll go through with graphs. You’ll start with last year’s or last month’s version and see if you can improve it a bit.)
What specific requests are you being asked to address?
Hint 1: A bit of help
Here are the requests that we see:
- Make the
x-axis
andy-axis
as useful as you can. (This generally means defining the limits, breaks, and possibly the labels.) - Define a useful title, subtitle, caption, and axis names.
- Apply the same theme that you have been using.
- Make the colors of the bar get darker going to the right.
- Define an appropriate color palette using
RBrewer
. - Ensure that the color legend is as useful as you can make it.
Address the first three before progressing to the next hint.
Hint 2: A bit of help
This is what we have from starting with our answer to #2 and then addressing the first three requests:
|>
admit_data ggplot(aes(x = UnivGPA)) +
geom_histogram(aes(fill = after_stat(x)),
color = "black",
alpha = 0.8,
linewidth = 0.2) +
scale_x_continuous(
breaks = seq(1, 4, by = 0.5)
+
) scale_y_continuous(
limits = c(0, 2000),
breaks = seq(0, 2000, by = 250)
+
) theme_minimal() +
labs(
title = "Distribution of University GPA values",
subtitle = "For all students",
x = "University GPA",
y = "Number of students",
caption = "For homework assignment"
)
Now we need to address the last three requests:
- Make the colors of the bar get darker going to the right.
- Define an appropriate color palette using
RBrewer
. - Ensure that the color legend is as useful as you can make it.
To make the colors of the bars get darker as they go to the right, we need to use the scale_fill_distiller()
function. This function allows us to create a color gradient based on the values of the x
variable. All you have to do is set the palette
argument to the name of an appropriate palette. (You can find a list of the available palettes on this page.)
As for making the color legend useful…well, we don’t find it useful at all! We want to get rid of it, so we set the legend.position
argument (within theme()
) to "none"
.
Solution
This is how we answered this question:
Notice that we also set the direction
argument within scale_fill_distiller()
to 1
. This means that the colors will go from light to dark as the values of the x
variable increase. If you set it to -1
, the colors would go from dark to light.
And that’s it!
We suggest that, after you have completed this homework within this page, you come back to it and attempt to answer the questions within RStudio
(while relying on the hints as little as possible). This will help you to get used to the process of building graphs in R
and ggplot2
.