Week 2: In-class Practice & Homework
This in-class exercise is intended to develop familiarity with the RStudio
interface and some useful operators & functions in R/tidyverse
.
1 Set up
You’ll work in groups. Choose one person to share screen, and walk through the instructions below. Everyone should try to work along if possible.
- Find the folder
week2-files
. - Double-click on the project file (
week2.Rproj
). - Once everything loads, use the
File
tab in the lower right to click oncollaborate.R
to bring the script to the editing window. - From the
File
menu, chooseSave As...
and save the file toYOURNAME-wk2.R
(or whatever — just choose something different). This will ensure that you always have thecollaborate.R
script to go back to in case something goes wrong. - You will now work in this new script for the rest of class and for your homework (by going through the tasks on this page).
- Check to make sure that the file
southern_conf_data.csv
is in thedata
folder. - Run the code through line 16 in the script. You should be able to describe (and execute) at least two different ways of quickly doing this. Share among your group the ways that you can name.
- It should create a new data frame in the
Environment
(upper right tab) calledenrollment
. - Explore the data. It should have 70 rows and 21 columns.
If you are working in this web page, then read in the data and show its structure by running the following code:
2 The data
This data is an extract from IPEDS data (Integrated Postsecondary Education Data System). We downloaded the 12-Month Enrollment (EFFY) survey results from 2017-23 and the Institutional Characteristics (HD) directory information from 2022.
2.1 The institutions
We filtered this down to schools that are in the Southern Conference:
- UNCG 199148
- Wofford 218973
- East TN State 220075
- Chattanooga 221740
- Furman 218070
- Mercer 140447
- Samford 102049
- Western Carolina 200004
- VMI 234085
- The Citadel 217864
After each school in the list above, the number is the IPEDS Unit ID that represents that school throughout its database.
You can look up any institution’s Unit ID on this page.
2.2 Other limitations
We downloaded all of the enrollment data for 2017-23. We then filtered it to include only the data from those schools and their undergraduate programs.
2.3 Specific pieces of data collected
We further limited the columns of data to only the following:
UNITID
: the IPEDS unit identifieryear
: the year for which the data is capturedinst_name
: the name of the institutioncity
: the city location of the institutioncounty
: the county location of the institutionstate
: the state location of the institutionfips
: the FIPS code representing the location of the institutionsize_set
: The Carnegie Basic Classification is the 2021 update of the traditional classification framework developed by the Carnegie Commission on Higher Education in the early 1970s to support its research program.size_set_desc
: the description of thesize_set
column codeland_grant_desc
: answers the question “Is this a land grant institution?”mult_camp
: specifies whether or not the unit is part of a multi-institution or multi-campus organizationstudent_status
: specifies the code representing the level and degree/certificate-seeking status of the studentsstudent_status_desc
: the description of thestudent_status
codeug_grad_desc
: answers the question “are the students Graduate or Undergraduate”. All of the data in this data set represent undergraduates.grand_total
: the number of students represented in this rowaian_pc
: the percent of students who are American Indian or Alaskan Nativeasia_pc
: the percent of students who are Asianbaa_pc
: the percent of students who are Black or African Americanhislat_pc
: the percent of students who are Hispanic or Latinonhopi_pc
: the percent of students who are Native Hawaiian or Other Pacific Islanderwhite_pc
: the percent of students who are White
3 The situation
The enrollment
data frame is a fairly typical wide data table. We will want to get this into a long data format so that we can then complete some analytical tasks on it. These tasks are related to what an IR analyst would do in order to learn about the ethnic composition of a set of schools.
4 Tasks to complete
The rest of this document describes a series of tasks, all of which you have to complete in order. (Later tasks sometimes rely on the computational effects of previous tasks.) The hints of many questions contain valuable teaching/learning points! Be sure to at least scan through them all.
4.1 Pivot enrollment
to a long format
Test your knowledge
You need to pivot enrollment
to a long format. This is a multi-step process, so we’re going to take you through it step-by-step.
As you go through these steps, iteratively build your query and execute it every step of the way.
- Gather your information:
- Determine which columns are going to be pivoted. How are you going to describe these columns in the
cols
argument? - What name do you want to use for the column of names?
- What name do you want to use for the column of values?
- Determine which columns are going to be pivoted. How are you going to describe these columns in the
- Before you ever save a pivoted table, you should first execute the pivot command and examine (and refine) it.
- Write the
pivot_longer
command. - Once you are satisfied with it, assign the results to the
enrollment_long
data frame.
A bit of help
We are going to call the column of names ethnicity
and the column of values percent
. It will be easier for the rest of the assignment if you use these two names as well.
A bit of help
The easiest way to describe the columns that are to be pivoted is to collect them in a vector, as follows:
= c("aian_pc", "asia_pc", "baa_pc",
cols "hislat_pc", "nhopi_pc",
"white_pc")
If you want the values to appear in a certain order, then this is actually a convenient way to specify the set of columns.
However, R/tidyverse
has a whole set of ways to select variables that match a pattern (as documented on this page) that are flexible, can ensure that your list stays up-to-date, and require much less typing (especially if the list is long, as these lists can be):
starts_with(match, ignore.case = TRUE)
-
the column name starts with
match
. For example, if you had a set of columns(name, Q1, Q2, Q3)
, thenstarts_with_("q")
which match(Q1, Q2, Q3)
since they each starts withq
andignore.case
has a default value ofTRUE
. ends_with(match, ignore.case = TRUE)
-
the column name ends with
match
. For example, if you had a set of columns(first_name, last_name, TEAM_NAME, ssn)
, thenends_with("name", FALSE)
would match(first_name, last_name)
since they each end withname
andignore.case
has a value ofTRUE
. contains(match, ignore.case = TRUE)
-
the column name contains
match
. For example, if you had a set of columns(2025Q3wk13, 2025Q4wk01, 2025Q4wk02)
, thencontains("Q3")
would match only the first column. num_range(prefix, range, suffix = "", width = NULL)
-
the column matches column names containing a numerical range like
q1, q2, q3
orq001-s3, q002-s3, q003-s3..., q999-s3
. To match the first set, you might usenum_range("q", 1:3)
. To match the second set, you might usenum_range("q", 1:999, width=3)
.
Note for the values of match
above, you can also supply a vector of values and the function will return columns that match any one of the values.
Finally, another more advanced option named matches()
uses regular expressions to match column names.
We chose to use ends_with("_pc")
as our value for the cols
argument.
A bit of help
Note that each one of the columns that we want to use as a name
starts with the identifier of the ethnicity (e.g., aian
) and then concludes with _pc
.
We would prefer that the _pc
not be part of the name. The R/tidyverse
provides an option (in the pivot_longer
function) that enables you to specify this — it’s called names_pattern
. We can specify names_pattern = "(.*)_pc"
. This matches any string of characters up to, but not including, the _pc
at the end; this matching set would then be used as the name instead of the full name.
So, for example, this pattern would match aian_pc
but the name would be aian
in the pivoted data frame.
A bit of help
As a reminder, while you are iteratively developing your query, you should be running it in the form:
|>
enrollment pivot_longer(...)
Once you have it right, then you should go ahead and run it and assign the output to a variable, such as:
<-
enrollment_long |>
enrollment pivot_longer(...)
Solution
This is how we answered this question. (Be sure to run this; the rest of the homework depends on this data being loaded.)
Note that this isn’t the answer; this is one answer. Just because you got something that differs from this, it doesn’t mean that you’re wrong. Just be sure that you have maintained the first 15 columns from enrollment
and have added 2 new columns (ethnicity
and percent
).
Explore the data for a bit; if you’re doing this in this Web page, then use the code block below to run your exploratory code.
What do you want to know? How might you get that information? We’ll get you started with this select()
query (that’s already in the code block), but you should definitely spend a couple of minutes trying out different queries to learn about this data frame.
Note that the kbl()
and kable_minimal()
operators are both provided by the kableExtra
package. These are simply tools that you can use to improve the look and readability of your printed tables.
Try your queries with and without these operators (both in this Web page and in RStudio
) in order to explore the effects of this code. It is sometimes helpful to use these operators and sometimes less so. Just use what you find most appropriate.
4.2 Unique values in ethnicity
column
Test your knowledge
What are the unique values in the ethnicity
column?
A bit of help
This can be answered in a one-line R/tidyverse
query.
A bit of help
The operator distinct()
is probably what you’re looking for.
Solution
This is how we answered this question:
What is the data type of the output of the above command?
Answer: It is a one-column data frame. This will come up again (very soon).
4.3 Unique values in state
column
Test your knowledge
What are the unique values in the state
column?
A bit of help
This one should be pretty straight-forward after you answered the previous question.
Solution
This is how we answered this question:
We had you answer the previous two questions because we want to use that information when we make these two columns into factors (since they are categorical data).
(Note: this discussion and the following single question are not strictly necessary for this analysis or for the arriving at the right answer. However, it is informative and you might as well start getting comfortable with these concepts, even if you’re not quite ready to master them.)
Unfortunately, we cannot use the result of the queries to specify the information in the factor definition — it requires a vector while the previous two queries return data frames (with just one column, but it still isn’t appropriate for what we need to do). (See the lesson on data types for more information.)
Fortunately, but possibly not surprisingly, R/tidyverse
provides a helper operator that can transform a single column that is part of a data frame into a free-standing vector; it is called pull()
and we haven’t come across it yet.
The input to pull()
must be a data frame and the output will be, as stated, a vector. The only argument to pull()
is the name of the column that you want to transform into a vector.
In the following, we take the answers from the previous two questions and turn them into part of the methods that we use to define two vectors (state_info
and ethnicity_info
) that will contain the vector of all, respectively, states and ethnicities, that we have in our data.
Note that in both of the following queries we use the trick of surrounding the query with a set of parentheses so that it will display the result of the assignments.
So, from now on, we have these two vectors that we can use in commands where necessary.
4.4 Define factors for ethnicity
and state
columns
Test your knowledge
Both ethnicity
and state
contain categorical data. Write the mutate()
statements that will make the both of them factors.
A bit of help
If you have not yet done so, you should go through the short factor lesson to learn about how to define a factor for a column that contains categorical data.
A bit of help
You need to make two decisions about a factor:
- What are the set of allowable values (or
levels
inR/tidyverse
lingo). - Can the values be considered to be ordered? That is, does it make sense to be able to say (about two possible values, say
x
andy
) thatx < y
orx > y
? If so, then it is an ordered factor.
Solution
This is how we answered this question:
Both statements within the mutate()
operator have the same form:
= factor(
NEW_COL_NAME
OLD_COL_NAME,levels = VALUES,
ordered = TRUE_OR_FALSE
)
It is most common for VALUES
to simply be a vector of values.
If you’re looking for an in-depth discussion of factors after you look through our lesson on factors, you should read through this page in R for Data Science.
4.5 One command to get distribution of values for all columns
Test your knowledge
Write the one-line command that will return the distribution of values for all (okay, not the character-based ones) columns.
A bit of help
You have several choices when thinking about getting information about the columns in a data table: str()
, names()
, summary()
, View()
, head()
, spec()
, and glimpse()
, among others.
Solution
This is how we answered this question:
Note that this does not provide any information for character
(string) columns other than the length of the row. Thus, if you defined the factors above, both the state
and ethnicity
rows have useful information displayed.
4.6 Remove multiple columns from the data frame
Test your knowledge
This is a long and complicated analysis to undertake. It’s harder to do when columns that you do not need (at the moment, anyway) are cluttering up your mind and the screen. Let’s get rid of them.
Write a command that will remove the following columns from the enrollment_long
data frame: "city"
, "county"
, "fips"
, "student_status"
, "student_status_desc"
, "ug_grad_desc"
, "mult_camp"
A bit of help
The select()
command is useful for this question.
A bit of help
Also, the -
(minus) operator can make this command both easier to write and easier to understand.
Solution
This is how we answered this question:
Notice that we used the -
(minus) operator to tell R/tidyverse
to remove these columns.
Also, if we had not included the assignment (in the first line), then this column removal would not have had any effect beyond this one command.
Let’s remind ourselves of the columns that we have remaining:
4.7 Calculate the average value in a vector
Test your knowledge
Without using the tidyverse
, calculate the average value in the grand_total
column of the enrollment_long
data frame.
Solution
This is how we answered this question:
Other functions that you can use are the following:
- Single values
-
sum()
,median()
,min()
,max()
,sd()
,var()
- Multiple values
quantile()
: Calculate the quantilesrange()
: Finds the minimum and maximum valuesIQR()
: calculates the interquartile range; that isQ3 - Q1
summary()
: provides the minimum,1Q
,2Q
, average,3Q
, and maximum.
If you use summary()
— yes, it certainly looks exactly like the summary()
command that we use on data frames! — you get a lot of good information:
4.8 Display all ethnicity values across all years
Test your knowledge
For the institution with UNITID == 218973
, display the year
, ethnicity
, and percent
across all years and ethnicities.
A bit of help
To start working on this, you should start by answering these questions:
- Where is the data coming from?
- What column(s) are you going to filter on?
- What columns are you going to display?
- Do you need to calculate anything?
As always, build your response iteratively.
A bit of help
Here are our answers to those questions:
- Where is the data coming from? —
enrollment_long
- What column(s) are you going to filter on? —
UNITID
- What columns are you going to display? —
year, ethnicity, percent
- Do you need to calculate anything? — no
Solution
This is how we answered this question:
4.9 Calculate the number of students of a specific ethnicity
Test your knowledge
For the institution with UNITID == 218973
, calculate the actual number of white students who attended the institution each year, sorted by year.
A bit of help
Again, to start working on this, you should start by answering these questions:
- Where is the data coming from?
- What column(s) are you going to filter on?
- What columns are you going to display?
- Do you need to calculate anything?
The answers to these questions should help you build your query. Before proceeding, build your query and step-by-step integrate your answers to these questions.
A bit of help
How are you going to get the actual number of students of a specific ethnicity? What information do you have, and how might you use this to arrive at an answer?
A bit of help
Here are our answers to the questions:
- Where is the data coming from? —
enrollment_long
- What column(s) are you going to filter on? —
UNITID
andethnicity
- What columns are you going to display? —
year
,percent
, andgrand_total
. - Do you need to calculate anything? — yes, we need to calculate
num_white
. We get this by multiplyingpercent * grand_total
.
Solution
This is how we answered this question:
Certainly, the last two lines of the command are not necessary; however, we like how the results are displayed here.
Also, the arrange()
command is not strictly necessary, but we wanted to ensure that the rows were displayed in this order.
4.10 Calculate distribution of values across all institutions
Test your knowledge
Here we are going to do a little baseline setting for the institutions that we are looking at.
For just Hispanic and White students, grouping by year and ethnicity, calculate the minimum, average, and maximum percents among all these institutions each year for each ethnicity.
A bit of help
Here is how we parsed this question when reading it:
- “For just Hispanic and White students”
-
A
filter()
is going to be needed. - “grouping by year and ethnicity”
-
A
group_by()
as well! - “calculate the minimum, average, and maximum percents”
-
Probably will need a
summarize()
with three different values. - “among all these institutions each year for each ethnicity”
- I don’t see any need to filter out anything here — it’s all the institutions, all years, and all ethnicities.
I didn’t see any specification for any other information, so I don’t see the need for a select()
right here.
What query should you build based on the above?
A bit of help
Here are the pieces of query that I put together from my reading of the query:
- “For just Hispanic and White students”
- Either of the following two choices:
filter(ethnicity == "white" | ethnicity == "hislat")
filter(ethnicity %in% c("white", "hislat"))
Either one works.
- “grouping by year and ethnicity”
-
For this,
group_by(year, ethnicity)
. - “calculate the minimum, average, and maximum percents”
-
The
summarize()
command makes it easy to calculate the three values that we need:
summarize(Min = min(percent),
Avg = mean(percent),
Max = max(percent))
- “among all these institutions each year for each ethnicity”
- Nothing else.
A bit of help
After you have put together the basics of the query that displays the information that you need, how might you improve its display so that your target audience can most easily interpret it?
Solution
This is how we answered this question:
As always, the last two lines are optional and personal preference. Other choices for the last line include:
kable_styling()
kable_minimal()
kable_classic()
kable_classic_2()
kable_material()
kable_material_dark()
Be sure that you can provide a description of how each row of this table should be interpreted.
4.11 Create a new data frame
Test your knowledge
Create a new data frame from enrollment_long
. Select the institution name, the year, the ethnicity, and percent. Save it in a new data frame called inst_ethn
.
We will use this below when we use pivot_wider()
to create tables for display purposes.
A bit of help
This is a fairly straight-forward request. We know the following:
- We need to create a new data frame (and we know its name).
- We know the source of the data.
- We know what columns are needed.
That’s basically it.
Solution
This is how we answered this question:
You should display the data to ensure that you know what you have.
4.12 Display data sorted by multiple values
Test your knowledge
Display the data for Mercer University in inst_ethn
sorted in descending order by year, and then descending by percent. Just display the year, ethnicity, and percent. Format the table nicely.
A bit of help
In reading the question, we picked up the following:
- We need to use
filter
to limit toMercer University
. - We know the source of the data is
inst_ethn
. - We need to use
arrange()
to sort byyear
andpercent
. Both should be in descending order (usingdesc()
). - We need to use
kbl()
and related operators.
Solution
This is how we answered this question:
Be sure that you can interpret the table correctly.
4.13 Pivot to a wider format for display
Test your knowledge
Using inst_ethn
as your data source, filter to include just data from 2018
. Then pivot wider on the ethnicity and percent. Sort the resulting rows of the table so that it is descending by the percent of Hispanic students. Limit the digits to 3 and format it well.
A bit of help
There is a lot going on in this question. Again, we need to break it down into pieces:
- “Using
inst_ethn
as your data source” -
The easiest piece — we are starting with
inst_ethn
. In the real world, this can be a very difficult piece of information to know and feel confident about. It’s also the one that others might dispute. - “filter to include just data from
2018
” -
Seems like we need to use
filter()
on theyear
column. - “Then pivot wider on the ethnicity and percent”
-
When you
pivot_wider()
, you need to knownames_from
andvalues_from
. It seems that we know exactly what is needed. What information will be across the top of the table (since it is going to be wide)? What information will be down the side? How many rows will there be?
Spend time on these questions before proceeding!! Struggling with these and thinking deeply about what pivot_wider()
actually does is important for your successful usage of this operator.
- “Sort the resulting rows of the table so that it is descending by the percent of Hispanic students”
-
Seems like we need to use
arrange()
on thehislat
column. Wait…it’s a column now?!? Yes, it used to be a value in a cell but we have pivoted to wider and it’s now a column. - “Limit the digits to 3 and format it well”
-
The usual suspects —
kbl()
and friends.
A bit of help
I hope you’re building this query iteratively!
Solution
This is how we answered this question:
What does this table show? And why might we want to do a query like this?
4.14 Print all values for a particular institution
Test your knowledge
While this query will be structured similarly to the previous one, the results are going to feel quite different. In this case, we are looking for information across many years for one institution; in the previous case we were looking for information for one year across all the institutions.
Use inst_ethn
as your data source. For just the instutition Virginia Military Institute
, pivot on ethnicity
and percent
. Format the resulting table with no more than 3 digits.
Solution
This is how we answered this question:
Why did we not ask you to display a table such as the last two across all the institutions and all years?
After you have finished going through this homework in this Web page, you should attempt to complete it within RStudio
. You need to develop some muscle memory and familiarity with working with this tool since it is how you will be completing work in your job.
This Web page provides a lot of useful support while going through this the first time. The RStudio
environment also provides support, and you need to get comfortable using its help and the tool itself.