Week 2: In-class Practice & Homework

This in-class exercise is intended to develop familiarity with the RStudio interface and some useful operators & functions in R/tidyverse.

1 Set up

You’ll work in groups. Choose one person to share screen, and walk through the instructions below. Everyone should try to work along if possible.

Find the folder week2-files.
Double-click on the project file (week2.Rproj).
Once everything loads, use the File tab in the lower right to click on collaborate.R to bring the script to the editing window.
From the File menu, choose Save As... and save the file to YOURNAME-wk2.R (or whatever — just choose something different). This will ensure that you always have the collaborate.R script to go back to in case something goes wrong.
You will now work in this new script for the rest of class and for your homework (by going through the tasks on this page).
Check to make sure that the file southern_conf_data.csv is in the data folder.
Run the code through line 16 in the script. You should be able to describe (and execute) at least two different ways of quickly doing this. Share among your group the ways that you can name.
It should create a new data frame in the Environment (upper right tab) called enrollment.
Explore the data. It should have 70 rows and 21 columns.

If you are working in this web page, then read in the data and show its structure by running the following code:

2 The data

This data is an extract from IPEDS data (Integrated Postsecondary Education Data System). We downloaded the 12-Month Enrollment (EFFY) survey results from 2017-23 and the Institutional Characteristics (HD) directory information from 2022.

2.1 The institutions

We filtered this down to schools that are in the Southern Conference:

UNCG 199148
Wofford 218973
East TN State 220075
Chattanooga 221740
Furman 218070
Mercer 140447
Samford 102049
Western Carolina 200004
VMI 234085
The Citadel 217864

After each school in the list above, the number is the IPEDS Unit ID that represents that school throughout its database.

You can look up any institution’s Unit ID on this page.

2.2 Other limitations

We downloaded all of the enrollment data for 2017-23. We then filtered it to include only the data from those schools and their undergraduate programs.

2.3 Specific pieces of data collected

We further limited the columns of data to only the following:

UNITID: the IPEDS unit identifier
year: the year for which the data is captured
inst_name: the name of the institution
city: the city location of the institution
county: the county location of the institution
state: the state location of the institution
fips: the FIPS code representing the location of the institution
size_set: The Carnegie Basic Classification is the 2021 update of the traditional classification framework developed by the Carnegie Commission on Higher Education in the early 1970s to support its research program.
size_set_desc: the description of the size_set column code
land_grant_desc: answers the question “Is this a land grant institution?”
mult_camp: specifies whether or not the unit is part of a multi-institution or multi-campus organization
student_status: specifies the code representing the level and degree/certificate-seeking status of the students
student_status_desc: the description of the student_status code
ug_grad_desc: answers the question “are the students Graduate or Undergraduate”. All of the data in this data set represent undergraduates.
grand_total: the number of students represented in this row
aian_pc: the percent of students who are American Indian or Alaskan Native
asia_pc: the percent of students who are Asian
baa_pc: the percent of students who are Black or African American
hislat_pc: the percent of students who are Hispanic or Latino
nhopi_pc: the percent of students who are Native Hawaiian or Other Pacific Islander
white_pc: the percent of students who are White

3 The situation

The enrollment data frame is a fairly typical wide data table. We will want to get this into a long data format so that we can then complete some analytical tasks on it. These tasks are related to what an IR analyst would do in order to learn about the ethnic composition of a set of schools.

4 Tasks to complete

The rest of this document describes a series of tasks, all of which you have to complete in order. (Later tasks sometimes rely on the computational effects of previous tasks.) The hints of many questions contain valuable teaching/learning points! Be sure to at least scan through them all.

4.1 Pivot `enrollment` to a long format

Test your knowledge

You need to pivot enrollment to a long format. This is a multi-step process, so we’re going to take you through it step-by-step.

As you go through these steps, iteratively build your query and execute it every step of the way.

Gather your information:
1. Determine which columns are going to be pivoted. How are you going to describe these columns in the cols argument?
2. What name do you want to use for the column of names?
3. What name do you want to use for the column of values?
Before you ever save a pivoted table, you should first execute the pivot command and examine (and refine) it.
Write the pivot_longer command.
Once you are satisfied with it, assign the results to the enrollment_long data frame.

A bit of help

The easiest way to describe the columns that are to be pivoted is to collect them in a vector, as follows:

cols = c("aian_pc", "asia_pc", "baa_pc", 
         "hislat_pc", "nhopi_pc", 
         "white_pc")

If you want the values to appear in a certain order, then this is actually a convenient way to specify the set of columns.

However, R/tidyverse has a whole set of ways to select variables that match a pattern (as documented on this page) that are flexible, can ensure that your list stays up-to-date, and require much less typing (especially if the list is long, as these lists can be):

starts_with(match, ignore.case = TRUE): the column name starts with match. For example, if you had a set of columns (name, Q1, Q2, Q3), then starts_with_("q") which match (Q1, Q2, Q3) since they each starts with q and ignore.case has a default value of TRUE.
ends_with(match, ignore.case = TRUE): the column name ends with match. For example, if you had a set of columns (first_name, last_name, TEAM_NAME, ssn), then ends_with("name", FALSE) would match (first_name, last_name) since they each end with name and ignore.case has a value of TRUE.
contains(match, ignore.case = TRUE): the column name contains match. For example, if you had a set of columns (2025Q3wk13, 2025Q4wk01, 2025Q4wk02), then contains("Q3") would match only the first column.
num_range(prefix, range, suffix = "", width = NULL): the column matches column names containing a numerical range like q1, q2, q3 or q001-s3, q002-s3, q003-s3..., q999-s3. To match the first set, you might use num_range("q", 1:3). To match the second set, you might use num_range("q", 1:999, width=3).

Note for the values of match above, you can also supply a vector of values and the function will return columns that match any one of the values.

Finally, another more advanced option named matches() uses regular expressions to match column names.

We chose to use ends_with("_pc") as our value for the cols argument.

A bit of help

As a reminder, while you are iteratively developing your query, you should be running it in the form:

enrollment |> 
  pivot_longer(...)

Once you have it right, then you should go ahead and run it and assign the output to a variable, such as:

enrollment_long <-
  enrollment |>
    pivot_longer(...)

Explore the data for a bit; if you’re doing this in this Web page, then use the code block below to run your exploratory code.

What do you want to know? How might you get that information? We’ll get you started with this select() query (that’s already in the code block), but you should definitely spend a couple of minutes trying out different queries to learn about this data frame.

Note that the kbl() and kable_minimal() operators are both provided by the kableExtra package. These are simply tools that you can use to improve the look and readability of your printed tables.

Try your queries with and without these operators (both in this Web page and in RStudio) in order to explore the effects of this code. It is sometimes helpful to use these operators and sometimes less so. Just use what you find most appropriate.

4.2 Unique values in `ethnicity` column

Test your knowledge

What are the unique values in the ethnicity column?

4.3 Unique values in `state` column

Test your knowledge

What are the unique values in the state column?

We had you answer the previous two questions because we want to use that information when we make these two columns into factors (since they are categorical data).

(Note: this discussion and the following single question are not strictly necessary for this analysis or for the arriving at the right answer. However, it is informative and you might as well start getting comfortable with these concepts, even if you’re not quite ready to master them.)

Unfortunately, we cannot use the result of the queries to specify the information in the factor definition — it requires a vector while the previous two queries return data frames (with just one column, but it still isn’t appropriate for what we need to do). (See the lesson on data types for more information.)

Fortunately, but possibly not surprisingly, R/tidyverse provides a helper operator that can transform a single column that is part of a data frame into a free-standing vector; it is called pull() and we haven’t come across it yet.

The input to pull() must be a data frame and the output will be, as stated, a vector. The only argument to pull() is the name of the column that you want to transform into a vector.

In the following, we take the answers from the previous two questions and turn them into part of the methods that we use to define two vectors (state_info and ethnicity_info) that will contain the vector of all, respectively, states and ethnicities, that we have in our data.

Note that in both of the following queries we use the trick of surrounding the query with a set of parentheses so that it will display the result of the assignments.

So, from now on, we have these two vectors that we can use in commands where necessary.

4.4 Define factors for `ethnicity` and `state` columns

Test your knowledge

Both ethnicity and state contain categorical data. Write the mutate() statements that will make the both of them factors.

Solution

This is how we answered this question:

Both statements within the mutate() operator have the same form:

NEW_COL_NAME = factor(
  OLD_COL_NAME,
  levels = VALUES,
  ordered = TRUE_OR_FALSE
)

It is most common for VALUES to simply be a vector of values.

If you’re looking for an in-depth discussion of factors after you look through our lesson on factors, you should read through this page in R for Data Science.

4.5 One command to get distribution of values for all columns

Test your knowledge

Write the one-line command that will return the distribution of values for all (okay, not the character-based ones) columns.

4.6 Remove multiple columns from the data frame

Test your knowledge

This is a long and complicated analysis to undertake. It’s harder to do when columns that you do not need (at the moment, anyway) are cluttering up your mind and the screen. Let’s get rid of them.

Write a command that will remove the following columns from the enrollment_long data frame: "city", "county", "fips", "student_status", "student_status_desc", "ug_grad_desc", "mult_camp"

Let’s remind ourselves of the columns that we have remaining:

4.7 Calculate the average value in a vector

Test your knowledge

Without using the tidyverse, calculate the average value in the grand_total column of the enrollment_long data frame.

If you use summary() — yes, it certainly looks exactly like the summary() command that we use on data frames! — you get a lot of good information:

4.8 Display all ethnicity values across all years

Test your knowledge

For the institution with UNITID == 218973, display the year, ethnicity, and percent across all years and ethnicities.

4.9 Calculate the number of students of a specific ethnicity

Test your knowledge

For the institution with UNITID == 218973, calculate the actual number of white students who attended the institution each year, sorted by year.

4.10 Calculate distribution of values across all institutions

Test your knowledge

Here we are going to do a little baseline setting for the institutions that we are looking at.

For just Hispanic and White students, grouping by year and ethnicity, calculate the minimum, average, and maximum percents among all these institutions each year for each ethnicity.

A bit of help

Here are the pieces of query that I put together from my reading of the query:

“For just Hispanic and White students”: Either of the following two choices:

filter(ethnicity == "white" | ethnicity == "hislat")
filter(ethnicity %in% c("white", "hislat"))

Either one works.

“grouping by year and ethnicity”: For this, group_by(year, ethnicity).
“calculate the minimum, average, and maximum percents”: The summarize() command makes it easy to calculate the three values that we need:

summarize(Min = min(percent),
          Avg = mean(percent),
          Max = max(percent))

“among all these institutions each year for each ethnicity”: Nothing else.

4.11 Create a new data frame

Test your knowledge

Create a new data frame from enrollment_long. Select the institution name, the year, the ethnicity, and percent. Save it in a new data frame called inst_ethn.

We will use this below when we use pivot_wider() to create tables for display purposes.

4.12 Display data sorted by multiple values

Test your knowledge

Display the data for Mercer University in inst_ethn sorted in descending order by year, and then descending by percent. Just display the year, ethnicity, and percent. Format the table nicely.

4.13 Pivot to a wider format for display

Test your knowledge

Using inst_ethn as your data source, filter to include just data from 2018. Then pivot wider on the ethnicity and percent. Sort the resulting rows of the table so that it is descending by the percent of Hispanic students. Limit the digits to 3 and format it well.

4.14 Print all values for a particular institution

Test your knowledge

While this query will be structured similarly to the previous one, the results are going to feel quite different. In this case, we are looking for information across many years for one institution; in the previous case we were looking for information for one year across all the institutions.

Use inst_ethn as your data source. For just the instutition Virginia Military Institute, pivot on ethnicity and percent. Format the resulting table with no more than 3 digits.

After you have finished going through this homework in this Web page, you should attempt to complete it within RStudio. You need to develop some muscle memory and familiarity with working with this tool since it is how you will be completing work in your job.

This Web page provides a lot of useful support while going through this the first time. The RStudio environment also provides support, and you need to get comfortable using its help and the tool itself.

1 Set up

2 The data

2.1 The institutions

2.2 Other limitations

2.3 Specific pieces of data collected

3 The situation

4 Tasks to complete

4.1 Pivot enrollment to a long format

4.2 Unique values in ethnicity column

4.3 Unique values in state column

4.4 Define factors for ethnicity and state columns

4.5 One command to get distribution of values for all columns

4.6 Remove multiple columns from the data frame

4.7 Calculate the average value in a vector

4.8 Display all ethnicity values across all years

4.9 Calculate the number of students of a specific ethnicity

4.10 Calculate distribution of values across all institutions

4.11 Create a new data frame

4.12 Display data sorted by multiple values

4.13 Pivot to a wider format for display

4.14 Print all values for a particular institution

4.1 Pivot `enrollment` to a long format

4.2 Unique values in `ethnicity` column

4.3 Unique values in `state` column

4.4 Define factors for `ethnicity` and `state` columns