Week 2: In-class Practice & Homework

This in-class exercise is intended to develop familiarity with the RStudio interface and some useful operators & functions in R/tidyverse.

1 Set up

You’ll work in groups. Choose one person to share screen, and walk through the instructions below. Everyone should try to work along if possible.

  1. Find the folder week2-files.
  2. Double-click on the project file (week2.Rproj).
  3. Once everything loads, use the File tab in the lower right to click on collaborate.R to bring the script to the editing window.
  4. From the File menu, choose Save As... and save the file to YOURNAME-wk2.R (or whatever — just choose something different). This will ensure that you always have the collaborate.R script to go back to in case something goes wrong.
  5. You will now work in this new script for the rest of class and for your homework (by going through the tasks on this page).
  6. Check to make sure that the file southern_conf_data.csv is in the data folder.
  7. Run the code through line 16 in the script. You should be able to describe (and execute) at least two different ways of quickly doing this. Share among your group the ways that you can name.
  8. It should create a new data frame in the Environment (upper right tab) called enrollment.
  9. Explore the data. It should have 70 rows and 21 columns.

If you are working in this web page, then read in the data and show its structure by running the following code:

2 The data

This data is an extract from IPEDS data (Integrated Postsecondary Education Data System). We downloaded the 12-Month Enrollment (EFFY) survey results from 2017-23 and the Institutional Characteristics (HD) directory information from 2022.

2.1 The institutions

We filtered this down to schools that are in the Southern Conference:

  • UNCG 199148
  • Wofford 218973
  • East TN State 220075
  • Chattanooga 221740
  • Furman 218070
  • Mercer 140447
  • Samford 102049
  • Western Carolina 200004
  • VMI 234085
  • The Citadel 217864

After each school in the list above, the number is the IPEDS Unit ID that represents that school throughout its database.

You can look up any institution’s Unit ID on this page.

2.2 Other limitations

We downloaded all of the enrollment data for 2017-23. We then filtered it to include only the data from those schools and their undergraduate programs.

2.3 Specific pieces of data collected

We further limited the columns of data to only the following:

  1. UNITID: the IPEDS unit identifier
  2. year: the year for which the data is captured
  3. inst_name: the name of the institution
  4. city: the city location of the institution
  5. county: the county location of the institution
  6. state: the state location of the institution
  7. fips: the FIPS code representing the location of the institution
  8. size_set: The Carnegie Basic Classification is the 2021 update of the traditional classification framework developed by the Carnegie Commission on Higher Education in the early 1970s to support its research program.
  9. size_set_desc: the description of the size_set column code
  10. land_grant_desc: answers the question “Is this a land grant institution?”
  11. mult_camp: specifies whether or not the unit is part of a multi-institution or multi-campus organization
  12. student_status: specifies the code representing the level and degree/certificate-seeking status of the students
  13. student_status_desc: the description of the student_status code
  14. ug_grad_desc: answers the question “are the students Graduate or Undergraduate”. All of the data in this data set represent undergraduates.
  15. grand_total: the number of students represented in this row
  16. aian_pc: the percent of students who are American Indian or Alaskan Native
  17. asia_pc: the percent of students who are Asian
  18. baa_pc: the percent of students who are Black or African American
  19. hislat_pc: the percent of students who are Hispanic or Latino
  20. nhopi_pc: the percent of students who are Native Hawaiian or Other Pacific Islander
  21. white_pc: the percent of students who are White

3 The situation

The enrollment data frame is a fairly typical wide data table. We will want to get this into a long data format so that we can then complete some analytical tasks on it. These tasks are related to what an IR analyst would do in order to learn about the ethnic composition of a set of schools.

4 Tasks to complete

The rest of this document describes a series of tasks, all of which you have to complete in order. (Later tasks sometimes rely on the computational effects of previous tasks.) The hints of many questions contain valuable teaching/learning points! Be sure to at least scan through them all.

4.1 Pivot enrollment to a long format

Test your knowledge

You need to pivot enrollment to a long format. This is a multi-step process, so we’re going to take you through it step-by-step.

As you go through these steps, iteratively build your query and execute it every step of the way.

  1. Gather your information:
    1. Determine which columns are going to be pivoted. How are you going to describe these columns in the cols argument?
    2. What name do you want to use for the column of names?
    3. What name do you want to use for the column of values?
  2. Before you ever save a pivoted table, you should first execute the pivot command and examine (and refine) it.
  3. Write the pivot_longer command.
  4. Once you are satisfied with it, assign the results to the enrollment_long data frame.

A bit of help

We are going to call the column of names ethnicity and the column of values percent. It will be easier for the rest of the assignment if you use these two names as well.

A bit of help

The easiest way to describe the columns that are to be pivoted is to collect them in a vector, as follows:

cols = c("aian_pc", "asia_pc", "baa_pc", 
         "hislat_pc", "nhopi_pc", 
         "white_pc")

If you want the values to appear in a certain order, then this is actually a convenient way to specify the set of columns.

However, R/tidyverse has a whole set of ways to select variables that match a pattern (as documented on this page) that are flexible, can ensure that your list stays up-to-date, and require much less typing (especially if the list is long, as these lists can be):

starts_with(match, ignore.case = TRUE)
the column name starts with match. For example, if you had a set of columns (name, Q1, Q2, Q3), then starts_with_("q") which match (Q1, Q2, Q3) since they each starts with q and ignore.case has a default value of TRUE.
ends_with(match, ignore.case = TRUE)
the column name ends with match. For example, if you had a set of columns (first_name, last_name, TEAM_NAME, ssn), then ends_with("name", FALSE) would match (first_name, last_name) since they each end with name and ignore.case has a value of TRUE.
contains(match, ignore.case = TRUE)
the column name contains match. For example, if you had a set of columns (2025Q3wk13, 2025Q4wk01, 2025Q4wk02), then contains("Q3") would match only the first column.
num_range(prefix, range, suffix = "", width = NULL)
the column matches column names containing a numerical range like q1, q2, q3 or q001-s3, q002-s3, q003-s3..., q999-s3. To match the first set, you might use num_range("q", 1:3). To match the second set, you might use num_range("q", 1:999, width=3).

Note for the values of match above, you can also supply a vector of values and the function will return columns that match any one of the values.

Finally, another more advanced option named matches() uses regular expressions to match column names.

We chose to use ends_with("_pc") as our value for the cols argument.

A bit of help

Note that each one of the columns that we want to use as a name starts with the identifier of the ethnicity (e.g., aian) and then concludes with _pc.

We would prefer that the _pc not be part of the name. The R/tidyverse provides an option (in the pivot_longer function) that enables you to specify this — it’s called names_pattern. We can specify names_pattern = "(.*)_pc". This matches any string of characters up to, but not including, the _pc at the end; this matching set would then be used as the name instead of the full name.

So, for example, this pattern would match aian_pc but the name would be aian in the pivoted data frame.

A bit of help

As a reminder, while you are iteratively developing your query, you should be running it in the form:

enrollment |> 
  pivot_longer(...)

Once you have it right, then you should go ahead and run it and assign the output to a variable, such as:

enrollment_long <-
  enrollment |>
    pivot_longer(...)

Solution

This is how we answered this question. (Be sure to run this; the rest of the homework depends on this data being loaded.)

Note that this isn’t the answer; this is one answer. Just because you got something that differs from this, it doesn’t mean that you’re wrong. Just be sure that you have maintained the first 15 columns from enrollment and have added 2 new columns (ethnicity and percent).

Explore the data for a bit; if you’re doing this in this Web page, then use the code block below to run your exploratory code.

What do you want to know? How might you get that information? We’ll get you started with this select() query (that’s already in the code block), but you should definitely spend a couple of minutes trying out different queries to learn about this data frame.

Note that the kbl() and kable_minimal() operators are both provided by the kableExtra package. These are simply tools that you can use to improve the look and readability of your printed tables.

Try your queries with and without these operators (both in this Web page and in RStudio) in order to explore the effects of this code. It is sometimes helpful to use these operators and sometimes less so. Just use what you find most appropriate.

4.2 Unique values in ethnicity column

Test your knowledge

What are the unique values in the ethnicity column?

A bit of help

This can be answered in a one-line R/tidyverse query.

A bit of help

The operator distinct() is probably what you’re looking for.

Solution

This is how we answered this question:

What is the data type of the output of the above command?

Answer: It is a one-column data frame. This will come up again (very soon).

4.3 Unique values in state column

Test your knowledge

What are the unique values in the state column?

A bit of help

This one should be pretty straight-forward after you answered the previous question.

Solution

This is how we answered this question:

We had you answer the previous two questions because we want to use that information when we make these two columns into factors (since they are categorical data).

(Note: this discussion and the following single question are not strictly necessary for this analysis or for the arriving at the right answer. However, it is informative and you might as well start getting comfortable with these concepts, even if you’re not quite ready to master them.)

Unfortunately, we cannot use the result of the queries to specify the information in the factor definition — it requires a vector while the previous two queries return data frames (with just one column, but it still isn’t appropriate for what we need to do). (See the lesson on data types for more information.)

Fortunately, but possibly not surprisingly, R/tidyverse provides a helper operator that can transform a single column that is part of a data frame into a free-standing vector; it is called pull() and we haven’t come across it yet.

The input to pull() must be a data frame and the output will be, as stated, a vector. The only argument to pull() is the name of the column that you want to transform into a vector.

In the following, we take the answers from the previous two questions and turn them into part of the methods that we use to define two vectors (state_info and ethnicity_info) that will contain the vector of all, respectively, states and ethnicities, that we have in our data.

Note that in both of the following queries we use the trick of surrounding the query with a set of parentheses so that it will display the result of the assignments.

So, from now on, we have these two vectors that we can use in commands where necessary.

4.4 Define factors for ethnicity and state columns

Test your knowledge

Both ethnicity and state contain categorical data. Write the mutate() statements that will make the both of them factors.

A bit of help

If you have not yet done so, you should go through the short factor lesson to learn about how to define a factor for a column that contains categorical data.

A bit of help

You need to make two decisions about a factor:

  1. What are the set of allowable values (or levels in R/tidyverse lingo).
  2. Can the values be considered to be ordered? That is, does it make sense to be able to say (about two possible values, say x and y) that x < y or x > y? If so, then it is an ordered factor.

Solution

This is how we answered this question:

Both statements within the mutate() operator have the same form:

NEW_COL_NAME = factor(
  OLD_COL_NAME,
  levels = VALUES,
  ordered = TRUE_OR_FALSE
)

It is most common for VALUES to simply be a vector of values.

If you’re looking for an in-depth discussion of factors after you look through our lesson on factors, you should read through this page in R for Data Science.

4.5 One command to get distribution of values for all columns

Test your knowledge

Write the one-line command that will return the distribution of values for all (okay, not the character-based ones) columns.

A bit of help

You have several choices when thinking about getting information about the columns in a data table: str(), names(), summary(), View(), head(), spec(), and glimpse(), among others.

Solution

This is how we answered this question:

Note that this does not provide any information for character (string) columns other than the length of the row. Thus, if you defined the factors above, both the state and ethnicity rows have useful information displayed.

4.6 Remove multiple columns from the data frame

Test your knowledge

This is a long and complicated analysis to undertake. It’s harder to do when columns that you do not need (at the moment, anyway) are cluttering up your mind and the screen. Let’s get rid of them.

Write a command that will remove the following columns from the enrollment_long data frame: "city", "county", "fips", "student_status", "student_status_desc", "ug_grad_desc", "mult_camp"

A bit of help

The select() command is useful for this question.

A bit of help

Also, the - (minus) operator can make this command both easier to write and easier to understand.

Solution

This is how we answered this question:

Notice that we used the - (minus) operator to tell R/tidyverse to remove these columns.

Also, if we had not included the assignment (in the first line), then this column removal would not have had any effect beyond this one command.

Let’s remind ourselves of the columns that we have remaining:

4.7 Calculate the average value in a vector

Test your knowledge

Without using the tidyverse, calculate the average value in the grand_total column of the enrollment_long data frame.

Solution

This is how we answered this question:

Other functions that you can use are the following:

Single values
sum(), median(), min(), max(), sd(), var()
Multiple values
 
  • quantile(): Calculate the quantiles
  • range(): Finds the minimum and maximum values
  • IQR(): calculates the interquartile range; that is Q3 - Q1
  • summary(): provides the minimum, 1Q, 2Q, average, 3Q, and maximum.

If you use summary() — yes, it certainly looks exactly like the summary() command that we use on data frames! — you get a lot of good information:

4.8 Display all ethnicity values across all years

Test your knowledge

For the institution with UNITID == 218973, display the year, ethnicity, and percent across all years and ethnicities.

A bit of help

To start working on this, you should start by answering these questions:

  1. Where is the data coming from?
  2. What column(s) are you going to filter on?
  3. What columns are you going to display?
  4. Do you need to calculate anything?

As always, build your response iteratively.

A bit of help

Here are our answers to those questions:

  1. Where is the data coming from? — enrollment_long
  2. What column(s) are you going to filter on? — UNITID
  3. What columns are you going to display? — year, ethnicity, percent
  4. Do you need to calculate anything? — no

Solution

This is how we answered this question:

4.9 Calculate the number of students of a specific ethnicity

Test your knowledge

For the institution with UNITID == 218973, calculate the actual number of white students who attended the institution each year, sorted by year.

A bit of help

Again, to start working on this, you should start by answering these questions:

  1. Where is the data coming from?
  2. What column(s) are you going to filter on?
  3. What columns are you going to display?
  4. Do you need to calculate anything?

The answers to these questions should help you build your query. Before proceeding, build your query and step-by-step integrate your answers to these questions.

A bit of help

How are you going to get the actual number of students of a specific ethnicity? What information do you have, and how might you use this to arrive at an answer?

A bit of help

Here are our answers to the questions:

  1. Where is the data coming from? — enrollment_long
  2. What column(s) are you going to filter on? — UNITID and ethnicity
  3. What columns are you going to display? — year, percent, and grand_total.
  4. Do you need to calculate anything? — yes, we need to calculate num_white. We get this by multiplying percent * grand_total.

Solution

This is how we answered this question:

Certainly, the last two lines of the command are not necessary; however, we like how the results are displayed here.

Also, the arrange() command is not strictly necessary, but we wanted to ensure that the rows were displayed in this order.

4.10 Calculate distribution of values across all institutions

Test your knowledge

Here we are going to do a little baseline setting for the institutions that we are looking at.

For just Hispanic and White students, grouping by year and ethnicity, calculate the minimum, average, and maximum percents among all these institutions each year for each ethnicity.

A bit of help

Here is how we parsed this question when reading it:

“For just Hispanic and White students”
A filter() is going to be needed.
“grouping by year and ethnicity”
A group_by() as well!
“calculate the minimum, average, and maximum percents”
Probably will need a summarize() with three different values.
“among all these institutions each year for each ethnicity”
I don’t see any need to filter out anything here — it’s all the institutions, all years, and all ethnicities.

I didn’t see any specification for any other information, so I don’t see the need for a select() right here.

What query should you build based on the above?

A bit of help

Here are the pieces of query that I put together from my reading of the query:

“For just Hispanic and White students”
Either of the following two choices:
  • filter(ethnicity == "white" | ethnicity == "hislat")
  • filter(ethnicity %in% c("white", "hislat"))

Either one works.

“grouping by year and ethnicity”
For this, group_by(year, ethnicity).
“calculate the minimum, average, and maximum percents”
The summarize() command makes it easy to calculate the three values that we need:
summarize(Min = min(percent),
          Avg = mean(percent),
          Max = max(percent))
“among all these institutions each year for each ethnicity”
Nothing else.

A bit of help

After you have put together the basics of the query that displays the information that you need, how might you improve its display so that your target audience can most easily interpret it?

Solution

This is how we answered this question:

As always, the last two lines are optional and personal preference. Other choices for the last line include:

  • kable_styling()
  • kable_minimal()
  • kable_classic()
  • kable_classic_2()
  • kable_material()
  • kable_material_dark()

Be sure that you can provide a description of how each row of this table should be interpreted.

4.11 Create a new data frame

Test your knowledge

Create a new data frame from enrollment_long. Select the institution name, the year, the ethnicity, and percent. Save it in a new data frame called inst_ethn.

We will use this below when we use pivot_wider() to create tables for display purposes.

A bit of help

This is a fairly straight-forward request. We know the following:

  • We need to create a new data frame (and we know its name).
  • We know the source of the data.
  • We know what columns are needed.

That’s basically it.

Solution

This is how we answered this question:

You should display the data to ensure that you know what you have.

4.12 Display data sorted by multiple values

Test your knowledge

Display the data for Mercer University in inst_ethn sorted in descending order by year, and then descending by percent. Just display the year, ethnicity, and percent. Format the table nicely.

A bit of help

In reading the question, we picked up the following:

  1. We need to use filter to limit to Mercer University.
  2. We know the source of the data is inst_ethn.
  3. We need to use arrange() to sort by year and percent. Both should be in descending order (using desc()).
  4. We need to use kbl() and related operators.

Solution

This is how we answered this question:

Be sure that you can interpret the table correctly.

4.13 Pivot to a wider format for display

Test your knowledge

Using inst_ethn as your data source, filter to include just data from 2018. Then pivot wider on the ethnicity and percent. Sort the resulting rows of the table so that it is descending by the percent of Hispanic students. Limit the digits to 3 and format it well.

A bit of help

There is a lot going on in this question. Again, we need to break it down into pieces:

“Using inst_ethn as your data source”
The easiest piece — we are starting with inst_ethn. In the real world, this can be a very difficult piece of information to know and feel confident about. It’s also the one that others might dispute.
“filter to include just data from 2018
Seems like we need to use filter() on the year column.
“Then pivot wider on the ethnicity and percent”
When you pivot_wider(), you need to know names_from and values_from. It seems that we know exactly what is needed. What information will be across the top of the table (since it is going to be wide)? What information will be down the side? How many rows will there be?

Spend time on these questions before proceeding!! Struggling with these and thinking deeply about what pivot_wider() actually does is important for your successful usage of this operator.

“Sort the resulting rows of the table so that it is descending by the percent of Hispanic students”
Seems like we need to use arrange() on the hislat column. Wait…it’s a column now?!? Yes, it used to be a value in a cell but we have pivoted to wider and it’s now a column.
“Limit the digits to 3 and format it well”
The usual suspects — kbl() and friends.

A bit of help

I hope you’re building this query iteratively!

Solution

This is how we answered this question:

What does this table show? And why might we want to do a query like this?