Practice Problems for Data Types – R for the IR Professional

1 Introduction

1.1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

Edit the code that is shown in the box. Click on the Run Code button.
Make further edits and re-run that code. You can do this as often as you’d like.
Click the Start Over button to bring back the original code if you’d like.
If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

1.2 Using RStudio

If you’re following along with this exercise in RStudio, then you need to execute the following code in the Console. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.

library(tidyverse)
library(tidylog)

The first line loads the tidyverse package. The second package tells R to give more detailed messages.

2 Data frames

2.1 The basics of data frames

A data frame is like a table in a database or a spreadsheet. It has rows and named columns, like the one suggested below. When we have individual cases (like students, enrollments, etc) in rows and attributes (like ID, Class, InState, Enrolled) in columns that describe the cases, we call that tidy data. This is the most common format we work with.

ID	Class	InState	Enrolled
1	Fr	TRUE	2021-01-01
2	Fr	FALSE	2021-01-02
3	So	TRUE	2021-01-03

We can create a blank data frame with:

Note that nothing does (or should) print out after the above statement. You could add a line containing df1 if you want to attempt to print out the df1 object. (We say “attempt” because it will print an error message since it is empty.)

A variation of the data frame is the tibble, which comes from the tidyverse package. For our purposes, the main distinction is that data.frame() will rename columns if they have spaces or special characters, while tibble() will not. We will see examples of this below.

Again, no results are printed. That does not mean that nothing happened! It just means nothing is printed. In this case what happens is that an empty tibble named tb1 is created.

When we use the tidyverse functions like select(), data frames are automatically converted to tibbles. (It has been suggested that “tibble” is Australian for “table”.)

Tip

Working in RStudio

Run that line using control-enter or the Run button. You’ll see a new entry in the Environment in the top right tab. It’s a data frame with 0 rows and 0 columns, like an empty spreadsheet.

2.2 Set up: Create a small data frame

Most of the time, when creating a data frame, we load data from a file or database. But for a simple example, we can create one like this.

df = data.frame(
  id         = c(1, 2, 3, 4, 5),
  first_name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  last_name  = c("Smith", "Jones", "Kline", "White", "Zettle"),
  class      = c("Freshman", "Sophomore", "Junior", "Senior", "Senior"),
  age        = c(18,18,19,20,21),
  gpa        = c(3.5, 3.2, 3.8, 3.0, 3.9)
)

Let’s take a quick look at this data frame:

You can see that it has six columns and five rows, all corresponding to the above data.frame() function. Make sure that you understand how the display of the table and the function call above are related.

3 Using the structure function (`str()`)

To see the structure of a data frame, use the str() function. For more info, see this page on rforir. For example, run this line to see the structure of the df data frame that we created above:

If you are doing this in RStudio, you can see that this is the same information that is displayed in the Environment window.

Much of the rest of this lesson concerns itself with interpreting this output.

Tip

Using the Console in RStudio

If you type the name of a data object in the Console and press Enter, it will list part or all of it, so you can see what’s inside. This is similar to what appears in the Data window. Try it with the df object.

4 Vectors

In order to better understand the information printed out by the str() function, we need to learn about vectors, lists, and data types. In this section, we introduce you to the vector.

A vector (or array) is a collection of data that is all the same type. (It can be useful to think of a vector as a column of data. You will see this throughout your work with R.) Just like in Excel we can have columns of data, each of which has a different type. Here are some common ones:

character: text
numeric: numbers
integer: whole numbers
logical: TRUE or FALSE
factor: a category or label
date-related information
- date: the specification of a date without any reference to time
- date-time (or dttm): a specified time on a specified date

For more info on vectors, see this page on rforir. For more info on data types, see this page on rforir.

Much of the time R will figure out what data type you want.

Test your knowledge

Let’s create three vectors:

Again, nothing is displayed after you run the code above. You can add the code to display the data if you would like to the above code block. How do you do that?

Tip

Using RStudio

Run each of those lines and look in the Environment for them.

Use the str() function to see the data types of each of these vectors. Here’s the first one.

What do you expect to see for my_numbers and my_logic?

Using the code box above, verify that you’re right.

5 Identify types

Notice that the three vectors we made all have the same length. That means we can put them all into a data frame together.

Test your knowledge

Let’s run a command to create a data frame that uses the vectors we created above:

Note that the data.frame() function that we call here is the same function that we used in Section 2 to create an empty data frame — the only difference here is that we define the initial contents of the data frame.

After running the above code, add df2 to the end so that you can see the data frame better.

Now that you can see the data frame, what are text, numbers, and logic (on the left side of the equal signs above)?

Test your knowledge

Now run str(df2) at the end of the following code block.

When might you run str() and when might you simply display the data frame (e.g., just run df2)?

6 Missing data

Missing data is common in real world cases. In R, missing data is represented by NA.

We note the following in the results of the above:

missing_data is a vector. How do we know? Because the structure of the output line is how R prints out vectors. (We know this isn’t optimal, but it’s something that you will learn to get used to.)
num: missing_data is a vector of numbers.
[1:5]: missing_data has 5 elements.
1 2 NA 4 5: the elements of the vector in order are 1, 2, NA, 4, and 5. Note that while NA is not strictly of type num, when NA is used in a vector, its type is coerced to match the type of the vector. That is, it is treated as if it were the type of the other values.

Tip

Using RStudio

After running that line, if you are in RStudio, then look at the Data browser in the Environment window. You’ll see that the third value is NA, meaning that it’s a blank.

Test your knowledge

Now let’s change the vector slightly, as follows:

How do the structures of missing_data (from the previous code block) and missing_data2 (from this code block) differ? Why does this happen? Why do we prefer to use NA rather than "", "NA", --, or similar values?

For more info on missing data, see this page on rforir.

Test your knowledge

Create a character (text) vector that has at least one missing value by defining it here:

Now, go back to the code box and

Print the vector, and on the next line
Display the structure of the vector.

How do you know that it is a character vector?

7 Lists

The final data type we’ll mention here is a list, which can be an ordered collection of different types of data. Here’s an example in which we create a list using the list() function:

While a list is similar to a vector, the printed representation is very different:

It says, right on the first line, both that it is a list and how many elements it has.
Each element is printed on a separate line.
The data type for each element is printed on each line. This is necessary since each element can be of a different type (unlike a vector in which each element has to have the same type).

Test your knowledge

Now, consider these slightly-revised versions of the list elements and different functions (list() and c()) used to create the variables:

What can you say about the list() and c() given the results when they are applied to the two sets of arguments (e.g., "milk", "9am", and "2" and "milk", "9am", and 2)?

Test your knowledge

Here is a slightly more complex list:

What data type is val_e — a list or a vector? A list (or vector) of what?

If it is a list, could it be a vector? If it is a vector, could it be a list?

For more info on lists and vectors, see this page on rforir.

Test your knowledge

Create your own list using different data types.

8 Accessing elements of a list

8.1 A simple list

Let’s take a look for a moment at the following results when you run this code (referencing the val_a list that we created above):

Here we have printed the value of val_a and not its structure. This leads R to print it in a new, probably unfamiliar, way.

[[X]] indicates the value of the Xth value follows on the next line.
The value is printed on each of the lines before the next blank line; in this case, since each list element is only a single value, the information is displayed on just one line.

The [[X]] provides a big hint about how you can access the Xth value of a list. Consider the following:

This prints out the first and third elements. Then it attempts to print out the fourth element but R returns an error saying the subscript (i.e., 4) is out of bounds (i.e., not valid).

8.2 A list with named elements

Let’s take a look for a moment at the following results when you run this code. You might want to look at the original definition of val_e above before you run this code:

The displayed results here differ dramatically from the results shown after val_a above. Instead of showing [[X]] before each value, R displays $ITEM_NAME. The only possibly confusing information is for $mixed2. Here are a few observations:

The $mixed2 value printed by itself declares that val_e has a list element with this name.
What follows is five pairs of lines, one per each element of this list.
The elements within $mixed2 are not named, so the values are displayed using the [[X]] formatting.

Now let’s go through a few examples of how we can access named elements of a list.

You should note from the results of this code that each of the pairs (1 & 2, 3 & 4, 5 & 6) return the exact same information. To our eyes, the second of each pair — the ones using the names of the list elements — is easier to use and interpret. However, it doesn’t always make sense to name list elements. Thus, we find that we use names when we have named the list items but otherwise we happily use the [[X]] method.

8.3 Elements of a list within a list

Let’s take a closer look at the fifth element of val_e (which is named mixed2). You can see that this is a list containing five unnamed elements. The following shows two different ways that we can access the third element of that list:

As in previous examples, the results of these two statements display equivalent values. Read each of these from right-to-left:

val_e[[5]][[3]]: return the third element of the fifth element of val_e.
val_e$mixed2[[3]]: return the third element of the element named mixed2 in val_e.

Test your knowledge

Access the following items in val_a and val_e:

For val_a:
- the first value
- the third value
- add 7 to the third value
For val_e:
- the element named shopping
- the second element
- the third element of primes
Add the fourth element of primes (of val_e) to the third value of val_a

9 Data frames are lists

A data frame is a special kind of list where each element is a vector of the same length. This is an important technicality, because it explains why a data frame with one column is a different kind of object than a vector.

Consider the following:

When you run the above, R does not print out any values. It simply tells us that select() (from the first command) has failed to print either the numbers or logic columns from df2 (which is correct since we told it to only print text).

Before you continue, what are the data types of df_1v1 and df_1v2?

Let’s answer that question:

The displayed results of these two statements should be familiar, but let’s take a moment to examine them:

The first, which is the result of a select() statement, is a data frame which has one column/variable named text. This column has five values in it (with only the first four displayed in the result).
The second is a character vector.

You can think of it like this: a data frame is like a spreadsheet file in Excel. Even if it’s only got one column, it’s still a spreadsheet. But if you highlight and copy the contents of that column, as if to paste it in an email, the copied data is not a whole spreadsheet–it’s just the vector of data. This distinction is often not important, but having a bit of background knowledge can clarify issues when things don’t behave as expected.

The following won’t work because df_1v2 isn’t a data frame (and you can’t use the pipe operator unless you’re operating on a data frame):

However, the following does work because df_1v1 is a data frame:

Note that the result this time says that no changes result from the select statement. This is because df_1v1 has only one column, text, and the select statement did not leave any columns out.

10 Converting types

Sometimes we want to convert one type of data to another. For more information on types & coercion, see this page on rforir. Here are some common cases.

10.1 Converting the types of simple values

10.1.1 Convert character to numeric

Consider the following functions that convert between numeric and character data:

The results of lines 1 & 2 and lines 3 & 4 are the same. The functions work with either numeric or character arguments.

These functions also work on vectors!

As expected (and as specified by the functions), the first vector is numeric and the second vector is a character vector.

The conversion to a number depends on that making sense:

As you can see, if as.numeric(X) cannot coerce its argument X to a number, then it produces NA. Sometimes, this will be what you want; sometimes it won’t. Just be aware of how the function works.

10.1.2 Convert logical to numeric

Sometimes we want to convert a logical TRUE/FALSE to a number. Before looking at those values within a list, let’s look at them individually:

We can see here that TRUE becomes 1 and FALSE becomes 0. (FYI, this is fairly standard across all programming languages.)

This function also works on vectors, as you might guess:

Here’s what we did:

We first defined the vector logic_vector.
We then converted this vector to a numeric vector using as.numeric().
Finally, we displayed the structure of num_vect, confirming that it is, indeed, a numeric vector.

10.1.3 Converting a list to a vector

Suppose, for whatever reason, that we want to convert a list of values to a vector. Consider the following:

You should be familiar with this by now. We have defined a list my_list using the function list().

The unlist() function converts a list to a vector:

The displayed results show that my_val is a numeric vector, as promised.

Now let’s demonstrate what unlist() does when confronted with a more complicated list:

The variable my_list1 is a list containing a number, two character strings, a vector, a list containing a number and another vector, and a blank value. The result of using unlist() is a character vector containing nine elements. We’re not sure that we would have predicted this result, but this is how unlist() works — it flattens all of the contents of all of its elements into one vector without any sublists or subvectors.

10.2 Converting the types for all elements in a list

The above as.numeric() and as.character() functions work on single arguments (or on vectors). What does one do when one wants to convert a list of values to another type?

Maybe not surprisingly, R has a powerful function that makes this super easy, barely an inconvenience: map().

Suppose that we have the following list that we want to convert to a list of numeric values:

You can see here that char_list is, indeed, a list containing three character strings plus a blank value.

Let’s use the map() function to convert all of these values to numeric (where possible):

Here’s how you should interpret the command on the first line:

Apply the function as.numeric to each element of the list char_list.

The displayed result shows that num_list is a list consisting of the number 1, the number 2, and then two blanks. The warning shown before the result indicates that an NA (that is, a blank value) was introduced into the result. In this case, this happened because "a test" could not be converted to a number.

Test your knowledge

How would you convert char_list to a numeric vector with four elements?

1 Introduction

1.1 Using this document

1.2 Using RStudio

2 Data frames

2.1 The basics of data frames

2.2 Set up: Create a small data frame

3 Using the structure function (str())

4 Vectors

5 Identify types

6 Missing data

7 Lists

8 Accessing elements of a list

8.1 A simple list

8.2 A list with named elements

8.3 Elements of a list within a list

9 Data frames are lists

10 Converting types

10.1 Converting the types of simple values

10.1.1 Convert character to numeric

10.1.2 Convert logical to numeric

10.1.3 Converting a list to a vector

10.2 Converting the types for all elements in a list

3 Using the structure function (`str()`)