Practice Problems for Data Types

1 Introduction

1.1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

  • Edit the code that is shown in the box. Click on the Run Code button.
  • Make further edits and re-run that code. You can do this as often as you’d like.
  • Click the Start Over button to bring back the original code if you’d like.
  • If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

1.2 Using RStudio

If you’re following along with this exercise in RStudio, then you need to execute the following code in the Console. If you are going through this set of exercises within this document, you don’t have to do so because we are loading this libraries for you.

library(tidyverse)
library(tidylog)

The first line loads the tidyverse package. The second package tells R to give more detailed messages.

2 Data frames

2.1 The basics of data frames

A data frame is like a table in a database or a spreadsheet. It has rows and named columns, like the one suggested below. When we have individual cases (like students, enrollments, etc) in rows and attributes (like ID, Class, InState, Enrolled) in columns that describe the cases, we call that tidy data. This is the most common format we work with.

ID Class InState Enrolled
1 Fr TRUE 2021-01-01
2 Fr FALSE 2021-01-02
3 So TRUE 2021-01-03

We can create a blank data frame with:

Note that nothing does (or should) print out after the above statement. You could add a line containing df1 if you want to attempt to print out the df1 object. (We say “attempt” because it will print an error message since it is empty.)

A variation of the data frame is the tibble, which comes from the tidyverse package. For our purposes, the main distinction is that data.frame() will rename columns if they have spaces or special characters, while tibble() will not. We will see examples of this below.

Again, no results are printed. That does not mean that nothing happened! It just means nothing is printed. In this case what happens is that an empty tibble named tb1 is created.

When we use the tidyverse functions like select(), data frames are automatically converted to tibbles. (It has been suggested that “tibble” is Australian for “table”.)

Tip

Working in RStudio

Run that line using control-enter or the Run button. You’ll see a new entry in the Environment in the top right tab. It’s a data frame with 0 rows and 0 columns, like an empty spreadsheet.

2.2 Set up: Create a small data frame

Most of the time, when creating a data frame, we load data from a file or database. But for a simple example, we can create one like this.

df = data.frame(
  id         = c(1, 2, 3, 4, 5),
  first_name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  last_name  = c("Smith", "Jones", "Kline", "White", "Zettle"),
  class      = c("Freshman", "Sophomore", "Junior", "Senior", "Senior"),
  age        = c(18,18,19,20,21),
  gpa        = c(3.5, 3.2, 3.8, 3.0, 3.9)
)

Let’s take a quick look at this data frame:

You can see that it has six columns and five rows, all corresponding to the above data.frame() function. Make sure that you understand how the display of the table and the function call above are related.

3 Using the structure function (str())

To see the structure of a data frame, use the str() function. For more info, see this page on rforir. For example, run this line to see the structure of the df data frame that we created above:

If you are doing this in RStudio, you can see that this is the same information that is displayed in the Environment window.

Much of the rest of this lesson concerns itself with interpreting this output.

Tip

Using the Console in RStudio

If you type the name of a data object in the Console and press Enter, it will list part or all of it, so you can see what’s inside. This is similar to what appears in the Data window. Try it with the df object.

4 Vectors

In order to better understand the information printed out by the str() function, we need to learn about vectors, lists, and data types. In this section, we introduce you to the vector.

A vector (or array) is a collection of data that is all the same type. (It can be useful to think of a vector as a column of data. You will see this throughout your work with R.) Just like in Excel we can have columns of data, each of which has a different type. Here are some common ones:

  • character: text
  • numeric: numbers
  • integer: whole numbers
  • logical: TRUE or FALSE
  • factor: a category or label
  • date-related information
    • date: the specification of a date without any reference to time
    • date-time (or dttm): a specified time on a specified date

For more info on vectors, see this page on rforir. For more info on data types, see this page on rforir.

Much of the time R will figure out what data type you want.

Test your knowledge

Let’s create three vectors:

Again, nothing is displayed after you run the code above. You can add the code to display the data if you would like to the above code block. How do you do that?

Tip

Using RStudio

Run each of those lines and look in the Environment for them.

Use the str() function to see the data types of each of these vectors. Here’s the first one.

What do you expect to see for my_numbers and my_logic?

Using the code box above, verify that you’re right.

Solution

Here is our answer. Note that we can run all three str() statements in the same code block:

The above output shows that we have three vectors — a character (chr) vector, a numeric (num) vector, and a logical (logi) vector. Each one has five values.

5 Identify types

Notice that the three vectors we made all have the same length. That means we can put them all into a data frame together.

Test your knowledge

Let’s run a command to create a data frame that uses the vectors we created above:

Note that the data.frame() function that we call here is the same function that we used in Section 2 to create an empty data frame — the only difference here is that we define the initial contents of the data frame.

After running the above code, add df2 to the end so that you can see the data frame better.

Now that you can see the data frame, what are text, numbers, and logic (on the left side of the equal signs above)?

Solution

Let’s go ahead and add df2 after the data.frame() function to see what it looks like:

The words text, numbers, and logic all become the names of the columns in the data frame. Note that the word text has no magical properties in this case — it could have had numeric values in the column and not kicked up any type of fuss. The same goes for numbers and logic.

Test your knowledge

Now run str(df2) at the end of the following code block.

When might you run str() and when might you simply display the data frame (e.g., just run df2)?

Solution

Here we have added the str() function call just after the creation of the data frame:

The str() function output looks more similar to the original data.frame() function call that created the data frame. It also shows R’s data type explicitly for each column.

If you just want to display the data, then just call the data frame itself (e.g., df2). If you want to see a sampling of the data and see the data types, then use str().

6 Missing data

Missing data is common in real world cases. In R, missing data is represented by NA.

We note the following in the results of the above:

  • missing_data is a vector. How do we know? Because the structure of the output line is how R prints out vectors. (We know this isn’t optimal, but it’s something that you will learn to get used to.)
  • num: missing_data is a vector of numbers.
  • [1:5]: missing_data has 5 elements.
  • 1 2 NA 4 5: the elements of the vector in order are 1, 2, NA, 4, and 5. Note that while NA is not strictly of type num, when NA is used in a vector, its type is coerced to match the type of the vector. That is, it is treated as if it were the type of the other values.
Tip

Using RStudio

After running that line, if you are in RStudio, then look at the Data browser in the Environment window. You’ll see that the third value is NA, meaning that it’s a blank.

Test your knowledge

Now let’s change the vector slightly, as follows:

How do the structures of missing_data (from the previous code block) and missing_data2 (from this code block) differ? Why does this happen? Why do we prefer to use NA rather than "", "NA", --, or similar values?

Solution

Let’s look at the same values in the output that we looked at previously:

  • vector: missing_data2 is also a vector.
  • chr: however, in this case, this is a character vector.
  • [1:5]: it also has 5 elements.
  • "1" "2" "NA" "4" "5": this is the big change — each element is now a character string! You should note the quotation marks around each value. This signifies that the values are strings and not numeric values.

This conversion to a character vector happened because of the following:

  • A vector’s elements have to all be of the same type.
  • Every kind of value can be converted to a character string…
  • But not every type of value can be converted to a number.
  • "NA" is a string, and…
  • The rest of the values were numbers.
  • Since all of the vector elements have to be of the same type, then every element has to be converted to a character string.

Since NA (from the previous example) can be coerced to a number, it is. So the result in that case was a numeric vector. Using NA instead of "NA" allowed R to treat the data as a numeric vector (as it almost assuredly is supposed to be).

For more info on missing data, see this page on rforir.

Test your knowledge

Create a character (text) vector that has at least one missing value by defining it here:

Now, go back to the code box and

  1. Print the vector, and on the next line
  2. Display the structure of the vector.

How do you know that it is a character vector?

A bit of help

Your solution should have three statements:

  1. Create the vector
  2. Display the vector
  3. Display the structure of the vector

Solution

As promised in the hint, here are the three statements:

Note the str at the beginning of the line that shows the structure of our vector — this means that it is a character vector, as you were instructed to create.

You could have, of course, given your vector a different name and different values. But you should ensure that your vector is a character vector. If it is not, then change it until it is.

7 Lists

The final data type we’ll mention here is a list, which can be an ordered collection of different types of data. Here’s an example in which we create a list using the list() function:

While a list is similar to a vector, the printed representation is very different:

  • It says, right on the first line, both that it is a list and how many elements it has.
  • Each element is printed on a separate line.
  • The data type for each element is printed on each line. This is necessary since each element can be of a different type (unlike a vector in which each element has to have the same type).

Test your knowledge

Now, consider these slightly-revised versions of the list elements and different functions (list() and c()) used to create the variables:

What can you say about the list() and c() given the results when they are applied to the two sets of arguments (e.g., "milk", "9am", and "2" and "milk", "9am", and 2)?

A bit of help

First, consider the structures for val_a and val_c. These both have the same argument values.

Second, consider the structures for val_b and val_d. These also both have the same argument values.

What can you say about how list() operators verssus how c() operates?

Solution

First, consider the results for val_a and val_c. The first uses lst() while the second uses c(); however, the arguments to each function are the same (2 is a number). The data val_a is a list while the data val_c is a vector. The data types for the val_a elements have all been retained while the data types for the val_c elements have all been converted to character strings.

Second, consider the results for val_b and val_d. Again, the first uses lst() while the second uses c(); again the arguments to each function are the same ("2" is a character string). The data val_b is a list while the data val_d is a vector. The data types for the val_b and val_d elements have all been retained. Contrary to what we have seen so far, the data types for all the elements of val_b are the same.

So, what can we say about list() and c() given these results?

  • list() does not convert the values of its arguments in creating a list. The result is always a list.
  • c() does convert the values of its arguments if it is necessary. The result is always a vector.

Test your knowledge

Here is a slightly more complex list:

What data type is val_e — a list or a vector? A list (or vector) of what?

If it is a list, could it be a vector? If it is a vector, could it be a list?

Solution

val_e is a list. And it’s a list containing three character vectors, one numeric vector, and one list.

It could have been a vector…but the results are quite different than what you might expect. Go ahead and try it out — use c() instead of list() to create val_e. Go through the output in detail.

For more info on lists and vectors, see this page on rforir.

Test your knowledge

Create your own list using different data types.

A bit of help

Your answer should use the list() function since you are to create a list. You might also want to use the str() function to verify that you created a list.

Solution

Here is how we created a list of five values. We follow it up with a str() statement in order to display the structure of the variable:

The output of str() shows the following:

  • A list of five values was created.
  • The first element is a number, the second a character, the third a logical value (NA’s default data type is logical), the fourth a list of two numbers, and the fifth a vector of two numbers.

You could, of course, have named your vector something other than my_val.

8 Accessing elements of a list

8.1 A simple list

Let’s take a look for a moment at the following results when you run this code (referencing the val_a list that we created above):

Here we have printed the value of val_a and not its structure. This leads R to print it in a new, probably unfamiliar, way.

  • [[X]] indicates the value of the Xth value follows on the next line.
  • The value is printed on each of the lines before the next blank line; in this case, since each list element is only a single value, the information is displayed on just one line.

The [[X]] provides a big hint about how you can access the Xth value of a list. Consider the following:

This prints out the first and third elements. Then it attempts to print out the fourth element but R returns an error saying the subscript (i.e., 4) is out of bounds (i.e., not valid).

8.2 A list with named elements

Let’s take a look for a moment at the following results when you run this code. You might want to look at the original definition of val_e above before you run this code:

The displayed results here differ dramatically from the results shown after val_a above. Instead of showing [[X]] before each value, R displays $ITEM_NAME. The only possibly confusing information is for $mixed2. Here are a few observations:

  • The $mixed2 value printed by itself declares that val_e has a list element with this name.
  • What follows is five pairs of lines, one per each element of this list.
  • The elements within $mixed2 are not named, so the values are displayed using the [[X]] formatting.

Now let’s go through a few examples of how we can access named elements of a list.

You should note from the results of this code that each of the pairs (1 & 2, 3 & 4, 5 & 6) return the exact same information. To our eyes, the second of each pair — the ones using the names of the list elements — is easier to use and interpret. However, it doesn’t always make sense to name list elements. Thus, we find that we use names when we have named the list items but otherwise we happily use the [[X]] method.

8.3 Elements of a list within a list

Let’s take a closer look at the fifth element of val_e (which is named mixed2). You can see that this is a list containing five unnamed elements. The following shows two different ways that we can access the third element of that list:

As in previous examples, the results of these two statements display equivalent values. Read each of these from right-to-left:

  • val_e[[5]][[3]]: return the third element of the fifth element of val_e.
  • val_e$mixed2[[3]]: return the third element of the element named mixed2 in val_e.

Test your knowledge

Access the following items in val_a and val_e:

  • For val_a:
    • the first value
    • the third value
    • add 7 to the third value
  • For val_e:
    • the element named shopping
    • the second element
    • the third element of primes
  • Add the fourth element of primes (of val_e) to the third value of val_a

A bit of help

You should have seven different statements. Three of these should reference val_a exclusively, three should reference val_e exclusively, and one should reference both variables.

Solution

Here is how we answered the question:

9 Data frames are lists

A data frame is a special kind of list where each element is a vector of the same length. This is an important technicality, because it explains why a data frame with one column is a different kind of object than a vector.

Consider the following:

When you run the above, R does not print out any values. It simply tells us that select() (from the first command) has failed to print either the numbers or logic columns from df2 (which is correct since we told it to only print text).

Before you continue, what are the data types of df_1v1 and df_1v2?

Let’s answer that question:

The displayed results of these two statements should be familiar, but let’s take a moment to examine them:

  1. The first, which is the result of a select() statement, is a data frame which has one column/variable named text. This column has five values in it (with only the first four displayed in the result).
  2. The second is a character vector.

You can think of it like this: a data frame is like a spreadsheet file in Excel. Even if it’s only got one column, it’s still a spreadsheet. But if you highlight and copy the contents of that column, as if to paste it in an email, the copied data is not a whole spreadsheet–it’s just the vector of data. This distinction is often not important, but having a bit of background knowledge can clarify issues when things don’t behave as expected.

The following won’t work because df_1v2 isn’t a data frame (and you can’t use the pipe operator unless you’re operating on a data frame):

However, the following does work because df_1v1 is a data frame:

Note that the result this time says that no changes result from the select statement. This is because df_1v1 has only one column, text, and the select statement did not leave any columns out.

10 Converting types

Sometimes we want to convert one type of data to another. For more information on types & coercion, see this page on rforir. Here are some common cases.

10.1 Converting the types of simple values

10.1.1 Convert character to numeric

Consider the following functions that convert between numeric and character data:

The results of lines 1 & 2 and lines 3 & 4 are the same. The functions work with either numeric or character arguments.

These functions also work on vectors!

As expected (and as specified by the functions), the first vector is numeric and the second vector is a character vector.

The conversion to a number depends on that making sense:

As you can see, if as.numeric(X) cannot coerce its argument X to a number, then it produces NA. Sometimes, this will be what you want; sometimes it won’t. Just be aware of how the function works.

10.1.2 Convert logical to numeric

Sometimes we want to convert a logical TRUE/FALSE to a number. Before looking at those values within a list, let’s look at them individually:

We can see here that TRUE becomes 1 and FALSE becomes 0. (FYI, this is fairly standard across all programming languages.)

This function also works on vectors, as you might guess:

Here’s what we did:

  1. We first defined the vector logic_vector.
  2. We then converted this vector to a numeric vector using as.numeric().
  3. Finally, we displayed the structure of num_vect, confirming that it is, indeed, a numeric vector.

10.1.3 Converting a list to a vector

Suppose, for whatever reason, that we want to convert a list of values to a vector. Consider the following:

You should be familiar with this by now. We have defined a list my_list using the function list().

The unlist() function converts a list to a vector:

The displayed results show that my_val is a numeric vector, as promised.

Now let’s demonstrate what unlist() does when confronted with a more complicated list:

The variable my_list1 is a list containing a number, two character strings, a vector, a list containing a number and another vector, and a blank value. The result of using unlist() is a character vector containing nine elements. We’re not sure that we would have predicted this result, but this is how unlist() works — it flattens all of the contents of all of its elements into one vector without any sublists or subvectors.

10.2 Converting the types for all elements in a list

The above as.numeric() and as.character() functions work on single arguments (or on vectors). What does one do when one wants to convert a list of values to another type?

Maybe not surprisingly, R has a powerful function that makes this super easy, barely an inconvenience: map().

Suppose that we have the following list that we want to convert to a list of numeric values:

You can see here that char_list is, indeed, a list containing three character strings plus a blank value.

Let’s use the map() function to convert all of these values to numeric (where possible):

Here’s how you should interpret the command on the first line:

Apply the function as.numeric to each element of the list char_list.

The displayed result shows that num_list is a list consisting of the number 1, the number 2, and then two blanks. The warning shown before the result indicates that an NA (that is, a blank value) was introduced into the result. In this case, this happened because "a test" could not be converted to a number.

Test your knowledge

How would you convert char_list to a numeric vector with four elements?

A bit of help

You can do this with one line of code.

A bit of help

You need to use unlist, map, and as.numeric.

Solution

Here is how we did it:

You can see that my_sol is a numeric vector containing four elements as specified.

The solution works as follows:

  1. The map function converts each element of char_list, where possible, into a numeric value using as.numeric().
  2. The unlist() function converts the list resulting from the map() function into a vector.
  3. Since each element of the list is a number (or NA, which can be coerced into a number), the resulting vector is a numeric vector.

Note that we could have used the following pipe-based approach to the code as well (instead of the above function-based approach):

It’s the exact same output and applies the exact same functions…but we think the formatting makes it easier to determine what’s going on.