Practice Problems with Character Variables

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

1 Using this document

Within this document are blocks of R code. You can edit and execute this code as a way of practicing your R skills:

  • Edit the code that is shown in the box. Click on the Run Code button.
  • Make further edits and re-run that code. You can do this as often as you’d like.
  • Click the Start Over button to bring back the original code if you’d like.
  • If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.

2 Set up: Create two data frames — one small and one larger

A data frame is like a table in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.

In this lesson we will be working with two separate data frames: 1) df with information about students, and 2) bestsellers with information about NYTimes best sellers.

2.1 The df student data frame

The df data frame has six rows and seven columns. After creating it, we use the mutate() statement to define class as containing categorical data.

df <- data.frame(
  id         = c(1, 2, 3, 4, 5, 6),
  first_name = c("Alice", "Bob", "Charlie", "David", 
                 "Eve", "Stanislav"),
  last_name  = c("Smith", "Jones", "Kline", "White", 
                 "Zettle", "Bernard-Zza"),
  class      = c("Freshman", "Sophomore", "Junior", 
                 "Senior", "Senior", "Sophomore"),
  age        = c(18,18,19,20,21,19),
  gpa        = c(3.5, 3.2, 3.8, 3.0, 3.9, 2.9), 
  home       = c("Dayton, OH", "Columbia, SC", 
                 "Cleveland, OH", "New York, NY", 
                 "Las Vegas, NV", "Cedar Rapids, IA")
)
df <-
  df |>
    mutate(class = factor(class,
                          levels = c("Freshman", "Sophomore",
                                     "Junior", "Senior"),
                          ordered = TRUE))

Let’s look at a summary of df:

2.2 The bestsellers data frame about books

The bestsellers data frame starts off in a CSV file. Here’s the process we go through:

  1. We read the file into the bestsellers data frame.
  2. We use a select() statement to choose those columns that we want to work with in this exercise.
  3. We then use as.Date() to convert pub_date (which was read in as a character value) into a date format. You can learn more about R’s date formats on this page.
  4. Finally, we use as.character() to convert isbn (which was read in as a numeric value) into a character format.
bestsellers <- read.csv("bestsellers.csv")
bestsellers <-
  bestsellers |>
    select(pub_date = published_date,
           list_name = list_name_encoded,
           rank = rank,
           isbn = isbn13,
           title, author, description)
bestsellers$pub_date <-
  as.Date(bestsellers$pub_date,
          format = "%Y-%m-%d")
bestsellers$isbn <- as.character(bestsellers$isbn)

Let’s take a look at what all of this processing on bestsellers results in:

You will use this data frame in some of the exercises below.

Test your knowledge

Show the first 5 rows of the bestsellers data frame.

Solution

This is an application of the head() function that we previously learned about (in the select() lesson).

3 Overview

Character vectors, sometimes called ‘strings’ are more complicated than numerical vectors because of the variety of language. For an extensive discussion of string functions, see this page on rforir. We were also introduced to this topic in the data types lesson.

An important illustration is the difference between 1) a number, and 2) a character representation of a number. For more info on this topic, see this page on rforir.

This exists in Excel too, which you can see by typing =ISNUMBER("3") in a cell. Compare that to =ISNUMBER(3). As in Excel, the use of quotes distinguishes a character string from other data types.

3.1 Examples

This means that some operations don’t make sense, even though they look right.

We would hope that this works (since we’re adding two numeric values):

The following, I think obviously by how it is typed here, does not work; R gives an error related to non-numeric argument (that is, both "3" and "2" are not numeric — they are strings):

3.2 Quote characters

As you’ll note from the examples above, character strings are created by using a double-quote (“) to begin and end the string.

There are three kinds of quote marks on the keyboard. The double quote ", the single quote (apostrophe) ', and the backtick ` (usually top left of keyboard).

In R, the double quote is used to mark character strings, and the backtick is used in select(), mutate(), and other functions to refer to column names that have unusual characters in them. You can use single quotes to mark characters, but we recommend not doing that, because it is then easier to use the single quote as an apostrophe within strings.

4 Working with character (or string) vectors

You can create a character vector — that is, a vector of character elements — by using the c() function, just like with numbers.

As with any vector, we can use the [] operator to access elements by their position. The first command selects the first element of the vector; the second command selects its third through the fifth elements.

We can test logical conditions with character vectors too. Note that both of the following cannot be TRUE at the same time because strings are case sensitive.

Similar to performing mathematical operations with vectors and scalars, we can compare a scalar (single character string) with every element within a vector:

This flexibility around vectors and scalars is one of the foundational characteristics of the R programming language. It is both a foreign concept to people who have programmed in other languages and a welcomed practice to those who use R in their daily work.

Test your knowledge

Get a vector of the first to fifth titles in the bestsellers data frame.

Solution

The key to this question is the phrase get a vector. This means that you can’t use an R/tidyverse query because it returns, by default, a data frame.

Test your knowledge

Get a data frame consisting of a title column, that contains only the first 5 rows of the bestsellers data frame.

Solution

This time the key phrase is get a data frame. This should be a signal to you that you can (and should!) use R/tidyverse commands.

Test your knowledge

Create a data frame named A_titles that consists of all titles and authors for books whose authors come before B in the alphabet. If you can, sort the data frame by author and title.

A bit of help

Note that “authors come before B in the alphabet” does not mean — to R — the same thing as “starts with A!

Solution

We first filter for the authors that we want — including some authors that are blank!. Then we select the columns that we want. Finally, we use arrange() to sort the rows appropriately.

5 String functions

The tidyverse includes a library of string functions that can be used to manipulate character vectors. You can find a cheatsheet for them here. You can also find a page dedicated to them within this web site; this page both goes through a detailed case study demonstrating how one might use these functions and explanations of many of the commonly used functions.

Most of these functions start with str_. Here are a few common uses.

5.1 Combine two strings (concatenate)

The str_c() function combines two or more strings into one.

This function has a commonly-used additional argument named sep (separator character(s)). If you do not set this argument, then it uses its default value of the empty string (""). Here is an example:

Test your knowledge

Suppose we want to combine first and last names from the data frame into a new column called full_name. We can use the mutate() function to do this, in combination with the str_c() function.

First, create the new column and store it within a new df2 data frame. Second, run a query that prints the first, last, and full name of the student.

Solution

The str_c() function call below constructs the name in the form that we desire.

Test your knowledge

Define a query that accomplishes the following:

  • For each row in the data frame, defines a column full_name that contains 'TITLE' by AUTHOR. Notice that there are apostrophes around the name of the title.
  • Use the str_to_title() function to change titles from ALL CAPITAL LETTERS to Title Text (that is, the first letter of each word is capitalized).
  • Only displays full_name and not any other column from the bestsellers data frame.

You will want to build this query in steps.

A bit of help

The first select() chooses the columns that we need. The second select() removes the columns that we no longer need.

A bit of help

Do not try to use the sep argument in str_c().

Solution

Let’s explain the following query:

  1. We need the title and author columns (and no others), so we use the select() statement.
  2. We use the mutate() statement to define the new full_name column.
  3. We use the minus (-) operator to get rid of the title and author columns.

We could not use the sep argument with str_c() because of the complication of adding apostrophes around the title column.

Note, if we had wanted to put double-quotes (") around the title, we have to handle the character carefully since a double-quote usually starts and completes strings. Here is how we do it:

5.2 Working with string lengths

A common pitfall of new R programmers working with strings concerns calculations related to the string’s length.

It might seem that the length() function would be appropriate. Let’s see how it applies to the fruit_vector that we defined above:

That isn’t what we were looking for! This tells us how many items are in the vector. Maybe if we just apply it to a single character string:

No, that’s definitely not what we’re looking for.

What we actually should use is the nchar() function. This operates on each element of a vector, as here:

And it also acts appropriate on a single character string:

Test your knowledge

Define a query that defines a column that contains the length of the title on every row. Then calculate the maximum title length.

Solution

Let’s go through this query line-by-line:

  1. We only need the title column, so we select() it.
  2. We calculate a new column titlelen that contains the number of characters in title.
  3. We use summarize() — with no group_by() — to calculate the maximum title length in the whole data frame.

Test your knowledge

Define a query that selects all titles which have 60 or more characters. Do not show a specific title more than once.

A bit of help

The distinct() function gets rid of duplicates when used instead of select().

A bit of help

Here is the form of the query that we came up with:

bestsellers |>
  mutate(titlelen _______) |>
  filter(titlelen _______) |>
  distinct(______)

Solution

Here’s an explanation of this query:

  1. Create a new column called titlelen that contains the length of title for each row.
  2. Filter to include those rows with titlelen greater than or equal to 60.
  3. Use distinct() to ensure that no titles are repeated in the display.

5.3 Comparing strings

5.3.1 A helpful function

To help with the following discussion, we are going to use the following function (more of this to come in the lesson on functions). Given a character, it returns its 7-bit binary encoding:

char_to_binary <- function(char) {
  ascii_value <- utf8ToInt(char)
  binary_value <- paste(rev(as.integer(intToBits(ascii_value))), collapse = "")
  binary_value <- sub("^0+", "", binary_value)
  return(binary_value)
}

(You do not need to understand how this works; we just wanted to have this available for us to use in the following.)

We say 7-bit because it uses up to seven 0 and 1 values to represent the character.

Consider it when applied to the elements in the following character vector:

We are more used to thinking about integers rather than bitwise encoding, so here are the integers represented by the above 7-bit encodings:

5.3.2 Letter encoding (representation)

This seems esoteric — how could any of the following matter?! — but much of the confusion discussed in this section could be swept aside if one understood, and could remember!, how computers — and, more specifically, R and other data management systems — represent characters.

5.3.2.1 ASCII Character Encoding

Plain ASCII characters use 7 bits for encoding. This allows for a total of 128 (2^7) unique characters, which include:

  • 26 uppercase letters (A-Z)
  • 26 lowercase letters (a-z)
  • 10 digits (0-9)
  • Basic punctuation marks (e.g., period, comma, semicolon)
  • Control characters (e.g., newline, carriage return)
  • A few special characters (e.g., @, #, $, %)

Here’s a bit more explanation:

  • 7-bit Encoding: ASCII uses 7 bits to represent each character, which provides 2^7 = 128 possible values.
  • Character Set: The 128 characters include letters, digits, punctuation, control characters, and a few special symbols.

In Section 5.3.1 we saw examples of ASCII representations (which is, maybe not surprisingly at this point, a 7-bit representation) of some characters.

If you want to know the specific ASCII values for characters, refer to this ASCII table.

By understanding that plain ASCII characters use 7 bits for encoding, you can better appreciate the limitations and capabilities of the ASCII character set.

5.3.2.2 Extended ASCII Character Encoding

You might sometimes see the “Extended ASCII” character set referred to. This was an initial response to the complaint that ASCII could not represent letters that are commonly used:

  • Accented letters (e.g., é, ñ, ü)
  • Additional punctuation and special symbols (e.g., ©, ®, ±)
  • Graphical characters (e.g., line-drawing characters)

This character set added an additional bit to ASCII, thus becoming an 8-bit representation. The representation of the original 128 characters stayed the same, merely adding a 0 to the beginning (left side) of the encoding.

5.3.2.3 Unicode

Unicode is a comprehensive character encoding standard designed to support the digital representation of characters from virtually all writing systems in the world. It includes characters from almost all modern and historical scripts, as well as symbols, punctuation marks, and control characters. It supports over 143,000 characters from more than 150 scripts.

We are not going to go into the details of how Unicode represents characters, as that is far beyond the scope of this lesson. However, know that it is complicated, extensive, flexible, and all-inclusive.

5.3.3 Upper- and lower-case letters

Now, getting back to specific concerns that you will have to deal with when writing your own scripts…

Another common pitfall for new R programmers relates to working with letter-cases (i.e., lowercase and uppercase letters). (FYI, a detailed description related to the following is on this rforir page.)

When we’re working with numbers, we already know that it is true that 3 < 5, 7 < 12, and so on.

With characters, we also know that "a" < "b", "C" < "D" and so on are true.

However, when we start mixing cases then we are asking for nothing but trouble and confusion. For example:

This seems seems counterintuitive! “Capital letters are more important than lower case letters! They’re more important and should count to be greater than lower case!!”

Well, that might be true, but computers don’t care about that. Consider the following, referring back to our old friend the char_to_binary() function:

Now, when we are reminded of the binary representation of data and the use of ASCII, the result of these comparisons don’t seem to random or counterintuitive any longer.

5.3.4 Punctuation, digits, and spaces

Okay, so much for characters. Let’s think about punctuation, digits, and spaces for a bit.

We already know that all upper case letters have smaller values than all lower case letters, but what about a space? Let’s compare it to the letter with the smallest value, "A":

Let’s look at some other comparisons:

And here are some comparisons involving digits — mind you, that’s the character representation of digits!

This is kind of a mess. Character digits are greater than some punctuation and less than other punctuation. They are also less than all upper- and lower-case letters.

These are all rather random, and it’s rare that they would matter in much of the work that you do. What you should take out of all of this is the following:

punctuation, digits, and spaces < capital letters < lower-case letters

5.3.5 The importance of length

Suppose that two character strings have the exact same characters up to the point that one of the strings ends. Then the shorter string will be considered less than the longer string. For example:

Note that you will almost always want to trim white spaces from the ends of the strings that you’re going to manipulate. See this page on rforir for a more in-depth discussion.

5.3.6 Conversion to upper- or lower-case

As discussed on this rforir page, R supplies some functions to convert a string to upper- or lower-case. See the following:

Test your knowledge

Define a query that displays the titles of books (with duplicates removed) that are ordered by the lower-case version of the title. Do not show the lower-case version of the title in the result.

Solution

Here is how we constructed the query:

  • Create a column that contains the lower-case version of the title column.
  • Sort the rows by that column.
  • Remove duplicates so that no title is shown more once.

Notice how many rows are removed with distinct().

5.3.7 Summary about comparing strings

You will have to come up with your own strategy for comparing strings if you are going to do so at scale. You have alternatives, and you should be thoughtful about what you want to do:

  • Compare the strings as they are with no pre-processing. Take R’s decision as final.
  • Convert all characters to upper- (or lower-) case (you just have to decide which one to use), remove all punctuation, and retain all spaces. This might be appropriate if you have all English-language characters.

Test your knowledge

Display the author and title (with no duplicates) for all rows for which the author comes later in the alphabet than the title. Be sure that the letter case in both author and title do not affect this comparison. Also, sort the results by the author and then title.

Solution

Let’s explain the following query:

  1. Create a column called uppertitle that is the uppercase version of the title. (Note that you could have converted both to lowercase if you preferred.)
  2. Create a column called upperauthor that is the uppercase version of the author.
  3. Use filter() to include those rows for which upperauthor is greater than uppertitle.
  4. Just list the unique combinations of author and title (i.e., get rid of duplicates).
  5. Sort the rows that are displayed by author and title.

5.4 Detecting patterns in strings

Sometimes we want to look for a pattern in a string. We can use the str_detect() function to do this. To see if any last names have a capital Z, we can do the following:

Technically that will find any name with a capital Z in it, not just at the beginning. To find names that start with “Z”, we can use the ^ character to indicate the start of the string.

There are a LOT of string functions, and practically anything you want to do has a function for that.

5.5 Splitting strings

It commonly comes up in IR work that we have a data set that has a column with multiple pieces of information in it. For example, a column might have City, State in it. We might want to split that into two columns, one for city and one for state. There is more than one way to do this in R, but R/tidyverse contains a particularly easy function called separate_wider_delim().

We’ll split the ‘home’ column into two columns, ‘city’ and ‘state’.

Note that if you wanted to add these columns to the df data frame, then you should put df <- at the beginning of the above.

Test your knowledge

Define a query that chooses those rows (and then displays the author and title with no duplicates) for which the author column contains either edited (note the space), and (note the space again), or a comma (,).

A bit of help

The vertical bar | is the logical OR operator.

Solution

The str_detect() function is the centerpiece of this answer:

5.6 Creating new rows from strings

Another common case is to create a new row of data for each piece of information in a string. For example, consider the following student data that has the student ID, the student’s one (or more) majors, and his/her expected graduation date.

For some types of reports, we’d like a separate row of data for each major that duplicates the other fields in the data frame. We can use the separate_rows() function to do this.

In the results, note that both id=1 and id=3 now are in two separate rows since students 1 and 3 have two majors.

Test your knowledge

Define a query that creates separate rows for each author. They are typically separated by commas and the word and. Remove duplicates. Only display the author and title. Order the results by title and author.

What do you notice about the results? Did this solution handle all situations in the author column?

Solution

The first separate_rows() defines a data frame for which all of the commas have been used to create separate rows. The second separate_rows() defines a data frame that handles the ands in the author. Note that we included spaces after the comma and before and after and to ensure that and is a complete word.

While these two handled a wide variety of cases, it does not handle at least a couple:

  • with is used as a connector
  • Some authors are listed such as “Scott and Teresa Moore” so “Scott” is then listed as the whole author’s name.
  • others is not an author.

And so on. So this was a start, but if you are doing this kind of query for your job, then you need to be much more detail-oriented and complete.