Functions
1 Using this document
Within this document are blocks of R
code. You can edit and execute this code as a way of practicing your R
skills:
- Edit the code that is shown in the box. Click on the
Run Code
button. - Make further edits and re-run that code. You can do this as often as you’d like.
- Click the
Start Over
button to bring back the original code if you’d like. - If the code has a “Test your knowledge” header, then you are given instructions and can get feedback on your attempts.
2 Set up: Create a small data frame
A data frame
is like a table
in a database or a spreadsheet. It has rows and named columns. Most of the time, we load data from a file or database. But for a simple example, we can create one like this.
See data_types.qmd
for more on these data structures.
= data.frame(
df id = c(1, 2, 3, 4, 5),
first_name = c("Alice", "Bob", "Charlie", "David", NA),
last_name = c("Smith", "Jones", "Kline", "White", "Zettle"),
class = c("Freshman", "Sophomore", "Junior", "Senior", "Senior"),
age = c(18,18,19,20,21),
gpa = c(3.5, 3.2, 3.8, 3.0, 3.9)
)<-
df |>
df mutate(class = factor(class,
levels = c("Freshman", "Sophomore",
"Junior", "Senior"),
ordered = TRUE))
3 Functions in R
A “function” is a procedure that takes any number of inputs and produces a single output. You’re probably familiar with functions in Excel, like sum(...)
, average(...)
, etc. In R
, functions are used in a similar way.
A function is called by its name, followed by parentheses. Inside the parentheses, you put the arguments (inputs) to the function.
3.1 Functions on data frames
Here are some functions that are handy for quick looks at a data frame, often used in the console. For more information, see this page on rforir
.
3.1.1 Summary data frame information
These similar functions provide summary information. This first one gives a summary of each column in the data frame. Note that this function tells you which columns are character columns…but it doesn’t tell you much besides that about them. It’s much more informative about numeric and categorical (factor) columns.
This one is a more succinct version of summary()
meaning, of course, that it leaves some information out. You’ll just have to decide which one to use at which time.
Finally, str()
is much like glimpse()
but it also explicitly tells you the data type for each column.
3.1.2 Other data frame information
More specific functions for data frames include the following.
3.1.2.1 Number of rows in a data frame
3.1.2.2 Number of columns in a data frame
3.1.2.3 Dimensions of a data frame
3.1.2.4 First few rows of a data frame
3.1.2.5 Last few rows of a data frame
3.1.2.6 Names of the columns in a data frame
3.2 Detecting blanks in vectors (columns)
For any type of data, it’s useful to be able to detect missing values. We can do that with the is.na()
function. The first line tells us about the first_name
column and the second one about the last_name
column.
Now, suppose that the df
data frame had thousands or millions of rows. We certainly couldn’t scan the list of returned values and look to see if there are any TRUE
values. No, we couldn’t.
But we could let R
count up the number of blank values for us!
Much better. This tells us that only the first_name
column has a blank.
The next question popping into your head would be, “Well, which row/student has missing information in the first_name
column?” A R/tidyverse
query using is.na()
can help us out with that:
For more information on blanks and missing values, see this page on rforir
.
This is.na()
function is commonly used in a filter, to remove columns with blanks. The following query creates a new data frame df_clean
that contains rows that do not have any blank values in first_name
:
As a reminder, the exclamation point acts as a NOT
operator, so filtering with !is.na(first_name)
includes all rows for which first_name
is not blank.
3.3 Functions on numerical vectors
Recall that a vector is a single column of a data frame. Just like with spreadsheets, these can be numbers, text, dates, or other types (see the data types lesson for more on this). For more on numeric functions see this page on rforir
.
The following is a selection of R
’s functions on numerical vectors.
3.3.1 Square root of each element in a vector
3.3.2 Mean (average) of all values in a vector
3.3.3 Median of all values in a vector
3.3.4 Standard deviation of all values in a vector
3.3.5 Minimum value in a vector
3.3.6 Maximum value in a vector
3.3.7 Handling NA
values
All of the above functions will return NA
if there are blanks in the vector, e.g.:
To ignore blanks, all of them accept an extra argument na.rm = TRUE
. Read this as “yes, remove all NA
values before calculating”.
Verify this by changing mean()
to any of the other functions from above.
3.3.8 Rounding values
It’s common to want to round results, which we can do like this:
The last example can be customized by giving the function more information. If we want to round to 1 decimal place, we can do this:
4 Named arguments
The inputs that go between the parentheses for a function are called arguments for historical reasons. In R
, every argument has a name, as in the example with the round()
function above. The second argument digits
specifies how many places we want in the result. You can see the names of arguments using the help information in RStudio
about any function by either (1) typing ?function_name
or (2) putting the cursor on the function and hitting F1
, or (3) using the Help
tab in the lower right panel to search, or (4) just search the internet.
The help information for round()
shows that the names of the arguments are x
and digits
. The x
argument is the vector of numbers to round, and digits
is the number of decimal places to round to.
4.1 How named arguments work
We can omit the name of the arguments if they appear in the right order. It would be tedious to type this all the time:
Instead we can leave out the names, as long as the arguments are in the correct order, as with this example:
This will not do what we expected, because the arguments are in the wrong order:
But this works, because the argument names are given:
For functions that are common, you will find yourself generally omitting the argument names.
4.2 Example: generating random numbers
Suppose you want to give a researcher student data, but want some protections on the privacy of individuals. We could slightly perturb GPAs by adding a small random number to each one. Here’s a way to do that:
We first generate a noise
vector with random numbers; we use the n()
function to ensure that we generate the right number of elements. Then we add that vector to gpa
to create a gpa_noisy
column. Finally, we use the select()
operator to display three columns.
To be clear, runif()
should not be read as “run if”! It is a function to generate random numbers from a uniform distribution — hence r+unif
or runif()
. (That confused one of us for quite a while.) This function has three arguments: one to specify the number of random numbers to generate, and then the minimum and maximum in the range of the uniform distribution.
5 Function composition
We can use functions within functions. In the previous example, we hard coded the 5 in runif(5, min = -0.1, max = 0.1)
, but that only works because the data frame has five students. What about next time we run it and it has 12? It would be better to specify the correct number of rows automatically, like this:
The first argument to the random number generator is n
, which is the number of random numbers to generate. Here we use the nrow()
function to get the number of rows in the data frame. This way, the code will work no matter how many rows are in the data frame.
Most commonly, we use functions to transform data within a mutate()
or summarize()
function, as with this example:
Look closely at the code to see where functions are within other functions. One annoyance is making sure the parentheses are balanced. The RStudio
IDE will help with this by highlighting the matching parentheses when you put the cursor on one of them. You can also turn on “rainbow” mode that colors them differently depending on their nesting level. To do that to go Tools -> Global Options -> Code -> Display -> Show rainbow parentheses.
6 Creating your own functions
You can create your own functions in R. This is a powerful way to make your code more readable and reusable. Here’s an example of a simple function that takes a number and returns the square of that number.
<- function(x) {
sqrd return(x^2)
}
When you execute the code block, it stores the function definition for later use. After that you can use it like any other function:
7 Try it
Do the following before continuing with the rest of this section.
Test your knowledge
Create a function cubd
that takes a number and returns the cube of that number (that is, raised to the third power). Powers can be created with the ^
operator as with the sqrd()
function we created.
Solution
It is really as simple as changing the 2
above to a 3
:
Now check a math identity that’s kind of amazing. If we sum up a sequence of numbers 1, 2, ..., N
and then square the result, it’s the same as summing the cubes of the numbers 1, 2, ..., N
.
Note that we could write these using pipes, and it might be easier to follow:
8 A reminder about how pipes work
See the lesson on pipes for more on how pipes work. Basically, the |>
operator takes the result of the left side and puts it as the first argument on the right. This avoids the readability problem with parentheses when we nest functions within functions. They get executed from the inside-out, which is not the most intuitive way to read the script. Here’s an exaggerated example:
What does this do?
It’s hard to tell because we have to work from the inside out. Here’s the same thing with the pipe:
This is much easier to read, because we can read from left to right, and see the data flow through the functions. The round(2)
now means take the input from the left and round it to 2 decimal places, which is intuitive.