Functions in R
library(tidyverse)
library(palmerpenguins)
Run the code below in your console to download this exercise as a set of R scripts.
usethis::use_course("cis-ds/pipes-and-functions-in-r")
Functions are an important tool in the computational social scientist’s toolkit. They enable you to avoid repetition and copy-and-paste and greatly increase the efficiency of your code writing.
- They are easy to reuse. If an update to the code is necessary, you revise it in one location and the changes propogate to all over components that implement the function.
- They are self-documenting. Give your function a good name and you will easily remember the function and its purpose.
- They are easy-ier to debug. There are fewer chances to make mistakes because the code only exists in one location. When copying and pasting, you may forget to copy an important line or fail to update a line in one location.
In fact, you have used functions the entire time you have programmed in R. The only difference is that the functions were written for you. tidyr
, dplyr
, ggplot2
, all of these libraries contain major functions for tidying, transforming, and visualizing data. You have the power to write your own functions. Well, if you don’t already you soon will.
Components of a function
Functions have three key components:
- A name. This should be informative and describe what the function does
- The arguments, or list of inputs, to the function. They go inside the parentheses in
function()
. - The body. This is the block of code within
{}
that immediately followsfunction(...)
, and is the code that you developed to perform the action described in the name using the arguments you provide.
The rescale
function
Here is a user-generated function from R for Data Science. Analyze it and identify the three key components.
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5, 10))
## [1] 0.0 0.5 1.0
rescale01(c(-10, 0, 10))
## [1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
## [1] 0.00 0.25 0.50 NA 1.00
Click for the solution
- Name -
rescale01
- This is a function that will rescale a variable from 0 to 1
- Arguments
- This function takes one argument
x
- the variable to be transformed - We could call the argument whatever we like, but
x
is a conventional name - Multiple inputs would be
x
,y
,z
, etc., or take on informative names such asdata
,formula
,na.rm
, etc. - You should use what makes sense
- This function takes one argument
- Body
- This takes two lines of code
- Calculate the range of the variable (its minimum and maximum values) and ignore missing values. Save this as an object called
rng
. - For each value in the variable, subtract the minimum value in the variable and divide by the difference between the maximum and minimum value. Use arthimetic notation to make sure order of operations is followed.
- Calculate the range of the variable (its minimum and maximum values) and ignore missing values. Save this as an object called
- By default, whatever is the last thing generated by the function is returned as the output
- This takes two lines of code
This function can easily be reused for any numeric variable. Rather than writing out the contents of the function every time, we just use the function itself.
Pythagorean theorem function
Analyze the following function.
- Identify the name, arguments, and body
- What does it do?
- If
a = 3
andb = 4
, what should we expect the output to be?
pythagorean <- function(a, b) {
hypotenuse <- sqrt(a^2 + b^2)
return(hypotenuse)
}
Click for the solution
- Name -
pythagorean
- Calculates the length of the hypotenuse of a right triangle.
- Arguments
- These are the inputs of the function. They go inside
function
- This function takes two arguments
a
- length of one side of a right triangleb
- length of another side of a right triangle
- These are the inputs of the function. They go inside
- Body
- Block of code within
{}
that immediately followsfunction(...)
- Here, I wrote two lines of code
- The first line creates a new object
hypotenuse
which is the square root of the sum of squares of the two sides of the right triangle (also called the hypotenuse) - I then explicitly
return
hypotenuse
as the output of the function. I could also have written the function as:
- The first line creates a new object
- Block of code within
pythagorean <- function(a, b) {
hypotenuse <- sqrt(a^2 + b^2)
}
or even:
pythagorean <- function(a, b) {
sqrt(a^2 + b^2)
}
But I wanted to explicitly identify each step of the code for others to review. Early on in your function writing career, you will want to be more explicit so future you can interpret your own code. As you practice and become more comfortable writing functions, you can be more relaxed in your coding style and documentation.
How to use a function
When using functions, by default the returned object is merely printed to the screen.
pythagorean(a = 3, b = 4)
## [1] 5
If you want it saved, you need to assign it to an object.
(tri_c <- pythagorean(a = 3, b = 4))
## [1] 5
Objects created inside functions
pythagorean(a = 3, b = 4)
## [1] 5
hypotenuse
## Error in eval(expr, envir, enclos): object 'hypotenuse' not found
Why does this generate an error? Why can we not see the results of hypotenuse
? After all, it was generated by pythagorean
, right?
Objects created inside a function exist within their own environment. Typically you are working in the global environment. You can see all objects that exist in that environment in the top-right panel.
Objects created within a function are destroyed once the function completes its execution, unless you return
the object as part of the output. This is why you do not see hypotenuse
listed in the environment - it has already been destroyed.
Exercise: calculate the sum of squares of two variables
Write a function that calculates the sum of the squared value of two numbers. For instance, it should generate the following output:
my_function(3, 4)
## [1] 25
Click for the solution
sum_of_squares <- function(x, y) {
return(x^2 + y^2)
}
sum_of_squares(3, 4)
## [1] 25
- Name -
sum_of_squares
- Calculates the sum of the squared value of two variables
- Arguments
x
- one numbery
- a second number
- Body
- The first line squares
x
andy
independently and then adds the results together
- The first line squares
Cool fact - this function also works with vectors of numbers
x <- c(2, 4, 6)
y <- c(1, 3, 5)
sum_of_squares(x, y)
## [1] 5 25 61
Conditional execution
Sometimes you only want to execute code if a condition is met. To do that, use an if-else statement.
if (condition) {
# code executed when condition is TRUE
} else {
# code executed when condition is FALSE
}
condition
must always evaluate to either TRUE
or FALSE
.1 This is similar to filter()
, except condition
can only be a single value (i.e. a vector of length 1), whereas filter()
works for entire vectors (or columns).
You can chain conditional statements together:
if (this) {
# do that
} else if (that) {
# do something else
} else {
# do something completely different
}
But this can get tedious if you need to consider many conditions. There are alternatives in R for some of these long conditional statements. For instance, if you want to convert a continuous (or numeric) variable to categories, use cut()
:
penguins %>%
select(body_mass_g) %>%
mutate(
body_mass_g_autobin = cut(body_mass_g, breaks = 5),
body_mass_g_manbin = cut(body_mass_g,
breaks = c(2700, 3600, 4500, 5400, 6300),
labels = c("Small", "Medium", "Large", "Huge")
)
)
## # A tibble: 344 × 3
## body_mass_g body_mass_g_autobin body_mass_g_manbin
## <int> <fct> <fct>
## 1 3750 (3.42e+03,4.14e+03] Medium
## 2 3800 (3.42e+03,4.14e+03] Medium
## 3 3250 (2.7e+03,3.42e+03] Small
## 4 NA <NA> <NA>
## 5 3450 (3.42e+03,4.14e+03] Small
## 6 3650 (3.42e+03,4.14e+03] Medium
## 7 3625 (3.42e+03,4.14e+03] Medium
## 8 4675 (4.14e+03,4.86e+03] Large
## 9 3475 (3.42e+03,4.14e+03] Small
## 10 4250 (4.14e+03,4.86e+03] Medium
## # … with 334 more rows
if
versus if_else()
Because if-else conditional statements like the ones outlined above must always resolve to a single TRUE
or FALSE
, they cannot be used for vector operations. Vector operations are where you make multiple comparisons simultaneously for each value stored inside a vector. Consider the gun_deaths
data and imagine you wanted to create a new column identifying whether or not an individual had at least a high school education.
library(rcis)
##
## Attaching package: 'rcis'
## The following objects are masked from 'package:rcfss':
##
## add_ci, cfss_notes, cfss_slides, err.rate.rf, err.rate.tree,
## logit2prob, mse, mse_vec, plot_ci, prob2logodds, prob2odds,
## xaringan, xaringan_wide
data("gun_deaths")
(educ <- select(gun_deaths, education))
## # A tibble: 100,798 × 1
## education
## <fct>
## 1 BA+
## 2 Some college
## 3 BA+
## 4 BA+
## 5 HS/GED
## 6 Less than HS
## 7 HS/GED
## 8 HS/GED
## 9 Some college
## 10 <NA>
## # … with 100,788 more rows
This sounds like a classic if-else operation. For each individual, if education
equals “Less than HS”, then the value in the new column should be “Less than HS”. Otherwise, it should be “HS+”. But what happens if we try to implement this using an if-else operation like above?
(educ_if <- educ %>%
mutate(hsPlus = if (education == "Less than HS") {
"Less than HS"
} else {
"HS+"
}))
## Error in `mutate()`:
## ! Problem while computing `hsPlus = if (...) NULL`.
## Caused by error in `if (education == "Less than HS") ...`:
## ! the condition has length > 1
This did not work correctly. if()
can only handle a single TRUE
/FALSE
value; as of R version 4.2.0, it generates an error if the argument contains more than a single value.
Because we in fact want to make this if-else comparison 100798 times, we should instead use if_else()
. This vectorizes the if-else comparison and makes a separate comparison for each row of the data frame. This allows us to correctly generate this new column.2
(educ_ifelse <- educ %>%
mutate(hsPlus = if_else(education == "Less than HS", "Less than HS", "HS+")))
## # A tibble: 100,798 × 2
## education hsPlus
## <fct> <chr>
## 1 BA+ HS+
## 2 Some college HS+
## 3 BA+ HS+
## 4 BA+ HS+
## 5 HS/GED HS+
## 6 Less than HS Less than HS
## 7 HS/GED HS+
## 8 HS/GED HS+
## 9 Some college HS+
## 10 <NA> <NA>
## # … with 100,788 more rows
count(educ_ifelse, hsPlus)
## # A tibble: 3 × 2
## hsPlus n
## <chr> <int>
## 1 HS+ 77553
## 2 Less than HS 21823
## 3 <NA> 1422
Exercise: write a fizzbuzz
function
Fizz buzz is a children’s game that teaches about division. Players take turns counting incrementally, replacing any number divisible by three with the word “fizz” and any number divisible by five with the word “buzz”.
Likewise, a fizzbuzz
function takes a single number as input. If the number is divisible by three, it returns “fizz”. If it’s divisible by five it returns “buzz”. If it’s divisible by three and five, it returns “fizzbuzz”. Otherwise, it returns the number.
The output of your function should look like this:
my_function(3)
## [1] "fizz"
my_function(5)
## [1] "buzz"
my_function(15)
## [1] "fizzbuzz"
my_function(4)
## [1] 4
A helpful hint about modular division
%%
is modular division. It returns the remainder left over after the division, rather than a floating point number.
5 / 3
## [1] 1.666667
5 %% 3
## [1] 2
Click for the solution
fizzbuzz <- function(x) {
if (x %% 3 == 0 && x %% 5 == 0) {
return("fizzbuzz")
} else if (x %% 3 == 0) {
return("fizz")
} else if (x %% 5 == 0) {
return("buzz")
} else {
return(x)
}
}
fizzbuzz(3)
## [1] "fizz"
fizzbuzz(5)
## [1] "buzz"
fizzbuzz(15)
## [1] "fizzbuzz"
fizzbuzz(4)
## [1] 4
- Name -
fizzbuzz
- Plays a single round of the Fizz Buzz game
- Arguments
x
- a number
- Body
- Uses modular division and a series of if-else statements to check if
x
is evenly divisible with 3 and/or 5. - The first comparison to make checks if
x
is a “fizzbuzz” (evenly divisible by 3 and 5). This should be the first comparison because it needs to return “fizzbuzz”. If we had this at the end of the comparison chain, the function would prematurely return on “fizz” or “buzz”.- If
TRUE
, then print “fizzbuzz”
- If
- If the first condition is not met, check to see if
x
is a “fizz” (divisible by 3).- If
TRUE
, then print “fizz”
- If
- If the first two conditions are not met, check to see if
x
is a “buzz” (divisible by 5).- If
TRUE
, then print “buzz”
- If
- If the first three conditions are all
FALSE
, then print the original numberx
.
- Uses modular division and a series of if-else statements to check if
Session Info
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.1 (2022-06-23)
## os macOS Monterey 12.3
## system aarch64, darwin20
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/New_York
## date 2022-08-22
## pandoc 2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.2.0)
## backports 1.4.1 2021-12-13 [2] CRAN (R 4.2.0)
## blogdown 1.10 2022-05-10 [2] CRAN (R 4.2.0)
## bookdown 0.27 2022-06-14 [2] CRAN (R 4.2.0)
## broom 1.0.0 2022-07-01 [2] CRAN (R 4.2.0)
## bslib 0.4.0 2022-07-16 [2] CRAN (R 4.2.0)
## cachem 1.0.6 2021-08-19 [2] CRAN (R 4.2.0)
## cellranger 1.1.0 2016-07-27 [2] CRAN (R 4.2.0)
## cli 3.3.0 2022-04-25 [2] CRAN (R 4.2.0)
## colorspace 2.0-3 2022-02-21 [2] CRAN (R 4.2.0)
## crayon 1.5.1 2022-03-26 [2] CRAN (R 4.2.0)
## DBI 1.1.3 2022-06-18 [2] CRAN (R 4.2.0)
## dbplyr 2.2.1 2022-06-27 [2] CRAN (R 4.2.0)
## digest 0.6.29 2021-12-01 [2] CRAN (R 4.2.0)
## dplyr * 1.0.9 2022-04-28 [2] CRAN (R 4.2.0)
## ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.2.0)
## evaluate 0.16 2022-08-09 [1] CRAN (R 4.2.1)
## fansi 1.0.3 2022-03-24 [2] CRAN (R 4.2.0)
## fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.2.0)
## forcats * 0.5.1 2021-01-27 [2] CRAN (R 4.2.0)
## fs 1.5.2 2021-12-08 [2] CRAN (R 4.2.0)
## gargle 1.2.0 2021-07-02 [2] CRAN (R 4.2.0)
## generics 0.1.3 2022-07-05 [2] CRAN (R 4.2.0)
## ggplot2 * 3.3.6 2022-05-03 [2] CRAN (R 4.2.0)
## glue 1.6.2 2022-02-24 [2] CRAN (R 4.2.0)
## googledrive 2.0.0 2021-07-08 [2] CRAN (R 4.2.0)
## googlesheets4 1.0.0 2021-07-21 [2] CRAN (R 4.2.0)
## gtable 0.3.0 2019-03-25 [2] CRAN (R 4.2.0)
## haven 2.5.0 2022-04-15 [2] CRAN (R 4.2.0)
## here 1.0.1 2020-12-13 [2] CRAN (R 4.2.0)
## hms 1.1.1 2021-09-26 [2] CRAN (R 4.2.0)
## htmltools 0.5.3 2022-07-18 [2] CRAN (R 4.2.0)
## httr 1.4.3 2022-05-04 [2] CRAN (R 4.2.0)
## jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.2.0)
## jsonlite 1.8.0 2022-02-22 [2] CRAN (R 4.2.0)
## knitr 1.39 2022-04-26 [2] CRAN (R 4.2.0)
## lifecycle 1.0.1 2021-09-24 [2] CRAN (R 4.2.0)
## lubridate 1.8.0 2021-10-07 [2] CRAN (R 4.2.0)
## magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.2.0)
## modelr 0.1.8 2020-05-19 [2] CRAN (R 4.2.0)
## munsell 0.5.0 2018-06-12 [2] CRAN (R 4.2.0)
## palmerpenguins * 0.1.0 2020-07-23 [2] CRAN (R 4.2.0)
## pillar 1.8.0 2022-07-18 [2] CRAN (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.2.0)
## purrr * 0.3.4 2020-04-17 [2] CRAN (R 4.2.0)
## R6 2.5.1 2021-08-19 [2] CRAN (R 4.2.0)
## readr * 2.1.2 2022-01-30 [2] CRAN (R 4.2.0)
## readxl 1.4.0 2022-03-28 [2] CRAN (R 4.2.0)
## reprex 2.0.1.9000 2022-08-10 [1] Github (tidyverse/reprex@6d3ad07)
## rlang 1.0.4 2022-07-12 [2] CRAN (R 4.2.0)
## rmarkdown 2.14 2022-04-25 [2] CRAN (R 4.2.0)
## rprojroot 2.0.3 2022-04-02 [2] CRAN (R 4.2.0)
## rstudioapi 0.13 2020-11-12 [2] CRAN (R 4.2.0)
## rvest 1.0.2 2021-10-16 [2] CRAN (R 4.2.0)
## sass 0.4.2 2022-07-16 [2] CRAN (R 4.2.0)
## scales 1.2.0 2022-04-13 [2] CRAN (R 4.2.0)
## sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.2.0)
## stringi 1.7.8 2022-07-11 [2] CRAN (R 4.2.0)
## stringr * 1.4.0 2019-02-10 [2] CRAN (R 4.2.0)
## tibble * 3.1.8 2022-07-22 [2] CRAN (R 4.2.0)
## tidyr * 1.2.0 2022-02-01 [2] CRAN (R 4.2.0)
## tidyselect 1.1.2 2022-02-21 [2] CRAN (R 4.2.0)
## tidyverse * 1.3.2 2022-07-18 [2] CRAN (R 4.2.0)
## tzdb 0.3.0 2022-03-28 [2] CRAN (R 4.2.0)
## utf8 1.2.2 2021-07-24 [2] CRAN (R 4.2.0)
## vctrs 0.4.1 2022-04-13 [2] CRAN (R 4.2.0)
## withr 2.5.0 2022-03-03 [2] CRAN (R 4.2.0)
## xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0)
## xml2 1.3.3 2021-11-30 [2] CRAN (R 4.2.0)
## yaml 2.3.5 2022-02-21 [2] CRAN (R 4.2.0)
##
## [1] /Users/soltoffbc/Library/R/arm64/4.2/library
## [2] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────
These are Boolean logical values - we used them to make comparisons and will talk more next class about logical vectors. ↩︎
Notice that is also preserves missing values in the new column. Remember, any operation performed on a missing value will itself become a missing value. ↩︎