You need a way to iterate in R in a data-structure-informed way. What does that mean?
All of this is absolutely possible with base R, using for()
loops or the “apply” functions, such as apply()
, [slvmt]apply()
, and by()
. I know, because I did so religiously for over 15 years. It works fine!
Why might you do otherwise? As an instructor, I encounter lots of useRs who sheepishly say they know they should be using the apply functions, but they don’t. Because they’ve never quite figured them out or been able to form the habit.
I have a theory about why. The user interface of the “apply”" functions is not as consistent as it could be, which slows down learning. The return objects frequently require further checking and massage to use downstream, which diminishes the payoff. In particular, there’s a tendency to return a vector (atomic or otherwise) or array, instead of data frame, with the original factor levels appearing in a names attribute.
Regardless of your preference, hopefully the side-by-side examples below are a helpful tour of different ways to iterate. Even if you prefer purrr/tidyverse in some settings, you might want base methods when you don’t want any dependencies.
purrr addresses some of the friction identified in the base functions for “split-apply-combine”:
map()
family of functions is highly internally consistent, making it easier to transfer expertise from one function to another.We load purrr now. Below we also use example lists from repurrrsive which are introduced and explored elsewhere.
library(purrr)
library(repurrrsive)
If you have never heard of the plyr package, you can skip this section! Personally, I’m a bit sad that it’s time to move on from plyr. But it is time.
Why? plyr is no longer under active development. The innovation is happening elsewhere, in purrr and the other packages in the tidyverse. Also, even I must admit that plyr can be rather slow. I wouldn’t recommend writing new code that requires plyr.
In terms of understanding or translating plyr code, here’s the minimum you need to know: The key feature of the main 3 * 4 = 12 functions in plyr is their explicit, memorable form.
XYply()
where X
and Y
specify the type of input and output, respectively. X
and Y
take values from here:
a
= “array”d
= “data frame”l
= “list”_
= “nothing” (not eligible as X
re: input, for obvious reasons)The most useful function proved to be ddply()
: data frame in, data frame out. This inspired the creation of the dplyr package. Most tasks suitable for a*ply()
and, especially l*ply()
are now handled by purrr’s map()
family. The functions called for their side effects, as opposed to return value, are those named *_ply()
; this is handled by purrr’s walk()
functions. Marching through a data frame row by row is now a job for purrr’s pmap()
.
lapply()
vs. purrr::map()
These are the core mapping functions of base and purrr, respectively. They are “list in, list out”. The main (only?) difference is access to purrr’s shortcuts for indexing by name or position and for creating anonymous functions.
base
lapply(got_chars[1:3],
function(x) x[["name"]])
#> [[1]]
#> [1] "Theon Greyjoy"
#>
#> [[2]]
#> [1] "Tyrion Lannister"
#>
#> [[3]]
#> [1] "Victarion Greyjoy"
purrr
map(got_chars[1:3], "name")
#> [[1]]
#> [1] "Theon Greyjoy"
#>
#> [[2]]
#> [1] "Tyrion Lannister"
#>
#> [[3]]
#> [1] "Victarion Greyjoy"
This is also plyr::llply()
, although it offers some other arguments, such as .progress
and .parallel
.
plyr::llply(got_chars[1:3], function(x) x[["name"]])
sapply()
vs. ¯\_(ツ)_/¯sapply()
is a base function that attempts to apply a reasonable simplification to the output of lapply()
. It’s handy for interactive use, but due to the unpredictability of it return value, it’s unwise to use it in programming. There is no equivalent in purrr or plyr.
Problem demonstration: The characters in Game of Thrones can have aliases. Some have several, some have one, some have none. Depending on which characters I work with, the same sapply()
call can return an object of entirely different class, i.e. a list or a character vector. This can be a source of mysterious bugs in code meant to run non-interactively.
aliases1 <- sapply(got_chars[20:22], function(x) x[["aliases"]])
str(aliases1)
#> List of 3
#> $ : chr [1:3] "Ned" "The Ned" "The Quiet Wolf"
#> $ : chr [1:4] "The Kingslayer" "The Lion of Lannister" "The Young Lion" "Cripple"
#> $ : chr "Griffthe Mad King's Hand"
aliases2 <- sapply(got_chars[c(3, 22, 27)], function(x) x[["aliases"]])
str(aliases2)
#> chr [1:3] "The Iron Captain" "Griffthe Mad King's Hand" ...
With purrr, you would use map()
to get a list back or map_chr()
to get atomic character vector. If you use map_chr()
when you should not, you’ll get an informative error right away (shown below) and can adjust your approach accordingly.
map_chr(got_chars[2:4], "aliases")
#> Result 1 must be a single string, not a character vector of length 11
vapply()
vs. map_*()
Base vapply()
requires you to specify a template for the return value and is described as a safer alternative to sapply()
. The closest purrr functions are the type-specific mapping functions: map_lgl()
, map_int()
, map_dbl()
, and map_chr()
that are “list in, atomic vector out”. Here’s comparable use of vapply()
and map_chr()
to get some of the Game of Thrones characters’ names.
base
vapply(got_chars[1:3],
function(x) x[["name"]],
character(1))
#> [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"
purrr
map_chr(got_chars[1:3], "name")
#> [1] "Theon Greyjoy" "Tyrion Lannister" "Victarion Greyjoy"
vapply()
always simplifiesWhat’s not to love with vapply()
then? It suffers from the drop = TRUE
vs FALSE
problem we have when requesting a single row or column from a 2-dimensional object. Except vapply()
has no drop
argument to control this behavior. It’s an example of the base functions being more difficult to program around. The template allows you to specify the form of each individual result, but there is no way to specify the form – such as the dimension – of the overall result.
I adapt this example from my real life, where I have vapply()
inside a function and n
is an argument to that function, i.e. it varies. Here I simply define n
in the global environment prior to the vapply()
call. Note how vapply()
returns a 2 dimensional object in the first case and atomic vector in the second. As it says in the docs: “Simplification is always done in vapply
.” Believe it.
f <- function(x, n) rep(x, n)
n <- 3
vapply(c("a", "b"), f, character(n), n = n)
#> a b
#> [1,] "a" "b"
#> [2,] "a" "b"
#> [3,] "a" "b"
n <- 1
vapply(c("a", "b"), f, character(n), n = n)
#> a b
#> "a" "b"
map_dfr()
The purrr::map_dfr()
function is “list in, data frame out” and there is no true base equivalent. Given the centrality of data frames for analysis, it is handy to have a function to produce them, without resorting to do.call()
.
base
l <- lapply(got_chars[23:25],
`[`, c("name", "playedBy"))
mat <- do.call(rbind, l)
(df <- as.data.frame(mat, stringsAsFactors = FALSE))
#> name playedBy
#> 1 Jon Snow Kit Harington
#> 2 Aeron Greyjoy Michael Feast
#> 3 Kevan Lannister Ian Gelder
purrr
map_dfr(got_chars[23:25],
`[`, c("name", "playedBy"))
#> # A tibble: 3 x 2
#> name playedBy
#> <chr> <chr>
#> 1 Jon Snow Kit Harington
#> 2 Aeron Greyjoy Michael Feast
#> 3 Kevan Lannister Ian Gelder
The base workflow above gets trickier if you’re extracting elements of disparate type. At that point, it may make more sense to use vapply()
repeatedly. For comparability, we’ll show similar using purrr’s type-specific mapping, which is also safer than relying on automatic type conversion from map_dfr()
.
base
data.frame(
name = vapply(got_chars[23:25], `[[`,
character(1), "name"),
id = vapply(got_chars[23:25], `[[`,
integer(1), "id"),
stringsAsFactors = FALSE
)
#> name id
#> 1 Jon Snow 583
#> 2 Aeron Greyjoy 60
#> 3 Kevan Lannister 605
purrr
tibble::tibble(
name = map_chr(got_chars[23:25], "name"),
id = map_int(got_chars[23:25], "id")
)
#> # A tibble: 3 x 2
#> name id
#> <chr> <int>
#> 1 Jon Snow 583
#> 2 Aeron Greyjoy 60
#> 3 Kevan Lannister 605
mapply()
vs. map2()
, pmap()
When you need to iterate over 2 or more vectors/lists in parallel, the base option is mapply()
. Unlike the other apply functions, the first argument is FUN
, the function to apply, and the multiple vector inputs are provided “loose” via ...
.
For exactly two vector inputs, purrr has map2()
, with all the usual type-specific variants. For an arbitrary number of vector inputs, use purrr pmap()
or type-specific variants, with the inputs packaged in a list. A very handy special case is when the input is a data frame, in which case pmap_*()
applies .f
to each row.
base
nms <- vapply(got_chars[16:18],
`[[`, character(1), "name")
birth <- vapply(got_chars[16:18],
`[[`, character(1), "born")
mapply(function(x, y) paste(x, "was born", y),
nms, birth)
#> Brandon Stark
#> "Brandon Stark was born In 290 AC, at Winterfell"
#> Brienne of Tarth
#> "Brienne of Tarth was born In 280 AC"
#> Catelyn Stark
#> "Catelyn Stark was born In 264 AC, at Riverrun"
purrr
nms <- got_chars[16:18] %>%
map_chr("name")
birth <- got_chars[16:18] %>%
map_chr("born")
map2_chr(nms, birth, ~ paste(.x, "was born", .y))
#> [1] "Brandon Stark was born In 290 AC, at Winterfell"
#> [2] "Brienne of Tarth was born In 280 AC"
#> [3] "Catelyn Stark was born In 264 AC, at Riverrun"
## and again, but with pmap()
df <- tibble::tibble(
nms,
connector = "was born",
birth
)
pmap_chr(df, paste)
#> [1] "Brandon Stark was born In 290 AC, at Winterfell"
#> [2] "Brienne of Tarth was born In 280 AC"
#> [3] "Catelyn Stark was born In 264 AC, at Riverrun"
aggregate()
vs. dplyr::summarize()
Consider a data frame, as opposed to a nested list. How do you split it into pieces, according to one or more factors, apply a function to the pieces, and combine the results?
Create a tiny excerpt of the Gapminder dataset that contains a bit of data for Canada and Germany. Load dplyr, now that we are more in the data frame world.
library(dplyr)
library(gapminder)
(mini_gap <- gapminder %>%
filter(country %in% c("Canada", "Germany"), year > 2000) %>%
droplevels())
#> # A tibble: 4 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Canada Americas 2002 79.8 31902268 33329.
#> 2 Canada Americas 2007 80.7 33390141 36319.
#> 3 Germany Europe 2002 78.7 82350671 30036.
#> 4 Germany Europe 2007 79.4 82400996 32170.
Simple summary of a single variable for each country. Specifically, take the mean of life expectancy. In this case, the formula method of base aggregate()
is quite nice, i.e. it returns a convenient data frame.
base
aggregate(lifeExp ~ country, mini_gap, mean)
#> country lifeExp
#> 1 Canada 80.2115
#> 2 Germany 79.0380
tidyverse
mini_gap %>%
group_by(country) %>%
summarize(lifeExp = mean(lifeExp))
#> # A tibble: 2 x 2
#> country lifeExp
#> <fct> <dbl>
#> 1 Canada 80.2
#> 2 Germany 79.0
Simple summaries of two variables for each country. We take the mean of life expectancy and of GDP per capita.
base
## formula method
aggregate(cbind(lifeExp, gdpPercap) ~ country, mini_gap, mean)
#> country lifeExp gdpPercap
#> 1 Canada 80.2115 34824.10
#> 2 Germany 79.0380 31103.09
## data.frame method
aggregate(mini_gap[c("lifeExp", "gdpPercap")], list(mini_gap$country), mean)
#> Group.1 lifeExp gdpPercap
#> 1 Canada 80.2115 34824.10
#> 2 Germany 79.0380 31103.09
## tapply() more general but output less useful here (data frame?)
## returns named vector
tapply(mini_gap$lifeExp, mini_gap$country, mean)
#> Canada Germany
#> 80.2115 79.0380
## returns list
tapply(mini_gap$lifeExp, mini_gap$country, mean, simplify = FALSE)
#> $Canada
#> [1] 80.2115
#>
#> $Germany
#> [1] 79.038
tidyverse
mini_gap %>%
group_by(country) %>%
summarize_at(vars(lifeExp, gdpPercap), mean)
#> # A tibble: 2 x 3
#> country lifeExp gdpPercap
#> <fct> <dbl> <dbl>
#> 1 Canada 80.2 34824.
#> 2 Germany 79.0 31103.
Bivariate summary of two variables for each country. We compute the correlation of life expectancy and year, for the full gapminder dataset now. On the base side, we can no longer use aggregate()
or tapply()
and need to graduate to by()
.
base
## by() with simplification (the default)
by_obj <- by(gapminder, gapminder$country, function(df) cor(df$lifeExp, df$year))
head(by_obj)
#> gapminder$country
#> Afghanistan Albania Algeria Angola Argentina Australia
#> 0.9735051 0.9542420 0.9925307 0.9422392 0.9977816 0.9897716
## by() without simplification
by_obj <- by(gapminder, gapminder$country, function(df) cor(df$lifeExp, df$year),
simplify = FALSE)
head(by_obj)
#> $Afghanistan
#> [1] 0.9735051
#>
#> $Albania
#> [1] 0.954242
#>
#> $Algeria
#> [1] 0.9925307
#>
#> $Angola
#> [1] 0.9422392
#>
#> $Argentina
#> [1] 0.9977816
#>
#> $Australia
#> [1] 0.9897716
tidyverse
gapminder %>%
group_by(country) %>%
summarize(cor = cor(lifeExp, year))
#> # A tibble: 142 x 2
#> country cor
#> <fct> <dbl>
#> 1 Afghanistan 0.974
#> 2 Albania 0.954
#> 3 Algeria 0.993
#> 4 Angola 0.942
#> 5 Argentina 0.998
#> 6 Australia 0.990
#> 7 Austria 0.996
#> 8 Bahrain 0.983
#> 9 Bangladesh 0.995
#> 10 Belgium 0.997
#> # … with 132 more rows
This is a good place to pause and glance back over the base vs tidyverse calls and results in this section. It shows that you can absolutely do these tasks either way, but the consistency of the code and return values is higher in the tidyverse.
by()
vs. tidyr::nest()
Fit a linear model of life expectancy against year. On the tidyverse side, we now create a nested data frame, with one meta-row per country. Therefore we load tidyr to get nest()
. The data needed for each country’s linear model is stored as a list-column of country-specific data frame.
base
by_obj <- by(gapminder,
gapminder$country,
function(df) lm(lifeExp ~ year, data = df))
str(by_obj[1:2], max.level = 1)
#> List of 2
#> $ Afghanistan:List of 12
#> ..- attr(*, "class")= chr "lm"
#> $ Albania :List of 12
#> ..- attr(*, "class")= chr "lm"
#> - attr(*, "dim")= int 2
#> - attr(*, "dimnames")=List of 1
by_obj[[1]]
#>
#> Call:
#> lm(formula = lifeExp ~ year, data = df)
#>
#> Coefficients:
#> (Intercept) year
#> -507.5343 0.2753
tidyverse
library(tidyr)
nested_df <- gapminder %>%
group_by(country, continent) %>%
nest() %>%
mutate(fit = map(data, ~ lm(lifeExp ~ year, data = .x)))
str(nested_df$fit[1:2], max.level = 1)
#> List of 2
#> $ :List of 12
#> ..- attr(*, "class")= chr "lm"
#> $ :List of 12
#> ..- attr(*, "class")= chr "lm"
nested_df$fit[[1]]
#>
#> Call:
#> lm(formula = lifeExp ~ year, data = .x)
#>
#> Coefficients:
#> (Intercept) year
#> -507.5343 0.2753
What if you want to inspect the fits for Oceania? On the base side, you’ve got some work to do because the country information lives only in the names and the continent information is not directly linked to the model fits at all. On the tidyverse side, where the fits live in a data frame that carries country and continent info, we can use our usual techniques for filtering rows based on the data.
base
o_countries <- as.character(unique(gapminder$country[gapminder$continent == "Oceania"]))
by_obj[names(by_obj) %in% o_countries]
#> $Australia
#>
#> Call:
#> lm(formula = lifeExp ~ year, data = df)
#>
#> Coefficients:
#> (Intercept) year
#> -376.1163 0.2277
#>
#>
#> $`New Zealand`
#>
#> Call:
#> lm(formula = lifeExp ~ year, data = df)
#>
#> Coefficients:
#> (Intercept) year
#> -307.6996 0.1928
tidyverse
nested_df %>%
filter(continent == "Oceania") %>%
.$fit
#> [[1]]
#>
#> Call:
#> lm(formula = lifeExp ~ year, data = .x)
#>
#> Coefficients:
#> (Intercept) year
#> -376.1163 0.2277
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = lifeExp ~ year, data = .x)
#>
#> Coefficients:
#> (Intercept) year
#> -307.6996 0.1928
Let’s form a data frame with one row per country and variables for country, continent, estimated intercept, and estimated slope. I also want to guarantee that the country and continent factors have the same levels as they originally had in gapminder.
base
coefs <- lapply(by_obj, coef)
coefs <- do.call(rbind, coefs)
coefs <- data.frame(
country = I(rownames(coefs)),
coefs
)
coefs$continent <- gapminder$continent[match(coefs$country, gapminder$country)]
coefs$continent <- factor(coefs$continent, levels = levels(gapminder$continent))
coefs$country <- factor(coefs$country, levels = levels(gapminder$country))
head(coefs)
#> country X.Intercept. year continent
#> Afghanistan Afghanistan -507.5343 0.2753287 Asia
#> Albania Albania -594.0725 0.3346832 Europe
#> Algeria Algeria -1067.8590 0.5692797 Africa
#> Angola Angola -376.5048 0.2093399 Africa
#> Argentina Argentina -389.6063 0.2317084 Americas
#> Australia Australia -376.1163 0.2277238 Oceania
tidyverse
nested_df %>%
mutate(coefs = map(fit, coef),
intercept = map_dbl(coefs, 1),
slope = map_dbl(coefs, 2)) %>%
select(country, continent, intercept, slope)
#> # A tibble: 142 x 4
#> # Groups: country, continent [710]
#> country continent intercept slope
#> <fct> <fct> <dbl> <dbl>
#> 1 Afghanistan Asia -508. 0.275
#> 2 Albania Europe -594. 0.335
#> 3 Algeria Africa -1068. 0.569
#> 4 Angola Africa -377. 0.209
#> 5 Argentina Americas -390. 0.232
#> 6 Australia Oceania -376. 0.228
#> 7 Austria Europe -406. 0.242
#> 8 Bahrain Asia -860. 0.468
#> 9 Bangladesh Asia -936. 0.498
#> 10 Belgium Europe -340. 0.209
#> # … with 132 more rows
This illustrates the post-processing that is often necessary in a base workflow. We need to restore the country info from the names of by_obj
, use them to look up the continents in gapminder, and then restore the original factor levels for both country and continent. It illustrates the payoff for temporarily tolerating the data
and fit
list-columns in the nested data frame used in the tidyverse workflow. The country and continent factors remain intact and directly associated with the data and fits.