A great use of purrr’s map()
functions is to dig information out of a non-rectangular data structure and create a neat data frame. Where do these awkward objects come from? Often as JSON or XML from an API. JSON and XML are two plain text formats for non-rectangular data, i.e. for scenarios where CSV is not an option. If you are lucky it’s JSON, which is less aggravating, and readily converts to a list you can work with in R.
Here we explore some lists obtained from the GitHub API. Interactive exploration of these lists is made possible by the listviewer
package.
Load the packages.
library(repurrrsive)
library(listviewer)
library(jsonlite)
library(dplyr)
library(tibble)
library(purrr)
The repurrrsive package provides information on 6 GitHub users in an R list named gh_users
.
gh_users
is a recursive list:
We have no clue what is in this list. This is normal. That is why it’s important to develop list inspection strategies.
Use str()
with arguments such as max.level
and list.len
. It often pays off to do deeper inspection on a single element.
str(gh_users, max.level = 1)
#> List of 6
#> $ :List of 30
#> $ :List of 30
#> $ :List of 30
#> $ :List of 30
#> $ :List of 30
#> $ :List of 30
str(gh_users[[1]], list.len = 6)
#> List of 30
#> $ login : chr "gaborcsardi"
#> $ id : int 660288
#> $ avatar_url : chr "https://avatars.githubusercontent.com/u/660288?v=3"
#> $ gravatar_id : chr ""
#> $ url : chr "https://api.github.com/users/gaborcsardi"
#> $ html_url : chr "https://github.com/gaborcsardi"
#> [list output truncated]
You can also use listviewer::jsonedit()
to explore it interactively:
str()
. What does max.level
control? Do str(gh_users, max.level = i)
for i
in 0,1, and 2.list.len
argument of str()
control? What is it’s default value? Call str()
on gh_users
and then on a single component of gh_users
with list.len
set to a value much smaller than the default.str()
on gh_users
, specifying both max.level
and list.len
.gh_users
.gh_users
here. Or, optionally, install the listviewer package via install.packages("listviewer")
and call jsonedit(gh_users)
to run this widget locally. Can you find the same info you extracted in the previous exercise? The same info you see in user’s GitHub.com profile?Who are these GitHub users?
We need to reach into each user’s list and pull out the element that holds the user’s name or, maybe, username. How?
Recall the basic usage of purrr::map()
:
map(.x, .f, ...)
The first input .x
is your list. It will be gh_users
here.
The second input .f
, is the function to apply to each component of the list.
We want the element with name “login”, so we do this:
map(gh_users, "login")
#> [[1]]
#> [1] "gaborcsardi"
#>
#> [[2]]
#> [1] "jennybc"
#>
#> [[3]]
#> [1] "jtleek"
#>
#> [[4]]
#> [1] "juliasilge"
#>
#> [[5]]
#> [1] "leeper"
#>
#> [[6]]
#> [1] "masalmon"
We are exploiting one of purrr’s most useful features: a shortcut to create a function that extracts an element based on its name.
A companion shortcut is used if you provide a positive integer to map()
. This creates a function that extracts an element based on position.
The 18th element of each user’s list is his or her name and we get them like so:
map(gh_users, 18)
#> [[1]]
#> [1] "Gábor Csárdi"
#>
#> [[2]]
#> [1] "Jennifer (Jenny) Bryan"
#>
#> [[3]]
#> [1] "Jeff L."
#>
#> [[4]]
#> [1] "Julia Silge"
#>
#> [[5]]
#> [1] "Thomas J. Leeper"
#>
#> [[6]]
#> [1] "Maëlle Salmon"
To recap, here are two shortcuts for making the .f
function that map()
will apply:
function(x) x[["TEXT"]]
i
to extract the i
-th element
function(x) x[[i]]
You will frequently see map()
used together with the pipe %>%
. These calls produce the same result as the above.
gh_users %>%
map("login")
gh_users %>%
map(18)
names()
to inspect the names of the list elements associated with a single user. What is the index or position of the created_at
element? Use the character and position shortcuts to extract the created_at
elements for all 6 users.gh_users
via map()
. Do you get the same result as with the shortcut? Reflect on code length and readability.gh_users
via map()
. How does this result and process compare with the shortcut?map()
always returns a list, even if all the elements have the same flavor and are of length one. But in that case, you might prefer a simpler object: an atomic vector.
If you expect map()
to return output that can be turned into an atomic vector, it is best to use a type-specific variant of map()
. This is more efficient than using map()
to get a list and then simplifying the result in a second step. Also purrr will alert you to any problems, i.e. if one or more inputs has the wrong type or length. This is the increased rigor about type alluded to in the section about coercion.
Our current examples are suitable for demonstrating map_chr()
, since the requested elements are always character.
map_chr(gh_users, "login")
#> [1] "gaborcsardi" "jennybc" "jtleek" "juliasilge" "leeper"
#> [6] "masalmon"
map_chr(gh_users, 18)
#> [1] "Gábor Csárdi" "Jennifer (Jenny) Bryan"
#> [3] "Jeff L." "Julia Silge"
#> [5] "Thomas J. Leeper" "Maëlle Salmon"
Besides map_chr()
, there are other variants of map()
, with the target type conveyed by the name:
map_lgl()
, map_int()
, map_dbl()
map()
and an extraction shortcut to extract the ids of these 6 users.map()
and an extraction shortcut to extract this for all 6 users.id
that are numbers. Practice extracting them.What if you want to retrieve multiple elements? Such as the user’s name and GitHub username? First, recall how we do this with the list for a single user:
gh_users[[3]][c("name", "login", "id", "location")]
#> $name
#> [1] "Jeff L."
#>
#> $login
#> [1] "jtleek"
#>
#> $id
#> [1] 1571674
#>
#> $location
#> [1] "Baltimore,MD"
We use single square bracket indexing and a character vector to index by name. How will we ram this into the map()
framework? To paraphrase Chambers, “everything that happens in R is a function call” and indexing with [
is no exception.
It feels (and maybe looks) weird, but we can map [
just like any other function. Recall map()
usage:
map(.x, .f, ...)
The function .f
will be [
. And we finally get to use ...
! This is where we pass the character vector of the names of our desired elements. We inspect the result for the first 2 users.
x <- map(gh_users, `[`, c("login", "name", "id", "location"))
str(x[1:2])
#> List of 2
#> $ :List of 4
#> ..$ login : chr "gaborcsardi"
#> ..$ name : chr "Gábor Csárdi"
#> ..$ id : int 660288
#> ..$ location: chr "Chippenham, UK"
#> $ :List of 4
#> ..$ login : chr "jennybc"
#> ..$ name : chr "Jennifer (Jenny) Bryan"
#> ..$ id : int 599454
#> ..$ location: chr "Vancouver, BC, Canada"
Some people find this ugly and might prefer the extract()
function from magrittr.
x <- map(gh_users, magrittr::extract, c("login", "name", "id", "location"))
str(x[3:4])
#> List of 2
#> $ :List of 4
#> ..$ login : chr "jtleek"
#> ..$ name : chr "Jeff L."
#> ..$ id : int 1571674
#> ..$ location: chr "Baltimore,MD"
#> $ :List of 4
#> ..$ login : chr "juliasilge"
#> ..$ name : chr "Julia Silge"
#> ..$ id : int 12505835
#> ..$ location: chr "Salt Lake City, UT"
[
or magrittr::extract()
over users, requesting these four elements by position instead of name.We just learned how to extract multiple elements per user by mapping [
. But, since [
is non-simplifying, each user’s elements are returned in a list. And, as it must, map()
itself returns list. We’ve traded one recursive list for another recursive list, albeit a slightly less complicated one.
How can we “stack up” these results row-wise, i.e. one row per user and variables for “login”, “name”, etc.? A data frame would be the perfect data structure for this information.
This is what map_dfr()
is for.
map_dfr(gh_users, `[`, c("login", "name", "id", "location"))
#> # A tibble: 6 x 4
#> login name id location
#> <chr> <chr> <int> <chr>
#> 1 gaborcsardi Gábor Csárdi 660288 Chippenham, UK
#> 2 jennybc Jennifer (Jenny) Bryan 599454 Vancouver, BC, Canada
#> 3 jtleek Jeff L. 1571674 Baltimore,MD
#> 4 juliasilge Julia Silge 12505835 Salt Lake City, UT
#> 5 leeper Thomas J. Leeper 3505428 London, United Kingdom
#> 6 masalmon Maëlle Salmon 8360597 Barcelona, Spain
Finally! A data frame! Hallelujah!
Notice how the variables have been automatically type converted. It’s a beautiful thing. Until it’s not. When programming, it is safer, but more cumbersome, to explicitly specify type and build your data frame the usual way.
gh_users %>% {
tibble(
login = map_chr(., "login"),
name = map_chr(., "name"),
id = map_int(., "id"),
location = map_chr(., "location")
)
}
#> # A tibble: 6 x 4
#> login name id location
#> <chr> <chr> <int> <chr>
#> 1 gaborcsardi Gábor Csárdi 660288 Chippenham, UK
#> 2 jennybc Jennifer (Jenny) Bryan 599454 Vancouver, BC, Canada
#> 3 jtleek Jeff L. 1571674 Baltimore,MD
#> 4 juliasilge Julia Silge 12505835 Salt Lake City, UT
#> 5 leeper Thomas J. Leeper 3505428 London, United Kingdom
#> 6 masalmon Maëlle Salmon 8360597 Barcelona, Spain
map_dfr()
to create a data frame with one row per user and variables for “name”, “following”, and “created_at”. What type are the variables?The gh_users
list from above has one primary level of nesting, but it’s common to have even more.
Meet gh_repos
. It is a list with:
The repurrrsive package provides this in an R list named gh_repos
.
str(gh_repos, max.level = 1)
#> List of 6
#> $ :List of 30
#> $ :List of 30
#> $ :List of 30
#> $ :List of 26
#> $ :List of 30
#> $ :List of 30
As usual, we have no idea what’s in here and, again, this is normal. To work with lists, you have to develop list inspection strategies.
Explore it interactively:
Use str()
, [
, and [[
to explore this list, possibly in addition to the interactive list viewer.
gh_repos
have? How many elements does each of those elements have?str()
on it. Maybe print the whole thing to screen. How many elements does this list have and what are their names? Do the same for at least one other repo from a different user and get an rough sense for whether these repo-specific lists tend to look similar.Now we use the indexing shortcuts in a more complicated setting. Instead of providing a single name or position, we use a vector:
j
-th element addresses the j
-th level of the hierarchyIt’s easiest to see in a concrete example. We get the full name (element 3) of the first repository listed for each user.
gh_repos %>%
map_chr(c(1, 3))
#> [1] "gaborcsardi/after" "jennybc/2013-11_sfu" "jtleek/advdatasci"
#> [4] "juliasilge/2016-14" "leeper/ampolcourse" "masalmon/aqi_pdf"
## TO DO? I would prefer a character example :( but gh_repos is unnamed atm
Note that this does NOT give elements 1 and 3 of gh_repos
. It extracts the first repo for each user and, within that, the 3rd piece of information for the repo.
map_chr()
and the position indexing shortcut with vector input to get an atomic character vector of the 6 GitHub usernames for our 6 users: “gaborcsardi”, “jennybc”, etc. You will need to use your list inspection skills to figure out where this info lives.We go out in a blaze of glory now, using all of the techniques from above plus a couple news ones.
NOTE TO SELF: this still goes from zero to sixty too fast.
Mission: get a data frame with one row per repository, with variables identifying which GitHub user owns it, the repository name, etc.
Step 1: Put the gh_repos
list into a data frame, along with identifying GitHub usernames. The care and feeding of lists inside a data frame – “list-columns” – is the subject of its own lesson (yet to be written / linked), so I ask you to simply accept that this can be done.
We use the answer to the previous exercise to grab the 6 usernames and set them as the names on the gh_repos
list. Then we use tibble::enframe()
to convert this named vector into a tibble with the names as one variable and the vector as the other. This is a generally useful setup technique.
(unames <- map_chr(gh_repos, c(1, 4, 1)))
#> [1] "gaborcsardi" "jennybc" "jtleek" "juliasilge" "leeper"
#> [6] "masalmon"
(udf <- gh_repos %>%
set_names(unames) %>%
enframe("username", "gh_repos"))
#> # A tibble: 6 x 2
#> username gh_repos
#> <chr> <list>
#> 1 gaborcsardi <list [30]>
#> 2 jennybc <list [30]>
#> 3 jtleek <list [30]>
#> 4 juliasilge <list [26]>
#> 5 leeper <list [30]>
#> 6 masalmon <list [30]>
Build confidence by doing something modest on the list-column of repos. This is your introduction to another powerful, general technique: map()
inside mutate()
. Note we are now bringing the data frame wrangling tools from dplyr and tidyr to bear.
udf %>%
mutate(n_repos = map_int(gh_repos, length))
#> # A tibble: 6 x 3
#> username gh_repos n_repos
#> <chr> <list> <int>
#> 1 gaborcsardi <list [30]> 30
#> 2 jennybc <list [30]> 30
#> 3 jtleek <list [30]> 30
#> 4 juliasilge <list [26]> 26
#> 5 leeper <list [30]> 30
#> 6 masalmon <list [30]> 30
This shows that we know how to operate on a list-column inside a tibble.
Figure out how to do what we want for a single user, i.e. for a single element of udf$gh_repos
. Walk before you run.
How far to we need to drill to get a single repo? How do we create “one row’s worth” of data for this repo? How do we do that for all repos for a single user?
## one_user is a list of repos for one user
one_user <- udf$gh_repos[[1]]
## one_user[[1]] is a list of info for one repo
one_repo <- one_user[[1]]
str(one_repo, max.level = 1, list.len = 5)
#> List of 68
#> $ id : int 61160198
#> $ name : chr "after"
#> $ full_name : chr "gaborcsardi/after"
#> $ owner :List of 17
#> $ private : logi FALSE
#> [list output truncated]
## a highly selective list of tibble-worthy info for one repo
one_repo[c("name", "fork", "open_issues")]
#> $name
#> [1] "after"
#>
#> $fork
#> [1] FALSE
#>
#> $open_issues
#> [1] 0
## make a data frame of that info for all a user's repos
map_df(one_user, `[`, c("name", "fork", "open_issues"))
#> # A tibble: 30 x 3
#> name fork open_issues
#> <chr> <lgl> <int>
#> 1 after FALSE 0
#> 2 argufy FALSE 6
#> 3 ask FALSE 4
#> 4 baseimports FALSE 0
#> 5 citest TRUE 0
#> 6 clisymbols FALSE 0
#> 7 cmaker TRUE 0
#> 8 cmark TRUE 0
#> 9 conditions TRUE 0
#> 10 crayon FALSE 7
#> # … with 20 more rows
## YYAAAASSSSSSS
Now we scale this up to all of our users. Yes, we use mutate to map()
inside a map()
.
udf %>%
mutate(repo_info = gh_repos %>%
map(. %>% map_df(`[`, c("name", "fork", "open_issues"))))
#> # A tibble: 6 x 3
#> username gh_repos repo_info
#> <chr> <list> <list>
#> 1 gaborcsardi <list [30]> <tibble [30 × 3]>
#> 2 jennybc <list [30]> <tibble [30 × 3]>
#> 3 jtleek <list [30]> <tibble [30 × 3]>
#> 4 juliasilge <list [26]> <tibble [26 × 3]>
#> 5 leeper <list [30]> <tibble [30 × 3]>
#> 6 masalmon <list [30]> <tibble [30 × 3]>
The user-specific tibbles about each user’s repos are now sitting in the repo_info
. How do we simplify this to a normal data frame that is free of list-columns? Remove the gh_repos
variable, which has served its purpose and use tidyr::unnest()
.
(rdf <- udf %>%
mutate(
repo_info = gh_repos %>%
map(. %>% map_df(`[`, c("name", "fork", "open_issues")))
) %>%
select(-gh_repos) %>%
tidyr::unnest(repo_info))
#> # A tibble: 176 x 4
#> username name fork open_issues
#> <chr> <chr> <lgl> <int>
#> 1 gaborcsardi after FALSE 0
#> 2 gaborcsardi argufy FALSE 6
#> 3 gaborcsardi ask FALSE 4
#> 4 gaborcsardi baseimports FALSE 0
#> 5 gaborcsardi citest TRUE 0
#> 6 gaborcsardi clisymbols FALSE 0
#> 7 gaborcsardi cmaker TRUE 0
#> 8 gaborcsardi cmark TRUE 0
#> 9 gaborcsardi conditions TRUE 0
#> 10 gaborcsardi crayon FALSE 7
#> # … with 166 more rows
Let’s do a little manipulation with dplyr to find some of the more interesting repos and get repos from each user in front of our eyeballs. I get rid of forks and show the 3 repos for each user that have the most open issues. (Remember we are only working with the first 30 repos for each user – I had to remember my open issue situation is much more grim than this table suggests.)
rdf %>%
filter(!fork) %>%
select(-fork) %>%
group_by(username) %>%
arrange(username, desc(open_issues)) %>%
slice(1:3)
#> # A tibble: 18 x 3
#> # Groups: username [6]
#> username name open_issues
#> <chr> <chr> <int>
#> 1 gaborcsardi gh 8
#> 2 gaborcsardi crayon 7
#> 3 gaborcsardi argufy 6
#> 4 jennybc 2014-01-27-miami 4
#> 5 jennybc bingo 3
#> 6 jennybc candy 2
#> 7 jtleek datasharing 399
#> 8 jtleek dataanalysis 5
#> 9 jtleek genstats 3
#> 10 juliasilge tidytext 5
#> 11 juliasilge choroplethrUTCensusTract 0
#> 12 juliasilge CountyHealthApp 0
#> 13 leeper crandatapkgs 12
#> 14 leeper csvy 2
#> 15 leeper ciplotm 1
#> 16 masalmon cpcb 5
#> 17 masalmon rtimicropem 5
#> 18 masalmon laads 4