Simplifying data from a list of GitHub users

A great use of purrr’s map() functions is to dig information out of a non-rectangular data structure and create a neat data frame. Where do these awkward objects come from? Often as JSON or XML from an API. JSON and XML are two plain text formats for non-rectangular data, i.e. for scenarios where CSV is not an option. If you are lucky it’s JSON, which is less aggravating, and readily converts to a list you can work with in R.

Here we explore some lists obtained from the GitHub API. Interactive exploration of these lists is made possible by the listviewer package.

Load the packages.

library(repurrrsive)
library(listviewer)
library(jsonlite)
library(dplyr)
library(tibble)
library(purrr)

Get several GitHub users

The repurrrsive package provides information on 6 GitHub users in an R list named gh_users.

gh_users is a recursive list:

one component for each of the 6 GitHub users
each component is, in turn, a list with info on the user

We have no clue what is in this list. This is normal. That is why it’s important to develop list inspection strategies.

Use str() with arguments such as max.level and list.len. It often pays off to do deeper inspection on a single element.

str(gh_users, max.level = 1)
#> List of 6
#>  $ :List of 30
#>  $ :List of 30
#>  $ :List of 30
#>  $ :List of 30
#>  $ :List of 30
#>  $ :List of 30
str(gh_users[[1]], list.len = 6)
#> List of 30
#>  $ login              : chr "gaborcsardi"
#>  $ id                 : int 660288
#>  $ avatar_url         : chr "https://avatars.githubusercontent.com/u/660288?v=3"
#>  $ gravatar_id        : chr ""
#>  $ url                : chr "https://api.github.com/users/gaborcsardi"
#>  $ html_url           : chr "https://github.com/gaborcsardi"
#>   [list output truncated]

You can also use listviewer::jsonedit() to explore it interactively:

Exercises

Read the documentation on str(). What does max.level control? Do str(gh_users, max.level = i) for i in 0,1, and 2.
What does the list.len argument of str() control? What is it’s default value? Call str() on gh_users and then on a single component of gh_users with list.len set to a value much smaller than the default.
Call str() on gh_users, specifying both max.level and list.len.
Recall the list and vector indexing techniques. Inspect elements 1, 2, 6, 18, 21, and 24 of the list component for the 5th GitHub user. One of these should be the URL for the user’s profile on GitHub.com. Go there and compare info you see there with the info you just extracted from gh_users.
Consider the interactive view of gh_users here. Or, optionally, install the listviewer package via install.packages("listviewer") and call jsonedit(gh_users) to run this widget locally. Can you find the same info you extracted in the previous exercise? The same info you see in user’s GitHub.com profile?

Name and position shortcuts

Who are these GitHub users?

We need to reach into each user’s list and pull out the element that holds the user’s name or, maybe, username. How?

Recall the basic usage of purrr::map():

map(.x, .f, ...)

The first input .x is your list. It will be gh_users here.

The second input .f, is the function to apply to each component of the list.

We want the element with name “login”, so we do this:

map(gh_users, "login")
#> [[1]]
#> [1] "gaborcsardi"
#> 
#> [[2]]
#> [1] "jennybc"
#> 
#> [[3]]
#> [1] "jtleek"
#> 
#> [[4]]
#> [1] "juliasilge"
#> 
#> [[5]]
#> [1] "leeper"
#> 
#> [[6]]
#> [1] "masalmon"

We are exploiting one of purrr’s most useful features: a shortcut to create a function that extracts an element based on its name.

A companion shortcut is used if you provide a positive integer to map(). This creates a function that extracts an element based on position.

The 18th element of each user’s list is his or her name and we get them like so:

map(gh_users, 18)
#> [[1]]
#> [1] "Gábor Csárdi"
#> 
#> [[2]]
#> [1] "Jennifer (Jenny) Bryan"
#> 
#> [[3]]
#> [1] "Jeff L."
#> 
#> [[4]]
#> [1] "Julia Silge"
#> 
#> [[5]]
#> [1] "Thomas J. Leeper"
#> 
#> [[6]]
#> [1] "Maëlle Salmon"

To recap, here are two shortcuts for making the .f function that map() will apply:

provide “TEXT” to extract the element named “TEXT”
- equivalent to function(x) x[["TEXT"]]
provide i to extract the i-th element
- equivalent to function(x) x[[i]]

You will frequently see map() used together with the pipe %>%. These calls produce the same result as the above.

gh_users %>% 
  map("login")
gh_users %>% 
  map(18)

Exercises

Use names() to inspect the names of the list elements associated with a single user. What is the index or position of the created_at element? Use the character and position shortcuts to extract the created_at elements for all 6 users.
What happens if you use the character shortcut with a string that does not appear in the lists’ names?
What happens if you use the position shortcut with a number greater than the length of the lists?
What if these shortcuts did not exist? Write a function that takes a list and a string as input and returns the list element that bears the name in the string. Apply this to gh_users via map(). Do you get the same result as with the shortcut? Reflect on code length and readability.
Write another function that takes a list and an integer as input and returns the list element at that position. Apply this to gh_users via map(). How does this result and process compare with the shortcut?

Type-specific map

map() always returns a list, even if all the elements have the same flavor and are of length one. But in that case, you might prefer a simpler object: an atomic vector.

If you expect map() to return output that can be turned into an atomic vector, it is best to use a type-specific variant of map(). This is more efficient than using map() to get a list and then simplifying the result in a second step. Also purrr will alert you to any problems, i.e. if one or more inputs has the wrong type or length. This is the increased rigor about type alluded to in the section about coercion.

Our current examples are suitable for demonstrating map_chr(), since the requested elements are always character.

map_chr(gh_users, "login")
#> [1] "gaborcsardi" "jennybc"     "jtleek"      "juliasilge"  "leeper"     
#> [6] "masalmon"
map_chr(gh_users, 18)
#> [1] "Gábor Csárdi"           "Jennifer (Jenny) Bryan"
#> [3] "Jeff L."                "Julia Silge"           
#> [5] "Thomas J. Leeper"       "Maëlle Salmon"

Besides map_chr(), there are other variants of map(), with the target type conveyed by the name:

map_lgl(), map_int(), map_dbl()

Exercises

For each user, the second element is named “id”. This is the user’s GitHub id number, where it seems like the person with an id of, say, 10 was the 10th person to sign up for GitHub. At least, it’s something like that! Use a type-specific form of map() and an extraction shortcut to extract the ids of these 6 users.
Use your list inspection strategies to find the list element that is logical. There is one! Use a type-specific form of map() and an extraction shortcut to extract this for all 6 users.
Use your list inspection strategies to find elements other than id that are numbers. Practice extracting them.

Extract multiple values

What if you want to retrieve multiple elements? Such as the user’s name and GitHub username? First, recall how we do this with the list for a single user:

gh_users[[3]][c("name", "login", "id", "location")]
#> $name
#> [1] "Jeff L."
#> 
#> $login
#> [1] "jtleek"
#> 
#> $id
#> [1] 1571674
#> 
#> $location
#> [1] "Baltimore,MD"

We use single square bracket indexing and a character vector to index by name. How will we ram this into the map() framework? To paraphrase Chambers, “everything that happens in R is a function call” and indexing with [ is no exception.

It feels (and maybe looks) weird, but we can map [ just like any other function. Recall map() usage:

map(.x, .f, ...)

The function .f will be [. And we finally get to use ...! This is where we pass the character vector of the names of our desired elements. We inspect the result for the first 2 users.

x <- map(gh_users, `[`, c("login", "name", "id", "location"))
str(x[1:2])
#> List of 2
#>  $ :List of 4
#>   ..$ login   : chr "gaborcsardi"
#>   ..$ name    : chr "Gábor Csárdi"
#>   ..$ id      : int 660288
#>   ..$ location: chr "Chippenham, UK"
#>  $ :List of 4
#>   ..$ login   : chr "jennybc"
#>   ..$ name    : chr "Jennifer (Jenny) Bryan"
#>   ..$ id      : int 599454
#>   ..$ location: chr "Vancouver, BC, Canada"

Some people find this ugly and might prefer the extract() function from magrittr.

x <- map(gh_users, magrittr::extract, c("login", "name", "id", "location"))
str(x[3:4])
#> List of 2
#>  $ :List of 4
#>   ..$ login   : chr "jtleek"
#>   ..$ name    : chr "Jeff L."
#>   ..$ id      : int 1571674
#>   ..$ location: chr "Baltimore,MD"
#>  $ :List of 4
#>   ..$ login   : chr "juliasilge"
#>   ..$ name    : chr "Julia Silge"
#>   ..$ id      : int 12505835
#>   ..$ location: chr "Salt Lake City, UT"

Exercises

Use your list inspection skills to determine the position of the elements named “login”, “name”, “id”, and “location”. Map [ or magrittr::extract() over users, requesting these four elements by position instead of name.

Data frame output

We just learned how to extract multiple elements per user by mapping [. But, since [ is non-simplifying, each user’s elements are returned in a list. And, as it must, map() itself returns list. We’ve traded one recursive list for another recursive list, albeit a slightly less complicated one.

How can we “stack up” these results row-wise, i.e. one row per user and variables for “login”, “name”, etc.? A data frame would be the perfect data structure for this information.

This is what map_dfr() is for.

map_dfr(gh_users, `[`, c("login", "name", "id", "location"))
#> # A tibble: 6 x 4
#>   login       name                         id location              
#>   <chr>       <chr>                     <int> <chr>                 
#> 1 gaborcsardi Gábor Csárdi             660288 Chippenham, UK        
#> 2 jennybc     Jennifer (Jenny) Bryan   599454 Vancouver, BC, Canada 
#> 3 jtleek      Jeff L.                 1571674 Baltimore,MD          
#> 4 juliasilge  Julia Silge            12505835 Salt Lake City, UT    
#> 5 leeper      Thomas J. Leeper        3505428 London, United Kingdom
#> 6 masalmon    Maëlle Salmon           8360597 Barcelona, Spain

Finally! A data frame! Hallelujah!

Notice how the variables have been automatically type converted. It’s a beautiful thing. Until it’s not. When programming, it is safer, but more cumbersome, to explicitly specify type and build your data frame the usual way.

gh_users %>% {
  tibble(
       login = map_chr(., "login"),
        name = map_chr(., "name"),
          id = map_int(., "id"),
    location = map_chr(., "location")
  )
}
#> # A tibble: 6 x 4
#>   login       name                         id location              
#>   <chr>       <chr>                     <int> <chr>                 
#> 1 gaborcsardi Gábor Csárdi             660288 Chippenham, UK        
#> 2 jennybc     Jennifer (Jenny) Bryan   599454 Vancouver, BC, Canada 
#> 3 jtleek      Jeff L.                 1571674 Baltimore,MD          
#> 4 juliasilge  Julia Silge            12505835 Salt Lake City, UT    
#> 5 leeper      Thomas J. Leeper        3505428 London, United Kingdom
#> 6 masalmon    Maëlle Salmon           8360597 Barcelona, Spain

Exercises

Use map_dfr() to create a data frame with one row per user and variables for “name”, “following”, and “created_at”. What type are the variables?

Repositories for each user

The gh_users list from above has one primary level of nesting, but it’s common to have even more.

Meet gh_repos. It is a list with:

one component for each of our 6 GitHub users
each component is another list of that user’s repositories (or just the first 30, if user has more than 30)
several of those list components are, again, a list

The repurrrsive package provides this in an R list named gh_repos.

str(gh_repos, max.level = 1)
#> List of 6
#>  $ :List of 30
#>  $ :List of 30
#>  $ :List of 30
#>  $ :List of 26
#>  $ :List of 30
#>  $ :List of 30

As usual, we have no idea what’s in here and, again, this is normal. To work with lists, you have to develop list inspection strategies.

Explore it interactively:

Exercises

Use str(), [, and [[ to explore this list, possibly in addition to the interactive list viewer.

How many elements does gh_repos have? How many elements does each of those elements have?
Extract a list with all the info for one repo for one user. Use str() on it. Maybe print the whole thing to screen. How many elements does this list have and what are their names? Do the same for at least one other repo from a different user and get an rough sense for whether these repo-specific lists tend to look similar.
What are three pieces of repo information that strike you as the most useful? I.e. if you were going to make a data frame of repositories, what might the variables be?

Vector input to extraction shortcuts

Now we use the indexing shortcuts in a more complicated setting. Instead of providing a single name or position, we use a vector:

the j-th element addresses the j-th level of the hierarchy

It’s easiest to see in a concrete example. We get the full name (element 3) of the first repository listed for each user.

gh_repos %>%
  map_chr(c(1, 3))
#> [1] "gaborcsardi/after"   "jennybc/2013-11_sfu" "jtleek/advdatasci"  
#> [4] "juliasilge/2016-14"  "leeper/ampolcourse"  "masalmon/aqi_pdf"
## TO DO? I would prefer a character example :( but gh_repos is unnamed atm

Note that this does NOT give elements 1 and 3 of gh_repos. It extracts the first repo for each user and, within that, the 3rd piece of information for the repo.

Exercises

Each repository carries information about its owner in a list. Use map_chr() and the position indexing shortcut with vector input to get an atomic character vector of the 6 GitHub usernames for our 6 users: “gaborcsardi”, “jennybc”, etc. You will need to use your list inspection skills to figure out where this info lives.

List inside a data frame

We go out in a blaze of glory now, using all of the techniques from above plus a couple news ones.

NOTE TO SELF: this still goes from zero to sixty too fast.

Mission: get a data frame with one row per repository, with variables identifying which GitHub user owns it, the repository name, etc.

Step 1: Put the gh_repos list into a data frame, along with identifying GitHub usernames. The care and feeding of lists inside a data frame – “list-columns” – is the subject of its own lesson (yet to be written / linked), so I ask you to simply accept that this can be done.

We use the answer to the previous exercise to grab the 6 usernames and set them as the names on the gh_repos list. Then we use tibble::enframe() to convert this named vector into a tibble with the names as one variable and the vector as the other. This is a generally useful setup technique.

(unames <- map_chr(gh_repos, c(1, 4, 1)))
#> [1] "gaborcsardi" "jennybc"     "jtleek"      "juliasilge"  "leeper"     
#> [6] "masalmon"
(udf <- gh_repos %>%
    set_names(unames) %>% 
    enframe("username", "gh_repos"))
#> # A tibble: 6 x 2
#>   username    gh_repos   
#>   <chr>       <list>     
#> 1 gaborcsardi <list [30]>
#> 2 jennybc     <list [30]>
#> 3 jtleek      <list [30]>
#> 4 juliasilge  <list [26]>
#> 5 leeper      <list [30]>
#> 6 masalmon    <list [30]>

Build confidence by doing something modest on the list-column of repos. This is your introduction to another powerful, general technique: map() inside mutate(). Note we are now bringing the data frame wrangling tools from dplyr and tidyr to bear.

udf %>% 
  mutate(n_repos = map_int(gh_repos, length))
#> # A tibble: 6 x 3
#>   username    gh_repos    n_repos
#>   <chr>       <list>        <int>
#> 1 gaborcsardi <list [30]>      30
#> 2 jennybc     <list [30]>      30
#> 3 jtleek      <list [30]>      30
#> 4 juliasilge  <list [26]>      26
#> 5 leeper      <list [30]>      30
#> 6 masalmon    <list [30]>      30

This shows that we know how to operate on a list-column inside a tibble.

Figure out how to do what we want for a single user, i.e. for a single element of udf$gh_repos. Walk before you run.

How far to we need to drill to get a single repo? How do we create “one row’s worth” of data for this repo? How do we do that for all repos for a single user?

## one_user is a list of repos for one user
one_user <- udf$gh_repos[[1]]
## one_user[[1]] is a list of info for one repo
one_repo <- one_user[[1]]
str(one_repo, max.level = 1, list.len = 5)
#> List of 68
#>  $ id               : int 61160198
#>  $ name             : chr "after"
#>  $ full_name        : chr "gaborcsardi/after"
#>  $ owner            :List of 17
#>  $ private          : logi FALSE
#>   [list output truncated]
## a highly selective list of tibble-worthy info for one repo
one_repo[c("name", "fork", "open_issues")]
#> $name
#> [1] "after"
#> 
#> $fork
#> [1] FALSE
#> 
#> $open_issues
#> [1] 0
## make a data frame of that info for all a user's repos
map_df(one_user, `[`, c("name", "fork", "open_issues"))
#> # A tibble: 30 x 3
#>    name        fork  open_issues
#>    <chr>       <lgl>       <int>
#>  1 after       FALSE           0
#>  2 argufy      FALSE           6
#>  3 ask         FALSE           4
#>  4 baseimports FALSE           0
#>  5 citest      TRUE            0
#>  6 clisymbols  FALSE           0
#>  7 cmaker      TRUE            0
#>  8 cmark       TRUE            0
#>  9 conditions  TRUE            0
#> 10 crayon      FALSE           7
#> # … with 20 more rows
## YYAAAASSSSSSS

Now we scale this up to all of our users. Yes, we use mutate to map() inside a map().

udf %>% 
  mutate(repo_info = gh_repos %>%
           map(. %>% map_df(`[`, c("name", "fork", "open_issues"))))
#> # A tibble: 6 x 3
#>   username    gh_repos    repo_info        
#>   <chr>       <list>      <list>           
#> 1 gaborcsardi <list [30]> <tibble [30 × 3]>
#> 2 jennybc     <list [30]> <tibble [30 × 3]>
#> 3 jtleek      <list [30]> <tibble [30 × 3]>
#> 4 juliasilge  <list [26]> <tibble [26 × 3]>
#> 5 leeper      <list [30]> <tibble [30 × 3]>
#> 6 masalmon    <list [30]> <tibble [30 × 3]>

The user-specific tibbles about each user’s repos are now sitting in the repo_info. How do we simplify this to a normal data frame that is free of list-columns? Remove the gh_repos variable, which has served its purpose and use tidyr::unnest().

(rdf <- udf %>% 
   mutate(
     repo_info = gh_repos %>%
       map(. %>% map_df(`[`, c("name", "fork", "open_issues")))
   ) %>% 
   select(-gh_repos) %>% 
   tidyr::unnest(repo_info))
#> # A tibble: 176 x 4
#>    username    name        fork  open_issues
#>    <chr>       <chr>       <lgl>       <int>
#>  1 gaborcsardi after       FALSE           0
#>  2 gaborcsardi argufy      FALSE           6
#>  3 gaborcsardi ask         FALSE           4
#>  4 gaborcsardi baseimports FALSE           0
#>  5 gaborcsardi citest      TRUE            0
#>  6 gaborcsardi clisymbols  FALSE           0
#>  7 gaborcsardi cmaker      TRUE            0
#>  8 gaborcsardi cmark       TRUE            0
#>  9 gaborcsardi conditions  TRUE            0
#> 10 gaborcsardi crayon      FALSE           7
#> # … with 166 more rows

Let’s do a little manipulation with dplyr to find some of the more interesting repos and get repos from each user in front of our eyeballs. I get rid of forks and show the 3 repos for each user that have the most open issues. (Remember we are only working with the first 30 repos for each user – I had to remember my open issue situation is much more grim than this table suggests.)

rdf %>% 
  filter(!fork) %>% 
  select(-fork) %>% 
  group_by(username) %>%
  arrange(username, desc(open_issues)) %>%
  slice(1:3)
#> # A tibble: 18 x 3
#> # Groups:   username [6]
#>    username    name                     open_issues
#>    <chr>       <chr>                          <int>
#>  1 gaborcsardi gh                                 8
#>  2 gaborcsardi crayon                             7
#>  3 gaborcsardi argufy                             6
#>  4 jennybc     2014-01-27-miami                   4
#>  5 jennybc     bingo                              3
#>  6 jennybc     candy                              2
#>  7 jtleek      datasharing                      399
#>  8 jtleek      dataanalysis                       5
#>  9 jtleek      genstats                           3
#> 10 juliasilge  tidytext                           5
#> 11 juliasilge  choroplethrUTCensusTract           0
#> 12 juliasilge  CountyHealthApp                    0
#> 13 leeper      crandatapkgs                      12
#> 14 leeper      csvy                               2
#> 15 leeper      ciplotm                            1
#> 16 masalmon    cpcb                               5
#> 17 masalmon    rtimicropem                        5
#> 18 masalmon    laads                              4

Appendix

If you just wanted to solve this problem, you could let jsonlite simplify the JSON for you. Other packages for list handling include listless, rlist.