Mara Averick recently tweeted about a blog post from February 2015 by Zev Ross, demonstrating the conversion of JSON data into a data frame (Using R to download and parse JSON: an example using data from an open data portal). I thought it would be interesting to revisit this data wrangling task with purrr, which has only been on CRAN since fall 2015.

Zev is dealing with data on food markets in New York State that, sadly but realistically, cannot be well-represented in simple rectangular form, such as CSV. Therefore the data is available as JSON and it’s our job to turn it into a useful data frame. This is a very, very common task and a typical use case for purrr.

I downloaded and unpacked the zip archive provided by Zev: foodMarkets.zip. These files are available in this tutorial’s repo in an unpacked state: foodMarkets.

Load packages and the data

Packages used below:

  • jsonlite for parsing JSON
  • purrr for obvious reasons
  • listviewer for doing recon on awkard lists
  • tibble and dplyr for forming and manipulating data frames, especially the “tibble” variant (class tbl_df), which is friendly to list-columns.
library(jsonlite)
#> 
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#> 
#>     flatten
library(purrr)
library(listviewer)
library(tibble)
library(dplyr)
food_mkts_raw <- fromJSON("foodMarkets/retail_food_markets.json",
                          simplifyVector = FALSE)

Voyage of discovery: data

When you first import JSON, it is common to have no clue what’s in there. You must explore.

str(food_mkts_raw, max.level = 1)
#> List of 2
#>  $ meta:List of 1
#>  $ data:List of 28355
#>   .. [list output truncated]

Ooh, a data component sounds promising!

str(food_mkts_raw$data, max.level = 1, list.len = 5)
#> List of 28355
#>  $ :List of 23
#>   .. [list output truncated]
#>  $ :List of 23
#>   .. [list output truncated]
#>  $ :List of 23
#>   .. [list output truncated]
#>  $ :List of 23
#>   .. [list output truncated]
#>  $ :List of 23
#>   .. [list output truncated]
#>   [list output truncated]

Working hypothesis: we have data of similar form for each of 28355 food markets. A slightly mysterious, but homogeneous list.

Pull out just the data component and store as food_mkts.

food_mkts <- food_mkts_raw[["data"]]

Voyage of discovery: meta and columns

Before we move on, what’s in the meta component of the raw JSON? It turns out we should focus on its sole component, which is named view.

jsonedit(food_mkts_raw[[c("meta", "view")]])

The columns component is especially valuable, because it tells us what data we have for each food market over in the data component. This is an example of a common but, in my opinion, dangerous practice of storing information with no explanatory names and providing a separate lexicon. This is kind of like a CSV with variables named X1 through Xn or no names at all.

We need to pull out the name elements of columns. We kick off with [[ vector indexing into a nested list, to drill down to the columns sub-component. Then map_chr(..., "name") to grab the names as character vector.

(cnames <- food_mkts_raw[[c("meta", "view", "columns")]] %>% 
   map_chr("name"))
#>  [1] "sid"            "id"             "position"       "created_at"    
#>  [5] "created_meta"   "updated_at"     "updated_meta"   "meta"          
#>  [9] "County"         "License Number" "Operation Type" "Estab Type"    
#> [13] "Entity Name"    "DBA Name"       "Street Number"  "Street Name"   
#> [17] "Address Line 2" "Address Line 3" "City"           "State"         
#> [21] "Zip Code"       "Square Footage" "Location"

Apply names

Downstream work will be easier if the food market elements bear the names in cnames. I believe this is how the JSON should have been to begin with. This makes exploration and troubleshooting easier and, more importantly, it helps in our goal to create a data frame.

food_mkts <- food_mkts %>% 
  map(set_names, cnames)

One variable

As a warm-up activity, pull out and simplify the data for one column, across all food markets. From Zev’s post, I decided to look at DBA Name (“doing business as”).

food_mkts %>% 
  map_chr("DBA Name") %>% 
  head()
#> [1] "PLAZA 23 TRUCK STOP    " "PRICE CHOPPER #245     "
#> [3] "PEACOCK                " "FINYOUR FISHMONGER     "
#> [5] "R&A GROCERY STORE      " "ANTHONYS CHOC DIP FRUIT"

I think we’ll want to trim whitespace at some point!

Data frame of (almost) everything

Again, freeriding on Zev’s post, I know that 22 of the 23 variables can be extracted in a fairly simple way. Everything but Location, which holds some unparsed JSON we must tackle separately.

Create a vector of our targetted variable names:

(to_process <- cnames[cnames != "Location"])
#>  [1] "sid"            "id"             "position"       "created_at"    
#>  [5] "created_meta"   "updated_at"     "updated_meta"   "meta"          
#>  [9] "County"         "License Number" "Operation Type" "Estab Type"    
#> [13] "Entity Name"    "DBA Name"       "Street Number"  "Street Name"   
#> [17] "Address Line 2" "Address Line 3" "City"           "State"         
#> [21] "Zip Code"       "Square Footage"

Goal: create a data frame with these variables and one row per food market.

“Rows first”

Voice from the future: not all of these 22 variables are populated for every food market. There are explicit NULLs. Zev handled this with a two-layered approach: catch the NULLs and replace with NAs, prior to combining with data from other markets via sapply() and data.frame().

purrr’s handling of void, explicit NULL, and implicit NULL is quite good, but doesn’t address this case either. Therefore I too must fanny around with NULLs before I can make my beautiful data frame.

Ideally, I would use map_df() like this for food markets instead of just the first 3:

food_mkts[1:3] %>% 
  map_df(`[`, to_process)
#> # A tibble: 3 × 22
#>     sid                                   id position created_at
#>   <int>                                <chr>    <int>      <int>
#> 1     1 D943D633-8DB5-4740-B5F4-06AAB6FC52E2        1 1412261567
#> 2     2 7A069C98-90B2-4B11-A4A7-3C4761D14DDD        2 1412261567
#> 3     3 68DBE61E-ACF4-4803-BD90-87380FDAA4F9        3 1412261567
#> # ... with 18 more variables: created_meta <chr>, updated_at <int>,
#> #   updated_meta <chr>, meta <chr>, County <chr>, `License Number` <chr>,
#> #   `Operation Type` <chr>, `Estab Type` <chr>, `Entity Name` <chr>, `DBA
#> #   Name` <chr>, `Street Number` <chr>, `Street Name` <chr>, `Address Line
#> #   2` <chr>, `Address Line 3` <chr>, City <chr>, State <chr>, `Zip
#> #   Code` <chr>, `Square Footage` <chr>

But I have learned the hard way that the NULLs are a dealkiller.

How does map_df() actually work? First, it calls map() and, in my application, gets the sub-list of non-Location variables for each food market. Then a list of these lists is shipped off to dplyr::bind_rows() for row-binding and type conversion. But the presence of NULLs breaks this workflow when there are markets, like the one below, that lack, say, a street number. Basically most of R’s data-frame-making facilities balk at NULLs. Here is some exploration of that.

food_mkts[[67]][14:16]
#> $`DBA Name`
#> [1] "PORTOBELLA PETES       "
#> 
#> $`Street Number`
#> NULL
#> 
#> $`Street Name`
#> [1] "BLDG 12 STATE CAMPUS DOL     "
data.frame(food_mkts[[67]][14:16])
#> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0
food_mkts[66:68] %>% 
  map_df(`[`, to_process)
#> Error in eval(expr, envir, enclos): incompatible sizes (1 != 0)

I want to rescue this workflow, because the automatic type conversion from dplyr::bind_rows() is very appealing.

Therefore, like Zev, I use a custom function safe_extract(). It’s a slightly expanded version of [ that replaces NULL with NA.

safe_extract <- function(l, wut) {
  res <- l[wut]
  null_here <- map_lgl(res, is.null)
  res[null_here] <- NA
  res
}
safe_extract(food_mkts[[67]][14:16])
#> $`DBA Name`
#> [1] "PORTOBELLA PETES       "
#> 
#> $`Street Number`
#> [1] NA
#> 
#> $`Street Name`
#> [1] "BLDG 12 STATE CAMPUS DOL     "

Now I can get a data frame easily.

(mkts_df <- food_mkts %>% 
    map_df(safe_extract, to_process))
#> # A tibble: 28,355 × 22
#>      sid                                   id position created_at
#>    <int>                                <chr>    <int>      <int>
#> 1      1 D943D633-8DB5-4740-B5F4-06AAB6FC52E2        1 1412261567
#> 2      2 7A069C98-90B2-4B11-A4A7-3C4761D14DDD        2 1412261567
#> 3      3 68DBE61E-ACF4-4803-BD90-87380FDAA4F9        3 1412261567
#> 4      4 9DA5A827-A0AD-4782-8CE2-5D6867B53293        4 1412261567
#> 5      5 5D624E4B-E4A8-4F3D-BF88-190663DF8FCF        5 1412261567
#> 6      6 86BC29A4-55DA-43C5-BDED-6EF2C92C3C8B        6 1412261567
#> 7      7 51EE0EB4-6D71-4C2E-AD44-E225A1A45987        7 1412261567
#> 8      8 22B148DF-F188-4245-8E69-1137F9A6064E        8 1412261567
#> 9      9 62526ABF-801E-46DE-97CD-05C2DDE6BB20        9 1412261567
#> 10    10 13153CBA-3E65-4D4D-9A4C-ACB24BCD33DF       10 1412261567
#> # ... with 28,345 more rows, and 18 more variables: created_meta <chr>,
#> #   updated_at <int>, updated_meta <chr>, meta <chr>, County <chr>,
#> #   `License Number` <chr>, `Operation Type` <chr>, `Estab Type` <chr>,
#> #   `Entity Name` <chr>, `DBA Name` <chr>, `Street Number` <chr>, `Street
#> #   Name` <chr>, `Address Line 2` <chr>, `Address Line 3` <chr>,
#> #   City <chr>, State <chr>, `Zip Code` <chr>, `Square Footage` <chr>

Victory is ours. Well, partial victory. We still need to deal with Location.

“Rows first” Recap

Self-contained code to go from raw JSON to data frame for 22 out of 23 variables, relying on automatic type conversion.

food_mkts_raw <- fromJSON("foodMarkets/retail_food_markets.json",
                          simplifyVector = FALSE)
cnames <- food_mkts_raw[[c("meta", "view", "columns")]] %>% 
  map_chr("name")

food_mkts <- food_mkts_raw[["data"]] %>% 
  map(set_names, cnames)

to_process <- cnames[cnames != "Location"]
safe_extract <- function(l, wut) {
  res <- l[wut]
  null_here <- map_lgl(res, is.null)
  res[null_here] <- NA
  res
}
mkts_df <- food_mkts %>% 
  map_df(safe_extract, to_process)

“Columns first”

The more proper workflow is to build the columns first, with purrr’s type-specific mapping functions, then put them into a data frame. This is safer because it forces you to articulate and enforce type expectations for each variable.

  • Good news: there’s a built-in way to handle explicit NULLs.
  • Bad news: we need to specify type for each of the 22 variables.
  • ¯\_(ツ)_/¯ news: we get to use more exotic features of purrr.

Here’s a naïve approach for two variables:

food_mkts %>% {
  tibble(
    dba_name = map_chr(., "DBA Name"),
    city = map_chr(., "City")
  )
}
#> # A tibble: 28,355 × 2
#>                   dba_name               city
#>                      <chr>              <chr>
#> 1  PLAZA 23 TRUCK STOP     ALBANY            
#> 2  PRICE CHOPPER #245      WATERVLIET        
#> 3  PEACOCK                 ALBANY            
#> 4  FINYOUR FISHMONGER      GUILDERLAND       
#> 5  R&A GROCERY STORE       ALBANY            
#> 6  ANTHONYS CHOC DIP FRUIT ALBANY            
#> 7  DOLLAR TREE #4168       GLENMONT          
#> 8  GREAT WALL EXPRESS ASIA SLINGERLANDS      
#> 9  INANC I OZBAY           ALBANY            
#> 10 SAGAMIA                 SLINGERLANDS      
#> # ... with 28,345 more rows

Doing this for 22 variables is a drag, though it’s certainly possible, and even reasonable for the ingest of an important dataset.

However, because I can, I store the type of each variable and do extraction + type conversion programmatically. This lets us explore more of the functional programming side of purrr.

I exploit the variable types learned from the automatic type conversion in the “rows first” approach. Here’s a data frame with one row per variable, providing the expected class and the type-appropriate map_*() and NA. It will soon become clear why I use the specific variable names, .f and .null.

(vdf <- mkts_df %>% 
   map_chr(class) %>%
   enframe(name = ".f", value = "type") %>% 
   mutate(vname = .f %>% tolower() %>% gsub("\\s+", "_", .),
          mapper = c(integer = "map_int", character = "map_chr")[type],
          .null = list(integer = NA_integer_,
                       character = NA_character_)[type]))
#> # A tibble: 22 × 5
#>                .f      type          vname  mapper     .null
#>             <chr>     <chr>          <chr>   <chr>    <list>
#> 1             sid   integer            sid map_int <int [1]>
#> 2              id character             id map_chr <chr [1]>
#> 3        position   integer       position map_int <int [1]>
#> 4      created_at   integer     created_at map_int <int [1]>
#> 5    created_meta character   created_meta map_chr <chr [1]>
#> 6      updated_at   integer     updated_at map_int <int [1]>
#> 7    updated_meta character   updated_meta map_chr <chr [1]>
#> 8            meta character           meta map_chr <chr [1]>
#> 9          County character         county map_chr <chr [1]>
#> 10 License Number character license_number map_chr <chr [1]>
#> # ... with 12 more rows

Here’s how I could use do.call() to create one variable based on one row of this data frame.

i <- 14 # DBA Name
vdf[i, ]
#> # A tibble: 1 × 5
#>         .f      type    vname  mapper     .null
#>      <chr>     <chr>    <chr>   <chr>    <list>
#> 1 DBA Name character dba_name map_chr <chr [1]>
do.call(vdf$mapper[i],
        list(.x = food_mkts, .f = vdf$.f[i], .null = vdf$.null[[i]])) %>% 
  head()
#> [1] "PLAZA 23 TRUCK STOP    " "PRICE CHOPPER #245     "
#> [3] "PEACOCK                " "FINYOUR FISHMONGER     "
#> [5] "R&A GROCERY STORE      " "ANTHONYS CHOC DIP FRUIT"

do.call() is a base R function that constructs and executes a function call from a function (name or actual function) and a list of arguments. It’s exactly what we need. But we also want to do this for each of the 22 variables.

How to scale up? We use the invoke() family of functions in purrr. invoke() itself is a pipe-friendly wrapper around do.call(). invoke_map() and friends allow you to map over functions and arguments in parallel, analagous to map2(). The big difference with invoke_map() is that the two primary inputs are a vector of functions and a vector of argument values. We use invoke_map_df() here, which is the variant that requests to get a data frame back.

(mkts_df <- invoke_map_df(.f = set_names(vdf$mapper, vdf$vname),
                          .x = transpose(vdf[c(".f", ".null")]),
                          food_mkts))
#> # A tibble: 28,355 × 22
#>      sid                                   id position created_at
#>    <int>                                <chr>    <int>      <int>
#> 1      1 D943D633-8DB5-4740-B5F4-06AAB6FC52E2        1 1412261567
#> 2      2 7A069C98-90B2-4B11-A4A7-3C4761D14DDD        2 1412261567
#> 3      3 68DBE61E-ACF4-4803-BD90-87380FDAA4F9        3 1412261567
#> 4      4 9DA5A827-A0AD-4782-8CE2-5D6867B53293        4 1412261567
#> 5      5 5D624E4B-E4A8-4F3D-BF88-190663DF8FCF        5 1412261567
#> 6      6 86BC29A4-55DA-43C5-BDED-6EF2C92C3C8B        6 1412261567
#> 7      7 51EE0EB4-6D71-4C2E-AD44-E225A1A45987        7 1412261567
#> 8      8 22B148DF-F188-4245-8E69-1137F9A6064E        8 1412261567
#> 9      9 62526ABF-801E-46DE-97CD-05C2DDE6BB20        9 1412261567
#> 10    10 13153CBA-3E65-4D4D-9A4C-ACB24BCD33DF       10 1412261567
#> # ... with 28,345 more rows, and 18 more variables: created_meta <chr>,
#> #   updated_at <int>, updated_meta <chr>, meta <chr>, county <chr>,
#> #   license_number <chr>, operation_type <chr>, estab_type <chr>,
#> #   entity_name <chr>, dba_name <chr>, street_number <chr>,
#> #   street_name <chr>, address_line_2 <chr>, address_line_3 <chr>,
#> #   city <chr>, state <chr>, zip_code <chr>, square_footage <chr>

Partial victory is ours again! A data frame, one row per food market, with 22 variables. How did this work? Line-by-line:

  • .f = set_names(vdf$mapper, vdf$vname) specifies the list of functions to iterate over: the type-specific mappers we pre-selected above. I bother to apply names here because they propagate to the variable names of the data frame.
  • .x = transpose(vdf[c(".f", ".null")]) gives the parallel list of function arguments, namely the strings specifying named elements to extract and the value to insert in place of explicit NULL. Two things that might be confusing:
    • This instance of .f is at the variable level, i.e. inside our iteration. In the previous line, .f refers to the overall operation, i.e. setting up our iteration.
    • transpose() is needed to repackage the .f and .null variables. Instead of a list of two same-length vectors, we want a single list holding lists of length two.
  • food_mkts is the ... argument. It’s a common input across all units of iteration.

“Columns first” Recap

Self-contained code to go from raw JSON to data frame for 22 out of 23 variables, with explicit type specification.

food_mkts_raw <- fromJSON("foodMarkets/retail_food_markets.json",
                          simplifyVector = FALSE)
cnames <- food_mkts_raw[[c("meta", "view", "columns")]] %>% 
  map_chr("name")
food_mkts <- food_mkts_raw[["data"]] %>% 
  map(set_names, cnames)

vdf <- tribble(
             ~ .f,      ~ type,
            "sid",   "integer",
             "id", "character",
       "position",   "integer",
     "created_at",   "integer",
   "created_meta", "character",
     "updated_at",   "integer",
   "updated_meta", "character",
           "meta", "character",
         "County", "character",
 "License Number", "character",
 "Operation Type", "character",
     "Estab Type", "character",
    "Entity Name", "character",
       "DBA Name", "character",
  "Street Number", "character",
    "Street Name", "character",
 "Address Line 2", "character",
 "Address Line 3", "character",
           "City", "character",
          "State", "character",
       "Zip Code", "character",
 "Square Footage", "character")
vdf <- vdf %>% 
  mutate(vname = .f %>% tolower() %>% gsub("\\s+", "_", .),
         mapper = c(integer = "map_int", character = "map_chr")[type],
         .null = list(integer = NA_integer_,
                      character = NA_character_)[type])
mkts_df <- invoke_map_df(.f = set_names(vdf$mapper, vdf$vname),
                         .x = transpose(vdf[c(".f", ".null")]),
                         food_mkts)

Rescue Location

We need to process the 23rd variable, Location. There are no big new principles here, so this will heavy on the code and light on explanation.

Prepare Location

Pull Location into its own list and inspect one element.

loc_raw <- food_mkts %>%
  map("Location")
loc_raw[[345]]
#> [[1]]
#> [1] "{\"address\":\"196 VLIET BLVD\",\"city\":\"COHOES\",\"state\":\"NY\",\"zip\":\"12047\"}"
#> 
#> [[2]]
#> [1] "42.77787640800045"
#> 
#> [[3]]
#> [1] "-73.71585755099966"
#> 
#> [[4]]
#> NULL
#> 
#> [[5]]
#> [1] FALSE

Ah, more key-less JSON, giving rise to name-less lists. Before we do anything else, extract names from the meta component and apply them. Re-inspect an element to verify we’ve succeeded.

i <- which(cnames == "Location")
location_meta <- food_mkts_raw[[c("meta", "view", "columns")]][[i]]
(lnames <- location_meta[["subColumnTypes"]] %>% flatten_chr())
#> [1] "human_address"   "latitude"        "longitude"       "machine_address"
#> [5] "needs_recoding"
loc_raw <- loc_raw %>% 
  map(set_names, lnames)
loc_raw[[345]]
#> $human_address
#> [1] "{\"address\":\"196 VLIET BLVD\",\"city\":\"COHOES\",\"state\":\"NY\",\"zip\":\"12047\"}"
#> 
#> $latitude
#> [1] "42.77787640800045"
#> 
#> $longitude
#> [1] "-73.71585755099966"
#> 
#> $machine_address
#> NULL
#> 
#> $needs_recoding
#> [1] FALSE

Parse the JSON string holding human address

The human_address element of each location holds a JSON string:

  • Parse it.
  • Rowbind elements into a data frame.
  • Prepend ha_ to the variable names to avoid name conflicts downstream.
ha <- loc_raw %>% 
  map("human_address") %>%
  map_df(fromJSON) %>% 
  set_names(paste0("ha_", names(.)))
head(ha)
#> # A tibble: 6 × 4
#>           ha_address     ha_city ha_state ha_zip
#>                <chr>       <chr>    <chr>  <chr>
#> 1 240 CHURCH ST #242      ALBANY       NY  12202
#> 2        515 19TH ST  WATERVLIET       NY  12189
#> 3       795 BROADWAY      ALBANY       NY  12207
#> 4   2050 WESTERN AVE GUILDERLAND       NY  12084
#> 5    265 CENTRAL AVE      ALBANY       NY  12206
#> 6   1693 CENTRAL AVE      ALBANY       NY  12205

Convert remaining variables into a data frame

Convert everything else left in the location data to a data frame. Convert latitude and longitude into numeric variables and make sure they don’t look insane. This New York latitude and longitude map indicates that latitude should range between 40° and 50° and longitude between -80° and -72°.

ee <- loc_raw %>% 
  map_df(safe_extract, lnames[lnames != "human_address"]) %>% 
  mutate_at(vars(latitude, longitude), as.numeric)
head(ee)
#> # A tibble: 6 × 4
#>   latitude longitude machine_address needs_recoding
#>      <dbl>     <dbl>           <lgl>          <lgl>
#> 1 42.63541 -73.75541              NA          FALSE
#> 2 42.73111 -73.70496              NA          FALSE
#> 3 42.65794 -73.74780              NA          FALSE
#> 4 42.69787 -73.88924              NA          FALSE
#> 5 42.66340 -73.77088              NA          FALSE
#> 6 42.72312 -73.83912              NA          FALSE
ee %>% 
  select(latitude, longitude) %>% 
  map(summary)
#> $latitude
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>   40.51   40.71   40.84   41.46   42.45   45.00       1 
#> 
#> $longitude
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>  -79.75  -74.18  -73.94  -74.58  -73.84  -71.92       1

Put everything together

Column bind the main data frame mkts_df with the human address data in ha and everything else from Location.

(mkts_df <- bind_cols(mkts_df, ha, ee))
#> # A tibble: 28,355 × 30
#>      sid                                   id position created_at
#>    <int>                                <chr>    <int>      <int>
#> 1      1 D943D633-8DB5-4740-B5F4-06AAB6FC52E2        1 1412261567
#> 2      2 7A069C98-90B2-4B11-A4A7-3C4761D14DDD        2 1412261567
#> 3      3 68DBE61E-ACF4-4803-BD90-87380FDAA4F9        3 1412261567
#> 4      4 9DA5A827-A0AD-4782-8CE2-5D6867B53293        4 1412261567
#> 5      5 5D624E4B-E4A8-4F3D-BF88-190663DF8FCF        5 1412261567
#> 6      6 86BC29A4-55DA-43C5-BDED-6EF2C92C3C8B        6 1412261567
#> 7      7 51EE0EB4-6D71-4C2E-AD44-E225A1A45987        7 1412261567
#> 8      8 22B148DF-F188-4245-8E69-1137F9A6064E        8 1412261567
#> 9      9 62526ABF-801E-46DE-97CD-05C2DDE6BB20        9 1412261567
#> 10    10 13153CBA-3E65-4D4D-9A4C-ACB24BCD33DF       10 1412261567
#> # ... with 28,345 more rows, and 26 more variables: created_meta <chr>,
#> #   updated_at <int>, updated_meta <chr>, meta <chr>, county <chr>,
#> #   license_number <chr>, operation_type <chr>, estab_type <chr>,
#> #   entity_name <chr>, dba_name <chr>, street_number <chr>,
#> #   street_name <chr>, address_line_2 <chr>, address_line_3 <chr>,
#> #   city <chr>, state <chr>, zip_code <chr>, square_footage <chr>,
#> #   ha_address <chr>, ha_city <chr>, ha_state <chr>, ha_zip <chr>,
#> #   latitude <dbl>, longitude <dbl>, machine_address <lgl>,
#> #   needs_recoding <lgl>

Trim any leading/trailing whitespace from the character variables.

mkts_df <- mkts_df %>% 
  mutate_if(is.character, trimws)

Take a look at some of the most interpretable variables.

mkts_df %>% 
  select(dba_name, city, latitude, longitude, square_footage)
#> # A tibble: 28,355 × 5
#>                   dba_name         city latitude longitude square_footage
#>                      <chr>        <chr>    <dbl>     <dbl>          <chr>
#> 1      PLAZA 23 TRUCK STOP       ALBANY 42.63541 -73.75541           3600
#> 2       PRICE CHOPPER #245   WATERVLIET 42.73111 -73.70496              0
#> 3                  PEACOCK       ALBANY 42.65794 -73.74780           2000
#> 4       FINYOUR FISHMONGER  GUILDERLAND 42.69787 -73.88924           2100
#> 5        R&A GROCERY STORE       ALBANY 42.66340 -73.77088           1500
#> 6  ANTHONYS CHOC DIP FRUIT       ALBANY 42.72312 -73.83912              0
#> 7        DOLLAR TREE #4168     GLENMONT 42.60176 -73.79454          10000
#> 8  GREAT WALL EXPRESS ASIA SLINGERLANDS 42.64117 -73.86302              0
#> 9            INANC I OZBAY       ALBANY 42.71746 -73.83267              0
#> 10                 SAGAMIA SLINGERLANDS 42.64117 -73.86302           1000
#> # ... with 28,345 more rows

Creative Commons License