Sample from groups, n varies by group

A challenge tweeted by Hilary Parker, paraphrased:

How do you sample from groups, with a different sample size for each group?

Illustrated with the iris data.

Species = groups.
Sample from the 3 Species with 3 different sample sizes.

How fits the template:

DRAW A SAMPLE for each PAIR OF (SPECIES DATA, SPECIES SAMPLE SIZE)

How to prepare the data? I need a data frame with

One row per Species
A variable of Species-specific sample sizes
A variable of “Species data”, whatever that means.
- Actually we know what that is: a variable of Species-specific data frames. A list-column!

We need a nested data frame.

suppressMessages(library(dplyr))
library(purrr)
library(tidyr)
set.seed(4561)

(nested_iris <- iris %>%
    group_by(Species) %>%   # prep for work by Species
    nest() %>%              # --> one row per Species
    ungroup() %>% 
    mutate(n = c(2, 5, 3))) # add sample sizes
#> # A tibble: 3 x 3
#>   Species              data     n
#>   <fct>      <list<df[,4]>> <dbl>
#> 1 setosa           [50 × 4]     2
#> 2 versicolor       [50 × 4]     5
#> 3 virginica        [50 × 4]     3

Draw the samples.

purrr::map2() is good since we want to operate on 2 things (data = DATA FOR ONE SPECIES, n = SAMPLE SIZE).
We’ve already got data = DATA FOR ONE SPECIES and n = SAMPLE SIZE as variables in our data frame.
Drop them in as inputs 1 and 2 to dplyr::sample_n(tbl, size).
Accept whatever comes back as a new list-column in the data frame, i.e. use dplyr::mutate(). Be brave and deal with it.

(sampled_iris <- nested_iris %>%
  mutate(samp = map2(data, n, sample_n)))
#> # A tibble: 3 x 4
#>   Species              data     n samp            
#>   <fct>      <list<df[,4]>> <dbl> <list>          
#> 1 setosa           [50 × 4]     2 <tibble [2 × 4]>
#> 2 versicolor       [50 × 4]     5 <tibble [5 × 4]>
#> 3 virginica        [50 × 4]     3 <tibble [3 × 4]>

What came back? More Species-specific data frames.

We are in that uncomfortable intermediate state, with two list-columns: the original data and the sampled data, samp. Let’s get back to a normal data frame!

Keep only Species and samp variables.
Unnest, which essentially rowbinds the data frames in samp and replicates Species as necessary.

sampled_iris %>% 
  select(-data) %>%
  unnest(samp)
#> # A tibble: 10 x 6
#>    Species        n Sepal.Length Sepal.Width Petal.Length Petal.Width
#>    <fct>      <dbl>        <dbl>       <dbl>        <dbl>       <dbl>
#>  1 setosa         2          4.6         3.2          1.4         0.2
#>  2 setosa         2          5.1         3.8          1.5         0.3
#>  3 versicolor     5          6           3.4          4.5         1.6
#>  4 versicolor     5          6           2.2          4           1  
#>  5 versicolor     5          5.7         2.8          4.5         1.3
#>  6 versicolor     5          6.9         3.1          4.9         1.5
#>  7 versicolor     5          7           3.2          4.7         1.4
#>  8 virginica      3          5.8         2.7          5.1         1.9
#>  9 virginica      3          5.8         2.7          5.1         1.9
#> 10 virginica      3          6.3         2.9          5.6         1.8

Again, from the top, with no exposition:

iris %>%
  group_by(Species) %>% 
  nest() %>%            
  ungroup() %>% 
  mutate(n = c(2, 5, 3)) %>% 
  mutate(samp = map2(data, n, sample_n)) %>% 
  select(-data) %>%
  unnest(samp)
#> # A tibble: 10 x 6
#>    Species        n Sepal.Length Sepal.Width Petal.Length Petal.Width
#>    <fct>      <dbl>        <dbl>       <dbl>        <dbl>       <dbl>
#>  1 setosa         2          4.7         3.2          1.6         0.2
#>  2 setosa         2          4.8         3.1          1.6         0.2
#>  3 versicolor     5          6.7         3.1          4.7         1.5
#>  4 versicolor     5          5.1         2.5          3           1.1
#>  5 versicolor     5          5.2         2.7          3.9         1.4
#>  6 versicolor     5          5.7         3            4.2         1.2
#>  7 versicolor     5          5.6         2.7          4.2         1.3
#>  8 virginica      3          7.6         3            6.6         2.1
#>  9 virginica      3          5.6         2.8          4.9         2  
#> 10 virginica      3          6.4         3.1          5.5         1.8

A base R solution, with some marginal comments:

split_iris <- split(iris, iris$Species) # why can't Species be found in iris?
                                        # where else would it be found?
str(split_iris)                         # split_iris ~= nested_iris[["data"]]
#> List of 3
#>  $ setosa    :'data.frame':  50 obs. of  5 variables:
#>   ..$ Sepal.Length: num [1:50] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>   ..$ Sepal.Width : num [1:50] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>   ..$ Petal.Length: num [1:50] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>   ..$ Petal.Width : num [1:50] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ versicolor:'data.frame':  50 obs. of  5 variables:
#>   ..$ Sepal.Length: num [1:50] 7 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 ...
#>   ..$ Sepal.Width : num [1:50] 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 ...
#>   ..$ Petal.Length: num [1:50] 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 ...
#>   ..$ Petal.Width : num [1:50] 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1 1.3 1.4 ...
#>   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 2 2 2 2 2 2 2 2 2 2 ...
#>  $ virginica :'data.frame':  50 obs. of  5 variables:
#>   ..$ Sepal.Length: num [1:50] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 ...
#>   ..$ Sepal.Width : num [1:50] 3.3 2.7 3 2.9 3 3 2.5 2.9 2.5 3.6 ...
#>   ..$ Petal.Length: num [1:50] 6 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 ...
#>   ..$ Petal.Width : num [1:50] 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 ...
#>   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 3 3 3 3 3 3 3 3 3 3 ...
(n <- c(2, 5, 3))                       # Species data and n are only 'in sync'
#> [1] 2 5 3
                                        # due to my discipline / care
                                        # not locked safely into a data frame
(group_sizes <- vapply(split_iris, nrow, integer(1))) # also floating free
#>     setosa versicolor  virginica 
#>         50         50         50
(sampled_obs <- mapply(sample, group_sizes, n)) # I'm floating free too!
#> $setosa
#> [1] 36 14
#> 
#> $versicolor
#> [1] 28 43 22 24 13
#> 
#> $virginica
#> [1] 48  3 25
get_rows <- function(df, rows) df[rows, , drop = FALSE] # custom function
                                        # drop = FALSE required to avoid
                                        # nasty surprise in case of n = 1
(sampled_iris <-                        # god help you if forget SIMPLIFY = FALSE
    mapply(get_rows, split_iris, sampled_obs, SIMPLIFY = FALSE))
#> $setosa
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 36          5.0         3.2          1.2         0.2  setosa
#> 14          4.3         3.0          1.1         0.1  setosa
#> 
#> $versicolor
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> 78          6.7         3.0          5.0         1.7 versicolor
#> 93          5.8         2.6          4.0         1.2 versicolor
#> 72          6.1         2.8          4.0         1.3 versicolor
#> 74          6.1         2.8          4.7         1.2 versicolor
#> 63          6.0         2.2          4.0         1.0 versicolor
#> 
#> $virginica
#>     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#> 148          6.5         3.0          5.2         2.0 virginica
#> 103          7.1         3.0          5.9         2.1 virginica
#> 125          6.7         3.3          5.7         2.1 virginica
do.call(rbind, sampled_iris)            # :( do.call()
#>               Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#> setosa.36              5.0         3.2          1.2         0.2     setosa
#> setosa.14              4.3         3.0          1.1         0.1     setosa
#> versicolor.78          6.7         3.0          5.0         1.7 versicolor
#> versicolor.93          5.8         2.6          4.0         1.2 versicolor
#> versicolor.72          6.1         2.8          4.0         1.3 versicolor
#> versicolor.74          6.1         2.8          4.7         1.2 versicolor
#> versicolor.63          6.0         2.2          4.0         1.0 versicolor
#> virginica.148          6.5         3.0          5.2         2.0  virginica
#> virginica.103          7.1         3.0          5.9         2.1  virginica
#> virginica.125          6.7         3.3          5.7         2.1  virginica

IMO the base R solution requires much greater facility with R programming and data structures to get it right. It feels more like programming than data analysis.