Using R's set.seed() to set seeds for use in C/C++ (including Rcpp)

In native R, the user sets the seed for random number generation (RNG) with set.seed(). Random number generators exist in C and C++ too; these need their own seeds, which are not obviously settable by set.seed(). Good news! It can be done.

pacman::p_load(inline, purrr)

rbernoulli

Base R (or technically the stats package) provides no rbernoulli(). It’s a pretty gaping hole in the pantheon of rbeta(), rbinom(), rcauchy, rchisq(), rexp(), rf(), rgamma(), etc. Thankfully, Hadley Wickham noticed this and gave us purrr::rbernoulli().

set.seed(1)
rbernoulli(5, 0.7)
#> [1] FALSE  TRUE  TRUE  TRUE FALSE
set.seed(1)
rbernoulli(5, 0.7)
#> [1] FALSE  TRUE  TRUE  TRUE FALSE

So it seems like Hadley managed to get set.seed() to work with rbernoulli(). How did he do this? Let’s take a closer look at purrr::rbernoulli().

purrr::rbernoulli
#> function (n, p = 0.5) 
#> {
#>     stats::runif(n) > (1 - p)
#> }
#> <bytecode: 0x7fc591364740>
#> <environment: namespace:purrr>

Ah, it seems Hadley just wrapped runif(); hence, because set.seed() works with runif(), it works with his implementation of purrr::rbernoulli().

C++ RNG

The C++ standard library provides the <random> header file, which includes Bernoulli RNG. Let’s give that a whirl.

cpp_rbernoulli <- rcpp(c(n = "integer", p = "numeric", seed = "integer"), '
                       int n_ = as<int>(n), seed_ = as<int>(seed);
                       double p_ = as<double>(p);
                       std::default_random_engine generator(seed_);
                       std::bernoulli_distribution distribution(p_);
                       IntegerVector out(n_);
                       for (std::size_t i = 0; i != n_; ++i) {
                         out[i] = distribution(generator);
                       }
                       return out;
                       ', includes = "#include <random>")
cpp_rbernoulli(6, 0.7, seed = 1)
#> [1] 1 0 1 1 0 1
cpp_rbernoulli(6, 0.7, seed = 1)
#> [1] 1 0 1 1 0 1
cpp_rbernoulli(6, 0.7, seed = 2)
#> [1] 1 0 1 0 1 1

OK, so now we have cpp_rbernoulli() which is working, but the user has to pass the seed as an argument of the function, there’s no option to use set.seed().

get_seed()

If only there was a get_seed() function that we could use. Well, here it is!

get_seed <- function() {
  sample.int(.Machine$integer.max, 1)
}

This gets a positive number in the unsigned 32-bit integer range (which is always a safe bet for a seed) and it is completely determined by set.seed(). Therefore, it’s fine to use as a seed itself. Let’s take a look.

set.seed(1)
replicate(6, get_seed())
#> [1]  570175513  799129990 1230193230 1950361378  433108649 1929277158
set.seed(1)
replicate(6, get_seed())
#> [1]  570175513  799129990 1230193230 1950361378  433108649 1929277158
set.seed(2)
replicate(6, get_seed())
#> [1]  397031630 1508336757 1231208929  360888751 2026879546 2026097046
set.seed(2)
replicate(6, get_seed())
#> [1]  397031630 1508336757 1231208929  360888751 2026879546 2026097046

So as we can see, setting a seed via set.seed() determines the seeds that subsequently come out of get_seed(), so all is well with the world. get_seed() can now be used to create a version of cpp_rbernoulli() which uses set.seed(). For the sake of inflating my own ego, I’ll name this version after myself.

rorybernoulli <- function(n, p) {
  cpp_rbernoulli(n, p, get_seed())
}

Let’s check that it’s in working order.

set.seed(1)
rorybernoulli(6, 0.7)
#> [1] 1 1 0 0 1 0
set.seed(1)
rorybernoulli(6, 0.7)
#> [1] 1 1 0 0 1 0
set.seed(2)
rorybernoulli(6, 0.7)
#> [1] 0 1 1 1 1 1
set.seed(2)
rorybernoulli(6, 0.7)
#> [1] 0 1 1 1 1 1

Everything is awesome.

Benchmarking

Lastly, let’s compare the two Bernoulli RNGs that we have now by asking them both to give us a million Bernoulli random numbers with p = 0.5.

bench::mark(purrr::rbernoulli(1e6, p = 0.5),
            rorybernoulli(1e6, p = 0.5),
            check = FALSE)
#> # A tibble: 2 x 10
#>   expression     min    mean median    max `itr/sec` mem_alloc  n_gc n_itr
#>   <chr>      <bch:t> <bch:t> <bch:> <bch:>     <dbl> <bch:byt> <dbl> <int>
#> 1 purrr::rb…  28.6ms 30.36ms 29.3ms 34.3ms      32.9   11.45MB     4    12
#> 2 roryberno…  8.98ms  9.96ms  9.7ms 12.4ms     100.     3.82MB     4    44
#> # ... with 1 more variable: total_time <bch:tm>

Wow, rorybernoulli() is three times faster! I wasn’t expecting that. Perhaps it’s because there’s a quicker way of generating a Bernoulli random number than by going through a uniform random number (as purrr::rbernoulli() does). It’s also three times as efficient with memory; the memory and time improvements probably have the same underlying reason. The point of me writing this post was to share this get_seed() thing with people so that they can use set.seed() with Rcpp and the like; purrr::rbernoulli() was just a cool example of a non-base RNG that popped into my head. Maybe I should submit a pull request to purrr!

comments powered by Disqus