R function call overhead

I’m entirely perplexed by this - I was exploring an idea I saw discussed in a blog post (that will have to wait for now - I’ll come back to it) and found some interesting behaviour I don’t understand.

I managed to boil the issue down to the following: let’s say I have a function that does something with all the values of an array, in this case adds 1 to each element (not returning it). I might define that like this

add_inline <- function(z) { for (i in seq_along(z)) z[i] + 1 }

Sure, I could do something more “functional” but that’s the point and I’ll come back to that.

Now, what if I add a single layer of abstraction - the adding of 1 is defined in a function

f <- function(x) x + 1
add_call <- function(z) { for (i in seq_along(z)) f(z[i]) }

I thought perhaps that the scoping of that function, in that R needs to look through all the loaded packages before finding it in the global scope might matter, so I also added a version where f is defined more locally

add_call_local <- function(z) {
  localf <- function(x) x + 1
  for (i in seq_along(z)) {
    localf(z[i])
  }
}

Just in case there was any funny business with named function lookup, how about an anonymous function version?

add_call_anon <- function(z) {
  for (i in seq_along(z)) {
    (\(x) x + 1)(z[i])
  }
}

Okay, now let’s look at the comparative performance of each of these

n <- 1e4
bm <- bench::mark(
  inline = add_inline(1:n),
  call = add_call(1:n),
  local = add_call_local(1:n),
  anon = add_call_anon(1:n),
  iterations = 100
)
plot(bm)

bm[, 1:7]
#> # A tibble: 4 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 inline     414.73µs 444.43µs     2178.    3.94MB      0  
#> 2 call         5.38ms   5.83ms      169.  113.74KB     20.8
#> 3 local        5.22ms    5.6ms      174.   76.47KB     19.3
#> 4 anon         5.85ms   6.29ms      157.   35.06KB     23.4

What? There’s a 10x difference between inlining x + 1 and calling out to a function.

How does that scale? I made a similar comparison with the addition five times

add5_inline <- function(z) { for (i in seq_along(z)) z[i] + 1 + 1 + 1 + 1 + 1}
add5_call <- function(z) { for (i in seq_along(z)) f(f(f(f(f(z[i]))))) }

n <- 1e4
bm5 <- bench::mark(
  inline = add_inline(1:n),
  call = add_call(1:n),
  inline5 = add5_inline(1:n),
  call5 = add5_call(1:n),
  iterations = 100
)
plot(bm5)

bm5[, 1:7]
#> # A tibble: 4 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 inline      509.3µs   2.06ms     511.         0B     0   
#> 2 call          5.3ms    5.6ms     125.         0B     8.00
#> 3 inline5       746µs 773.44µs    1237.     38.1KB     0   
#> 4 call5        20.5ms  21.53ms      46.2    18.5KB    13.8

That’s nearly 30x difference.

Now, that’s over 10,000 calls, so it’s not a whole lot for a single call, but for many, it adds up! Especially inside a hot loop.

What if we actually do something with the values? I added in a sapply version but it has to take a function, so there’s no strict comparison to inlining it

add3_inline <- function(z) { a <- z[NA]; for (i in seq_along(z)) a[i] <- z[i] + 1 + 1 + 1; a }
f3 <- function(x) f(f(f(x)))
add3_call <- function(z) { a <- z[NA]; for (i in seq_along(z)) a[i] <- f3(z[i]); a }
add3_sapply_call <- function(z) { sapply(z, f3) }

n <- 1e4
bms <- bench::mark(
  inline = add3_inline(1:n),
  call = add3_call(1:n),
  sapply_call = add3_sapply_call(1:n),
  iterations = 100
)
plot(bms)

bms[, 1:7]
#> # A tibble: 3 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 inline       991.8µs   1.08ms     897.      186KB      0  
#> 2 call            18ms  18.73ms      52.4     184KB     12.3
#> 3 sapply_call   19.2ms  21.19ms      46.8     363KB     13.2

Still, 18x?

I don’t know how to explain this apart from “there’s a not-insignificant overhead to calling a function vs inlining the body”. Have I missed something?

R is usually “fast enough” for everything I do - I’m rarely trying to cram a lot of evaluations inside a given millisecond, but I was surprised that this single abstraction added so much. I’m imagining a heavy calculation inside a hot loop involving a lot of composed function calls and wondering how much could be saved by consolidating the composition into a single function body?

Just for context of how “slow” this is, I wanted to see how Rust performs and … ouch.

rextendr::rust_function("fn add3_rust(z:Vec<i32>) -> Vec<i32> { z.iter().map(|x| x + 3).collect() }", profile = "release")
#> ℹ build directory: '/tmp/RtmpqsZ3jP/file73e7742716b8f'
#> ✔ Writing '/tmp/RtmpqsZ3jP/file73e7742716b8f/target/extendr_wrappers.R'

n <- 1e4
bmr <- bench::mark(
  inline = add3_inline(1:n),
  call = add3_call(1:n),
  sapply_call = add3_sapply_call(1:n),
  rust = add3_rust(1:n),
  iterations = 100
)
plot(bmr)

bmr[, 1:7]
#> # A tibble: 4 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 inline        1.03ms   1.08ms     925.    156.4KB      0  
#> 2 call         18.52ms  21.53ms      41.7   156.4KB     11.8
#> 3 sapply_call  19.79ms   23.6ms      41.8   362.6KB     13.2
#> 4 rust         18.37µs  24.21µs   14165.     78.2KB      0

Yeah, my best R is 40x slower than Rust (the f3(x) R version 875x slower than Rust) at this particular example. I love the idea of calling out to Rust for performance, and will definitely be doing that a bit more.

If you have an explanation about the function calling overhead, please let me know here or on Mastodon.