I’m entirely perplexed by this - I was exploring an idea I saw discussed in a blog post (that will have to wait for now - I’ll come back to it) and found some interesting behaviour I don’t understand.
I managed to boil the issue down to the following: let’s say I have a function that does something with all the values of an array, in this case adds 1 to each element (not returning it). I might define that like this
add_inline <- function(z) { for (i in seq_along(z)) z[i] + 1 }
Sure, I could do something more “functional” but that’s the point and I’ll come back to that.
Now, what if I add a single layer of abstraction - the adding of 1 is defined in a function
f <- function(x) x + 1
add_call <- function(z) { for (i in seq_along(z)) f(z[i]) }
I thought perhaps that the scoping of that function, in that R needs to look through all the loaded packages before finding it in the global scope might matter, so I also added a version where f
is defined more locally
add_call_local <- function(z) {
localf <- function(x) x + 1
for (i in seq_along(z)) {
localf(z[i])
}
}
Just in case there was any funny business with named function lookup, how about an anonymous function version?
add_call_anon <- function(z) {
for (i in seq_along(z)) {
(\(x) x + 1)(z[i])
}
}
Okay, now let’s look at the comparative performance of each of these
n <- 1e4
bm <- bench::mark(
inline = add_inline(1:n),
call = add_call(1:n),
local = add_call_local(1:n),
anon = add_call_anon(1:n),
iterations = 100
)
plot(bm)
bm[, 1:7]
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 inline 414.73µs 444.43µs 2178. 3.94MB 0
#> 2 call 5.38ms 5.83ms 169. 113.74KB 20.8
#> 3 local 5.22ms 5.6ms 174. 76.47KB 19.3
#> 4 anon 5.85ms 6.29ms 157. 35.06KB 23.4
What? There’s a 10x difference between inlining x + 1
and calling out to a function.
How does that scale? I made a similar comparison with the addition five times
add5_inline <- function(z) { for (i in seq_along(z)) z[i] + 1 + 1 + 1 + 1 + 1}
add5_call <- function(z) { for (i in seq_along(z)) f(f(f(f(f(z[i]))))) }
n <- 1e4
bm5 <- bench::mark(
inline = add_inline(1:n),
call = add_call(1:n),
inline5 = add5_inline(1:n),
call5 = add5_call(1:n),
iterations = 100
)
plot(bm5)
bm5[, 1:7]
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 inline 509.3µs 2.06ms 511. 0B 0
#> 2 call 5.3ms 5.6ms 125. 0B 8.00
#> 3 inline5 746µs 773.44µs 1237. 38.1KB 0
#> 4 call5 20.5ms 21.53ms 46.2 18.5KB 13.8
That’s nearly 30x difference.
Now, that’s over 10,000 calls, so it’s not a whole lot for a single call, but for many, it adds up! Especially inside a hot loop.
What if we actually do something with the values? I added in a sapply
version but it has to take a function, so there’s no strict comparison to inlining it
add3_inline <- function(z) { a <- z[NA]; for (i in seq_along(z)) a[i] <- z[i] + 1 + 1 + 1; a }
f3 <- function(x) f(f(f(x)))
add3_call <- function(z) { a <- z[NA]; for (i in seq_along(z)) a[i] <- f3(z[i]); a }
add3_sapply_call <- function(z) { sapply(z, f3) }
n <- 1e4
bms <- bench::mark(
inline = add3_inline(1:n),
call = add3_call(1:n),
sapply_call = add3_sapply_call(1:n),
iterations = 100
)
plot(bms)
bms[, 1:7]
#> # A tibble: 3 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 inline 991.8µs 1.08ms 897. 186KB 0
#> 2 call 18ms 18.73ms 52.4 184KB 12.3
#> 3 sapply_call 19.2ms 21.19ms 46.8 363KB 13.2
Still, 18x?
I don’t know how to explain this apart from “there’s a not-insignificant overhead to calling a function vs inlining the body”. Have I missed something?
R is usually “fast enough” for everything I do - I’m rarely trying to cram a lot of evaluations inside a given millisecond, but I was surprised that this single abstraction added so much. I’m imagining a heavy calculation inside a hot loop involving a lot of composed function calls and wondering how much could be saved by consolidating the composition into a single function body?
Just for context of how “slow” this is, I wanted to see how Rust performs and … ouch.
rextendr::rust_function("fn add3_rust(z:Vec<i32>) -> Vec<i32> { z.iter().map(|x| x + 3).collect() }", profile = "release")
#> ℹ build directory: '/tmp/RtmpqsZ3jP/file73e7742716b8f'
#> ✔ Writing '/tmp/RtmpqsZ3jP/file73e7742716b8f/target/extendr_wrappers.R'
n <- 1e4
bmr <- bench::mark(
inline = add3_inline(1:n),
call = add3_call(1:n),
sapply_call = add3_sapply_call(1:n),
rust = add3_rust(1:n),
iterations = 100
)
plot(bmr)
bmr[, 1:7]
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 inline 1.03ms 1.08ms 925. 156.4KB 0
#> 2 call 18.52ms 21.53ms 41.7 156.4KB 11.8
#> 3 sapply_call 19.79ms 23.6ms 41.8 362.6KB 13.2
#> 4 rust 18.37µs 24.21µs 14165. 78.2KB 0
Yeah, my best R is 40x slower than Rust (the f3(x)
R version 875x slower than Rust) at this particular example. I love the idea of calling out to Rust for performance, and will definitely be doing that a bit more.
If you have an explanation about the function calling overhead, please let me know here or on Mastodon.