R has a number of looping constructs, some strictly sequential and some that can be parallelized. Here I measure their speed for tasks of different size. The tests are done on Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz using R 3.4.4.
We run a loop over “task”, a function that computes trace of the inverse of a random matrix of a given size:
trace <- function(N) {
x <- matrix(rnorm(N*N), N, N)
sum(diag(solve(x)))
}
We compute the trace R times by submitting a vector of length R to the looping construct. In all cases we return a list of R values as we model the case where we need to store (at least some) of the results computed in the loop. We test the following loops below:
for
while
with manual counter incrementslapply
foreach(...) %do%
mclapply
running on multiple coresforeach %dopar%
with the registerDoParallel
environment.The loops are run as follows:
forLoop <- function(R) {
a <- list(length(R))
for(i in seq(R)) {
a[[i]] <- trace(N)
}
a
}
whileLoop <- function(R) {
a <- list(length(R))
i <- 0
while(i < R) {
i <- i + 1
a[[i]] <- trace(N)
}
a
}
lapplyLoop <- function(R) {
lapply(seq(R), function(x) trace(N))
}
doLoop <- function(R) {
foreach(x = seq(R)) %do% trace(N)
}
mclapplyLoop <- function(R) {
mclapply(seq(R), function(x) trace(N), mc.cores=P)
}
doparLoop <- function(R) {
foreach(w = seq(R)) %dopar% trace(N)
}
We use microbenchmark library and report timing in milliseconds and no parallelism (P = 1).
N <- 10
R <- 100
P <- 1
registerDoParallel(cores=P)
microbenchmark(forLoop(R), whileLoop(R),
lapplyLoop(R), doLoop(R), mclapplyLoop(R), doparLoop(R),
control=list(warmup=10))
## Unit: milliseconds
## expr min lq mean median uq max neval
## forLoop(R) 4.078027 4.178272 4.384721 4.237074 4.286943 7.615022 100
## whileLoop(R) 4.108514 4.198623 4.433847 4.247306 4.316617 8.261299 100
## lapplyLoop(R) 4.119630 4.212962 4.455840 4.252858 4.312328 8.849039 100
## doLoop(R) 24.451473 25.022616 26.150421 25.390442 26.799916 46.319618 100
## mclapplyLoop(R) 4.088158 4.221177 4.482490 4.294568 4.378143 9.245137 100
## doparLoop(R) 19.415499 20.926860 23.610211 21.185729 22.575442 126.409148 100
We can get a number of interesting results from the table.
for
, while
, lapply
) run at virtually the same speed.foreach
-family is 5-6 times slower. Apparently, the large number of options causes noticeable overhead for foreach
.foreach-%dopar%
run even faster than foreach-%do%
.We only test 2 and 4-fold parallelism (as the CPU has 4 cores). As the built-in loops are virtually equal, we only benchmark the parallel versions with for-loop below.
P <- 2
registerDoParallel(cores=P)
microbenchmark(forLoop(R),
doLoop(R), mclapplyLoop(R), doparLoop(R),
control=list(warmup=10))
## Unit: milliseconds
## expr min lq mean median uq max neval
## forLoop(R) 4.142546 4.848619 5.584471 5.240928 5.972274 14.93132 100
## doLoop(R) 25.598150 28.818500 31.568113 30.628691 32.840927 84.65080 100
## mclapplyLoop(R) 6.890511 9.077193 10.484795 10.068753 10.913809 25.12076 100
## doparLoop(R) 28.332939 31.797713 36.375949 33.968924 36.500402 131.02642 100
P <- 4
registerDoParallel(cores=P)
microbenchmark(forLoop(R),
doLoop(R), mclapplyLoop(R), doparLoop(R),
control=list(warmup=10))
## Unit: milliseconds
## expr min lq mean median uq max neval
## forLoop(R) 4.131459 4.761316 5.346676 5.099435 5.805955 9.56446 100
## doLoop(R) 24.748887 28.083830 30.447750 29.754592 31.571215 93.55436 100
## mclapplyLoop(R) 8.804902 10.693054 11.945832 11.582869 12.501200 26.70816 100
## doparLoop(R) 29.876154 32.410709 34.468055 33.655811 35.737043 51.15542 100
Handling more than one worker makes the parallel routines noticeably slower— these tasks are too small to actually benefit from parallel execution. mclapply
runs at roughly 50% of the speed of for
, the difference between %do%
and %dopar%
is more like 10%. Spawning 4 workers causes some additional overhead compared to the 2-worker case, 20% in case of mclapply
and 3% for %dopar%
.
Next, we run a similar small number of large tasks (N=500).
N <- 400
R <- 100
P <- 1
registerDoParallel(cores=P)
microbenchmark(forLoop(R), whileLoop(R),
lapplyLoop(R), doLoop(R), mclapplyLoop(R), doparLoop(R),
times=4, control=list(warmup=1))
## Unit: seconds
## expr min lq mean median uq max neval
## forLoop(R) 6.446824 6.452655 6.485609 6.485555 6.518563 6.524501 4
## whileLoop(R) 6.427717 6.428578 6.448148 6.435780 6.467719 6.493316 4
## lapplyLoop(R) 6.423989 6.428965 6.444378 6.446152 6.459792 6.461220 4
## doLoop(R) 6.381831 6.383609 6.404575 6.392188 6.425542 6.452094 4
## mclapplyLoop(R) 6.440837 6.449752 6.629433 6.466622 6.809114 7.143651 4
## doparLoop(R) 6.378982 6.388411 6.420240 6.402173 6.452070 6.497633 4
As you can see, when execution time reaches to seconds, all the loops run at virtually the same speed. This is to be expected as the overhead is negligible next to the task of inverting 400x400 matrices.
As all the contenders perform similarly here, we only compare the for-loop as the benchmark with two parallelized loops: mclapply
and %dopar%
N <- 400
R <- 100
P <- 2
registerDoParallel(cores=P)
microbenchmark(forLoop(R), mclapplyLoop(R), doparLoop(R),
times=4, control=list(warmup=1))
## Unit: seconds
## expr min lq mean median uq max neval
## forLoop(R) 6.460486 6.490344 6.589692 6.585137 6.689041 6.728010 4
## mclapplyLoop(R) 3.361297 3.422785 3.480489 3.508789 3.538193 3.543081 4
## doparLoop(R) 3.330712 3.338512 3.396184 3.385586 3.453855 3.482850 4
N <- 400
R <- 100
P <- 4
registerDoParallel(cores=P)
microbenchmark(forLoop(R), mclapplyLoop(R), doparLoop(R),
times=4, control=list(warmup=1))
## Unit: seconds
## expr min lq mean median uq max neval
## forLoop(R) 6.427266 6.442492 6.452247 6.458069 6.462003 6.465585 4
## mclapplyLoop(R) 1.784055 1.786218 1.798606 1.799227 1.810995 1.811916 4
## doparLoop(R) 1.718475 1.723604 1.765193 1.765118 1.806782 1.812063 4
The parallel results indicate that we get a close to linear increase in speed by adding the workers—at least this is true till the P = 4, the number of CPU cores. mclapply
and %dopar%
perform identically well here.
lapply
or foreach
?We saw that for small tasks, both foreach
and parallelism have noticeable overheads. But is there a task size where it pays off to use mclapply
but not %dopar%
for parallelization? We try a medium-sized task (N = 100) with two workers, again using the for-loop as the benchmark:
N <- 100
R <- 100
P <- 2
registerDoParallel(cores=P)
microbenchmark(forLoop(R), mclapplyLoop(R), doparLoop(R),
times=4, control=list(warmup=1))
## Unit: milliseconds
## expr min lq mean median uq max neval
## forLoop(R) 171.30374 172.18295 174.5544 174.4457 176.9258 178.0223 4
## mclapplyLoop(R) 97.98898 99.03379 100.4681 100.0959 101.9023 103.6916 4
## doparLoop(R) 122.37250 124.61882 180.1563 176.8257 235.6937 244.6011 4
Indeed, for this task size we find that mclapply
is noticeably faster than both of the other loops in the table. The opportunity window for mclapply
is relatively narrow though, further testing indicates that it’s advantage is there for N values of roughly between 50 and 150.
For the task sizes analyzed here, there are two conclusions:
for
, while
or lapply
. They run at virtually the same speed, so pick whichever is the most convenient. foreach
family is noticeably slower.foreach
-loops are equivalent. Just choose the most convenient one.However, the current post leaves number of cases out. In particular, tiny tasks, and tasks that do not return a vector, were not tested here.