Setup

R has a number of looping constructs, some strictly sequential and some that can be parallelized. Here I measure their speed for tasks of different size. The tests are done on Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz using R 3.4.4.

We run a loop over “task”, a function that computes trace of the inverse of a random matrix of a given size:

trace <- function(N) {
   x <- matrix(rnorm(N*N), N, N)
   sum(diag(solve(x)))
}

We compute the trace R times by submitting a vector of length R to the looping construct. In all cases we return a list of R values as we model the case where we need to store (at least some) of the results computed in the loop. We test the following loops below:

  1. plain for
  2. while with manual counter increments
  3. lapply
  4. foreach(...) %do%
  5. parallel mclapply running on multiple cores
  6. parallel foreach %dopar% with the registerDoParallel environment.

The loops are run as follows:

forLoop <- function(R) {
   a <- list(length(R))
   for(i in seq(R)) {
      a[[i]] <- trace(N)
   }
   a
}

whileLoop <- function(R) {
   a <- list(length(R))
   i <- 0
   while(i < R) {
      i <- i + 1
      a[[i]] <- trace(N)
   }
   a
}

lapplyLoop <- function(R) {
   lapply(seq(R), function(x) trace(N))
}

doLoop <- function(R) {
   foreach(x = seq(R)) %do% trace(N)
}

mclapplyLoop <- function(R) {
   mclapply(seq(R), function(x) trace(N), mc.cores=P)
}

doparLoop <- function(R) {
   foreach(w = seq(R)) %dopar% trace(N)
}

We use microbenchmark library and report timing in milliseconds and no parallelism (P = 1).

Small task: N = 10 and R=100

No parallelism

N <- 10
R <- 100
P <- 1
registerDoParallel(cores=P)
microbenchmark(forLoop(R), whileLoop(R),
               lapplyLoop(R), doLoop(R), mclapplyLoop(R), doparLoop(R),
               control=list(warmup=10))
## Unit: milliseconds
##             expr       min        lq      mean    median        uq        max neval
##       forLoop(R)  4.078027  4.178272  4.384721  4.237074  4.286943   7.615022   100
##     whileLoop(R)  4.108514  4.198623  4.433847  4.247306  4.316617   8.261299   100
##    lapplyLoop(R)  4.119630  4.212962  4.455840  4.252858  4.312328   8.849039   100
##        doLoop(R) 24.451473 25.022616 26.150421 25.390442 26.799916  46.319618   100
##  mclapplyLoop(R)  4.088158  4.221177  4.482490  4.294568  4.378143   9.245137   100
##     doparLoop(R) 19.415499 20.926860 23.610211 21.185729 22.575442 126.409148   100

We can get a number of interesting results from the table.

  • Among the non-parallel loops, there are clearly two classes:
    1. all the built-in loops (for, while, lapply) run at virtually the same speed.
    2. foreach-family is 5-6 times slower. Apparently, the large number of options causes noticeable overhead for foreach.
  • Both parallel versions run with the parallel backend but only one registered CPU core. The good news is that in this case there is virtually no extra overhead. Managing just one worker is easy. Parallel-enabled foreach-%dopar% run even faster than foreach-%do%.

Parallelism

We only test 2 and 4-fold parallelism (as the CPU has 4 cores). As the built-in loops are virtually equal, we only benchmark the parallel versions with for-loop below.

2-fold

P <- 2
registerDoParallel(cores=P)
microbenchmark(forLoop(R),
               doLoop(R), mclapplyLoop(R), doparLoop(R),
               control=list(warmup=10))
## Unit: milliseconds
##             expr       min        lq      mean    median        uq       max neval
##       forLoop(R)  4.142546  4.848619  5.584471  5.240928  5.972274  14.93132   100
##        doLoop(R) 25.598150 28.818500 31.568113 30.628691 32.840927  84.65080   100
##  mclapplyLoop(R)  6.890511  9.077193 10.484795 10.068753 10.913809  25.12076   100
##     doparLoop(R) 28.332939 31.797713 36.375949 33.968924 36.500402 131.02642   100

4-fold

P <- 4
registerDoParallel(cores=P)
microbenchmark(forLoop(R),
               doLoop(R), mclapplyLoop(R), doparLoop(R),
               control=list(warmup=10))
## Unit: milliseconds
##             expr       min        lq      mean    median        uq      max neval
##       forLoop(R)  4.131459  4.761316  5.346676  5.099435  5.805955  9.56446   100
##        doLoop(R) 24.748887 28.083830 30.447750 29.754592 31.571215 93.55436   100
##  mclapplyLoop(R)  8.804902 10.693054 11.945832 11.582869 12.501200 26.70816   100
##     doparLoop(R) 29.876154 32.410709 34.468055 33.655811 35.737043 51.15542   100

Handling more than one worker makes the parallel routines noticeably slower— these tasks are too small to actually benefit from parallel execution. mclapply runs at roughly 50% of the speed of for, the difference between %do% and %dopar% is more like 10%. Spawning 4 workers causes some additional overhead compared to the 2-worker case, 20% in case of mclapply and 3% for %dopar%.

Small number of large tasks: N = 500 and R=100

Next, we run a similar small number of large tasks (N=500).

No parallelism

N <- 400
R <- 100
P <- 1
registerDoParallel(cores=P)
microbenchmark(forLoop(R), whileLoop(R),
               lapplyLoop(R), doLoop(R), mclapplyLoop(R), doparLoop(R),
               times=4, control=list(warmup=1))
## Unit: seconds
##             expr      min       lq     mean   median       uq      max neval
##       forLoop(R) 6.446824 6.452655 6.485609 6.485555 6.518563 6.524501     4
##     whileLoop(R) 6.427717 6.428578 6.448148 6.435780 6.467719 6.493316     4
##    lapplyLoop(R) 6.423989 6.428965 6.444378 6.446152 6.459792 6.461220     4
##        doLoop(R) 6.381831 6.383609 6.404575 6.392188 6.425542 6.452094     4
##  mclapplyLoop(R) 6.440837 6.449752 6.629433 6.466622 6.809114 7.143651     4
##     doparLoop(R) 6.378982 6.388411 6.420240 6.402173 6.452070 6.497633     4

As you can see, when execution time reaches to seconds, all the loops run at virtually the same speed. This is to be expected as the overhead is negligible next to the task of inverting 400x400 matrices.

Parallelism

As all the contenders perform similarly here, we only compare the for-loop as the benchmark with two parallelized loops: mclapply and %dopar%

2-fold parallelism

N <- 400
R <- 100
P <- 2
registerDoParallel(cores=P)
microbenchmark(forLoop(R), mclapplyLoop(R), doparLoop(R),
               times=4, control=list(warmup=1))
## Unit: seconds
##             expr      min       lq     mean   median       uq      max neval
##       forLoop(R) 6.460486 6.490344 6.589692 6.585137 6.689041 6.728010     4
##  mclapplyLoop(R) 3.361297 3.422785 3.480489 3.508789 3.538193 3.543081     4
##     doparLoop(R) 3.330712 3.338512 3.396184 3.385586 3.453855 3.482850     4

4-fold parallelism

N <- 400
R <- 100
P <- 4
registerDoParallel(cores=P)
microbenchmark(forLoop(R), mclapplyLoop(R), doparLoop(R),
               times=4, control=list(warmup=1))
## Unit: seconds
##             expr      min       lq     mean   median       uq      max neval
##       forLoop(R) 6.427266 6.442492 6.452247 6.458069 6.462003 6.465585     4
##  mclapplyLoop(R) 1.784055 1.786218 1.798606 1.799227 1.810995 1.811916     4
##     doparLoop(R) 1.718475 1.723604 1.765193 1.765118 1.806782 1.812063     4

The parallel results indicate that we get a close to linear increase in speed by adding the workers—at least this is true till the P = 4, the number of CPU cores. mclapply and %dopar% perform identically well here.

The middle case: lapply or foreach?

We saw that for small tasks, both foreach and parallelism have noticeable overheads. But is there a task size where it pays off to use mclapply but not %dopar% for parallelization? We try a medium-sized task (N = 100) with two workers, again using the for-loop as the benchmark:

N <- 100
R <- 100
P <- 2
registerDoParallel(cores=P)
microbenchmark(forLoop(R), mclapplyLoop(R), doparLoop(R),
               times=4, control=list(warmup=1))
## Unit: milliseconds
##             expr       min        lq     mean   median       uq      max neval
##       forLoop(R) 171.30374 172.18295 174.5544 174.4457 176.9258 178.0223     4
##  mclapplyLoop(R)  97.98898  99.03379 100.4681 100.0959 101.9023 103.6916     4
##     doparLoop(R) 122.37250 124.61882 180.1563 176.8257 235.6937 244.6011     4

Indeed, for this task size we find that mclapply is noticeably faster than both of the other loops in the table. The opportunity window for mclapply is relatively narrow though, further testing indicates that it’s advantage is there for N values of roughly between 50 and 150.

Conclusion

For the task sizes analyzed here, there are two conclusions:

However, the current post leaves number of cases out. In particular, tiny tasks, and tasks that do not return a vector, were not tested here.