Setup

In a sister post I compare lookup-up time in integer-key dictionaries in R, and benchmark it with a few basic python approaches. Here I essentially repeat the same test, but instead of one-by-one lookup I am extracting values corresponding to a vector of keys.

The tests are done on Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz on a single core.

I create a vector of keys and values and use the first one to look up the value in the second one. I test several different data structures to store these, and several related lookup methods.

L <- 10
                           # number of look-ups
B <- 10000
                           # key vector size for a single lookup
ns <- c(3e1, 1e2, 3e2, 1e3, 3e3, 1e4, 3e4, 1e5, 3e5, 1e6)
                           # timing repetitions used later

Here I am accessing only a few times–10 times–a long vectors of values–104 values from tables of different sizes. This goes toward the other extreme compared to the individual lookup in the sister post. The table sizes are

ns
##  [1] 3e+01 1e+02 3e+02 1e+03 3e+03 1e+04 3e+04 1e+05 3e+05 1e+06

We use microbenchmark library and report timing in milliseconds.

Benchmark: no lookup

First we do the exercise with no actual value lookup. This serves as a benchmark: in particular for small lookup tables, the actual lookup time may be small compared to other related overheads. We just loop over all keys and pick the value corresponding to that key, using positional indexing.

doNothing <- function(sKeys, values) {
   ## find values, corresponding to keys
   for(key in sKeys) {
      v <- values[key]
   }
}
time <- foreach(n = ns, .combine=c) %do% {
   values <- sample(1000000000L, n)
   keys <- sample(seq(along=values))
                           # keys are just the integer indices, in random order
   shuffledKeys <- foreach(l = seq(length=L)) %do% {
      sample(keys, B, replace=TRUE)
   }
                           # a list of L index vectors of size B
   microbenchmark(doNothing(shuffledKeys, values),
                                  times=10L,
                                  control=list(warmup=10)) %>%
                           # find keys in random order
      as.data.frame() %>%
      extract2("time") %>%
      mean() %>%
      divide_by(1e6) %>%
      set_names(n)
}
results <- data.frame(n=ns, "no lookup"=time, check.names=FALSE)
time
##        30       100       300      1000      3000     10000     30000 
## 0.8176300 0.5927398 0.5904013 0.6026060 0.6177570 0.5991526 0.5849503 
##     1e+05     3e+05     1e+06 
## 0.5535261 0.5168898 0.7336963

We see that the theoretical lower bound is 0.73ms. This time does not depend on the the lookup table size as we use plain integer indexing.

Built-in match

R has built-in match function whose task is to do just such lookup. As it takes vector arguments, we can just loop over the key vectors:

doMatch <- function(sKeys, keys, values) {
   ## find values, corresponding to keys
   for(key in sKeys) {
      v <- values[match(key, keys)]
   }
}

timeMatch <- foreach(n = ns, .combine=c) %do% {
   values <- sample(1000000000L, n)
   keys <- sample(values)
                           # keys are values in random order
   shuffledKeys <- foreach(l = seq(length=L)) %do% {
      sample(keys, B, replace=TRUE)
   }
                           # a list of L index vectors of size B
   microbenchmark::microbenchmark(doMatch(shuffledKeys, keys, values), times=2L,
                                  control=list(warmup=1)) %>%
                           # find keys in random order
      as.data.frame() %>%
      extract2("time") %>%  # extract time in nanoseconds
      mean() %>%
      divide_by(1e6) %>%
      set_names(n)
}
timeMatch
##         30        100        300       1000       3000      10000 
##   3.324577   1.397438   1.532703   2.294491   2.001345   2.816632 
##      30000      1e+05      3e+05      1e+06 
##   6.785501  17.699546  53.298807 354.059743

The result–0.354s is not impressive but will probably work for many practical applications for 106-element table.

Named vectors

R has a version of lookup table built-in in the form of named vectors. This also accepts vectors of names to look up. We test the speed by assigning the keys as names to the value vector, and thereafter we convert numeric keys to character during the lookup with values[as.character(key)]. Note that in many applications one may also be able to pre-convert all the keys to characters.

We get the following timings:

doNV1 <- function(sKeys, values) {
   ## find values, corresponding to keys
   for(key in sKeys) {
      v <- values[as.character(key)]
   }
}
timeNV <- foreach(n = ns, .combine=c) %do% {
   values <- sample(1000000000L, n)
   keys <- sample(values)
                           # keys are values in random order
   values <- setNames(values, keys)
                           # create named vector keys/values
   shuffledKeys <- foreach(l = seq(length=L)) %do% {
      sample(keys, B, replace=TRUE)
   }
                           # a list of L index vectors of size B
   microbenchmark::microbenchmark(doNV1(shuffledKeys, values), times=1L) %>%
                           # find keys in random order
      as.data.frame() %>%
      extract2("time") %>%
      mean() %>%
      divide_by(1e6) %>%
      set_names(n)
}
resultsNV <- data.frame(n=ns, time=timeNV, method="named vectors", check.names=FALSE)
timeNV
##        30       100       300      1000      3000     10000     30000 
##  28.88445  26.96172  26.57887  28.05259  28.65140  31.25083  37.86916 
##     1e+05     3e+05     1e+06 
##  79.37346 202.27430 986.34658

As one can clearly see, the approach is noticeably slower but 0.986s for 106 keys may still be acceptable in practice.

fmatch in “fastmatch”

Unlike built-in match with it’s sequential lookup, fastmatch::fmatch builds the lookup hashtable at the first iteration, and later uses it. It is developed as a drop-in replacement for match and can result in dramatically better times for long tables.

doFMatch <- function(sKeys, keys, values) {
   ## find values, corresponding to keys
   for(key in sKeys) {
      v <- values[fastmatch::fmatch(key, keys)]
   }
}

time <- foreach(n = ns, .combine=c) %do% {
   values <- sample(1000000000L, ns)
   keys <- sample(values)
                           # keys are values in random order
   shuffledKeys <- foreach(l = seq(length=L)) %do% {
      sample(keys, B, replace=TRUE)
   }
                           # a list of L index vectors of size B
   microbenchmark::microbenchmark(doFMatch(shuffledKeys, keys, values), times=2L,
                                  control=list(warmup=1)) %>%
                           # find keys in random order
      as.data.frame() %>%
      extract2("time") %>%  # extract time in nanoseconds
      mean() %>%
      divide_by(1e6) %>%
      set_names(n)
}
results[["fmatch"]] <- time
time
##       30      100      300     1000     3000    10000    30000    1e+05 
## 3.467176 1.077946 1.051287 1.254798 1.148966 1.320558 1.038458 1.059002 
##    3e+05    1e+06 
## 0.981577 1.008037

The result is 1.008 ms –i.e. 1.008 ns per lookup seems rather good and suggests that fmatch is also rather cache-friendly.

Hashed environment

For single-value lookup, hashed environments–the closest R offers for hashmaps–provide the best performance. Unfortunately these do not support vectorized access, so we have to loop through the vector of keys. Note also that this requires the keys to be represented as character vector. This works well for integers. We create a vector of values, assign the keys to it as names, convert the named vector to a list, and finally into a hashed environment with list2env(., hash=TRUE). We perform the lookup in the environment with values[[key]] where keys are converted to characters earlier in vectorized fashion.

doHEnv <- function(sKeys, values) {
   ## find values, corresponding to keys
   for(bKeys in sKeys) {
                           # extract the vector of bunch keys
      for(key in as.character(bKeys)) {
         v <- values[[key]]
      }
   }
}
time <- foreach(n = ns, .combine=c) %do% {
   values <- sample(1000000000L, n)
   keys <- sample(values)
                           # we need character values to access the environment
   values <- setNames(values, keys) %>%
                           # make it into a vector...
      as.list() %>%
      list2env(hash=TRUE)
                           # and convert into a hashed environment
   shuffledKeys <- foreach(l = seq(length=L)) %do% {
      sample(keys, B, replace=TRUE)
   }
                           # a list of L index vectors of size B
   microbenchmark::microbenchmark(doHEnv(shuffledKeys, values), times=2L,
                                  control=list(warmup=1)) %>%
                           # find keys in random order
      as.data.frame() %>%
      extract2("time") %>%  # extract time in nanoseconds
      mean() %>%
      divide_by(1e6) %>%
      set_names(n)
}
results[["hashed env"]] <- time
time
##         30        100        300       1000       3000      10000 
##   38.41025   37.87887   39.26490   41.22010   45.20896   53.55510 
##      30000      1e+05      3e+05      1e+06 
##   79.44599  164.35403  387.01702 1214.16288

Unlike single results, hashed environments appear slow. At the largest table, the time 1.214s is about 1000x slower than fmatch, and even slower than named vectors–the definitive looser when testing the single-key lookups.

Data.tables

data.table library offers an alternative to data frame and contains built-in ordering by index. We use the vectorized index (as key) lookup of a the value from data.table

doDT <- function(sKeys, dtable) {
   ## find values, corresponding to keys
   for(key in sKeys) {
      v <- dtable[key, values]
   }
}
time <- foreach(n = ns, .combine=c) %do% {
   values <- sample(1000000000L, n)
   keys <- sample(values)
                           # we need character values to access the environment
   dtable <- data.table(keys, values, key="keys")
                           # set hashed index to the table
   shuffledKeys <- foreach(l = seq(length=L)) %do% {
      sample(keys, B, replace=TRUE)
   }
                           # a list of L index vectors of size B
   microbenchmark::microbenchmark(doDT(shuffledKeys, dtable), times=3L,
                                  control=list(warmup=2)) %>%
      as.data.frame() %>%
      extract2("time") %>%  # extract time in nanoseconds
      mean() %>%
      divide_by(1e6) %>%
      set_names(n)
}
results[["data.table"]] <- time
time
##        30       100       300      1000      3000     10000     30000 
## 18.241417  8.101245  8.143146  8.210385  8.120365  8.097955  8.170750 
##     1e+05     3e+05     1e+06 
##  8.088662  8.162855  8.087391

Unlike in case of single-key lookup, data tables appear quite fast. Their access time appears to be remarkably constant over different sizes. Clearly, these offer a good alternative for fmatch, at least in case of multi-value lookups. The huge performance gain for vectorized lookup compared to the single-index case suggests that [.data.table has quite substantial overhead.

Another benchmark: do the same in python

Finally, let’s leave R and use the standard dict in python. Python dict

import time
import random
import pandas as pd
L = 10
B = 10000
ns = [30, 100, 300, 1000, 3000, 10000, 30000, 100000, 300000, 1000000]
td = []
for n in ns:
    values = [i for i in range(n)]
    keys = values.copy()
    random.shuffle(keys)
    table = dict(zip(values, keys))
    lk = [[random.choice(keys) for _ in range(B)]
           for _ in range(L)]
    # list of list of keys to be looked up
    t0 = time.time()
    # start the clock
    for bKeys in lk:
        # take the bunch
        for k in bKeys:
            # take individual keys in bunches
            v = table[k]
    td.append((time.time() - t0)*1000)
    # in milliseconds
## Now repeat with pandas
tp = []
for n in ns:
    values = [i for i in range(n)]
    keys = values.copy()
    random.shuffle(keys)
    #
    table = pd.DataFrame({ "values" : values}, index=keys)
    lk = [[random.choice(keys) for _ in range(B)]
           for _ in range(L)]
    # list of list of keys to be looked up
    t0 = time.time()
    # start the clock
    for bKeys in lk:
        # take the bunch
        v = table.loc[bKeys]
        # index by vector of keys
    tp.append((time.time() - t0)*1000)
df = pd.DataFrame({ "n" : ns, "dict" : td , "pandas" : tp})
df.to_csv("python_results.csv", index=False)
print(df)
##         dict        n      pandas
## 0   9.900331       30   56.673765
## 1   9.639502      100   54.870844
## 2   9.468794      300   55.068493
## 3   9.374380     1000   61.885595
## 4   9.456873     3000   69.737673
## 5  10.689974    10000  112.521410
## 6  11.074543    30000  144.275427
## 7  15.145540   100000  205.878973
## 8  25.300026   300000  384.438038
## 9  31.639814  1000000  964.721203

Python dict appears to be outperformed by several R methods here. Looping to address the non-vectorized lookup only is probably the main culprit. It also appears that plain indexed lookup in pandas does even worse.

Conclusions and the final plot

Finally, here is a plot with all approaches together.

plot of chunk plotAll

We see that fmatch is clearly the best performer over all table sizes. As it is very easy to use and it performs well for one-element lookups too, this seems to be the best overall performer for R integer key lookups. Data tables also offer stable and reasonably good performance. The worst performers are named vectors–probably because of their sequential search approach, and hashed environments as these do not offer vectorized lookup.

As a surprise, the basic python methods do not perform well. Dicts that offer by far superior performance for single-element lookup are slow now–probably because the element lookup is not vectorized at C level. And pandas, also offering index-based hashtable, are even slower. It suggests that under the hood the index lookup is not vectorized either. As the aim of this post is to compare various R methods and benchmark these with the simple python ways, I do not asses the question about existence of superior methods in python.