Setup

In an earlier post I compared the time for looking up values in a 2D table, indexed by different type of objects. Now let’s take a step backward and analyze a simple 1D value lookup. I am focusing on integer-integer lookup, i.e. both keys and values are integers. The type of values is probably of little importance–at least if these are atomic data types–as most of the work is presumably done looking up the keys in the key tables. I am also solely interested in elementwise lookups, i.e. I look up single integer key instead of vector of keys.

The tests are done on Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz on a single core.

We create pairs of vectors to keep the values, and use the first one to look up the value in the second one.

L <- 1e5
                           # number of look-ups
ns <- c(1e1, 3e1, 1e2, 3e2, 1e3, 3e3, 1e4, 3e4, 1e5, 3e5)
                           # timing repetitions used later

We look up 105 values from tables of different sizes:

ns
##  [1] 1e+01 3e+01 1e+02 3e+02 1e+03 3e+03 1e+04 3e+04 1e+05 3e+05

We use microbenchmark library and report timing in milliseconds.

Benchmark: no lookup

First we do the exercise with no actual value lookup. This serves as a benchmark: in particular for small lookup tables, the actual lookup time may be small compared to other related overheads. We just loop over all keys and pick the value corresponding to that key, using positional indexing.

doNothing <- function(sKeys, keys, values) {
   ## find values, corresponding to keys
   for(key in sKeys) {
      v <- values[key]
   }
}
time <- foreach(n = ns, .combine=c) %do% {
   values <- sample(1000000000L, n)
   keys <- sample(seq(along=values))
                           # keys are just the integer indices, in random order
   shuffledKeys <- sample(keys, L, replace=TRUE)
   microbenchmark(doNothing(shuffledKeys, keys, values),
                                  times=10L,
                                  control=list(warmup=10)) %>%
                           # find keys in random order
      as.data.frame() %>%
      extract2("time") %>%
      mean() %>%
      divide_by(1e6) %>%
      set_names(n)
}
results <- data.frame(n=ns, "no lookup"=time, check.names=FALSE)
time
##       10       30      100      300     1000     3000    10000    30000 
## 5.896617 5.547590 5.257778 5.609493 5.165785 5.534773 5.269728 5.620854 
##    1e+05    3e+05 
## 5.312086 6.114928

We see that the theoretical lower bound is 6ms. This time does not depend on the the lookup table size as the use plain integer indexing.

Built-in match

R has built-in match function whose task is to do just such lookup. We loop over the keys in randomized order and use match to find the location of the corresponding key in the table.

doMatch <- function(sKeys, keys, values) {
   ## find values, corresponding to keys
   for(key in sKeys) {
      v <- values[match(key, keys)]
   }
}

timeMatch <- foreach(n = ns, .combine=c) %do% {
   values <- sample(1000000000L, n)
   keys <- sample(values)
                           # keys are values in random order
   shuffledKeys <- sample(keys, L, replace=TRUE)
   microbenchmark::microbenchmark(doMatch(shuffledKeys, keys, values), times=2L,
                                  control=list(warmup=1)) %>%
                           # find keys in random order
      as.data.frame() %>%
      extract2("time") %>%  # extract time in nanoseconds
      mean() %>%
      divide_by(1e6) %>%
      set_names(n)
}
timeMatch
##          10          30         100         300        1000        3000 
##    45.73507    44.19636    54.13771    73.48927   155.65892   343.43199 
##       10000       30000       1e+05       3e+05 
##  1152.46473  3447.61371 12012.88516 39546.64020

The result–40s is not impressive but may work for many practical applications for 3 × 105-element table.

Named vectors

R has a version of lookup table built-in in the form of named vectors. We test the speed by assigning the keys as names to the value vector, and thereafter we convert numeric keys to character during the lookup with values[as.character(key)]. Note that in many applications one may also be able to pre-convert all the keys to characters.

We get the following timings:

doNV1 <- function(sKeys, values) {
   ## find values, corresponding to keys
   for(key in sKeys) {
      v <- values[as.character(key)]
   }
}

timeNV <- foreach(n = ns, .combine=c) %do% {
   values <- sample(1000000000L, n)
   keys <- sample(values)
                           # keys are values in random order
   values <- setNames(values, keys)
                           # create named vector keys/values
   shuffledKeys <- sample(keys, L, replace=TRUE)
   microbenchmark::microbenchmark(doNV1(shuffledKeys, values), times=1L) %>%
                           # find keys in random order
      as.data.frame() %>%
      extract2("time") %>%
      mean() %>%
      divide_by(1e6) %>%
      set_names(n)
}
timeNV
##          10          30         100         300        1000        3000 
##    133.6203    134.3291    148.8759    180.5306    339.4469   1125.3740 
##       10000       30000       1e+05       3e+05 
##   4482.4326  14657.5944  81816.0627 412083.6395

As one can clearly see, the approach is slow. 412s is not a feasible speed for a lookup of table of size 3 × 105.

fmatch in “fastmatch”

Unlike built-in match with it’s sequential lookup, fastmatch::fmatch builds the lookup hashtable at the first iteration, and later uses it. It results in dramatically better times for long tables.

doFMatch <- function(sKeys, keys, values) {
   ## find values, corresponding to keys
   for(key in sKeys) {
      v <- values[fastmatch::fmatch(key, keys)]
   }
}

time <- foreach(n = ns, .combine=c) %do% {
   values <- sample(1000000000L, ns)
   keys <- sample(values)
                           # keys are values in random order
   shuffledKeys <- sample(keys, L, replace=TRUE)
   microbenchmark::microbenchmark(doFMatch(shuffledKeys, keys, values), times=2L,
                                  control=list(warmup=1)) %>%
                           # find keys in random order
      as.data.frame() %>%
      extract2("time") %>%  # extract time in nanoseconds
      mean() %>%
      divide_by(1e6) %>%
      set_names(n)
}
results[["fmatch"]] <- time
time
##       10       30      100      300     1000     3000    10000    30000 
## 907.1395 885.9555 885.1643 879.7747 856.7758 861.5872 871.0315 882.2046 
##    1e+05    3e+05 
## 904.9987 886.8782

The result–0.9s–seems to be a fair speed, potentially for most applications.

Hashed environment

R environments can be used as built-in hashed lookup tables, as long as the keys can be represented as character strings. This works well for integers. We create a vector of values, assign the keys to it as names, convert the named vector to a list, and finally into a hashed environment with list2env(., hash=TRUE). We perform the lookup in the environment with values[[as.character(key)]].

doHEnv <- function(sKeys, values) {
   ## find values, corresponding to keys
   for(key in sKeys) {
      v <- values[[as.character(key)]]
   }
}
time <- foreach(n = ns, .combine=c) %do% {
   values <- sample(1000000000L, n)
   keys <- sample(values)
                           # we need character values to access the environment
   values <- setNames(values, keys) %>%
                           # make it into a vector...
      as.list() %>%
      list2env(hash=TRUE)
                           # and convert into a hashed environment
   shuffledKeys <- sample(keys, L, replace=TRUE)
   microbenchmark::microbenchmark(doHEnv(shuffledKeys, values), times=2L,
                                  control=list(warmup=1)) %>%
                           # find keys in random order
      as.data.frame() %>%
      extract2("time") %>%  # extract time in nanoseconds
      mean() %>%
      divide_by(1e6) %>%
      set_names(n)
}
results[["hashed env"]] <- time
time
##        10        30       100       300      1000      3000     10000 
##  58.62803  58.74473  57.81125  59.13883  61.27899  63.94081  71.89576 
##     30000     1e+05     3e+05 
## 102.01760 189.86375 423.17053

So far, hashed environments appear to be the fastest approach. At the largest table, the time 0.4s is about twice as fast as for fmatch.

Data.tables

data.table library offers an alternative to data frame and contains built-in ordering by index. We use the index (as key) lookup of a the value from data.table

doDT <- function(sKeys, dtable) {
   ## find values, corresponding to keys
   for(key in sKeys) {
      v <- dtable[key, values]
   }
}
time <- foreach(n = ns, .combine=c) %do% {
   values <- sample(1000000000L, n)
   keys <- sample(values)
                           # we need character values to access the environment
   dtable <- data.table(keys, values, key="keys")
                           # set hashed index to the table
   shuffledKeys <- sample(keys, L, replace=TRUE)
   microbenchmark::microbenchmark(doDT(shuffledKeys, dtable), times=2L,
                                  control=list(warmup=1)) %>%
      as.data.frame() %>%
      extract2("time") %>%  # extract time in nanoseconds
      mean() %>%
      divide_by(1e6) %>%
      set_names(n)
}
results[["data.table"]] <- time
time
##       10       30      100      300     1000     3000    10000    30000 
## 40825.88 40695.60 40739.94 40724.71 40734.58 40264.48 40459.21 40314.97 
##    1e+05    3e+05 
## 40334.08 40324.63

Data tables appear to be surprisingly slow despite the indexing being advertised as a “incredibly fast” lookup. I guess this is because [.data.table has quite a bit of overhead. If this is the case, one may expect to see a substantial speed improvement with vectorized lookups. Even in the current form, data tables may still be substantially faster than the corresponding lookup using data frames, however, such a test is out of scope in this post.

Another benchmark: do the same in python

Finally, let’s leave R and use the standard dict in python.

import time
import random
import pandas as pd
L = 100000
ns = [10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000, 300000]
t = []
for n in ns:
    values = [i for i in range(n)]
    keys = values.copy()
    random.shuffle(keys)
    table = dict(zip(values, keys))
    lk = [random.choice(keys) for _ in range(L)]
    # keys to be looked up
    t0 = time.time()
    # start the clock
    for k in lk:
        # pick first L keys--just random ones
        v = table[k]
    t.append((time.time() - t0)*1000)
    # in milliseconds
df = pd.DataFrame({ "n" : ns, "time" : t })
df.to_csv("python_results.csv", index=False)
print(df)
##         n       time
## 0      10   8.660555
## 1      30   8.450270
## 2     100   8.121252
## 3     300   8.085012
## 4    1000   8.318663
## 5    3000   8.419037
## 6   10000   8.910894
## 7   30000   9.516478
## 8  100000  14.008999
## 9  300000  24.327040

Not surprisingly, python is at least a magnitude faster than R.

Conclusions and the final plot

Finally, here is a plot with all approaches together.

plot of chunk plotAll

We see that indexing by names is always inferior to the match. Of the tested R methods, the clear winner is the hashed environment, at least up to the table size of hundreds of thousands. The second place depends on the table size: built-in match gives better performance for up 10,000, for larger tables fastmatch::fmatch is better.

The “best” is relative though. Python manages to complete dictionary lookup in only little more time than R needs just to loop through the values with no lookup. Part of it is probably related to less looping overhead in python, but dicts are probably also superior data structures compared to all the R methods tested here. It remains to be seen if R manages to claim back some of the speed difference when using vectorized lookups.