In an earlier post I compared the time for looking up values in a 2D table, indexed by different type of objects. Now let’s take a step backward and analyze a simple 1D value lookup. I am focusing on integer-integer lookup, i.e. both keys and values are integers. The type of values is probably of little importance–at least if these are atomic data types–as most of the work is presumably done looking up the keys in the key tables. I am also solely interested in elementwise lookups, i.e. I look up single integer key instead of vector of keys.
The tests are done on Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz on a single core.
We create pairs of vectors to keep the values, and use the first one to look up the value in the second one.
L <- 1e5
# number of look-ups
ns <- c(1e1, 3e1, 1e2, 3e2, 1e3, 3e3, 1e4, 3e4, 1e5, 3e5)
# timing repetitions used later
We look up 105 values from tables of different sizes:
ns
## [1] 1e+01 3e+01 1e+02 3e+02 1e+03 3e+03 1e+04 3e+04 1e+05 3e+05
We use microbenchmark library and report timing in milliseconds.
First we do the exercise with no actual value lookup. This serves as a benchmark: in particular for small lookup tables, the actual lookup time may be small compared to other related overheads. We just loop over all keys and pick the value corresponding to that key, using positional indexing.
doNothing <- function(sKeys, keys, values) {
## find values, corresponding to keys
for(key in sKeys) {
v <- values[key]
}
}
time <- foreach(n = ns, .combine=c) %do% {
values <- sample(1000000000L, n)
keys <- sample(seq(along=values))
# keys are just the integer indices, in random order
shuffledKeys <- sample(keys, L, replace=TRUE)
microbenchmark(doNothing(shuffledKeys, keys, values),
times=10L,
control=list(warmup=10)) %>%
# find keys in random order
as.data.frame() %>%
extract2("time") %>%
mean() %>%
divide_by(1e6) %>%
set_names(n)
}
results <- data.frame(n=ns, "no lookup"=time, check.names=FALSE)
time
## 10 30 100 300 1000 3000 10000 30000
## 5.896617 5.547590 5.257778 5.609493 5.165785 5.534773 5.269728 5.620854
## 1e+05 3e+05
## 5.312086 6.114928
We see that the theoretical lower bound is 6ms. This time does not depend on the the lookup table size as the use plain integer indexing.
match
R has built-in match
function whose task is to do just such lookup. We loop over the keys in randomized order and use match to find the location of the corresponding key in the table.
doMatch <- function(sKeys, keys, values) {
## find values, corresponding to keys
for(key in sKeys) {
v <- values[match(key, keys)]
}
}
timeMatch <- foreach(n = ns, .combine=c) %do% {
values <- sample(1000000000L, n)
keys <- sample(values)
# keys are values in random order
shuffledKeys <- sample(keys, L, replace=TRUE)
microbenchmark::microbenchmark(doMatch(shuffledKeys, keys, values), times=2L,
control=list(warmup=1)) %>%
# find keys in random order
as.data.frame() %>%
extract2("time") %>% # extract time in nanoseconds
mean() %>%
divide_by(1e6) %>%
set_names(n)
}
timeMatch
## 10 30 100 300 1000 3000
## 45.73507 44.19636 54.13771 73.48927 155.65892 343.43199
## 10000 30000 1e+05 3e+05
## 1152.46473 3447.61371 12012.88516 39546.64020
The result–40s is not impressive but may work for many practical applications for 3 × 105-element table.
R has a version of lookup table built-in in the form of named vectors. We test the speed by assigning the keys as names to the value vector, and thereafter we convert numeric keys to character during the lookup with values[as.character(key)]
. Note that in many applications one may also be able to pre-convert all the keys to characters.
We get the following timings:
doNV1 <- function(sKeys, values) {
## find values, corresponding to keys
for(key in sKeys) {
v <- values[as.character(key)]
}
}
timeNV <- foreach(n = ns, .combine=c) %do% {
values <- sample(1000000000L, n)
keys <- sample(values)
# keys are values in random order
values <- setNames(values, keys)
# create named vector keys/values
shuffledKeys <- sample(keys, L, replace=TRUE)
microbenchmark::microbenchmark(doNV1(shuffledKeys, values), times=1L) %>%
# find keys in random order
as.data.frame() %>%
extract2("time") %>%
mean() %>%
divide_by(1e6) %>%
set_names(n)
}
timeNV
## 10 30 100 300 1000 3000
## 133.6203 134.3291 148.8759 180.5306 339.4469 1125.3740
## 10000 30000 1e+05 3e+05
## 4482.4326 14657.5944 81816.0627 412083.6395
As one can clearly see, the approach is slow. 412s is not a feasible speed for a lookup of table of size 3 × 105.
fmatch
in “fastmatch”Unlike built-in match
with it’s sequential lookup, fastmatch::fmatch
builds the lookup hashtable at the first iteration, and later uses it. It results in dramatically better times for long tables.
doFMatch <- function(sKeys, keys, values) {
## find values, corresponding to keys
for(key in sKeys) {
v <- values[fastmatch::fmatch(key, keys)]
}
}
time <- foreach(n = ns, .combine=c) %do% {
values <- sample(1000000000L, ns)
keys <- sample(values)
# keys are values in random order
shuffledKeys <- sample(keys, L, replace=TRUE)
microbenchmark::microbenchmark(doFMatch(shuffledKeys, keys, values), times=2L,
control=list(warmup=1)) %>%
# find keys in random order
as.data.frame() %>%
extract2("time") %>% # extract time in nanoseconds
mean() %>%
divide_by(1e6) %>%
set_names(n)
}
results[["fmatch"]] <- time
time
## 10 30 100 300 1000 3000 10000 30000
## 907.1395 885.9555 885.1643 879.7747 856.7758 861.5872 871.0315 882.2046
## 1e+05 3e+05
## 904.9987 886.8782
The result–0.9s–seems to be a fair speed, potentially for most applications.
R environments can be used as built-in hashed lookup tables, as long as the keys can be represented as character strings. This works well for integers. We create a vector of values, assign the keys to it as names, convert the named vector to a list, and finally into a hashed environment with list2env(., hash=TRUE)
. We perform the lookup in the environment with values[[as.character(key)]]
.
doHEnv <- function(sKeys, values) {
## find values, corresponding to keys
for(key in sKeys) {
v <- values[[as.character(key)]]
}
}
time <- foreach(n = ns, .combine=c) %do% {
values <- sample(1000000000L, n)
keys <- sample(values)
# we need character values to access the environment
values <- setNames(values, keys) %>%
# make it into a vector...
as.list() %>%
list2env(hash=TRUE)
# and convert into a hashed environment
shuffledKeys <- sample(keys, L, replace=TRUE)
microbenchmark::microbenchmark(doHEnv(shuffledKeys, values), times=2L,
control=list(warmup=1)) %>%
# find keys in random order
as.data.frame() %>%
extract2("time") %>% # extract time in nanoseconds
mean() %>%
divide_by(1e6) %>%
set_names(n)
}
results[["hashed env"]] <- time
time
## 10 30 100 300 1000 3000 10000
## 58.62803 58.74473 57.81125 59.13883 61.27899 63.94081 71.89576
## 30000 1e+05 3e+05
## 102.01760 189.86375 423.17053
So far, hashed environments appear to be the fastest approach. At the largest table, the time 0.4s is about twice as fast as for fmatch
.
data.table library offers an alternative to data frame and contains built-in ordering by index. We use the index (as key) lookup of a the value from data.table
doDT <- function(sKeys, dtable) {
## find values, corresponding to keys
for(key in sKeys) {
v <- dtable[key, values]
}
}
time <- foreach(n = ns, .combine=c) %do% {
values <- sample(1000000000L, n)
keys <- sample(values)
# we need character values to access the environment
dtable <- data.table(keys, values, key="keys")
# set hashed index to the table
shuffledKeys <- sample(keys, L, replace=TRUE)
microbenchmark::microbenchmark(doDT(shuffledKeys, dtable), times=2L,
control=list(warmup=1)) %>%
as.data.frame() %>%
extract2("time") %>% # extract time in nanoseconds
mean() %>%
divide_by(1e6) %>%
set_names(n)
}
results[["data.table"]] <- time
time
## 10 30 100 300 1000 3000 10000 30000
## 40825.88 40695.60 40739.94 40724.71 40734.58 40264.48 40459.21 40314.97
## 1e+05 3e+05
## 40334.08 40324.63
Data tables appear to be surprisingly slow despite the indexing being advertised as a “incredibly fast” lookup. I guess this is because [.data.table
has quite a bit of overhead. If this is the case, one may expect to see a substantial speed improvement with vectorized lookups. Even in the current form, data tables may still be substantially faster than the corresponding lookup using data frames, however, such a test is out of scope in this post.
Finally, let’s leave R and use the standard dict in python.
import time
import random
import pandas as pd
L = 100000
ns = [10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000, 300000]
t = []
for n in ns:
values = [i for i in range(n)]
keys = values.copy()
random.shuffle(keys)
table = dict(zip(values, keys))
lk = [random.choice(keys) for _ in range(L)]
# keys to be looked up
t0 = time.time()
# start the clock
for k in lk:
# pick first L keys--just random ones
v = table[k]
t.append((time.time() - t0)*1000)
# in milliseconds
df = pd.DataFrame({ "n" : ns, "time" : t })
df.to_csv("python_results.csv", index=False)
print(df)
## n time
## 0 10 8.660555
## 1 30 8.450270
## 2 100 8.121252
## 3 300 8.085012
## 4 1000 8.318663
## 5 3000 8.419037
## 6 10000 8.910894
## 7 30000 9.516478
## 8 100000 14.008999
## 9 300000 24.327040
Not surprisingly, python is at least a magnitude faster than R.
Finally, here is a plot with all approaches together.
We see that indexing by names is always inferior to the match
. Of the tested R methods, the clear winner is the hashed environment, at least up to the table size of hundreds of thousands. The second place depends on the table size: built-in match
gives better performance for up 10,000, for larger tables fastmatch::fmatch
is better.
The “best” is relative though. Python manages to complete dictionary lookup in only little more time than R needs just to loop through the values with no lookup. Part of it is probably related to less looping overhead in python, but dicts are probably also superior data structures compared to all the R methods tested here. It remains to be seen if R manages to claim back some of the speed difference when using vectorized lookups.