In a sister post I compare lookup-up time in integer-key dictionaries in R, and benchmark it with a few basic python approaches. Here I essentially repeat the same test, but instead of one-by-one lookup I am extracting values corresponding to a vector of keys.
The tests are done on Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz on a single core.
I create a vector of keys and values and use the first one to look up the value in the second one. I test several different data structures to store these, and several related lookup methods.
L <- 10
# number of look-ups
B <- 10000
# key vector size for a single lookup
ns <- c(3e1, 1e2, 3e2, 1e3, 3e3, 1e4, 3e4, 1e5, 3e5, 1e6)
# timing repetitions used later
Here I am accessing only a few times–10 times–a long vectors of values–104 values from tables of different sizes. This goes toward the other extreme compared to the individual lookup in the sister post. The table sizes are
ns
## [1] 3e+01 1e+02 3e+02 1e+03 3e+03 1e+04 3e+04 1e+05 3e+05 1e+06
We use microbenchmark library and report timing in milliseconds.
First we do the exercise with no actual value lookup. This serves as a benchmark: in particular for small lookup tables, the actual lookup time may be small compared to other related overheads. We just loop over all keys and pick the value corresponding to that key, using positional indexing.
doNothing <- function(sKeys, values) {
## find values, corresponding to keys
for(key in sKeys) {
v <- values[key]
}
}
time <- foreach(n = ns, .combine=c) %do% {
values <- sample(1000000000L, n)
keys <- sample(seq(along=values))
# keys are just the integer indices, in random order
shuffledKeys <- foreach(l = seq(length=L)) %do% {
sample(keys, B, replace=TRUE)
}
# a list of L index vectors of size B
microbenchmark(doNothing(shuffledKeys, values),
times=10L,
control=list(warmup=10)) %>%
# find keys in random order
as.data.frame() %>%
extract2("time") %>%
mean() %>%
divide_by(1e6) %>%
set_names(n)
}
results <- data.frame(n=ns, "no lookup"=time, check.names=FALSE)
time
## 30 100 300 1000 3000 10000 30000
## 0.8176300 0.5927398 0.5904013 0.6026060 0.6177570 0.5991526 0.5849503
## 1e+05 3e+05 1e+06
## 0.5535261 0.5168898 0.7336963
We see that the theoretical lower bound is 0.73ms. This time does not depend on the the lookup table size as we use plain integer indexing.
match
R has built-in match
function whose task is to do just such lookup. As it takes vector arguments, we can just loop over the key vectors:
doMatch <- function(sKeys, keys, values) {
## find values, corresponding to keys
for(key in sKeys) {
v <- values[match(key, keys)]
}
}
timeMatch <- foreach(n = ns, .combine=c) %do% {
values <- sample(1000000000L, n)
keys <- sample(values)
# keys are values in random order
shuffledKeys <- foreach(l = seq(length=L)) %do% {
sample(keys, B, replace=TRUE)
}
# a list of L index vectors of size B
microbenchmark::microbenchmark(doMatch(shuffledKeys, keys, values), times=2L,
control=list(warmup=1)) %>%
# find keys in random order
as.data.frame() %>%
extract2("time") %>% # extract time in nanoseconds
mean() %>%
divide_by(1e6) %>%
set_names(n)
}
timeMatch
## 30 100 300 1000 3000 10000
## 3.324577 1.397438 1.532703 2.294491 2.001345 2.816632
## 30000 1e+05 3e+05 1e+06
## 6.785501 17.699546 53.298807 354.059743
The result–0.354s is not impressive but will probably work for many practical applications for 106-element table.
R has a version of lookup table built-in in the form of named vectors. This also accepts vectors of names to look up. We test the speed by assigning the keys as names to the value vector, and thereafter we convert numeric keys to character during the lookup with values[as.character(key)]
. Note that in many applications one may also be able to pre-convert all the keys to characters.
We get the following timings:
doNV1 <- function(sKeys, values) {
## find values, corresponding to keys
for(key in sKeys) {
v <- values[as.character(key)]
}
}
timeNV <- foreach(n = ns, .combine=c) %do% {
values <- sample(1000000000L, n)
keys <- sample(values)
# keys are values in random order
values <- setNames(values, keys)
# create named vector keys/values
shuffledKeys <- foreach(l = seq(length=L)) %do% {
sample(keys, B, replace=TRUE)
}
# a list of L index vectors of size B
microbenchmark::microbenchmark(doNV1(shuffledKeys, values), times=1L) %>%
# find keys in random order
as.data.frame() %>%
extract2("time") %>%
mean() %>%
divide_by(1e6) %>%
set_names(n)
}
resultsNV <- data.frame(n=ns, time=timeNV, method="named vectors", check.names=FALSE)
timeNV
## 30 100 300 1000 3000 10000 30000
## 28.88445 26.96172 26.57887 28.05259 28.65140 31.25083 37.86916
## 1e+05 3e+05 1e+06
## 79.37346 202.27430 986.34658
As one can clearly see, the approach is noticeably slower but 0.986s for 106 keys may still be acceptable in practice.
fmatch
in “fastmatch”Unlike built-in match
with it’s sequential lookup, fastmatch::fmatch
builds the lookup hashtable at the first iteration, and later uses it. It is developed as a drop-in replacement for match
and can result in dramatically better times for long tables.
doFMatch <- function(sKeys, keys, values) {
## find values, corresponding to keys
for(key in sKeys) {
v <- values[fastmatch::fmatch(key, keys)]
}
}
time <- foreach(n = ns, .combine=c) %do% {
values <- sample(1000000000L, ns)
keys <- sample(values)
# keys are values in random order
shuffledKeys <- foreach(l = seq(length=L)) %do% {
sample(keys, B, replace=TRUE)
}
# a list of L index vectors of size B
microbenchmark::microbenchmark(doFMatch(shuffledKeys, keys, values), times=2L,
control=list(warmup=1)) %>%
# find keys in random order
as.data.frame() %>%
extract2("time") %>% # extract time in nanoseconds
mean() %>%
divide_by(1e6) %>%
set_names(n)
}
results[["fmatch"]] <- time
time
## 30 100 300 1000 3000 10000 30000 1e+05
## 3.467176 1.077946 1.051287 1.254798 1.148966 1.320558 1.038458 1.059002
## 3e+05 1e+06
## 0.981577 1.008037
The result is 1.008 ms –i.e. 1.008 ns per lookup seems rather good and suggests that fmatch
is also rather cache-friendly.
For single-value lookup, hashed environments–the closest R offers for hashmaps–provide the best performance. Unfortunately these do not support vectorized access, so we have to loop through the vector of keys. Note also that this requires the keys to be represented as character vector. This works well for integers. We create a vector of values, assign the keys to it as names, convert the named vector to a list, and finally into a hashed environment with list2env(., hash=TRUE)
. We perform the lookup in the environment with values[[key]]
where keys are converted to characters earlier in vectorized fashion.
doHEnv <- function(sKeys, values) {
## find values, corresponding to keys
for(bKeys in sKeys) {
# extract the vector of bunch keys
for(key in as.character(bKeys)) {
v <- values[[key]]
}
}
}
time <- foreach(n = ns, .combine=c) %do% {
values <- sample(1000000000L, n)
keys <- sample(values)
# we need character values to access the environment
values <- setNames(values, keys) %>%
# make it into a vector...
as.list() %>%
list2env(hash=TRUE)
# and convert into a hashed environment
shuffledKeys <- foreach(l = seq(length=L)) %do% {
sample(keys, B, replace=TRUE)
}
# a list of L index vectors of size B
microbenchmark::microbenchmark(doHEnv(shuffledKeys, values), times=2L,
control=list(warmup=1)) %>%
# find keys in random order
as.data.frame() %>%
extract2("time") %>% # extract time in nanoseconds
mean() %>%
divide_by(1e6) %>%
set_names(n)
}
results[["hashed env"]] <- time
time
## 30 100 300 1000 3000 10000
## 38.41025 37.87887 39.26490 41.22010 45.20896 53.55510
## 30000 1e+05 3e+05 1e+06
## 79.44599 164.35403 387.01702 1214.16288
Unlike single results, hashed environments appear slow. At the largest table, the time 1.214s is about 1000x slower than fmatch
, and even slower than named vectors–the definitive looser when testing the single-key lookups.
data.table library offers an alternative to data frame and contains built-in ordering by index. We use the vectorized index (as key) lookup of a the value from data.table
doDT <- function(sKeys, dtable) {
## find values, corresponding to keys
for(key in sKeys) {
v <- dtable[key, values]
}
}
time <- foreach(n = ns, .combine=c) %do% {
values <- sample(1000000000L, n)
keys <- sample(values)
# we need character values to access the environment
dtable <- data.table(keys, values, key="keys")
# set hashed index to the table
shuffledKeys <- foreach(l = seq(length=L)) %do% {
sample(keys, B, replace=TRUE)
}
# a list of L index vectors of size B
microbenchmark::microbenchmark(doDT(shuffledKeys, dtable), times=3L,
control=list(warmup=2)) %>%
as.data.frame() %>%
extract2("time") %>% # extract time in nanoseconds
mean() %>%
divide_by(1e6) %>%
set_names(n)
}
results[["data.table"]] <- time
time
## 30 100 300 1000 3000 10000 30000
## 18.241417 8.101245 8.143146 8.210385 8.120365 8.097955 8.170750
## 1e+05 3e+05 1e+06
## 8.088662 8.162855 8.087391
Unlike in case of single-key lookup, data tables appear quite fast. Their access time appears to be remarkably constant over different sizes. Clearly, these offer a good alternative for fmatch
, at least in case of multi-value lookups. The huge performance gain for vectorized lookup compared to the single-index case suggests that [.data.table
has quite substantial overhead.
Finally, let’s leave R and use the standard dict in python. Python dict
import time
import random
import pandas as pd
L = 10
B = 10000
ns = [30, 100, 300, 1000, 3000, 10000, 30000, 100000, 300000, 1000000]
td = []
for n in ns:
values = [i for i in range(n)]
keys = values.copy()
random.shuffle(keys)
table = dict(zip(values, keys))
lk = [[random.choice(keys) for _ in range(B)]
for _ in range(L)]
# list of list of keys to be looked up
t0 = time.time()
# start the clock
for bKeys in lk:
# take the bunch
for k in bKeys:
# take individual keys in bunches
v = table[k]
td.append((time.time() - t0)*1000)
# in milliseconds
## Now repeat with pandas
tp = []
for n in ns:
values = [i for i in range(n)]
keys = values.copy()
random.shuffle(keys)
#
table = pd.DataFrame({ "values" : values}, index=keys)
lk = [[random.choice(keys) for _ in range(B)]
for _ in range(L)]
# list of list of keys to be looked up
t0 = time.time()
# start the clock
for bKeys in lk:
# take the bunch
v = table.loc[bKeys]
# index by vector of keys
tp.append((time.time() - t0)*1000)
df = pd.DataFrame({ "n" : ns, "dict" : td , "pandas" : tp})
df.to_csv("python_results.csv", index=False)
print(df)
## dict n pandas
## 0 9.900331 30 56.673765
## 1 9.639502 100 54.870844
## 2 9.468794 300 55.068493
## 3 9.374380 1000 61.885595
## 4 9.456873 3000 69.737673
## 5 10.689974 10000 112.521410
## 6 11.074543 30000 144.275427
## 7 15.145540 100000 205.878973
## 8 25.300026 300000 384.438038
## 9 31.639814 1000000 964.721203
Python dict appears to be outperformed by several R methods here. Looping to address the non-vectorized lookup only is probably the main culprit. It also appears that plain indexed lookup in pandas does even worse.
Finally, here is a plot with all approaches together.
We see that fmatch
is clearly the best performer over all table sizes. As it is very easy to use and it performs well for one-element lookups too, this seems to be the best overall performer for R integer key lookups. Data tables also offer stable and reasonably good performance. The worst performers are named vectors–probably because of their sequential search approach, and hashed environments as these do not offer vectorized lookup.
As a surprise, the basic python methods do not perform well. Dicts that offer by far superior performance for single-element lookup are slow now–probably because the element lookup is not vectorized at C level. And pandas, also offering index-based hashtable, are even slower. It suggests that under the hood the index lookup is not vectorized either. As the aim of this post is to compare various R methods and benchmark these with the simple python ways, I do not asses the question about existence of superior methods in python.