Setup

What is the most efficient way to create lookup tables in R? I test a few approaches for relatively large tables with up to 100,000 rows and 50 columns. I index the rows by numbers and columns by dates–hence I cannot just use ordinary matrix indexing but have first to locate the correct row and column from an 1D table. Second, I am interested in repeated look-ups of single elements in arbitrary order, so I do no full row or full column extractions. These are not a huge tables nowadays, but as R does not have a dedicated Map data type, I was not sure what is the best way to quickly look up values from similar tables. Finally, as I am interested in langage efficiency, I run all the example single-threaded.

We create a number of tables that hold numeric values. Each table has 50 columns while the number of rows spans from 1k to 300k:

n <- c(1e3, 3e3, 1e4, 3e4, 1e5, 3e5)
                           # timing repetitions used later
n
## [1] 1e+03 3e+03 1e+04 3e+04 1e+05 3e+05

The Data

Now let’s create the data. As I am not concerned with the efficiency of the data generation, I just create a large matrix here, and return the row and column values as attributes. This code is fast as it is fully vectorized.

DGP <- function(n1=120000, n2=50) {
   ## create matrix of n1 msisdns by n2 dates
   ## content is random integers
   numbers <- as.numeric(sample(1e9, n1))
   dates <- seq(from=as.Date("2011-08-01"), by="month", length=n2)
   tab <- matrix(sample(1000, n1*n2, replace=TRUE), n1, n2)
   attr(tab, "rows") <- numbers
   attr(tab, "cols") <- dates
   tab
}

0. Benchmark: no lookup

To get an idea how fast lookup times can we realistically expect, I first run the code that does not contain any lookup, just extracting values from the matrix by known indices.

noMatch <- function(data) {
   rows <- attr(data, "rows")
   cols <- attr(data, "cols")
   for(i in seq(along=rows)) {
      for(j in seq(length=length(cols))) {
         a <- data[i,j]
      }
   }
}

and let’s time it with 6 different number of rows:

time <- foreach(n1 = n, .combine=c) %do% {
   dat <- DGP(n1)
   system.time(noMatch(dat))["user.self"] %>%
      set_names(n1)
}
time
##  1000  3000 10000 30000 1e+05 3e+05 
## 0.032 0.088 0.292 0.872 2.960 8.820

So, just extracting elements from such a matrix with 100k rows takes approximately 3s (on an i7-3440 system). We can also see a close to linear growth, 10x larger matrix takes approximately 10 time more time to go through.

1. match by row and column values

The first real task is to try the most straightforward approach by using match(value, rowValues) to find the row and match(value, colValues) to find the column:

matrixMatch <- function(data) {
   rows <- attr(data, "rows")
   cols <- attr(data, "cols")
   for(number in sample(rows)) {
      for(date in sample(cols)) {
         i <- match(number, rows)
         j <- match(date, cols)
         a <- data[i,j]
      }
   }
}

The function extracts row and column values from the data attributes, and thereafter looks up the values from the matrix in an arbitrary order. As I am only concerned with the lookup speed itself, I discard the result.

time <- foreach(n1 = n, .combine=c) %do% {
   dat <- DGP(n1)
   system.time(matrixMatch(dat))["user.self"] %>%
      set_names(n1)
}
time
##     1000     3000    10000    30000    1e+05    3e+05 
##    0.208    1.112   10.780   99.560  939.420 5768.012

match is slow for large tables. The largest one reported here (15M entries) takes 1hr 36 minutes to churn through. This is probably too slow for many realistic applications. Note that running time grows roughly as n^2, suggesting match is just a front-end to sequential lookup.

2. Fastmatch’ fmatch by row and column values

Our next approach is to use the function fmatch from fastmatch package. At the first pass, fmatch creates a hash table, and uses that table on subsequent passes. We keep the rest of the code identical to that in the previous section:

matrixFmatch <- function(data) {
   rows <- attr(data, "rows")
   cols <- attr(data, "cols")
   for(number in sample(rows)) {
      for(date in sample(cols)) {
         i <- fastmatch::fmatch(number, rows)
         j <- fastmatch::fmatch(date, cols)
         a <- data[i,j]
      }
   }
}

And the results are:

time <- foreach(n1 = n, .combine=c) %do% {
   dat <- DGP(n1)
   system.time(matrixFmatch(dat))["user.self"] %>%
      set_names(n1)
}
time
##    1000    3000   10000   30000   1e+05   3e+05 
##   0.924   2.756   9.132  27.588  91.568 270.452

We see a large improvement in the access time for larger tables–apparently the initial hash table creation dominates the timing for less than 10,000 records. As in the benchmark case, the time grows now linearily in number of rows.

3. Mixed fmatch and match

As fmatch appears to be slow for small tables, we can replace matching by the small number of columns by match, keeping the code otherwise as above.

matrixMixedMatch <- function(data) {
   rows <- attr(data, "rows")
   cols <- attr(data, "cols")
   for(number in sample(rows)) {
      for(date in sample(cols)) {
         i <- fastmatch::fmatch(number, rows)
         j <- match(date, cols)
         a <- data[i,j]
      }
   }
}

And the results are:

time <- foreach(n1 = n, .combine=c) %do% {
   dat <- DGP(n1)
   system.time(matrixMixedMatch(dat))["user.self"] %>%
      set_names(n1)
}
time
##    1000    3000   10000   30000   1e+05   3e+05 
##   0.568   1.584   5.280  15.964  53.088 160.288

We see a noticeable improvement again. Apparently switching from fmatch to match pays off for small tables. As it also involves one less letter of typing, I will recommend this trick for everyone. Note that this move cut the lookup time of pure fmatch down to 56% preserving the linear growth property.

4. Hashed environments

So far I have created data as matrices. But R offers other ways. For instance, we can store data in nested environments. As we cannot vectorize environment access, the approach is slow, but this post only considers lookup-time, not table creation time.

envDGP <- function(n1=120000, n2=50) {
   ## create matrix of n1 msisdns by n2 dates
   ## content is random integers
   numbers <- as.numeric(sample(1e9, n1))
   numberNames <- paste0("x", numbers)
   dates <- seq(from=as.Date("2011-08-01"), by="month", length=n2)
   dateNames <- paste0("d", format(dates, "%Y%m%d"))
   dataEnv <- new.env(hash=TRUE)
   for(id in seq(length=length(dates))) {
      dateEnv <- new.env()
      i <- sample(1000, length(numbers), replace=TRUE)
      lapply(seq(along=numbers), function(j) dateEnv[[numberNames[j] ]] <- i[j])
      dataEnv[[dateNames[id] ]] <- dateEnv
   }
   attr(dataEnv, "rows") <- numbers
   attr(dataEnv, "cols") <- dates
   dataEnv
}

We use the following extraction function:

nestedEnv <- function(data) {
   ## data: environment for nested data
   rows <- attr(data, "rows")
   cols <- attr(data, "cols")
   numberNames <- paste0("x", rows)
   dateNames <- paste0("d", format(cols, "%Y%m%d"))
   for(number in sample(numberNames)) {
      for(date in sample(dateNames)) {
         a <- data[[date]][[number]]
      }
   }
}

And the results are:

time <- foreach(n1 = n, .combine=c) %do% {
   dat <- envDGP(n1)
   system.time(nestedEnv(dat))["user.self"] %>%
      set_names(n1)
}
time
##   1000   3000  10000  30000  1e+05  3e+05 
##  0.036  0.108  0.444  1.568  9.356 69.240

Now we see a dramatic, almost 10-fold improvement over the mixed matching approach above for the middle-sized data. For large data, however, the advantage is not as big. The lookup time gets longer in a pace that fits between linear and quadratic pattern.

For comparison: the same process in python/pandas

For comparison, I also do the corresponding calculations in python/pandas. Note that this is exactly a task where pandas excel–addressing rectangular data by row and column indices. I will transfer the data into R plot through a csv file.

I do not test the python list built-in method .index as it is a wrapper to sequential matching and hence very slow.

#
import pandas as pd
import numpy as np
import time
def dgp(n1, n2):
    numbers = list(np.random.choice(int(1e7), n1, replace=False))
    # use a smaller number to choose from here -- seems to be terribly slow otherwise
    dates = list(pd.date_range('2011-03-04', periods=n2))
    data = pd.DataFrame(np.random.choice(1000, size=(n1,n2)),
                        columns=dates, index=numbers)
    return data, numbers, dates
def extract(data, rows, columns):
    t0 = time.time()
    for number in np.random.choice(rows, size=len(rows), replace=False):
        for date in np.random.choice(columns, size=len(columns), replace=False):
            a = data.ix[number, date]
    t1 = time.time()
    return t1 - t0
n2 = 50
n1s = np.array([1e3, 3e3, 1e4, 3e4, 1e5, 3e5], dtype="int")
t = []
for n1 in n1s:
    (data, rows, columns) = dgp(n1, n2)
    t.append(extract(data, rows, columns))
df = pd.DataFrame({ "n" : n1s, "time" : t })
df.to_csv("python_results.csv", index=False)
print(df)
##         n        time
## 0    1000    0.426014
## 1    3000    1.259237
## 2   10000    4.183796
## 3   30000   12.623798
## 4  100000   42.382897
## 5  300000  127.194350

The results indicate that the simple pandas’ approach is about 20% faster than R’s fastest approach using matches, although R environments are speedier still. It may be possible to squeeze even more speed out of pandas by tinkering the optimization, for instance the pandas’ date_range actually creates datetimes while in R I just stayed with dates.

Final plot

Finally, here is a plot with all approaches together.

plot of chunk plotAll

R’s built-in match is clearly the slowest approach for anything like large tables while fmatch offers a quick drop-in solution that is reasonably fast. For the sizes analyzed here, environments offer the best speed albeit at a cost of some more complexity.