blog/ Benchmarking BDB, CDB and Tokyo Cabinet on large datasets

At my job we have need of a high-performance hash lookup database in our antispam product. It's used to store Bayes tokens for quick lookups on individual scanning systems, and is read-only in the fast path (mail scanning) with updates taking place in another process. For the last few years, we've been using a plain old BerkeleyDB hash database via Perl's DB_File, but with all the hype about Tokyo Cabinet and its benchmark results I figured it was time to take a look.

The first thing I did was to run the TC benchmark suite on my system, and wow, it was fast. Just like it says on the box, 100% as-advertised. Next, I hacked up a quick script to import our Roaring Penguin Training Network data -- approximately 11M keys, as variable-length character strings, pointing to small bits of data for each. Unfortunately, TC didn't do well. After about a half hour, I was still only halfway through the data import, and each successive addition of 100k keys was taking noticeably longer as time went on. A few minutes of poking around on various forums and reading docs led me to some tuning parameters that got me down to a database creation time of about 30 minutes. Much better than before, but it still paled in comparison to the BerkeleyDB hash database creation time of about five minutes.

Some of those forum posts about tuning mentioned CDB as another alternative. CDB is a constant database (ie: non-updateable) so it's not really comparable to TC or BDB in the general case, but for our workload it would fit. So... with three databases to choose from, it's time for some comparison benchmarks.

The Setup

The benchmarking tool was written in Perl, using DB_File, CDB_File, and the TokyoCabinet module. The data for benchmarking was the uncompressed RPTN data from April 24, comprised of 11004950 keys pointing to an encoded string containing spam, ham, and generation counts. Writes were performed to a separate ext2 partition to avoid any journalling overhead.

The following tuning parameters were used on creation:

No tuning was performed for reads.

For the read test, 1024 random words were selected from /usr/share/dict/words, and looked up in each hash.

The results

As can be seen from those results, CDB kills all comers in this simulation of our normal workload. Perhaps there are ways to tune Tokyo Cabinet to perform better on large data sets?

Based on this, I'll be running more realistic tests on CDB to see if we can get those sorts of numbers in production.