Benchmarking BDB, CDB and Tokyo Cabinet on large datasets

At my job we have need of a high-performance hash lookup database in our antispam product. It's used to store Bayes tokens for quick lookups on individual scanning systems, and is read-only in the fast path (mail scanning) with updates taking place in another process. For the last few years, we've been using a plain old BerkeleyDB hash database via Perl's DB_File, but with all the hype about Tokyo Cabinet and its benchmark results I figured it was time to take a look.

The first thing I did was to run the TC benchmark suite on my system, and wow, it was fast. Just like it says on the box, 100% as-advertised. Next, I hacked up a quick script to import our Roaring Penguin Training Network data -- approximately 11M keys, as variable-length character strings, pointing to small bits of data for each. Unfortunately, TC didn't do well. After about a half hour, I was still only halfway through the data import, and each successive addition of 100k keys was taking noticeably longer as time went on. A few minutes of poking around on various forums and reading docs led me to some tuning parameters that got me down to a database creation time of about 30 minutes. Much better than before, but it still paled in comparison to the BerkeleyDB hash database creation time of about five minutes.

Some of those forum posts about tuning mentioned CDB as another alternative. CDB is a constant database (ie: non-updateable) so it's not really comparable to TC or BDB in the general case, but for our workload it would fit. So... with three databases to choose from, it's time for some comparison benchmarks.

The Setup

The benchmarking tool was written in Perl, using DB_File, CDB_File, and the TokyoCabinet module. The data for benchmarking was the uncompressed RPTN data from April 24, comprised of 11004950 keys pointing to an encoded string containing spam, ham, and generation counts. Writes were performed to a separate ext2 partition to avoid any journalling overhead.

The following tuning parameters were used on creation:

BerkeleyDB Hash

cachesize of 96MB (what we currently use in production)

BerkeleyDB BTree

cachesize of 96MB (I was lazy and didn't want to tune it, so it's copied from the Hash testcase)

CDB
```
No tuning
```

Tokyo Cabinet Hash

xmsiz value of 750MB (necessary to make it complete in under an hour)
cache value of 12000000 (same)

No tuning was performed for reads.

For the read test, 1024 random words were selected from /usr/share/dict/words, and looked up in each hash.

The results

Creation times:

Benchmark: timing 1 iterations of bdb_btree, bdb_hash, cdb_hash, tc_hash...
 bdb_btree: 3394 wallclock secs (410.94 usr + 322.49 sys = 733.43 CPU) @  0.00/s (n=1)
   tc_hash: 2038 wallclock secs (185.55 usr + 82.70 sys = 268.25 CPU) @  0.00/s (n=1)
  bdb_hash: 289 wallclock secs (258.62 usr + 14.97 sys = 273.59 CPU) @  0.00/s (n=1)
  cdb_hash: 101 wallclock secs (91.68 usr +  5.58 sys = 97.26 CPU) @  0.01/s (n=1)
          s/iter bdb_btree  bdb_hash   tc_hash  cdb_hash
bdb_btree    733        --      -63%      -63%      -87%
bdb_hash     274      168%        --       -2%      -64%
tc_hash      268      173%        2%        --      -64%
cdb_hash    97.3      654%      181%      176%        --

File sizes:

cdb_hash  file is 535773153 bytes
tc_hash   file is 587887360 bytes
bdb_btree file is 610369536 bytes
bdb_hash  file is 671383552 bytes

Random reads:

Benchmark: timing 2000 iterations of bdb_btree, bdb_hash, cdb_hash, tc_hash...
 bdb_btree: 156 wallclock secs (38.48 usr + 101.73 sys = 140.21 CPU) @ 14.26/s (n=2000)
  bdb_hash: 125 wallclock secs (31.76 usr + 78.93 sys = 110.69 CPU) @ 18.07/s (n=2000)
   tc_hash: 37 wallclock secs (26.86 usr + 10.40 sys = 37.26 CPU) @ 53.68/s (n=2000)
  cdb_hash: 21 wallclock secs (12.06 usr +  8.78 sys = 20.84 CPU) @ 95.97/s (n=2000)
            Rate bdb_btree  bdb_hash   tc_hash  cdb_hash
bdb_btree 14.3/s        --      -21%      -73%      -85%
bdb_hash  18.1/s       27%        --      -66%      -81%
tc_hash   53.7/s      276%      197%        --      -44%
cdb_hash  96.0/s      573%      431%       79%        --

As can be seen from those results, CDB kills all comers in this simulation of our normal workload. Perhaps there are ways to tune Tokyo Cabinet to perform better on large data sets?

Based on this, I'll be running more realistic tests on CDB to see if we can get those sorts of numbers in production.