Anvil

Abstract

Databases have achieved orders-of-magnitude performance improvements by changing the layout of stored data -- for instance, by arranging data in columns or compressing it before storage. These improvements have been implemented in monolithic new engines, however, making it difficult to experiment with feature combinations or extensions. We present Anvil, a modular and extensible toolkit for building database back ends. Anvil's storage modules, called dTables, have much finer granularity than prior work. For example, some dTables specialize in writing data, while others provide optimized read-only formats. This specialization makes both kinds of dTable simple to write and understand. Unifying dTables implement more comprehensive functionality by layering over other dTables -- for instance, building a read/write store from read-only tables and a writable journal, or building a general-purpose store from optimized special-purpose stores. The dTable design leads to a flexible system powerful enough to implement many database storage layouts. Our prototype implementation of Anvil performs up to 5.5 times faster than an existing B-tree-based database back end on conventional workloads, and can easily be customized for further gains on specific data and workloads.

Publications

Modular Data Storage with Anvil. Mike Mammarella, Shant Hovsepian, and Eddie Kohler. SOSP 2009, October 2009. [PDF] Slides: [PDF] [SWF]

People

Mike Mammarella now works at Google.
Shant Hovsepian now works at Aster Data Systems.
Eddie Kohler was our advisor and is now an ass. professor at Harvard.

Source code

The source code for Anvil is kept in a git repository. To obtain it, use a command like:

git clone git://read.cs.ucla.edu/anvil/anvil

To compile, just run make. Anvil compiles and runs on both Linux and Mac OS X, but its transaction system only works correctly on Linux when using ext3 in ordered mode. (Or when using Featherstitch, via the --with-fstitch option to configure.) For the curious, more details on how the transaction system works are available. (On OS X, you may need to run ./configure --with-cc=gcc-4.2 --with-cxx=g++-4.2 before make in order to compile, as Anvil must be compiled with GCC 4.1 or later and the default may be GCC 4.0.)

Anvil compiles into a shared library, libanvil.so (or, on OS X, libanvil.dylib). A small driver program, main, using the library is also compiled. Many of the benchmarks we run in the paper can be invoked through this driver, although some require small source changes that can be found in patch files in the bench directory.

Examples of how to use Anvil, both dTables and cTables (the column-aware wrapper layer on top of dTables) from C++ are plentiful in main_test.cpp and main_perf.cpp. Soon this page will contain more information on this topic, as well.

In addition to Anvil itself, we have also made a modified version of SQLite that uses Anvil instead of (or in addition to, for debugging) its native B-tree-based back end. To obtain it, use a command like:

svn co http://read.cs.ucla.edu/sqlite-anvil/trunk/sqlite-3.6.0

Configuration

dTables are implemented as C++ classes, with a factory system to allow their instantiation based on class names given in configuration files. The behavior of each dTable can also often be tuned via configuration parameters provided by this same mechanism; for instance, the Bloom filter dTable needs to know both the name of an underlying dTable class and how to divide the hash of each key to form the Bloom filter indices.

A simple configuration language is used to describe the arrangement of dTables and any parameters to be given to them. It's basically a hierarchical name-value dictionary. Here is an example:

config [
	"base" class(dt) bloom_dtable
	"base_config" config [
		"bloom_k" int 5
		"base" class(dt) simple_dtable
	]
	"digest_interval" int 2
]

(Note that the top level does not have a name.) This configuration, when provided to a managed_dtable, will cause it to use a bloom_dtable as its read-only dTable. It will recursively pass the nested configuration (the "base_config" part) to the bloom_dtable, which will then recursively instantiate a simple_dtable with an empty configuration. (Note: in the paper, simple_dtable is called "linear dTable" since "simple" is not very descriptive. Since the paper was published, we have added another dTable called linear_dtable in the source code which is not the same.)

The configuration system supports boolean, integer, float, string, and blob basic types. It also supports nested config dictionaries and dTable/cTable class names. (Class names differ from plain strings in that they are checked to make sure they exist during parsing.)

More to be posted here soon!

How to Shoot an Anvil 200 Feet in the Air