Butter from two CoWs

Written by Michael Sproul on January 8, 2023.

Over this holiday break I felt like indulging a hare-brained project I’d had in the back of my mind for a while: making a key-value database out of btrfs.

btrfs is a “next generation” filesystem for Linux that uses copy-on-write (CoW) to provide nice features like atomic snapshots and data de-duplication. Over the last few years I’ve become a copy-on-write maximalist, partly because I can’t shake the appreciation for immutability and data-sharing I picked up from functional programming. I’ve also recently been using CoW databases like LMDB and MDBX, and have been impressed by their ability to run rings around their non-CoW cousins.

Two CoWs?

LMDB is a key-value database that stores data in a single file. It maintains a B+tree in the file, with the user’s data at the leaves, and supports ACID transactions using copy-on-write! During a transaction, LMDB writes the updated data in new leaf nodes, then writes new internal nodes without mutating the existing internal nodes (the true essence of CoW). Upon reaching the root of the tree, it then needs to do a single write to switch the root of the tree to the new one. All the data under the new root node is already written and ready to access, so the whole update is atomic! If the transaction is aborted or there’s a crash like a power-failure then there’s no risk of ending up with a half-mutated database, the new data is just sitting there in the file not being referenced.

It took me a few months of reading around on LMDB to understand the points above, and I still wouldn’t claim to be any sort of expert. However I did notice a suspicious similarity when I started reading into how next-gen filesystems like BTRFS and ZFS work. They perform almost exactly the same kinds of copy-on-write transactions on B-trees, only the B-tree represents an entire directory of files, and the writes are made directly to the disk’s block device.

So, what happens when you combine these two copy-on-write mechanisms by running LMDB on top of BTRFS? Folk wisdom says not to do it, and at the very least it seems a bit wasteful to be paying for two rounds of copy-on-write for every atomic transaction. In other words: copy-on-write doesn’t compose particularly well 😱. Composition is everything to the functional programmer, and to my heat-addled summer holiday brain this sounded like an opportunity for innovation.

Let the OS do the work

One of the lessons I learnt from LMDB is that letting the OS do the work is a good strategy for minimising database engine complexity while maximising performance. For LMDB this means memory-mapping the database file into memory and letting the OS take care of caching pages from disk. For my btrfs-as-database project it means letting btrfs take care of copy-on-write and transaction atomicity, thus eliminating the unnecessary copying involved in two layers of copy-on-write. As an added bonus we also get btrfs’s implementation of defragmentation, which is something that LMDB sometimes struggles with.

Every project needs a name. I went with butter_db, and hence the title: butter from two CoWs.

Ahh, it’s too hot today

I’m usually very paranoid about doing things the right way and prefer to only write up thoughts once they’re fully formed. For this project I decided to follow a looser and more experimental philosophy, and write some code unburdened by the rigour of ordinary software engineering. If I’ve made any mistakes in this write-up I would appreciate gentle corrections via email at <firstname>@<lastname>.xyz.

With 35°C Australian heat sizzling outside and the cricket playing in the background I set about hacking on butter_db in Rust.

First design

The btrfs primitive for atomicity is the subvolume. This is a directory hierarchy containing files and folders which can be quickly snapshotted using copy-on-write. This suggests a natural design for butter_db:

Store the database in a subvolume which gets snapshotted during each transaction.
Allow only one write transaction at a time, but allow multiple readers (like LMDB).
Serve read requests from the previous subvolume while the transaction is on-going.
Switch the default volume to the new updated subvolume when the transaction commits.

This means there are at most two subvolumes in existence at any given time, one for reads and one for writes. I decided to call these subvolumes tick and tock and just switch between them on each transaction:

$ tree -L 1 db
db
├── tick
└── tock

Now comes the hard part of deciding how to actually store key-value data in these subvolumes. For the first prototype I figured to just try the simplest (and worst) thing I could imagine: one file per key-value pair, nested under directories representing “tables”:

$ tree db
db
├── tick
└── tock
    └── test
        ├── 00
        ├── 01
        ├── 02
        └── 03

Here we see a table called test in the tock subvolume, with keys 00 to 03. The values are serialized directly in each file as bytes.

Filesystems are fussy about the characters allowed in names, so we can’t just be stuffing arbitrary byte strings into the key file names. Taking the naive approach I decided to serialize them as hex, and accept the 2x increase in size.

impl Table {
    /// Path to the file for a key.
    ///
    /// Keys are encoded to ensure the path is filesystem safe.
    pub fn key_path(&self, key: &[u8]) -> PathBuf {
        let encoded_key = hex_string(key);
        self.path.join(encoded_key)
    }
}

Rust, meet `btrfs`

Next I needed to figure out how to interact with btrfs from Rust. Lots of the database operations could be done with simple file operations from std::fs, but the magic of snapshots would require interfacing with btrfs’s user-land C library: libbtrfsutil from btrfs-progs.

I settled on the btrfsutil wrapper crate, which seemed to provide an ergonomic interface for subvolumes. However I quickly ran into trouble: as soon as I tried to create a snapshot I got a mysterious BTRFS_UTIL_ERROR_SEARCH_FAILED error. The description printed by the Rust wrapper was just:

Could not search B-tree

This was puzzling, particularly as the btrfs CLI had no trouble creating snapshots:

$ btrfs subvolume snapshot tick tock
Create a snapshot of 'tick' in './tock'

I figured that the CLI should match the Rust wrapper quite closely because both were built on libbtrfsutil and ultimately invoked a function called btrfs_util_create_snapshot.

After a bit of digging around the btrfs-progs C code I found that it shouldn’t even have been possible for the btrfs_util_create_snapshot function to return this error, which is where I started to get suspicious of the wrapper. I’d also been seeing bits and pieces in the wrapper docs about the importance of CAP_SYSTEM_ADMIN:

Also, please keep in mind that many of the operations this library can perform may require elevated privileges(CAP_SYSTEM_ADMIN).

I don’t really know anything about capabilities on Linux, and a bit of quick reading didn’t enlighten me. It also seemed that the btrfs CLI worked fine without this capability, and I thought it would be preferable for butter_db to operate without it too.

Eventually I managed to trigger the same error on the CLI while trying to show a subvolume:

$ btrfs subvolume show tick
ERROR: Could not search B-tree: Operation not permitted

Finally I realised that the first line of the Rust snapshot implementation was the real culprit, an innocuous looking call to self.path() which invokes btrfs_util_subvolume_path:

/// Create a snapshot of this subvolume.
pub fn snapshot<T: Into<PathBuf> + Clone>(
    &self,
    path: T,
    flags: Option<SnapshotFlags>,
    qgroup: Option<QgroupInherit>,
) -> Result<Self> {
    let path_src_cstr = common::path_to_cstr(self.path()?)?;

Indeed in btrfs, determining the canonical path of a subvolume requires greater privileges than creating a snapshot of it. A bit weird, but that’s the way it is.

Writing this up now I see that the problem is already fixed in btrfsutil’s main branch, and there’s a neat issue describing the problem as well. Ah well. I worked around it in my own nasty way, falling back to one of the C functions and some manual plumbing with file descriptors:

let read_file = File::open(&read_snapshot.path)?;
let parent_file = File::open(&self.root_path)?;
let name = CString::new(write_gen.as_ref()).unwrap();
let res = unsafe {
    btrfs_util_create_snapshot_fd2(
        read_file.as_raw_fd(),
        parent_file.as_raw_fd(),
        name.as_ptr(),
        0,
        std::ptr::null_mut(),
        std::ptr::null_mut(),
    )
};
assert_eq!(res, btrfs_util_error_BTRFS_UTIL_OK);

With btrfs successfully wrangled my tests started to pass! I had a key-value database capable of atomic transactions in just a couple of hundred lines of Rust!

Cursors and `readdir`

Next on my agenda was working out cursors. It’s often useful to be able iterate the keys of a database in order, usually lexicographic order. I had assumed that btrfs would keep its data sorted by filename and that this would be sufficient to provide ordering by key. Unfortunately this assumption was very wrong: btrfs orders its data by inode number, which seems to basically be a proxy for file creation time. My first implementation of cursors using Rust’s fs::read_dir was completely incorrect. Consulting the man page for readdir confirmed what I probably should have known from the outset:

The order in which filenames are read by successive calls to readdir() depends on the filesystem implementation; it is unlikely that the names will be sorted in any fashion.

This was the first major setback. Unlike the library issues with libbtrfsutil which were merely implementation issues, this was a conceptual issue with my underlying design.

I could have given up at this point or changed tack, but I decided to see what would happen if I pushed ahead anyway, maybe giving up on some performance in the process. What I really needed was a way to order the filenames when visited by the cursor. One option would be to sort them in memory, but I figured the O(n log n) time complexity would be too slow to be practical. The other option I considered was maintaining an index file in each table directory to keep track of the file ordering. Basically I wanted a text file like:

$ cat index.txt
00
01
02
03

Using a text file seemed like a really ugly hack and came with extra overhead. What I really wanted was some sort of efficient, binary, single-file… database.

This is where I decided to cheat and embed SQLite in my database implementation. All I needed was a single table with a single column:

CREATE TABLE keys (
    key BLOB PRIMARY KEY ASC
) WITHOUT ROWID;

So an index.sqlite file in each table directory was the price to pay for ordered file names. The file could be bypassed on reads, but would need to be updated for each insert or delete from butter_db:

-- On insert
INSERT INTO keys VALUES (?1) ON CONFLICT DO NOTHING;

-- On delete
DELETE FROM keys WHERE key = ?1;

The ON CONFLICT DO NOTHING saves updating the SQLite index in the case where a value is mutated.

Feeling like a bit of a sell-out but still enjoying my summer holidays, I watched the unit tests for cursors pass with flying colours. Thanks SQLite ;)

Easy optimisations

To minimise the impact of the embedded SQLite databases I decided to see how I could kneecap SQLite’s strong consistency guarantees in exchange for speed. The whole reason for this project was to see if we could lean on btrfs for consistency, so it would be wasteful for SQLite to be doing its own atomic transactions as well.

Thankfully the SQLite docs had advice for just this occasion:

The MEMORY journaling mode stores the rollback journal in volatile RAM. This saves disk I/O but at the expense of database safety and integrity. If the application using SQLite crashes in the middle of a transaction when the MEMORY journaling mode is set, then the database file will very likely go corrupt.

(emphasis added, from PRAGMA Statements in the SQLite docs)

For butter_db this is perfect! If the transaction fails or there’s a crash we don’t care in the slightest what happens to the abandoned snapshot of the index.sqlite file. Let it burn!

Tweaking `btrfs`

On the btrfs side I also wanted to minimise unnecessary overhead. By virtue of being a filesystem and not a database btrfs stores extra metadata that we don’t really need, like file modification times. As far as I know file metadata can’t be turned off entirely, but we can at least opt-out of the frequently updated access times using noatime:

$ sudo mount -o noatime /butter.img /mnt/butter

Further, by default btrfs will use a profile called DUP for metadata, resulting in two copies of each chunk of file metadata. To give butter_db a fighting chance I decided to opt-out of this at filesystem creation time:

$ mkfs.btrfs --force --metadata single ~/Documents/butter.img

mkfs prints a summary of the profiles in use confirming the opt-out:

Block group profiles:
  Data:             single            8.00MiB
  Metadata:         single            8.00MiB
  System:           single            4.00MiB

Trying a real workload

Armed with a slightly-optimised database that was passing its unit tests I decided to take it for a spin on a real workload. When I’m not writing piles of hacks I usually work on Lighthouse which is an Ethereum client that helped Ethereum transition to proof-of-stake.

Lighthouse uses LMDB in a component called the “slasher”, which for the purposes of this post I’ll just say is a program operating on databases with thousands to millions of keys per table, and ~100GB of data in total. The slasher ingests data from the network in batches which it applies to its database in a single write transaction. It’s no coincidence that this matches butter_db’s capabilities — butter_db is modelled after LMDB and I knew from the start that I wanted to try it in the slasher.

The LMDB backend can usually process its first batch in around 30 seconds, and subsequent batches in around 1.5 seconds. Holding my breath I started up Lighthouse to see if by some miracle butter_db would be competitive…

It was not.

I waited and waited for the first batch to finish, and gave up after about 30 minutes. I added some more timing logs to check if it was making any progress. To my surprise I found it was. It was just going to take about 4.5 hours to finish its first full batch.

I let it run overnight and came back in the morning to find the log message I wanted to see!

DEBG Completed slasher update, num_blocks: 7, num_attestations: 835, time_taken: 15906405ms, epoch: 172573, service: slasher, module: slasher_service::service:157

It finished in a smooth 15.9K seconds = 4.4 hours, right around the 4.5 hour estimate.

Subsequent batches were even slower. At this point, and with my holiday coming to a close, I decided to kill it and start writing this post.

Closing remarks

I’m pretty happy with how this little project went. Yes, my database is 500x slower than the competition, but I learnt some interesting things about btrfs and SQLite along the way.

I may revisit the btrfs-as-database concept at some point with fresh eyes. I think the one-file-per-key design is likely the source of the current performance problems, and I’d like to explore ways of mitigating that. Or try a complete re-design using less files. It might be that the best way to leverage btrfs’s snapshots is with a single-file database like SQLite or LMDB and zero consistency/atomicity built-in. Running SQLite with journal_mode=MEMORY and a snapshot before each transaction might even be an interesting experiment.

Code for butter_db is available on GitHub, although I wouldn’t recommend using it for anything. If you are interested in chatting about btrfs, B-tree databases or anything in this post you can reach me via:

Email: <firstname>@<lastname>.xyz
Mastodon: sproul@mastodon.social
Twitter: sproulM_

Thanks for reading!

Michael Sproul