How do you store 1 Billion small files?

How do you store 1 Billion small files?

When you’re tracking 1 Billion SSL certificates, you need to store them somewhere. First of all, that’s a lot of data. But also this data is growing fast, at 10 million new certificates per day. And it also requires expiration, at 10 million per day.

Let’s examine a few options.

S3

S3 and other object storage solutions are great for storing files for a cheap price. Or that’s what I thought.

Some vendors will charge you a minimum size per file, and it can be as much as 64KB! even if your files are about 1KB (Digital Ocean Spaces). Others will apply rate limits to the number of files you can store, and these are often quite low (BackBlaze B2 recently introduced a 50/s rate limit to uploads). It’s easy to see that this is not a good solution for storing our Billion SSL certificates.

Database

1 Billion rows in a database is a lot. That’s about 1TB or raw data. It will cost a lot of money to store that much data, while most of it will never be used, before it expires and needs to be deleted.

Filesystem

Filesystems are great for storing files. That’s what they’re designed for.

But one Billion of small files? That’s a lot to manage, and a lot of metadata to store.

Most filesystems will have a limit on the number of files you can store, and a minimum allocated size per file (4KB up to 64KB).

How we do it

We’re using an LSM-Tree database, BadgerDB. It’s a Golang library, originally developed by Dgraph, and it’s derived from LevelDB, the original Google implementation of the LSM-Tree database.

BadgerDB is amazingly well suited for this use case:

  • LSM-Trees are designed to optimize addition of data: 100 new rows per second is nothing for BadgerDB
  • It handles expiration of data through compaction during its normal operation
  • It’s a lot faster than a traditional database and serves certificates in milliseconds
  • It’s trivial to perform incremental backups through rsync as data is stored in immutable files

The only drawbacks we’ve found so far:

  • Replication is not built in
  • Scalability is not horizontal
  • It uses quite a lot of memory, and is not easy to tune

But for our use case, it’s a perfect fit.

Conclusion

We’re using BadgerDB to store 1 Billion SSL certificates. It’s a perfect fit for our use case, and it’s a lot cheaper and more efficient than other solutions.

Photo by Ryoji Iwata on Unsplash