Taming our MongoDB database size.

Not so long ago we noticed that our MongoDB servers were running out of disk space at an alarming rate. Because we host our database on SSD enabled servers at DigitalOcean, scaling up could cost us a lot of money.

Our data model

We receive a lot of log entries from our clients. These sometimes contain a lot of data, like params and instrumentation for very large actions. For some clients the average size of a log entry is around 400KB. You can imagine that when we store 70,000 of those, it adds up.

Compacting

To make things even more complicated, we "compact" the entries depending on their age. For example, after a week we take all the entries in any one hour period and compact those into one entry. This means that we remove all the entries for that one hour period, except for the slowest.

This method saves us a lot of disk space, but the downside is that we create fragmentation in the data store, leaving gaps that may or may not be filled with new entries.

The result is that we had 14GB of data, but used almost 60GB on disk! That's a lot of overhead.

The documents have a wide spread of sizes and MongoDB has a hard time fitting entries into the gaps. This in turn bumps up the padding factor, because MongoDB adds padding after each document so it can grow without having to move the document to a new space on disk.

Compressing

We realized that we have a lot of data that we don't query on and is only shown in the front-end when a single request is examined, such as parameters, environment and backtraces for errors. If we could compress that data, we keep the document size down and have a more narrow distribution of document sizes.

Zip it!

We started zipping the fields we don't query on by converting them to JSON, Zip them with Zlib and store them as Binary BSON. The results are astounding: a collection that was 2GB and had an average document size of 400KB was brought down to just 121MB and 20.4KB after compression.