Not so long ago we noticed that our MongoDB servers were running out of disk space at an alarming rate. Because we host our database on SSD enabled servers at DigitalOcean, scaling up could cost us a lot of money.
Our data model
We receive a lot of log entries from our clients. These sometimes contain a lot of data, like params and instrumentation for very large actions. For some clients the average size of a log entry is around 400KB. You can imagine that when we store 70,000 of those, it adds up.
To make things even more complicated, we "compact" the entries depending on their age. For example, after a week we take all the entries in any one hour period and compact those into one entry. This means that we remove all the entries for that one hour period, except for the slowest.
This method saves us a lot of disk space, but the downside is that we create fragmentation in the data store, leaving gaps that may or may not be filled with new entries.
The result is that we had 14GB of data, but used almost 60GB on disk! That's a lot of overhead.
The documents have a wide spread of sizes and MongoDB has a hard time fitting entries into the gaps. This in turn bumps up the padding factor, because MongoDB adds padding after each document so it can grow without having to move the document to a new space on disk.
We realized that we have a lot of data that we don't query on and is only shown in the front-end when a single request is examined, such as parameters, environment and backtraces for errors. If we could compress that data, we keep the document size down and have a more narrow distribution of document sizes.
We started zipping the fields we don't query on by converting them to JSON, Zip them with Zlib and store them as Binary BSON. The results are astounding: a collection that was 2GB and had an average document size of 400KB was brought down to just 121MB and 20.4KB after compression.
Not only does this save us a ton of data on disk, our internal bandwidth usage has dropped significantly as well.