Realtime graphs from MongoDB with aggregations

In AppSignal, we process a ton of data: between 60 to 100 million entries per day. We aggregate this data with only few seconds delay. Let's see how we did this.

This data consists of performance measurements and application traces. To generate our graphs we use MongoDB's Map/Reduce feature. This turns millions of documents into only a few hundred per day.

For efficiency we "cascade" our Map/Reduce jobs. First we Map/Reduce or "bucket" data to minutely graphs. Then we reduce the minutely graphs to hourly graphs and so on. This way each Map/Reduce job has less and less documents to process.

The downside is that our jobs take a while to finish and so our graphs are delayed by a few minutes.

Making it even faster

We've decided that we wanted the graphs to be as real-time as possible so we either had to speed up the Map/Reduce jobs (which we did to an extent, but we were still a few minutes delayed). Or figure out a way to use our "bucket" data to fill the gap between the last Map/Reduce and "now".

Luckily for us MongoDb has an awesome new feature where you can use an "aggregation pipeline" to process data without running a Map/Reduce job.

On to the code

First we have to select the right data. We only want a graphs for a specific site after the last Map/Reduce run.

Ruby

selector = {
  's' => site.id,
  't' => {'$gte' => last_mapreduce_run_at }
}

We have to group this data by the minute (our highest resolution is minutely) and sum the troughput and durations

Ruby

grouper = {
  '_id' => {'s' => '$site_id', 't' => '$time'},
  'troughput' => {'$sum' => '$c'},
  'duration' => {'$sum' => '$d'}
}

Lets also make sure the data is neatly ordered by time

Ruby

sorter = {'_id.t' => 1}

Now we can aggregate the data:

Ruby

Counter.collection.aggregate([
  {'$match' => selector},
  {'$group' => grouper},
  {'$sort' => sorter}
])

With this data we can calculate total throughput for a site and average response times.

Of course we do a lot of other stuff in this aggregation like calculate queue times, percentiles/deviations etc.

We've made sure our "realtime" Class takes exactly the same in- and output. This gave us the chance to swap them out at random and see how the aggregations performed.

Downsides

While aggregations are great for real-time processing, there are a few caveats that you have to keep in mind.

There is a document size limit of 16 Megabytes, with high traffic sites we are slowly getting closer to that limit.
The code to calculate all our measurements now lives in two places, in the Map/Reduce javascript and in aggregations.
Performance is not that fast on very large datasets, at least not fast enough for web requests (+400ms).

But for the few minutes we have to calculate between the last Map/Reduce and now (usually below 5 minutes) it works very well. Our customers are now able to spot changes in their app's behaviour even faster, making our app more valuable.