How We Improved Table Performance in AppSignal - Back-end Changes

In our last development cycle, we spent time improving our table performance in AppSignal. As customers stay around for longer, data starts piling up. A view with just 10 items in the beginning gathers hundreds of items, and keeps growing.

Besides filtering data in the front-end to reduce the returned data, we wanted to ensure our data could keep growing without timeouts in our GraphQL API or slow-loading pages in our app.

Implementing Pagination

A while ago, we implemented pagination for our API responses. The front-end would fetch pages until there were no more results to fetch. Once all data was present, we would remove the loading indicator and show the result.

The problem was that, for apps with thousands of results, it would take many seconds for the loading indicator to disappear. Some requests even resulted in a timeout after our 30-second request limit was reached.

Back-end Improvements in AppSignal

We started by improving back-end performance. We investigated the performance of GraphQL queries with the most impact, by measuring throughput and response times.

While investigating, we found out that MongoDB has a few counter-intuitive traits when it comes to aggregating and projecting data.

Metrics

Our metrics are stored in two separate collections. The first one is for metric keys that contain the metric name, type, any custom tag the user has set and an idempotent identifier for this key.

The second collection contains documents for each minute/hour, and includes the site_id, time and a hash of metric keys and values - for example:

JSON

{
  "s": "site-1",
  "t": "10-10-2010 10:00:01 UTC",
  "v": {
      "transaction_duration;count": 10.1,
      "transaction_duration;mean": 18.2,
      "transaction_duration;90th": 35.05,
      [...]
  }
}

To generate lists/tables for this time-series data, we aggregate these documents into rows for each metric we want the data for.

Our original implementation would gather the keys required, get the identifiers for the keys, and then project each value in an aggregation step.

Our assumption was that it would be better to fetch the documents from memory once and project all the values.

After several experiments, we found out that projecting the data for 100 keys in one aggregation was a lot slower than projecting 20 keys in a loop, even though the database has to fetch the documents and project the values multiple times.

Parallellism

The query to get the keys was pretty fast most of the time, so we didn't have to improve performance there. And as we knew the number of keys beforehand, we could easily split up the metric data fetch query in batches.

With smaller batches of data, we started to run the database queries in parallel, which improved our query times by a ton.

The old code looked like this:

Ruby

Appsignal.instrument("v1_data.aggregation_list") do
  collection.aggregate(
    [
      {
        "$match" => {
          "s" => site.id,
          "t" => { "$gte" => from, "$lt" => to }
        }
      },
      { "$project" => aggregate_project },
      { "$group" => aggregate_group }
    ],
    :allow_disk_use => true
  ).try(:first) || {}
end

And executed like this:

Shell

┌────────────────────────────────────────────────────────────────────────────┐
│ ┌───────────────────────┐                                                  │
│ │       Keys query      │                                                  │
│ └───────────────────────┘                                                  │
│                         ┌───────────────────────────────────────────────┐  │
│                         │       Data query                              │  │
│                         └───────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────────────────┘

While the new code looks like this:

Ruby

Appsignal.instrument("v2_data.aggregation_list") do
  Parallel.flat_map(keys.in_groups_of(20, false), :in_threads => 10) do |kg|
    collection.aggregate(
      [
        {
          "$match" => {
            "s" => site.id,
            "t" => { "$gte" => from, "$lt" => to }
          }
        },
        { "$project" => aggregate_project(kg) },
        { "$group" => aggregate_group(kg) }
      ],
      :allow_disk_use => true
    ).try(:first) || {}
  end.reduce({}, :merge)
end

And executes like this:

Shell

┌─────────────────────────────────────────────────────────────────────────────┐
│ ┌───────────────────────┐                                                   │
│ │       Keys query      │                                                   │
│ └───────────────────────┘                                                   │
│                         ┌────────────┐                                      │
│                         │ Data query │                                      │
│                         └────────────┘                                      │
│                         ┌────────────┐                                      │
│                         │ Data query │                                      │
│                         └────────────┘                                      │
└─────────────────────────────────────────────────────────────────────────────┘

The result of this parallel batch setup speaks for itself:

In our next post, we'll talk about front-end improvements we made, where we took a similar approach to make pages even faster.

Behind the Scenes at AppSignal

Let us know if you'd like to read more stories like this one! We're always excited to share performance tips and tricks. We also ship stroopwafels worldwide to our users. Ping our support team if you'd like to try some 😋 🍪