Magicians never share their secrets. But we do. Sign up for our Ruby Magic email series and receive deep insights about garbage collection, memory allocation, concurrency and much more.
As an ever growing all-in-one APM, we spend a lot of time on making sure AppSignal can cope with our increase in traffic. Usually, we don’t talk about how we do that; our blog is full of articles about great things under the hood of Ruby or doing crazy things with Elixir, but not about what makes AppSignal tick.
This time however, we’d like to share some of the bigger changes in our stack we’ve made over the past few years, so we can (easily) process the double-digit billions of requests sent our way every month. In real-time. So today we use our scaling experience to discuss our own stack and help you that way.
AppSignal started out as a pretty standard Rails setup. We used a Rails app that collected data through an API endpoint which created Sidekiq jobs to process in the background.
After a while we replaced the Rails API with a Rack middleware to gain a bit of speed and later this was replaced with a Go web server that pushed Sidekiq compatible jobs to Redis.
While this setup worked well for a long time, we began to run into issues where the databases couldn’t keep up with the amount of queries run against them. At this point we were processing tens of billions of requests already. The main reason for this was that each Sidekiq process needed to get the entire app’s state from the database in order to increment the correct counters and update the right documents.
We could alleviate this somewhat with local caching of data, but because of the round-robin nature of our setup it still meant that each server needed to have a full cache of all data, because we couldn’t be sure on what server the payload would end up. We realised that with the data growth we were experiencing this setup would become impossible in the future.
In search for a better way to handle the data we settled on using Kafka as the data processing pipeline. Instead of aggregating metrics in the database, we now aggregate the metrics in Kafka processors. Our goal is that our Kafka pipeline never queries the database until the aggregated data has to be flushed. This drives the amount of queries per payload down from up to ten reads and writes to just one write at the end of the pipeline.
We specify a key for each Kafka message and Kafka guarantees that the same keys end up on the same partition, that’s consumed by the same server. We use the app’s ID as a key for messages, this means that instead of having a cache for all customers on the server, we only have to cache data for the apps a server receives from Kafka, not all apps.
Kafka is a great system and we’ve migrated over in the past two years. Right now almost all processing is done in Rust through Kafka, but there are still things that are easier done in Ruby, such as sending Notifications and other database-heavy tasks. This meant that we needed some way to get data from Kafka to our Rails stack.
When we began this transition there were a couple Kafka Ruby gems, but none worked with the latest (at the time 0.10.x) release of Kafka and most were unmaintained.
We looked at writing our own gem (which we eventually did). We will write more about that in a different article. But having a nice driver is only part of the requirements. We also needed a system to consume the data and execute the tasks in Ruby and spawn new workers when old ones crash.
Eventually we came up with a different solution. Our Kafka stack is built in Rust and we wrote a small binary that consumes a
sidekiq_out topic and creates Sidekiq compatible jobs in Redis. This way we could deploy this binary on our worker machines and it would feed new jobs into Sidekiq just as you would do within Rails itself.
The binary has a few options such as limiting the amount of data in Redis to stop consuming the Kafka topic until the threshold is cleared. This way all the data from Kafka won’t end up in Redis’ memory on the workers if there is a backlog.
From Ruby’s point of view, there is no difference at all between jobs generated in Rails and those that come from Kafka. It allows us to prototype new workers that get data from Kafka and process it in Rails–to send notifications and update the database–without having to know anything about Kafka.
It made the migration to Kafka easier as we could switch over to Kafka and back without having to deploy new Ruby code. It also made testing super easy as you could easily generate jobs in the test suite to be consumed by Ruby without having to setup an entire Kafka stack locally.
We use Protobuf to define all our (internal) messages, this way we can be pretty sure that if the test passes, the worker will correctly process jobs from Kafka.
In the end this solution saved us a lot of time and energy and made life a lot simpler for our Ruby team.
As with everything there are a few pros and cons for this setup:
Using Sidekiq helped us tremendously while migrating our processing pipeline to Kafka. We’ve now almost completely moved away from Sidekiq and are handling everything via our Kafka driver directly, but that’s for another article.
This happy ending wraps up the love story. We hope you enjoyed this perspective on performance and scaling, and our experience scaling AppSignal. We hope this story on the decisions we made around our stack, in turn, help you.
Check out the rest of the blog or follow us to keep an eye out on when the next episode about our Kafka setup is published. And if you end up looking for an all-in-one APM that is truly by developers for developers, come find us.