Magicians never share their secrets. But we do. Sign up for our Ruby Magic email series and receive deep insights about garbage collection, memory allocation, concurrency and much more.
Yesterday we had our first outage in years that resulted in partial data loss. To give an idea what happened we've written up the following timeline of our technical issues during this outage.
To get a better understanding of what happened we'll first describe a little bit about our architecture.
On our customers' machines, we have an agent running which aggregates performance metrics and exceptions. The agent sends a payload every 30 seconds to our Push API. This Push API is a Golang process that accepts the agent's payload, validates the API key and saves the data in Redis in a Sidekiq compatible format.
From there it gets picked up by Sidekiq, which processes the app samples and saves metric data back to another Redis queue.
Running along Sidekiq is a Rust process that fetches the metrics from the Sidekiq job result, formats them in our internal Protobuf format and sends the data to Kafka. Kafka, processes the data further to create Anomaly detection alerts, which results another Sidekiq job which stores them in our database and sends out notifications.
This combination of the Golang HTTP server, Redis database, Ruby worker and Rust processor is internally called a worker, and we run multiple machines in parallel with this stack.
Once the data is in Kafka it gets processed further by a Rust process that aggregates the metrics to minutely and hourly buckets. This is internally referred to as a processor.
The data is sent back to Kafka and finally ends up on one of our connect servers. These 'tail' certain Kafka topics and batch-write the processed data to MongoDB, ensuring efficient batch-inserts.
The outage started with an alert that one of our processors is out of memory and is writing excess memory to swap. We acknowledged the alert and restarted the Rust processor to clear the memory and swap. This sometimes happens after a network hiccup where the processor has to catch up on data from Kafka.
Another alert is opened, this time from one of the connect servers. Too much data is being batched and written to MongoDB at the same time.
More alerts are opened, this time there are large queues building up on the workers. The data processing is starting to slow down. Our Sidekiq process can't keep up with the amount of data from the Push API.
Our connect servers are having difficulties writing data to a certain MongoDB replica set, and the Kafka output topic is starting to lag behind. In the next few minutes, other workers are starting to build up queues as well, and more connect servers are reporting difficulties writing data to MongoDB.
The incident is escalated to a major outage, which comes with its own issues since half of the team is at RailsConf. They're in another timezone halfway around the world. To make matters worse they're on conference WiFi.
At this time it's still not entirely clear what the cause of the slow database writes is. A collection on the replicaset is completely locked with writes from the connect servers. Because of these writes, our workers have difficulties reading the required data from this cluster, increasing the average job duration from 137 millisecond to 1.61 seconds. The average global query time on this cluster went from 5 microseconds (µs) to 4 millisecond during this time.
The operations team start a conference call to diagnose the issue. The status page is updated as more people are involved.
The Redis servers on the workers are starting to run into swap as well. All the memory on the workers is being consumed with payloads from the Push API, this causes almost no jobs to be processed by Sidekiq.
As an added effect of our servers running out of memory, other processes are starting to crash, such as the process that interfaces Redis, Sidekiq with Kafka.
We've narrowed down the cause of the queues on the workers to jobs reading data from the affected database cluster, to populate certain values for Anomaly detection alerts. These alerts are generated in the Rust processor, sent to Kafka and written to Redis to be processed in Sidekiq.
These "alert batches" are blocking on a query that is routed to the affected replica-set. They take up all the available capacity, leaving almost no space for Sidekiq to process incoming payloads.
A config change was deployed that prevented "alert batches" from being written to Sidekiq, freeing up capacity for payloads processing.
To keep up with the incoming traffic we decided to scale up the number of workers.
We decide to shutdown Nginx on all workers servers. This prevents new payloads from arriving at our Push API. The agent that runs on our users' servers know what to do in this situation and will temporarily write the payloads to disk. They will check every once in a while if the Push API is back up and will then retry to post the queued payloads.
This way our workers aren’t constantly being bombarded with new payloads while trying to process the enormous queue already present.
We've entered a sort of whack-a-mole process where we're restarting the workers when they're starting to write to swap and become unresponsive. This has to be done until we've created enough breathing room for Sidekiq to start processing like normal and clear the backlog of queued payloads.
Because Redis becomes unresponsive the Rust process that interfaces with Kafka, Redis and Sidekiq crashes. This triggers a rule in our Sidekiq payload workers to stop writing data to Redis as to not overwhelm the service with even more metrics from payloads. This crash is where half of our data loss originated. Metrics were not emitted while the Kafka interface was down.
Another factor is that our agent only stores the locally queued payloads for a certain amount of time, to prevent us from claiming too much disk space on our customers' machines. Once enough time has passed our agent will start truncating the queued payloads.
The queues are processed and the added workers are coming online. We now have enough capacity and the queues are empty enough that we can re-enable the Push API.
As expected the agents running on the client’s servers are noticing a healthy API and start posting their backlog of payloads. This creates a huge wave of payloads and queues build up again on our system.
With the increased capacity we can handle these waves of payloads and slowly but surely the queues are starting to drop. We’re left with two problems.
We decide to scale up the replica set to the maximum configuration allowed by our hosting provider. This improves the situation somewhat. We modify the Sidekiq worker that handles the alert batches. Since the data it operates on has been lost, there’s no sense in sending out these batches. The job now returns immediately without doing any actual work. This allows us to clear the built up queue of batches, while not straining the congested MongoDB replica set even more.
The queue of alert batches is cleared and we re-enable the processing of these batches on our workers.
The congestion on the MongoDB replica set resolves, and the connect servers clear their Kafra queues. The system is back to functioning normally.
The outage started with one of our processors running out of memory, which caused a build-up of data. The build-up caused congestion writing data to MongoDB, locking the database from reads. This, in turn, caused the processing Sidekiq job to build up queues because it was unable to read data from the database.
From there it spiraled into a general slowdown of the entire stack and everything got congested.
Going forward, we'll try to depend on as few external data repositories (e.g. databases) as possible in our processing stack. This will prevent slowdowns if one of the databases gets congested.
We were quite bad in communicating during this outage. Something that annoys us when services we use are down. Our minds were too occupied with resolving the issue, moving communication to the background. In the future, we’ll call up additional team members to keep the status page properly updated. Ideally, more team members are present on the conference call between engineers, and also log important events and decisions. This will also make writing this post-mortem a lot easier.