Post-Mortem: Why our gem disabled itself

Yesterday around 3pm CEST we had an issue with one of our API endpoints, resulting in 401 ("Unauthorized") response codes, which in turn disabled the gems of some of our customers. We're still investigating and will release an updated post-mortem once we know all the details. For now we would like to explain what happened and how sites were affected.

Issues with the API

Yesterday we rolled out a partly new architecture, which included an upgrade of our worker machines. When the AppSignal gem gathers data and sends it to our Push API, it ends up on one of our worker machines. The worker checks if the data comes from an active customer, and if so saves the data to a queue for processing. The worker machines then pull from the queue to process the data and either store it in a normalized format, or delete the data. Right after we deployed "Worker 4", our automated scripts immediately added this machine to the load balancer that distributes incoming requests across the worker machines.

This shouldn't have happened, because a new worker machine first needs to synchronize the database of active accounts. When a worker receives data from a customer that doesn't have an active account, it returns a "401 Unauthorized" status. Because the database of active accounts was empty, it returned a 401 for all apps it received data from.

Within a few minutes we knew what went wrong, removed Worker 4 from the pool of available workers, updated its customer database and re-enabled the machine in the load balancer. After this the machine functioned like normal again.

Why the gems disabled

There are some reasons why we want the gem to shut itself down. Some of our trials don't convert into paying customers, yet the associated apps keep pushing data. Sometimes a customer cancels their subscription, but doesn't remove the gem from their apps. To prevent this scenario from putting an increasing burden on our servers, as well as having an unneeded process running on the client server, the gem disables itself when it receives a 401 response from the Push API. It effectively shuts down the AppSignal gem, until the app is restarted (for instance after deploying again). If by then the customer still isn't on our list of active accounts, the gem receives the 401 and shuts itself down again, and so forth. That way we only receive a small amount of data for apps that don't actually want to use AppSignal.

Combining the two

The outcome of the above combination is pretty self explanatory: when an active customer received a 401 from Worker 4, the gem was shut down. Even though the root cause of sending the 401 responses was fixed within minutes, the gem doesn't restart itself. This turned two minutes of incorrect responses into an ongoing issue for a small portion of our clients. Clients that were affected are the ones with apps that happened to send data within those two minutes, which ended up on this specific worker machine.

Our gem gathers data for a minute and sends it to the Push API in a batch, to keep the number of requests manageable. For the affected customers their push interval was set to fire within the timeframe that returned the 401, and that group of customers was limited even more by those ending up on Worker 4. Customers who didn't push data within this timeframe, or those for which data was processed by one of the other worker machines, were unaffected.

Aftermath

In total, 22 customers experienced issues because of this incident. We sent them an email yesterday evening, after determining the cause of the issues as well as identifying the ongoing problems for some customers. That email advised them to restart their processes to make sure all data was being sent to AppSignal. By then we had already found out that some sites had redeployed as part of their development cycle, had Heroku switch hosts and restart processes, and so forth. For those the issue of not having an active AppSignal gem resolved itself, other had to take manual action.

One of our customers indicated an issue with Sidekiq background jobs not being processed, starting at the same time of the gem shutdown. As of yet we don't know if it's a causal link or coincidence, but we're keen on finding that out. We currently have engineers working on investigating this. If there turns out to be a related issue, we will of course deploy a fix as soon as possible. We're also working on ways to prevent sending out unwanted 401's again, as well as evaluating ways we can make the gem behave differently.

Post-mortem updates

To be clear: this is a not a full post-mortem. The intent of this information is to clarify the key events that triggered this issue, as well as who were affected and why. You may expect updates to this post-mortem once we have concluded our investigation and have made changes to our architecture or gem (if appropriate), so we can share our findings.

For now, if you believe you were affected and/or believe you have ongoing issues, please contact us.

Update 22-08-2014

We've concluded our investigation of this incident and have taken preliminary measures to prevent this from occuring again:

We've added multiple checks in our provisioning and load balancing systems to make sure that only servers with a full sync of the API keys can become active.
We've released version 0.10.3 of the gem that includes a fix for a potential memory leak that can occur after shutting down monitoring.

In addition to these measures we're currently testing a change in the gem that will retry posting to the API in the event of unexepected response codes. We'll release this next week, after rigorous testing.