When we were moving an app to Kubernetes, we encountered a peculiar situation where other services running on Kubernetes started throwing a ThreadError
from time to time, saying that a resource is unavailable. We started investigating, and it turned out that you want to know where your AppSignal error has occurred.
A short reminder - Kubernetes works on two levels:
- A
pod
is the container where your application is running. - A
node
is the actual machine with the OS which runs yourpod
and otherpods
. For us at WeTransfer, thenode
is not a physical machine but an EC2 instance.
So, you want to know which pod and which node ran a particular AppSignal transaction.
Why You Might Want This Data in AppSignal
You can use this data to localize two particular types of bugs you might encounter:
Application-affine Bugs with Sticky State
Imagine - your application starts and fills up the TMPDIR
with some data. It doesn't clean up this data, but the data only gets generated when a particular code path is triggered. Some of your application pods start raising Errno::ENOSPC
exceptions, but others work fine. You want to know which host this happened on to troubleshoot it. Same for memory bloat issues or memory leaks.
Bad Tenant Situation
What happened with our application was that it would start processes in its container and not terminate them (leaving zombie processes). A pod
is a container, not a VM, so it shares the same userspace OS limits with other pods
running on the same machine. Once its pods pods have gobbled up the available PIDs of the OS, other pods running on the same nodes would start throwing ThreadErrors.
You can diagnose this only by observing where the error occurs on a limited subset of nodes, and then examining what other pods are colocated on each node.
How to Tag Transactions with the Kubernetes Pod and Node in AppSignal
AppSignal includes the ability to set the hostname for the AppSignal transaction, and then to filter on that hostname. However, autoscaled Kubernetes applications overpower that feature in AppSignal, because it wasn't made to work with large number of hosts going online and offline rapidly. In an autoscaled application every pod gets a pseudorandom name, and there can easily be hundreds of pods started and terminated within a short period of time. AppSignal folks have explicitly asked us not to use the hostname
value and to set it to the same string for all of our calls.
So, what you actually want is for every transaction to be tagged with the Kubernetes pod and node. Once you have this information, your AppSignal transactions will include these custom attributes:
Crucially, you will see that you can filter on these attribute values, which helps troubleshooting. Also, as you hover over your list of error samples, you can easily see whether your error happens in a normal distribution across multiple pods/nodes or if it is localised to a single node:
This can be immensely useful.
How Do You Enable Kubernetes Metadata?
There are two steps you need to take. First, you will need to pass your Kubernetes metadata through into the environment variables of the container - this is not done automatically. To make this data available, add the following to your kubernetes/base/deployment.yaml
file, for every container you run:
Once you have done that, deploy your application. Don't worry - the env
hashmap won't overwrite other hashmaps you define in, say, your kustomization
files.
Secondly, you will need to pick this data inside the application and populate it into the AppSignal transaction. For this, you will need this bit of code:
You need to run this code early in your background job handler or in your web request handler. For example, in a Rack middleware - which could look like so:
And boom - now you know where your workload ran!
Now we've helped you add Kubernetes metadata to errors in AppSignal, we hope it's all smooth sailing from here on out.