Magicians never share their secrets. But we do. Sign up for our Ruby Magic email series and receive deep insights about garbage collection, memory allocation, concurrency and much more.
When we were moving an app to Kubernetes, we encountered a peculiar situation where other services running on Kubernetes started throwing a
ThreadError from time to time, saying that a resource is unavailable. We started investigating, and it turned out that you want to know where your AppSignal error has occurred.
A short reminder - Kubernetes works on two levels:
podis the container where your application is running.
nodeis the actual machine with the OS which runs your
pods. For us at WeTransfer, the
nodeis not a physical machine but an EC2 instance.
So, you want to know which pod and which node ran a particular AppSignal transaction.
You can use this data to localize two particular types of bugs you might encounter:
Imagine - your application starts and fills up the
TMPDIR with some data. It doesn’t clean up this data, but the data only gets generated when a particular code path is triggered. Some of your application pods start raising
Errno::ENOSPC exceptions, but others work fine. You want to know which host this happened on to troubleshoot it. Same for memory bloat issues or memory leaks.
What happened with our application was that it would start processes in its container and not terminate them (leaving zombie processes). A
pod is a container, not a VM, so it shares the same userspace OS limits with other
pods running on the same machine. Once its pods pods have gobbled up the available PIDs of the OS, other pods running on the same nodes would start throwing ThreadErrors.
You can diagnose this only by observing where the error occurs on a limited subset of nodes, and then examining what other pods are colocated on each node.
AppSignal includes the ability to set the hostname for the AppSignal transaction, and then to filter on that hostname. However, autoscaled Kubernetes applications overpower that feature in AppSignal, because it wasn’t made to work with large number of hosts going online and offline rapidly. In an autoscaled application every pod gets a pseudorandom name, and there can easily be hundreds of pods started and terminated within a short period of time. AppSignal folks have explicitly asked us not to use the
hostname value and to set it to the same string for all of our calls.
So, what you actually want is for every transaction to be tagged with the Kubernetes pod and node. Once you have this information, your AppSignal transactions will include these custom attributes:
Crucially, you will see that you can filter on these attribute values, which helps troubleshooting. Also, as you hover over your list of error samples, you can easily see whether your error happens in a normal distribution across multiple pods/nodes or if it is localised to a single node:
This can be immensely useful.
There are two steps you need to take. First, you will need to pass your Kubernetes metadata through into the environment variables of the container - this is not done automatically. To make this data available, add the following to your
kubernetes/base/deployment.yaml file, for every container you run:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
containers: - name: <name of your container> image: <your image> command: ["/bin/sh"] # some sections skipped for brevity # To have visibility in Appsignal where our containers are running we will pass through some k8s metadata to # the containers. These envvars combine with the envFrom envvars. env: # Appsignal have asked us to disable the hostname use because they # are not well fit for quickly-rotating, large server fleets - name: APPSIGNAL_HOSTNAME value: "kubernetes" - name: K8S_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: K8S_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: K8S_POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace - name: K8S_POD_IP valueFrom: fieldRef: fieldPath: status.podIP - name: K8S_POD_SERVICE_ACCOUNT valueFrom: fieldRef: fieldPath: spec.serviceAccountName
Once you have done that, deploy your application. Don’t worry - the
env hashmap won’t overwrite other hashmaps you define in, say, your
Secondly, you will need to pick this data inside the application and populate it into the AppSignal transaction. For this, you will need this bit of code:
1 2 3 4
Appsignal::Transaction.current.set_metadata("k8s_pod_name", ENV.fetch('K8S_POD_NAME', 'unknown')) Appsignal::Transaction.current.set_metadata("k8s_node_name", ENV.fetch('K8S_NODE_NAME', 'unknown')) Appsignal::Transaction.current.set_metadata("k8s_pod_service_account", ENV.fetch('K8S_POD_SERVICE_ACCOUNT', 'unknown')) Appsignal::Transaction.current.set_metadata("k8s_pod_ip", ENV.fetch('K8S_POD_IP', 'unknown'))
You need to run this code early in your background job handler or in your web request handler. For example, in a Rack middleware - which could look like so:
1 2 3 4 5 6 7 8 9
class AppsignalPodData < Struct.new(:app) def call(env) Appsignal::Transaction.current.set_metadata("k8s_pod_name", ENV.fetch('K8S_POD_NAME', 'unknown')) Appsignal::Transaction.current.set_metadata("k8s_node_name", ENV.fetch('K8S_NODE_NAME', 'unknown')) Appsignal::Transaction.current.set_metadata("k8s_pod_service_account", ENV.fetch('K8S_POD_SERVICE_ACCOUNT', 'unknown')) Appsignal::Transaction.current.set_metadata("k8s_pod_ip", ENV.fetch('K8S_POD_IP', 'unknown')) app.call(env) end end
And boom - now you know where your workload ran!
Now we’ve helped you add Kubernetes metadata to errors in AppSignal, we hope it’s all smooth sailing from here on out.