Adding Kubernetes Metadata to Your AppSignal Errors

When we were moving an app to Kubernetes, we encountered a peculiar situation where other services running on Kubernetes started throwing a ThreadError from time to time, saying that a resource is unavailable. We started investigating, and it turned out that you want to know where your AppSignal error has occurred.

A short reminder - Kubernetes works on two levels:

A pod is the container where your application is running.
A node is the actual machine with the OS which runs your pod and other pods. For us at WeTransfer, the node is not a physical machine but an EC2 instance.

So, you want to know which pod and which node ran a particular AppSignal transaction.

Why You Might Want This Data in AppSignal

You can use this data to localize two particular types of bugs you might encounter:

Application-affine Bugs with Sticky State

Imagine - your application starts and fills up the TMPDIR with some data. It doesn't clean up this data, but the data only gets generated when a particular code path is triggered. Some of your application pods start raising Errno::ENOSPC exceptions, but others work fine. You want to know which host this happened on to troubleshoot it. Same for memory bloat issues or memory leaks.

Bad Tenant Situation

What happened with our application was that it would start processes in its container and not terminate them (leaving zombie processes). A pod is a container, not a VM, so it shares the same userspace OS limits with other pods running on the same machine. Once its pods pods have gobbled up the available PIDs of the OS, other pods running on the same nodes would start throwing ThreadErrors.

You can diagnose this only by observing where the error occurs on a limited subset of nodes, and then examining what other pods are colocated on each node.

How to Tag Transactions with the Kubernetes Pod and Node in AppSignal

AppSignal includes the ability to set the hostname for the AppSignal transaction, and then to filter on that hostname. However, autoscaled Kubernetes applications overpower that feature in AppSignal, because it wasn't made to work with large number of hosts going online and offline rapidly. In an autoscaled application every pod gets a pseudorandom name, and there can easily be hundreds of pods started and terminated within a short period of time. AppSignal folks have explicitly asked us not to use the hostname value and to set it to the same string for all of our calls.

So, what you actually want is for every transaction to be tagged with the Kubernetes pod and node. Once you have this information, your AppSignal transactions will include these custom attributes:

Adding%20Kubernetes%20metadata%20to%20your%20Appsignal%20error%208126ceb1a9154d248c0f43bde1e3a25b/Screenshot_2021-06-15_at_12.50.57.png

Crucially, you will see that you can filter on these attribute values, which helps troubleshooting. Also, as you hover over your list of error samples, you can easily see whether your error happens in a normal distribution across multiple pods/nodes or if it is localised to a single node:

Adding%20Kubernetes%20metadata%20to%20your%20Appsignal%20error%208126ceb1a9154d248c0f43bde1e3a25b/Screenshot_2021-06-15_at_12.52.14.png

This can be immensely useful.

How Do You Enable Kubernetes Metadata?

There are two steps you need to take. First, you will need to pass your Kubernetes metadata through into the environment variables of the container - this is not done automatically. To make this data available, add the following to your kubernetes/base/deployment.yaml file, for every container you run:

YAML

containers:
  - name: <name of your container>
    image: <your image>
    command: ["/bin/sh"]
    # some sections skipped for brevity
    # To have visibility in Appsignal where our containers are running we will pass through some k8s metadata to
    # the containers. These envvars combine with the envFrom envvars.
    env:
      # Appsignal have asked us to disable the hostname use because they
      # are not well fit for quickly-rotating, large server fleets
      - name: APPSIGNAL_HOSTNAME
        value: "kubernetes"
      - name: K8S_NODE_NAME
        valueFrom:
          fieldRef:
            fieldPath: spec.nodeName
      - name: K8S_POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      - name: K8S_POD_NAMESPACE
        valueFrom:
          fieldRef:
            fieldPath: metadata.namespace
      - name: K8S_POD_IP
        valueFrom:
          fieldRef:
            fieldPath: status.podIP
      - name: K8S_POD_SERVICE_ACCOUNT
        valueFrom:
          fieldRef:
            fieldPath: spec.serviceAccountName

Once you have done that, deploy your application. Don't worry - the env hashmap won't overwrite other hashmaps you define in, say, your kustomization files.

Secondly, you will need to pick this data inside the application and populate it into the AppSignal transaction. For this, you will need this bit of code:

Ruby

Appsignal::Transaction.current.set_metadata("k8s_pod_name", ENV.fetch('K8S_POD_NAME', 'unknown'))
Appsignal::Transaction.current.set_metadata("k8s_node_name", ENV.fetch('K8S_NODE_NAME', 'unknown'))
Appsignal::Transaction.current.set_metadata("k8s_pod_service_account", ENV.fetch('K8S_POD_SERVICE_ACCOUNT', 'unknown'))
Appsignal::Transaction.current.set_metadata("k8s_pod_ip", ENV.fetch('K8S_POD_IP', 'unknown'))

You need to run this code early in your background job handler or in your web request handler. For example, in a Rack middleware - which could look like so:

Ruby

class AppsignalPodData < Struct.new(:app)
  def call(env)
    Appsignal::Transaction.current.set_metadata("k8s_pod_name", ENV.fetch('K8S_POD_NAME', 'unknown'))
    Appsignal::Transaction.current.set_metadata("k8s_node_name", ENV.fetch('K8S_NODE_NAME', 'unknown'))
    Appsignal::Transaction.current.set_metadata("k8s_pod_service_account", ENV.fetch('K8S_POD_SERVICE_ACCOUNT', 'unknown'))
    Appsignal::Transaction.current.set_metadata("k8s_pod_ip", ENV.fetch('K8S_POD_IP', 'unknown'))
    app.call(env)
  end
end