When it comes to monitoring your Elixir application, it's challenging to make sense of the many metrics and statistics that you can read from the internals of the Erlang virtual machine. In this post, we'll be looking at the scheduler utilization metric in order to understand what it is, why we should monitor it, and how to monitor it.
What Is Scheduler Utilization?
In concurrent systems, scheduling is the mechanism by which work that needs to be done is assigned to the resources needed to do it. In the Erlang VM, these resources are managed by the schedulers. By default, there is one scheduler for each CPU core in your system, so tasks can be performed concurrently.
The scheduler utilization rate is a percentage value representing the amount of time that each scheduler is in use. A low scheduler utilization rate means that the scheduler is mostly idle, waiting to receive work to do, while a high scheduler utilization rate means that the scheduler has spent most of its time working on one or more tasks.
In order to ensure responsiveness, the Erlang VM is "busy waiting" when the application is idle. Because of this, metrics such as CPU usage, as provided by the OS, may not accurately represent your application's actual workload. Using the Erlang standard library to measure the scheduler utilization rate provides you with a metric that more closely corresponds to the actual work performed by your application.
Why Does Scheduler Utilization Matter?
Under full scheduler utilization, the work that needs to be done may start piling up in the schedulers' run queue. On a network application, such as a web server, this could result in increased latency: the schedulers' resources are spread thinly across many ongoing tasks. The scheduler may take more time to attend to new tasks, such as handling incoming requests.
To see this in action, let's start a Phoenix server from the iex
interactive Elixir shell, and start many concurrent, long-running tasks from that shell, in order to keep the scheduler busy. For this example, we will start ten thousand processes, each counting down to zero from a million:
If we attempt to perform requests against the Phoenix server shortly after starting these tasks, i.e. by reloading a page, we'll see that it can take up to several seconds for it to respond. The scheduling system in the Erlang VM attempts to ensure a fair distribution of execution time amongst tasks, but as the amount of tasks in waiting increases, it takes longer for the scheduler to get around to handle your request.
Measuring Scheduler Utilization
While these long-running tasks are ongoing, we can also see the effect they have on the scheduler utilization rate. To see the scheduler utilization over a given time span, we must first collect a sample from :scheduler.sample/0
. Then, pass that sample to :scheduler.utilization/1
in order to obtain the utilization rate since its collection:
We should note that this is an artificial example, in the sense that it's rare for tasks to be spawned in this manner. Most often, you would use Task.async_stream
instead of calling Task.start
on a loop, and Task.async_stream
would limit the amount of concurrent tasks it runs to the number of schedulers in your system. This would prevent the scheduler run queues from filling up, and therefore prevent response times from spiking in this manner. In this example, the saturation of the schedulers' run queues is what causes the latency increase, and the high scheduler utilization is what keeps the queues from emptying.
How To Monitor Scheduler Utilization With AppSignal
Starting with version 2.2.8 of our Elixir integration, AppSignal automatically displays a graph for the scheduler utilization metric in its Erlang magic dashboard. All you need to do is add AppSignal to your app.
If you want to get a focused view on specific metrics like these, creating a specialised dashboard is a great way to do so. You can import this dashboard to have a dedicated view of the Erlang schedulers' utilization metric. Click "Add dashboard", then "Import dashboard" and copy-paste the following dashboard configuration:
Let's see an example of how your dashboard might look. This is what it might look like in a healthy situation, with a locally running server processing somewhere around 300 requests per minute. The scheduler activity hovers around 10%, the run queue remains at zero, and the server is responsive.
APM for Elixir Sprinkled With Stroopwafels
If you haven't had the chance to try AppSignal monitoring, here's what you need to know:
- All features are included in all our plans.
- We have a free trial option that doesn’t require a credit card.
- AppSignal supports Node.js, Ruby, and Elixir projects.
- We’re free for open source & for good projects.
- We ship stroopwafels to our trial users on request.
Need we say more? 🍪
Sources
Big thanks to Hamidreza Soleimani for his comprehensive article about the Erlang VM and its scheduling system, and Dan Dresselhaus for his blog post about Erlang's scheduler module.