academy
CPU Steal Time: A Crucial Metric for Cloud Servers and VMs
Milica Maksimović and Thijs Cadier on
If your application runs in a virtualized environment, there is a crucial metric you might not be aware of: CPU steal. In this post, we'll explain what CPU steal is, how to monitor it, and what happens to your app when CPU steal is high.
What Is CPU Steal?
CPU steal is the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor. It gives you an indication of which percentage of a real core you can use in a given moment. This metric is valuable for those who run their infrastructure in the cloud, e.g. AWS, DigitalOcean, etc.
CPU Steal in a Real-life Example
Let's say you have a server running on DigitalOcean. For the sake of this example, your server runs on a physical machine that hosts 5 VMs. All of those VMs share the same resources. Memory usage is capped at 20%, but CPU cycles aren't. This means that one VM can use more than others.
If you SSH into your server and run the top
command, the CPU steal time will be marked as st
.
top - 13:26:09 up 22:38, 1 user, load average: 1.90, 1.92, 1.60 Tasks: 373 total, 1 running, 181 sleeping, 0 stopped, 0 zombie %Cpu(s): 4.8 us, 0.3 sy, 0.0 ni, 94.5 id, 0.4 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 13186532+total, 810704 free, 52079712 used, 78974904 buff/cache KiB Swap: 7811068 total, 7748604 free, 62464 used. 78530328 avail Mem
Let's examine the CPU line here. The last metric indicates the time stolen from this VM by the hypervisor. CPU waits for this VM's hypervisor to attend to another VM.
If steal time is low, it means that VMs aren't busy, but this is rarely the case with web applications that serve users all over the globe. If a VM you're sharing the server with has a task running for a long time in the background, that one is highly likely to get more than their fair share of CPU cycles for a while. Afterwards, other VMs will jump in the queue and slow down that task.
These tasks hurt web applications the most - they can cause performance issues and even lead to outages.
What to Do about CPU Steal Time
Let's run through a few examples.
CPU Steal Time Is Low
If your CPU steal time is lower than 10%, you have nothing to worry about. Your application should run smoothly.
CPU Steal Time Is Relatively High
Your CPU steal time is well over 10% for ~30 minutes.
Check what type of virtual server you're running. How fast is the processor, and how much RAM and disk space do you have? There is a possibility that your app needs more CPU resources overall. In that case, you are causing the problem.
However, what if you're already spending a lot of money on the right servers for your web application? In this case, you have every right to complain to your server provider. They are putting more VMs on physical machines, and your noisy neighbors are competing for resources with you.
Beware: there's a catch. You won't know which situation you're in unless you have the full host overview. To test whether you're the cause of the issue, you should run multiple identical VMs that perform the same tasks on several different hosts. If all VMs have issues, you're the one causing them, but if only one VM complains, that means your provider is the one to blame.
👋 If you are liking this article, there is a lot more we wrote about Ruby (on Rails) performance, check out our Ruby performance monitoring checklist.
How to Monitor CPU Steal with AppSignal
AppSignal displays CPU steal in the host metrics screen for every host by default for Ruby, Node.js, and Elixir applications. In addition, it can be useful to set up a dashboard of just the steal metric if you have infrastructure that's likely to be effected.
All you need to do is add AppSignal to your app, and if you're already using us, upgrade to the latest version of our integration or the standalone agent.
Creating a custom dashboard is a great way to get a focused view on a specific metric like this. I've created a dashboard you can import. Click "Add dashboard", then "Import dashboard" and copy-paste the following dashboard configuration:
{ "title": "CPU Steal", "description": "", "visuals": [ { "title": "CPU Steal", "line_label": "%hostname%", "display": "LINE", "format": "percent", "draw_null_as_zero": true, "metrics": [ { "name": "cpu", "fields": [ { "field": "GAUGE" } ], "tags": [ { "key": "host_metric", "value": "*" }, { "key": "hostname", "value": "*" }, { "key": "state", "value": "steal" } ] } ], "type": "timeseries" } ] }
Let's see some examples of how your dashboard might look.
Low CPU Steal with Some Outliers
This is an example of a healthy situation. Most servers are below 1%. Sometimes a neighbor is noisy, and you see some outliers.
Consistently Higher CPU Steal
This is an example of a situation where the CPU is oversubscribed. A big chunk of the hosts have very high steal percentages.
Contact your provider's support department to ask if they can look at the capacity allocated to your servers in this case.
What Did We Learn?
If you host in a virtualized environment and see performance issues you can't put your finger on, take a look at the CPU steal metric.
Monitor Your Hosts and Get Stroopwafels
If you haven't had the chance to try AppSignal monitoring, here's what you need to know:
- Host monitoring is included alongside all of our features.
- We have a free trial option that doesn’t require a credit card.
- AppSignal supports Node.js, Ruby, and Elixir projects.
- We’re free for open source & for good projects.
- We ship stroopwafels to our trial users on request.
Need we say more? 🍪