In recent years, cloud computing has become an integral part of the IT landscape. Many enterprises rely on services from the cloud, for example because they don’t want to buy and maintain hardware by themselves. Furthermore, cloud services allow to easily scale applications depending on the load the system is experiencing, making them very attractive for both small and large businesses. However, while the performance of the application is often monitored with modern APM tools, the performance of the infrastructure itself is rarely scrutinized. We recently had a case at our company which shows that sometimes, it is a good idea to check whether you actually receive the capacity you pay for.
At RETIT, we have been using an array of virtual machines provided by a large German hosting provider. These machines regularly ran system tests to ensure that our products are working as expected. During those tests, our tools predicted a significantly lower CPU utilization than we observed on the actual systems. This made us thinking – are our tools wrong, or is the system not performing as it should?
After looking at both our tools as well as the application for quite some time without finding any issues, we started wondering whether the infrastructure itself could be the root of the problem. While searching all the available data for any clue as to what was going on, we found the crucial hint in the file /proc/stat. This file is created by the Linux kernel and gives a summary of the time each CPU core spent in different CPU phases (idle, user, system, etc.). In the following, we will shortly describe what we did in order to find the problem.
We conducted a very simple experiment: we read the contents of /proc/stat, waited for 60 seconds, and then read the file’s contents again. This can be done with a single line in the shell:
cat /proc/stat && sleep 60 && cat /proc/stat
From the output, we were specifically interested in the two lines starting with “cpu”, which contain the execution summary for all the CPU cores in the system. On one machine, this output was as follows:
cpu 5120564 45299 1163028 110999319 47115 0 0 0
cpu 5120570 45300 1163042 111001702 47115 0 0 0
The first line contains data from the first read of /proc/stat, while the second line contains data recorded 60 seconds later. The different numbers are the amount of “jiffies” spent in each CPU phase – a jiffy is an internal time interval of the kernel, and on many Linux systems, there are 100 jiffies per second. Each column corresponds to a specific CPU phase. By comparing the two lines with each other, we could calculate how much time the CPU spent in each phase. Consequently, by summing up all the individual phase times, we could tell how much time the CPU had available overall:
user | nice | system | idle | iowait | irq | softirq | steal | Sum | |
t1 | 5120564 | 45299 | 1163028 | 110999319 | 47115 | 0 | 0 | 0 | 117375325 |
t2 | 5120570 | 45300 | 1163042 | 111001702 | 47115 | 0 | 0 | 0 | 117377729 |
Δt | 6 | 1 | 14 | 2383 | 0 | 0 | 0 | 0 | 2404 |
So, in the 60 seconds of our little experiment, the CPU accumulated a total of 2,404 jiffies. With 100 jiffies/second, this corresponds to 24 seconds of CPU time. This is clearly a lot less than 60 seconds. But things are really put into perspective when considering that the above system had 2 virtual CPU cores, meaning that in theory, this system should have 120 seconds of CPU time available in a time span of 60 seconds. This means that 96 seconds of CPU time were missing, or conversely, we were only receiving 20% of the CPU time we paid for!
While it is normal that virtual machines don’t receive all the CPU time they demand due to contention of physical resources, this is not the case here, since no time was spent in the “steal” phase at all. It can also be seen that the system was completely idle during the test; the result stayed the same when putting the system under stress. It seemed like our provider just decided to put a cap on the available CPU time of our virtual machine, regardless of whether it consumed excessive amounts of CPU time or not. We observed this behavior on all our machines – each of them only received 20% of their rightful share of CPU time.
We confronted our provider with these numbers, who then relocated our VMs to a different host. This in fact improved the situation: we repeated the experiment and subsequently observed values around 70%. While this is a lot better, it still wasn’t good enough for our particular use case, so we ended up investing in a powerful server of our own, giving us full control over each VM.
To put it in a nutshell, while the cloud enables many use cases and takes a lot of the pain away with regard to building and maintaining infrastructure, a thorough look at the provided infrastructure is sometimes advisable. Using the approach described above, you can easily do this for yourself.