For years these rumors have been floating around that DRS does not take CPU Ready Time (%RDY) in to account when it comes load balancing the virtual infrastructure. Fact is that %RDY has always been a part of the DRS algorithm but not as a first class citizen but as part of CPU Demand, which is a combination of various metrics but includes %RDY. Still, one might ask why %RDY is not a first class citizen.
There is a good reason though that %RDY isn’t, just think about what DRS is and does and how it actually goes about balancing out the environment, trying to please all virtual machines. Yes a lot of possibilities indeed to move virtual machines around in a cluster. So you can imagine that it is is really complex (and expensive) to calculate what the possible impact is after a virtual machine has been migrated “from a host” or “to a host” for all of the first class citizen metrics.
Now, for a long time the DRS engineering team has been looking for situations in the field where a cluster is balanced according to DRS but there are still virtual machines experiencing performance problems due to high %RDY. The DRS team really wants to fix this problem or bust the myth – what they need is hard data. In other words, vc-support bundles from vCenter and vm-support bundles from all hosts with high ready times. So far, no one has been able to provide these logs / cold hard facts.
If you see this scenario in your environment regularly please let me know. I will personally get you in touch with our DRS engineering team and they will look at your environment and try to solve this problem once and for all. We need YOU!
we do see these situations quite a lot in our environment, would be glad to get in touch with the experts to help them and get help in return. 🙂
I do believe we are seeing CPU Ready time issues in our datacenters. We have a large number of ESX hosts and VMs along with a variety of Intel CPUs. I would love the opportunity to discuss what we are seeing with the DRS team.
We use to see this issue about two years ago. Within the environment there were a number x5450’s which in a dual processor scenario hand a total of 8 threads. The CPU usage on the nodes was sitting around 10% but we would see a large amount of %RDY issues. Some of this was minimized by dropping the vCPU count on the guest VM’s but it never really went away until we had a hardware refresh upping the thread count to 32 per box (2xE5-2690’s).
From an engineering side of things beyond treating %RDY as a more prominent point of measurement, is the in-rush effect. If DRS kicks off a series of vMotions because a host is having and over-saturation scenario the result could be a few VM’s being moved all to a different host but all at the same time causing the %RDY to spike and another round of vMotions to kick-off because the destination host is now overloaded.
Beyond DRS, I often find that people assume that because the CPU usage within the web interface or C# interface is “low” there are resources available. I can have a relatively “low” cpu usage but have horrible %RDY issues.
We had a similar problem due to a firmware bug to do with power saving modes that would drop down to only 10% CPU available for use. It LOOKED like just a host not working very hard, but the VMs on that host would have ridiculous RDYs. DRS never moved anything in that scenario. This was over the course of several weeks of manual fixes until we could patch firmware.
While I’m interested in the extent that %RDY comes into play for the DRS calculation, I’m not convinced about elevating it to a primary factor in balancing the loads. Personally, I see it only as a strong indicator that something is wrong – that it’s time to dig into the charts/ESXTOP to determine root cause.
For one instance: if there’s latency in the disk subsystem, I’ve seen issues with %RDY spiking as the SCSI commands are being processed by the vmkernel (my assumption). Such a case can show higher Ready times but are explained by checking DAVG & KAVG (specifically KAVG on this example). At this point, guests on different hosts sharing that RAID Group could show the same issue. (To be fair: this has been less of a problem for us in ESXi 5 compared to ESX 4.1).
So, if the %RDY metric was backed up with “sanity checks” from other kernel-intensive metrics, I would feel better about it being used as a primary metric in DRS. There are just too many odd situations that may not be directly related to CPU contention.
Jonathan Klick says
Yes, DRS would have to first be smart enough to determine if the root cause of CPU Ready is related to a high vCPU:pCPU ratio on the host. If not, vMotioning would not help. Another example of artificially created CPU contention could be caused by a configured CPU limit. Once some of these outlier cases for CPU Ready have been ruled out, then DRS should be able to migrate a VM to a host with a lesser vCPU:pCPU ratio. At least, that’s what I would imagine.
We had many issues with RDY time in our HP BL685c 4 socket 32 core AMD environments. Ready time was through the roof even on hosts that had a few VMs(30).
We fixed our issues by upgrading to esx5.1, modifying NUMA rebalancer scheduler to run every 60 seconds instead of 2 seconds, and by modifying the HP Power profile(on BIOS) to “Max Performance”.
We are seeing 1000%+ improvement in RDY times. Host RDY has decreased from 250,000ms to 20,000ms.
I’m very interested in seeing if DRS can make better use of RDY metrics to migrate VMs.
We have got the HP DL385 with 2 socket 16 Core AMD 6278 and see high RDY times too. We have no CPU Overcommitment.
Can you tell me the exact Advanced Config for NUMA rebalancer scheduler runtime, please? We run on ESXi5.0
Jonathan Klick says
Yes, any details you can provide would be wonderful. I’ve seen several AMD environments with CPU Ready issues that I couldn’t explain.
We tested the following in our Lab with great results. We are now in the process of applying to Production.
1. If running esxi5.0, make sure that you have installed this patch:
I did not need the patch since it’s fixed in esxi5.1
2. Modify NUMA Rebalance Period.
Under Host and Clusters view, for each host:
Host–>Configuration–>Advanced Settings–>Numa–>Numa.RebalancePeriod = 60000
3. Boot server to BIOS and set the following
Power Management Options–>HP Power Profile–>Maximum Performance
We used the Esxi Host vCenter RDY metrics/graphs to validate our tests. We also use the same VM load and number of VMs when performing tests.
We also let the Host settle for a minutes after making these changes and before pulling metrics.
Initially we had much worse performance on our newer G7 servers when compared to older HP G6 Blades.
Let me know how it goes…
Jonathan Klick says
This is fantastic! Thank you!
One last question: it seems you’re presenting three different solutions to the problem. Which one is the fix? My suspicions were that it had something to do with NUMA, so I’m curious how much the other two solutions actually help. I wish I had an AMD-based lab so I could simply test it myself.
I can’t speak about #1 since I did not experience that issues.
#2,#3 – Each provided about 3-5X improvement.
I recommend you try both.
Raed Hussein says
We faced the insanely high CPU RDY time, in particular with the XenAPP VM’s, we logged a call with vmware ; ref 13308517504 – if you can access the ticket info then you will see all the log bundles; the final solution we get out was to increase the number of the physical CPU’s. I was happy with the solution as our design for the XenAPP wasn’t that proper and they were very CPU intensive,, the issue is almost solved now,, not a perfect numbers but the CPU RDY time has decreased , and getting way less complains from the end users…. hope the ticket ID will help!
We were facing high CPU RDY times on HP DL 585 G7, ESX4.1, with NUMA enabled, the CPU Load on the Hardware was low. After disabling NUMA this vanished and never appeared again.
Tim Cooke says
This is slightly off-topic, but I don’t think DRS deals with memory contention very well and think that is more of an issue (or at least it has been for us), where a cluster is balanced according to DRS, but there are VMs struggle to get going due to ballooning.
For example, two host cluster, one host was put into MM for hardware replacement or whatever, so all the VMs moved to the other host. Take the host out of MM and, generally, most of the VMs will just remain on the first host, till such time as the CPU kicks off on some of them and DRS decides to move them. Now whilst they are all on that single host, but pretty idle, they are pushing the physical memory usage up, which results in a number of the VMs getting ballooned. “That’s not a problem” I hear people say “since the VMs are idle”, but should those VMs then kick-off, it takes some time to reclaim the physical memory they need, to the extent that end users see performance degradation on the VMs through unresponsive applications (especially things like JVMs on Windows VMs). DRS doesn’t move them because it doesn’t see a need from a CPU perspective, so the VMs just struggle along paging out like mad, till the physical RAM is available and the ballon has deflated.
DRS really needs to look at memory utilization on the host and rebalance the VMs over the available physical memory, so ballooning is only used when there is genuine contention for physical memory. It can do the vmotion on the VMs that are idle, rather than active ones, and not just when there is CPU contention.
Matthew Reed says
[So far, no one has been able to provide these logs / cold hard facts.]
I find your comment interesting because customers actually have been providing those logs however, EMC could actually be more effective by
1) making it mandatory before an SE or case number is ever assigned that customers upload their ESX/vSphere/vCenter logs to them as this was not always the case with us.
2) Preserve the uploaded logs for engineering and QA purposes only. Unforuntately in our EMC support case experiences having always uploaded logs to them, once a case was closed the logs were soon deleted off of or from the FTP support case directories.
I am sure its been emphasised before but some diligence never hurts!!
Keep on innovating!!