Last week I mentioned which metrics DRS used for load balancing VMs across a cluster. Of course the obvious question was when the DRS Deepdive would be posted. I must admit I’m not an expert on this topic as like most of you I always took for granted that it worked out of the box. I can’t remember that there ever was the need to troubleshoot DRS related problems, or better said I don’t think I’ve ever seen an issue which was DRS related.
This article will focus on two primary DRS functions:
- Load balancing VMs due to imbalanced Cluster
- VM Placement when booting
I will not be focusing on Resource Pools at all as I feel that there are already more than enough articles which explain these. The Resource Management Guide also contains a wealth of info on resource pools and this should be your starting place!
Load Balancing
First of all VMware DRS evaluates your cluster every 5 minutes. If there’s an imbalance in load it will reorganize your cluster, with the help of VMotion, to create an evenly balanced cluster again. So how does it detect an imbalanced Cluster? First of all let’s start with a screenshot:
fig 1
There are three major elements here:
- Migration Threshold
- Target host load standard deviation
- Current host load standard deviation
Keep in mind that when you change the “Migration Threshold” the value of the “Target host load standard deviation” will also change. In other words the Migration Threshold dictates how much the cluster can be “imbalanced”. There also appears to be a direct relationship between the amount of hosts in a cluster and the “Target host load standard deviation”. However, I haven’t found any reference to support this observation. (Two host cluster with threshold set to three has a THLSD of 0.2, a three host cluster has a THLSD of 0.163.) As said every 5 minutes DRS will calculate the sum of the resource entitlements of all virtual machines on a single host and divides that number by the capacity of the host:
sum(expected VM loads) / (capacity of host)
The result of all hosts will then be used to compute an average and the standard deviation. (Which effectively is the “Current host load standard deviation” you see in the screenshot(fig1).) I’m not going to explain what a standard deviation is as it’s explained extensively on Wiki.
If the environment is imbalanced and the Current host load standard deviation exceeds the value of the “Target host load standard deviation” DRS will either recommend migrations or perform migrations depending on the chosen setting.
Every migration recommendation will get a priority rating. This priority rating is based on the Current host load standard deviation. The actual algorithm being used to determine this is described in this KB article. I needed to read the article 134 times before I actually understood what they were trying to explain so I will use an example based on the info shown in the screenshot(fig1). Just to make sure it’s absolutely clear, LoadImbalanceMetric is the Current host load standard deviation value and ceil is basically a “round up”. The formula mentioned in the KB article followed by an example based on the screenshot(fig1):
6 - ceil(LoadImbalanceMetric / 0.1 * sqrt(NumberOfHostsInCluster))
6 - ceil(0.022 / 0.1 * sqrt(3))
This would result in a priority level of 5 for the migration recommendation if the cluster was imbalanced.
The only question left for me is how does DRS decide which VM it will VMotion… If anyone knows, feel free to chip in. I’ve already emailed the developers and when I receive a reply I will add it to this article and create a seperate article about the change so that it stands out.
VM Placement
The placement of a VM when being powered on is as you know part of DRS. DRS analyzes the cluster based on the algorithm described in “Load Balancing”. The question of course is for the VM which is being powered on what kind of values does DRS work with? Here’s the catch, DRS assumes that 100% of the provisioned resources for this VM will be used. DRS does not take limits or reservations into account. Just like HA, DRS has got “admission control”. If DRS can’t guarantee the full 100% of the resources provisioned for this VM can be used it will VMotion VMs away so that it can power on this single VM. If however there are not enough resources available it will not power on this VM.
That’s it for now… Like I said earlier, if you have more indepth details feel free to chip in as this is a grey area for most people.
Brian Knudtson says
Duncan-
Should the sentence before figure 2 read “This would result in a priority level of 1” instead of “This would result in a priority level of 5”?
Duncan Epping says
I just modified my post as I made a direct relationship between the threshold and the recommendation priority level which was wrong.
NiTRo says
Duncan, very intresting thanks for this work.
Chris says
Hi Duncan, I was wondering if you have ever seen behavior of DRS powering on a VM that has been powered off for days. I checked the tasks and it was DRS that powered the VM on, not a user. It looks like DRS needed to migrate it to another host, so it moved it and then powered it on. I have never seen this behavior before. Just curious if you have seen this or have an explanation. Thanks!
Fred says
Hi Chris, I had the same problem two days ago: looking at the logs to see who powered on this vm and the message told that DRS powered on this VM….Curious, never seen this behaviour before !
Ratnadeep Bhattacharya says
Hi Duncan. Nice work and thanks. Had some problem following the KB and doing the actual calculations but it was all good in the end. Your article was as always a gem. Been following yellow-bricks since Cormac told me bout it. Keep it up!
Rafael says
Duncan, nice one! Been following quite a few of your blogs for my own. Don’t worry – will assign credits accordingly 🙂
Did you get to find out how DRS decides which VMs to vmotion?