admission control

HA Admission Control Policy: Dedicated Failover Hosts

Duncan Epping · Feb 5, 2019 ·

This week I received some questions on the topic of HA Admission Control. There was a customer that had a cluster configured with the dedicated failover host admission control policy and they had no clue why. This cluster had been around for a while and it was configured by a different admin, who had left the company. As they upgraded the environment they noticed that it was configured with an admission control policy they never used anywhere else, but why? Well, of course, the design was documented, but no one documented the design decision so that didn’t really help. So they came to me and asked what it exactly did and why you would use it.

Let’s start with that last question, why would you use it? Well normally you would not, you should not. Forget about it, well unless you have a specific use case and I will discuss that later. What does it do?

When you designate hosts as failover hosts, they will not participate in DRS and you will not be able to run VMs on these hosts! Not even in a two-host cluster when placing one of the two in maintenance. These hosts are literally reserved for failover situations. HA will attempt to use these hosts first to failover the VMs. If, for whatever reason, this is unsuccessful, it will attempt a failover on any of the other hosts in the cluster. For example, in a when two hosts would fail, including the hosts designated as failover hosts, HA will still try to restart the impacted VMs on the host that is left. Although this host was not a designated failover host, HA will use it to limit downtime.

As can be seen above, you select the correct admission control policy and then add the hosts. As mentioned earlier, the hosts added to this list will not be considered by DRS at all. This means that the resources go wasted unless there’s a failure. So why would you use it?

If you need to know where a VM runs all the time, this admission control policy dictates where the restart happens.
There is no resource fragmentation, as a full host (or multiple) worth of resources will be available to restart VMs on, instead of 1 host worth of resources across multiple hosts.

In some cases, the above may be very useful, for instance knowing where a VM is all the time could be required for regulatory compliance, or could be needed for licensing reasons when you run Oracle for instance.

HA Futures: Admission Control – Part 2 of 4 – (Please comment, feedback needed!)

Duncan Epping · Oct 23, 2018 ·

Admission Control is always a difficult topic when I talk to customers. It seems that many people still don’t fully grasp the concept, or simply misunderstand how it works. To be honest, I can’t blame them. It doesn’t always make sense when you think things through. Most recently for Admission Control we introduced a mechanism in which you can specify what the “tolerated performance loss” should be for any given VM. This isn’t really admission control unfortunately as it doesn’t stop you from powering on new VMs, it does, however, warn you if you reach the threshold where a host failure would lead to the specified performance degradation.

After various discussion with the HA team over the past couple of years, we are now exploring what we can change about Admission Control to give you more options as a user to ensure VMs are not only restarted but also receive the resources you expect them to receive. As such, the HA team is proposing 3 different ways of doing Admission Control, and we would like to have your feedback on this potential change:

Admission Control based on reserved resources and VM overheads
This is what you have today, nothing changes here. We use the static reservations and ensure that all VMs can be powered on!
Admission Control based on consumed resources
This is similar to the “performance degradation tolerated” option. We will look at the average consumed CPU and Memory resources, let’s say past 24 hours), and base our admission control calculations on that. This will allow you to guarantee performance for workloads to be similar after a failure.
Admission Control based on configured resources
This is a static way of doing admission control similar to the first. The only difference is that here Admission Control will do the calculations based on the resources configured. So if you configured a VM with 24GB of memory, then we will do the math with 24GB of memory for that VM. The big advantage, of course, is that the VMs will always be able to claim the resources they have assigned.

In our opinion, adding these options should help to ensure that VMs will receive the resources you (or your customers) would expect them to get. Please help us by leaving a comment/providing feedback. If you agree that this would be helpful then let us know, if you have serious concerns then we would also like to know. Please help shape the future of HA!

HA Admission control: How can I check how much reserved resources are used?

Duncan Epping · Sep 14, 2018 ·

I had this question twice in the past three months, so somehow this is something which isn’t clear to everyone. With HA Admission Control you set aside capacity for fail-over scenarios, this is capacity which is reserved. But of course VMs also use reservations, how can you see what the total combined used reserved capacity is including Admission Control?

Well, that is actually pretty simple it appears. If you open the Web Client, or the H5 client, and go to your cluster then you have a “Monitor” tab, under the Monitor tab there’s an option called “Resource Reservation” and it has graphs for both CPU as well as Memory. This actually includes admission control. To verify this I took a screenshot in the H5 client before enabling admission control, and after enabling it, as you can see a big increase in “reserved resources”, which indicates that admission control is taken into consideration.

Before:

After:

This is also covered in our Deep Dive book, by the way, if you want to know more, pick it up.

Insufficient configured resources to satisfy the desired vSphere HA failover level on the cluster

Duncan Epping · Jul 12, 2018 ·

I was talking to the HA team this week, specifically about the upcoming HA book. One thing they mentioned is that people still seem to struggle with the concept of admission control. This is the reason I wrote a whole chapter on it years ago, yet there still seems to be a lot of confusion. One thing that is not clear to people is the percentage calculations. We have some customers with VMs with extremely large reservations, in that case instead of using the “slot policy” they typically switch to “percentage based policy”. Simply as the percentage based policy is a lot more flexible.

However, recently we have had some customers that hit the following error message:

Insufficient configured resources to satisfy the desired vSphere HA failover level on the cluster

This error message, in the case of these particular situations (yes there was a bug as well, read this article on that), set the percentage lower than what would equal a full host. In other words, in a 4 host environment, a single host would equal 25%. In some cases, customers would set the percentage to a value lower than 25%. I am personally not sure why anyone would do this as it contradicts the whole essence of admission control. Nevertheless, it happens.

This message indicates that you may not have sufficient resources, in the case of a host failure, to restart all the VMs. This of course is the result of the percentage being set lower than the value that would equal a single host. Note though, this does not stop you from powering on new VMs. You will only be stopped from powering on new VMs when you exceed the available unreserved resources.

So if you are seeing this error message, please verify the configured percentage if you set it manually. Ensure that at a minimum it equals the largest host in the cluster.

** back to finalizing the book **

Guaranteeing availability through admission control, chip in!

Duncan Epping · Apr 9, 2013 ·

I have been having these discussions with our engineering teams for the last year around guaranteed restarts of virtual machines in a cluster. In the current shape / form we use Admission Control to guarantee virtual machines are restarted. Today Admission Control is all about guaranteeing virtual machine restarts by keeping track of Memory and CPU resource reservations, but you can imagine that in the Software Defined Datacenter this could be expanded with for instance storage or networking reservation.

Now why am I having these discussions, what is the problem with Admission Control today? Well first of all it is the perception that many appear to have of Admission Control. Many believe the Admission Control algorithm uses “used” resources. Reality however is that Admission Control is not that flexible, it uses resource reservations and as you know this is static. So what is the result of using reservations?

By using reservations for “admission control” vSphere HA has a simple way of guaranteeing a restart is possible at all times. Simply because it checks if sufficient “unreserved resources” are available and if so it allows the virtual machine to be powered-on. If not, then it won’t allow the power-on just to ensure that all virtual machines can be restarted in case of a failure. But what is the problem? Although we guarantee a restart we do not guarantee any type of performance after the restart! Unless, unless of course you are setting your reservations equal to what you provisioned… but I don’t know anyone doing this as it eliminates any form of overcommitment and will result in an increase of cost and a decrease in flexibility.

So that is the problem. Question is – what should we do about it? We (the engineering teams and I) would like to hear from YOU.

What would you like admission control to be?
What guarantees do you want HA to provide?
After a failure, what criteria should HA apply in deciding which VMs to restart?

One idea we have been discussing is to have Admission Control use something like “used” resources… or for instance an “average of resources used” per virtual machine. What if you could say: I want to ensure that my virtual machines always get at least 80% of what they use on average? If so, what should HA do when there are not enough resources to meet the 80% demand of all VMs? Power on some of the VMs? Power on all with reduced share values?

Also, something we have discussed is having vCenter show how many resources are used on average taking your high availability N-X setup in to account, which should at least provide an insight around how your VMs (and applications) will perform after a fail-over. Is that something you see value in?

What do you think? Be open and honest, tell us what you think… don’t be scared, we won’t be bite, we are open for all suggestions.