ha

HA Admission Control the basics – Part 2/2

Duncan Epping · Jun 20, 2012 ·

In part one I described what HA Admission Control is and in part two I will explain what your options are when admission control is enabled. Currently there are three admission control policies:

Host failures cluster tolerates
Percentage of cluster resources reserved as failover spare capacity
Specify a failover host

Each of these work in a slightly different way. And lets start with “Specify a failover host” as it is the most simple one to explain. This admission control policy allows you to set aside 1 host that will only be used in case a fail-over needs to occur. This means that even if your cluster is overloaded DRS will not use it. In my opinion there aren’t many usecases for it, and unless you have very specific requirements I would avoid using it.

The most difficult one to explain is “Host failures cluster tolerates” but I am going to try to keep it simple. This admission control policy takes the worst case scenario in to account, and only the worst case scenario, and it does this by using “slots”. A slot is comprised of two components:

Memory
CPU

For memory it will take the largest reservation on any powered-on virtual machine in your cluster plus the memory overhead for this virtual machine. So if you have one virtual machine that has 24GB memory provisioned and 10GB out of that is reserved than the slot size for memory is ~10GB (reservation + memory overhead).

For CPU it will take the largest reservation on any powered-on virtual machine in your cluster, or it will use a default of 32MHz (5.0, pre 5.0 it was 256MHz) for the CPU slot size. If you have a virtual machine with 8 vCPUs assigned and a 2GHz reservation then the slot size will be 2GHz for CPU.

HA admission control will look at the total amount of resources and see how many “memory slots” there are by dividing the total amount of memory by the “memory slot size”. It will do the same for CPU. It will calculate this for each host. From the total amount of available memory and CPU slots it will take the worst case scenario again, so if you have 80 memory slots and 120 CPU slots then you can power on 80 VMs… well almost, cause the number of slots of the largest hosts is also subtracted. Meaning that if you have 5 hosts and each of those have 10 slots for memory and CPU instead of having 50 slots available in total you will end up with 40.

Simple right? So remember: reservations –> slot size –> worst case. Yes, a single large reservation could severely impact this algorithm!

So now what? Well this is where the third admission control policy comes in to play… “Percentage of cluster resources reserved as failover spare capacity”. This is not a difficult one to explain, but again misunderstood by many. First of all HA will add up all available resources to see how much it has available. It will now subtract the amount of resource specified for both memory and CPU. Then HA will calculate how much resources are currently reserved for both memory and CPU for powered-on virtual machines. For CPU, those virtual machines that do not have a reservation larger than 32Mhz a default of 32Mhz will be used. For memory a default of 0MB+memory overhead will be used if there is no reservation set. If a reservation is set for memory it will use the reservation+memory overhead.

That is it. Percentage based looks at “powered-on virtual machines” and its reservation or uses the above mentioned defaults. Nothing more than that. No. it doesn’t look at resource usage / consumption / active etc. It looks at reserved resources. Remember that!

What do I recommend? I always recommend using the percentage based admission control policy as it is the most flexible policy. It will do admission control on a per virtual machine reservation basis without the risk of skewing the numbers.

If you have any questions around this please don’t hesitate.

HA Admission Control the basics – Part 1/2

Duncan Epping · Jun 18, 2012 ·

Last week I received three different questions about vSphere HA Admission control and I figured I would lay out the basics once more. What is admission control?

vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and to ensure that virtual machine resource reservations are respected.

Almost every thing you need to know about admission control is in that single sentence. But lets break it down in to more consumable bites:

vCenter uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection.
vCenter uses admission control to ensure that virtual machine resource reservations are respected.

So first and foremost… Admission control is not about resource management, I devoted a whole article to that so not going in to details, but HA admission control is all about reserving resources to allow for a failover.

Secondly, admission control ensures virtual machine resource reservations of powered-on VMs can be respected. This is because virtual machine resources reservations are required to be available in order for a power-on to successfully complete! Meaning that if you set a 5GB memory reservation there needs to be 5GB of unreserved memory available (+ reserved memory overhead) on a single host in order for this virtual machine to power-on. If that 5GB machine is actually actively using 40GB it might end up swapping / paging, as only those 5GB of reserved capacity is taken in to account!

Note the “+ reserved memory overhead”! Every virtual machine has a memory overhead. This is usually in the range of a couple hundred MBs. For a successful power-on attempt you will need to be able to reserve this memory. If there is not enough “unreserved memory capacity” the power-on attempt will fail. So in reality that 5GB could just be 5.15GB. Might seem irrelevant, but I will explain why it is relevant in a second. Did you spot the “powered-on”? Yes, admission control only takes the resource reservations of powered-on VMs in to account. So if you have a VM with a large memory reservation which is powered-off it will not impact your admission control calculations!

In summary:

Admission Control is about reserving resources to allow for a fail-over.
Admission Control is no resource management tool, it only takes reserved capacity of powered-on VMs in to account.

So now that you know what admission control is. There are three policies when it comes to admission control… and we will discuss these in Part 2 of this article.

vSphere 5.0 HA restarting of VMs with no access to storage?

Duncan Epping · Jun 6, 2012 ·

I had a question today around the restart of VMs with no access to storage by HA. The question was if HA would try to restart the VM and time out after 5 times. With the follow up question, if HA would try again when the storage would return for duty.

By default HA will try to restart a VM up to 5 times in roughly 30 minutes. If the master does not exceed it will stop trying. On top of that HA manages a “compatibility list”. This list will contain the details around which VM can be restarted and where. In other words; which hosts have access to the datastores and network portgroup required for this VM to successfully power-on. Now if for whatever reason there are no compatible hosts available for the restart then HA will not try to restart the VM.

But what if the problem is resolved? As soon as the problem is resolved, and reported as such, the compatibility list will be updated. When the list is updated HA will continue with the restarts again.

It might also be good to know that if for whatever reason the master fails, a new master will continue trying to restart the VM. It will start with 5 new attempts and not take the number of restart attempts that the previous master did into account.

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

VM Monitoring only using VMware Tools heartbeat?

Duncan Epping · Jun 5, 2012 ·

I had this question twice this week and did a quick search on my blog and I wrote an article about it a while back, but I figured it wouldn’t hurt to repeat some of that and expand on it. I copied / pasted this from part from our book as I think it it spot on!

VM/App monitoring uses a heartbeat mechanism kind of similar to HA. If heartbeats, and, in this case, VMware Tools heartbeats, are not received for a specific (and configurable) amount of time, the virtual machine will be restarted. These heartbeats are monitored by the HA agent and are not sent over a network, but stay local to the host.

Although the heartbeat produced by VMware Tools is reliable, VMware added a further verification mechanism. To avoid false positives, VM Monitoring also monitors I/O activity of the virtual machine. When heartbeats are not received AND no disk or network activity has occurred over the last 120 seconds, per default, the virtual machine will be reset. Changing the advanced setting “das.iostatsInterval” can modify this 120-second interval.

Which isolation response should I use?

Duncan Epping · May 31, 2012 ·

I wrote this article about split brain scenarios for the vSphere Blog. Based on this article I received some questions around which “isolation response” to use. This is not something that can be answered by a simple “recommended practice” and applied to all scenarios out there. Note that below has got everything to do with your infrastructure. Are you using IP-Based storage? Do you have a converged network? All of these impact the decision around the isolation response.

The following table however could be used to make a decision:

Likelihood that host will retain access to VM datastores	Likelihood that host will retain access to VM network	Recommended Isolation policy	Explanation
Likely	Likely	Leave Powered On	VM is running fine so why power it off?
Likely	Unlikely	Either Leave Powered On or Shutdown	Choose shutdown to allow HA to restart VMs on hosts that are not isolated and hence are likely to have access to storage
Unlikely	Likely	Power Off	Use Power Off to avoid having two instances of the same VM on the VM network
Unlikely	Unlikely	Leave Powered On or Power Off	Leave Powered on if the VM can recover from the network/datastore outage if it is not restarted because of the isolation, and Power Off if it likely can’t.