ha

Using das.vmmemoryminmb with Percentage Based admission control

Duncan Epping · Mar 8, 2013 ·

I had question today about using the advanced settings to set a minimal amount of resources that HA would use to do the admission control math with. Many of us have used these advanced settings das.vmMemoryMinMB and das.vmCpuMinMHz to dictate the slot size when no reservations were set in an environment where the “host failures” admission control policy was used. However what many don’t appear to realize is that this will also work for the Percentage Based admission control policy.

If you want to avoid extreme overcommitment and want to specify a minimal amount of resources that HA should use to do the math with then even with the Percentage Based admission control policy you can use these settings. In the case where your VM reservation does not exceed the value specified, the value is used to do the math with. In other words if you set “das.vmMemoryMinMB” to 2048, it will use 2048 to do the math with unless the reservation set on the VM is higher.

I did a quick experiment in my test lab which I had just rebuilt. Without das.vmMemoryMinMB and two VMs running (with no reservation) I had 99% Mem Failover Capacity as shown in the screenshot below:

With das.vmMemoryMinMB set to 20480, and two VMs running, I had 78% Mem Failover Capacity as shown in the screenshot below:

I guess that proves that you can use das.vmMemoryMinMB and das.vmCpuMinMHz to influence Percentage Based admission control.

How to disable Datastore Heartbeating

Duncan Epping · Feb 25, 2013 ·

I have had this question multiple times now, how do I disable datastore heartbeating? Personally, I don’t know why you would ever want to do this… but as multiple people have asked I figured I would write it down. There is no “disable” button unfortunately, but there is a work-around. Below are the steps you need to take to disable datastore heartbeating.

vSphere Client:

Right Cluster object
Click “Edit Settings”
Click “Datastore Heartbeating”
Click “Select only from my preferred datastores”
Do not select any datastores

Web Client:

Click “Cluster object”
Click “Manage” tab
Click “vSphere HA”
Click “Edit button” on the right side
Click “Datastore Heartbeating”
Click “Select only from my preferred datastores”
Do not select any datastores

It is as simple as that… However, let me stress that this is not something that I would recommend doing. Only when you are troubleshooting and need it disabled for whatever reason, please make sure to enable it when you are done.

vSphere HA 5.x restart attempt timing

Duncan Epping · Feb 18, 2013 ·

I wrote about how vSphere HA 5.x restart attempt timing works a long time ago but there appears still to be some confusion about this. I figured I would clarify this a bit more, I don’t think I can make it more simple than this:

Initial restart attempt
If the initial attempt failed, a restart will be retried after 2 minutes of the previous attempt
If the previous attempt failed, a restart will be retried after 4 minutes of the previous attempt
If the previous attempt failed, a restart will be retried after 8 minutes of the previous attempt
If the previous attempt failed, a restart will be retried after 16 minutes of the previous attempt

After the fifth failed attempt the cycle ends. Well that is, unless a new master host is selected (for whatever reason) between the first and the fifth attempt. In that case, we start counting again. Meaning that if a new master is selected after attempt 3, the new master will start with the “initial restart attempt.

Or as Frank Denneman would say:

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

SRM vs Stretched Cluster solution /cc @sakacc

Duncan Epping · Feb 11, 2013 ·

I was reading this article by Chad Sakac on vSphere DR / HA, or in other words SRM versus Stretched (vMSC) solutions. I have presented on vSphere Metro Storage Cluster solutions at VMworld together with Lee Dilworth and also wrote a white paper on this topic a while back and various blog posts since. I agree with Chad that there are too many people misinformed about the benefits of both solutions. I have been on calls with customers where indeed people were saying SRM is a legacy solution and the next big thing is “Active / Active”. Funny thing is that in a way I agree when they say SRM has been around for a long time and the world is slowly changing, I do not agree with the term “legacy” though.

I guess it depends on how you look at it, yes SRM has been around for a long time but it also is a proven solution that does what it says it does. It is an orchestration solution for Disaster Recovery solutions. Think about a disaster recovery scenario for a second and then read those two last sentences again. When you are planning for DR, isn’t it nice to use a solution that does what it says it does. Although I am a big believer in “active / active” solutions, there is a time and place for it; in many of the discussions I have been a stretched cluster solution was just not what people were looking for. On top of that Stretched Cluster solutions aren’t always easy to operate. That is I guess what Chad was also referring to in his post. Don’t get me wrong, a stretched cluster is a perfectly viable solution when your organization is mature enough and you are looking for a disaster avoidance and workload mobility solution.

If you are at the point of making a decision around SRM vs Stretched Cluster make sure to think about your requirements / goals first. Hopefully all of you have read this excellent white paper by Ken Werneburg. Ken describes the pros and cons of each of these solutions perfectly, read it carefully and then make your decision based on your business requirement.

So just in short to recap for those who are interested but don’t have time to read the full paper, make time though… really do!

Where does SRM shine:

Disaster Recovery
Orchestration
Testing
Reporting
Disaster Avoidance (will incur downtime when VMs failover to other site)

Where does a Stretched Cluster solution shine:

Workload mobility
Cross-site automated load balancing
Enhanced downtime avoidance
Disaster Avoidance (VMs can be vMotioned, no downtime incurred!)

Percentage Based Admission Control gives lower VM restart guarantee?

Duncan Epping · Jan 9, 2013 ·

Those who have configured vSphere HA have all seen that section where it asks if you want to use admission control or not. Of course if you decide you want to use it, and you should want this, then the next question that comes is which one do you want to use? I have always preferred the “Percentage Based Admission Control” policy. For some reason though there are people who think that the percentage based admission control policy rules out large VMs from being restarted or offers a lower guarantee.

The main perception that people have is that the percentages based admission control policy gives lower guarantees of virtual machines being restarted than the “host failures” admission control policy. So let break it down, and I mean BREAK IT DOWN, by using an example.

Example

5 hosts
200GB of Memory in cluster
20GHz of CPU in cluster

If no reservations are set:

Percentage Based will do the following:

The Percentage Based policy will take the total amount of resources and subtract the amount of resources reserved for fail-over. If that percentage is for instance 20% than 40GB and 4GHz are subtracted. Which means 160GB and 16GHz are left.
The reserved resources for every virtual machine that is powered on is subtracted from what the outcome of 1. was. If no reservation is set memory then memory overhead is subtracted, if the memory overhead is 200MB then 200MB is subtracted from the 160GB that was left resulting in 159,8GB being available. For CPU the default of 32MHz will be used.
You can power-on virtual machines until the amount of available resources, according to HA Admission Control, is depleted, yes many VMs in this case.

Host Failures will do the following:

The Host Failures policy will calculate the amount of slots. A slot is formed out of two components: memory and cpu. As no reservation is used the default for CPU is used which is 32MHz, with vSphere 5.0 and higher. For memory the largest memory overhead size is used, in this scenario there could be a variety of sizes lets say the smallest is 64MB and the largest 300MB. Now 300MB will be used for the Memory Slot size.
Now that the slotsize is known Admission Control will look for the host with the most slots (available resources / slot size) and subtract those slots from the total amount of available slots. (If one host failure is specified). Every time a VM is started a slot is subtracted. If a VM is started with a higher memory reservation we go back to 1 and the math will need to be done again.
You can power-on virtual machines until you are out of slots, again… many VMs.

If reservations are set:

Percentage Based will do the following:

The Percentage Based policy will take the total amount of resources and subtract the amount of resources reserved for fail-over. If that percentage is for instance 20% than 40GB and 4GHz are subtracted. Which means 160GB and 16GHz are left.
The reserved resources for every virtual machine that is powered on is subtracted from what the outcome of 1 was. So if 10GB of memory was reserved, then 10GB is subtracted resulting in 150GB being available.
You can power-on virtual machines until available resources are depleted (according to HA Admission Control), but as reservations are used you are “limited” in terms of the amount of VMs you can power-on.

Host Failures will do the following:

The Host Failures policy will calculate the amount of slots. A slot is formed out of two components: memory and cpu. As a reservation is used for memory but not for CPU the default for CPU is used which is 32MHz, with vSphere 5.0 and higher. For memory there is a 10GB reservation set. 10GB will be used for the Memory Slot size.
Now that the slotsize is known Admission Control will look for the host with the most slots (available resources / slot size) and subtract those slots from the total amount of available slots. (If one host failure is specified). Every time a VM is started a slot is subtracted, yes that is a 10GB memory slot, even if it has for instance a 2GB reservation. If a VM is started with a higher memory reservation we go back to 1 and the math will need to be done again.
You can power-on virtual machines until you are out of slots, as a high reservation is set you will be severely limited!

Now you can imagine that “Host Failures” can be on the safe side… If you have 1 reservation set the math will be done with that reservation. This means that a single 10GB reservation will impact how many VMs you can power-on until HA screams that it is out of resources. But at least you are guaranteed you can power them on right? Well yes, but realistically speaking people disable Admission Control at this point as that single 10GB reservation allows you to power on just a couple of VMs. (16 to be precise.)

But but that beats Percentage Based right… because if I have a lot of VMs who says my VM with 10GB reservation can be restarted? First of all, if there are no “unreserved resources” available on any given host to start this virtual machine then vSphere HA will ask vSphere DRS to defragment the cluster.As HA Admission Control had already accepted this virtual machine to begin with, chances are fairly high that DRS can solve the fragmentation.

Also, as the percentage based admission control policy uses reservations AND memory overhead… how many virtual machines do you need to have powered-on before your VM with 10 GB memory reservation is denied to be powered-on? It would mean that none of the hosts has 10GB of unreserved memory available. That is not very likely as that means you would need to power-on hundreds of VMs… Probably way too many for your environment to ever perform properly. So chances of hitting this scenario are limited, extremely small.

Conclusion

Although theoretically possible, it is very unlikely you will end up in situation where one or multiple virtual machines can not be restarted when using the Percentage Based Admission Control policy. Even if you are using reservations on all virtual machines then this is unlikely as the virtual machines have been accepted at some point by HA Admission Control and HA will leverage DRS to defragment resources at that point. Also keep in mind that when using reservations on all virtual machines that Host Failures is not an option as it will skew your numbers as it does the math with “worst case scenario”, a single 10GB reservation can kill your ROI/TCO.

In short: Go Percentage Based!