high availability

What’s new in vSphere 5.1 for High Availability

Duncan Epping · Sep 12, 2012 ·

As vSphere High Availability was completely revamped in 5.0 not a lot of changes have been introduced in 5.1. There are some noteworthy changes though that I figured I would share with you. So what’s cool?

Ability to set slot size for “Host failures tolerated” through the vSphere Web Client
Ability to retrieve a list of the virtual machines that span multiple slots
Support for Guest OS Sleep mode
Including the Application Monitoring SDK in the Guest SDK (VMware Tools SDK)
vSphere HA (FDM) VIB is automatically added to Auto-Deploy image profile
Ability to delay isolation response throught the use of “das.config.fdm.isolationPolicyDelaySec”

Although many of these speak for itself, I will elaborate on why these enhancements are useful and when to use them.

The ability to set slot size for “Host failures tolerated” allows you to manually dictate how many virtual machines you can power-on in your cluster. Many have used advanced settings to achieve more or less the same, but through the UI things are a lot easier I guess.

Now if you do this, it could happen that a virtual machine needs multiple slots in order to successfully power-on. That is where the second bullet point comes in to play. In the vSphere Web Client you can now see a list of all the virtual machines that currently span multiple slots.

Support for Guest OS “Sleep Mode” in environments where VM Monitoring is used was added. This was reported by Sudharsan a while back and I addressed it with the HA engineering team. As a result they added in the logic that recognizes the “state” of the virtual machine to avoid unneeded restarts. Thanks Sudharsan for reporting! (I can’t find this in the release notes however)

With 5.0 the Application Monitoring SDK was opened up to the broader audience. It was still a separate installer though. As of vSphere 5.1 the App Monitoring SDK is part of the VMware Tools SDK. This will make your life easier when you use Application Monitoring.

Those running stateless will be happy about the fact that the FDM VIB is now part of the Auto-Deploy image profile. This will avoid the need to manually add it every time you create a new image.

Last but not least, in 5.1 we re-introduce “das.failuredetectiontime”… well not exactly but a similar concept with a different name. This new advanced setting named “das.config.fdm.isolationPolicyDelaySec” will allow you to extend the time it takes before the isolation response is triggered. By default the isolation response is triggered after ~30 seconds with vSphere 5.x. If you have a requirement to increase this then this new advanced setting can be used.

Answering some admission control questions

Duncan Epping · Jul 3, 2012 ·

I received a bunch of questions on HA admission control in this blog post and I figured I would answer them in a blog post so that everyone would be able to find / read it. This was the original set of questions:

There are 4 ESXi Hosts in the network and 4 VMs (Same CPU, RAM Reservation for all VMs) on each Host. Admission Control is policy is set to ‘Host failure cluster tolerates’ to 1. All the available 12 slots have been used by the powered ON VMs, except the 4 reserved slots for failover.
1) What happens if 2 ESXi Hosts fails now? ( 2 * 4 VMs needs to fail over). Will HA restart only 4 VMs as it has only 4 slots available? And Restart of the remaining 4 VM fails?
Same Scenario, but Policy is set to ‘% of cluster resources reserved’ = 25%. All the available 75 % resources have been utilized by all the 16 VMs, except 25 % reserved for failover
2) What happens if 2 ESXi Hosts fails now? ( 2 * 4 VMs needs to fail over). Will HA restart only 4 VMs as it consumes 25 % of resources? And Restart of the other 4 VM fails?
3) Does HA check the VM reservation (or any other factor) at the time of restart ?
4) HA only restart a VM if the Host could guarantee the reserved resources or restart Fails?
5) What if no VM reservations are set VM level ?
6)What does HA takes into consideration when it has to restart VMs which has no reservation ?
7)Will it guarantee the configured Resources for each VMs ?
8)If not, How HA can restart 8 VMs (as per our eg) when it only has configured reserved resources for just 4 VM
9)Will it share the reserved resources across 8 VMs and will not care about the resource crunch or is it about first come first serve
10)Admission control doesn’t have any role at all in the event of HA failover ?

Let me tackle these questions one by one:

In this scenario 4 VMs will be restarted and 4 VMs might be restarted! Note that the “slot size” policy is used and that this is based on the worst case scenario. So if your slot is 1GB and 2GHz but your VMs require way less than that to power-on it could be all VMs are restarted. However, HA guarantees the restart of 4 VMs. Keep in mind that this scenario doesn’t happen too often, as you would be overcommitting to the extreme here. As said HA will restart all VMs it can. It just needs to be able to satisfy the resource reservations on memory and CPU!
Again, also in this HA will do its best to restart. It can restart new VMs until all “unreserved capacity” is used. As HA only needs to guarantee reserved resources chances of hitting this is very slim, as most people don’t use reservations at a VM level it would mean you are overcommiting extremely
Yes it will validate if there is a host which can back the the resource reservations before it tries the restart
Yes it will only restart the VM when this can be guaranteed. If it cannot be then HA can call,”DRS” to defragment resources for this VM
If there are no reservations then HA will only look at the “memory overhead” in order to place this VM
HA ensures the portgroup and datastore are available on the host.
It will not guarantee configured resources, HA is about restarting virtual machines not about resource management. DRS is about resource management and guaranteeing access to resources.
HA will only be able to restart the VM if there are unreserved resources available to satisfy the VMs request
All resources required for a virtual machine need to be available on a single host! Yes resources will be shared on a single host, just as long as no reservations are defined.
No Admission Control doesn’t have any role in an HA failover. Admission Control happens on a vCenter level, HA failovers happen on an ESX(i) level.

vSphere HA in 5.0 constantly pinging my gateway?

Duncan Epping · Jun 26, 2012 ·

I had this question today and noticed someone also dropped it on the community forums. The question was if vSphere HA is constantly pinging the default gateway or not. I knew HA would ping the gateway on a regular basis as of vSphere 5.0, and on a more frequently basis if a ping would fail but I wasn’t sure about the timing. I pointed Marc Sevigny from the HA engineering team to the thread on the community forums and he added added some nice juicy details to the it. I figured I would share them with you.

First of all, each ESXi host in a 5.x cluster will ping the isolation address every 5 minutes (300 seconds). Could this flood the isolation device?

There should be no “flood” of ICMP messages, and it should have little impact on network performance. The ICMP packet is 53 bytes long and sent once every 5 seconds from each of the HA hosts until the address(es) become pingable once again, at which point it returns to pinging once per hour.

If your default gateway is never pingable because of your firewall, you should open up the ports needed by HA. It is also possible to or disable the isolation address monitoring on the default gateway by using an advanced option (das.useDefaultIsolationAddress = false). It is recommended to specify a different isolation address (das.isolationaddress0) when the default gateway is a non-pingable device. Note that it is highly recommend to use a device as the default gateway which is as few hops removed from your hosts as possible!

VM Monitoring only using VMware Tools heartbeat?

Duncan Epping · Jun 5, 2012 ·

I had this question twice this week and did a quick search on my blog and I wrote an article about it a while back, but I figured it wouldn’t hurt to repeat some of that and expand on it. I copied / pasted this from part from our book as I think it it spot on!

VM/App monitoring uses a heartbeat mechanism kind of similar to HA. If heartbeats, and, in this case, VMware Tools heartbeats, are not received for a specific (and configurable) amount of time, the virtual machine will be restarted. These heartbeats are monitored by the HA agent and are not sent over a network, but stay local to the host.

Although the heartbeat produced by VMware Tools is reliable, VMware added a further verification mechanism. To avoid false positives, VM Monitoring also monitors I/O activity of the virtual machine. When heartbeats are not received AND no disk or network activity has occurred over the last 120 seconds, per default, the virtual machine will be reset. Changing the advanced setting “das.iostatsInterval” can modify this 120-second interval.

Which isolation response should I use?

Duncan Epping · May 31, 2012 ·

I wrote this article about split brain scenarios for the vSphere Blog. Based on this article I received some questions around which “isolation response” to use. This is not something that can be answered by a simple “recommended practice” and applied to all scenarios out there. Note that below has got everything to do with your infrastructure. Are you using IP-Based storage? Do you have a converged network? All of these impact the decision around the isolation response.

The following table however could be used to make a decision:

Likelihood that host will retain access to VM datastores	Likelihood that host will retain access to VM network	Recommended Isolation policy	Explanation
Likely	Likely	Leave Powered On	VM is running fine so why power it off?
Likely	Unlikely	Either Leave Powered On or Shutdown	Choose shutdown to allow HA to restart VMs on hosts that are not isolated and hence are likely to have access to storage
Unlikely	Likely	Power Off	Use Power Off to avoid having two instances of the same VM on the VM network
Unlikely	Unlikely	Leave Powered On or Power Off	Leave Powered on if the VM can recover from the network/datastore outage if it is not restarted because of the isolation, and Power Off if it likely can’t.