ha

How does “das.maxvmrestartcount” work?

Duncan Epping · Jun 30, 2010 ·

The amount of retries is configurable as of vCenter 2.5 U4 with the advanced option “das.maxvmrestartcount”. My colleague Hugo Strydom wrote about this a while ago and after a short discussion with one of our developers I realised Hugo’s article was not 100% correct. The default value is 5. Pre vCenter 2.5 U4 HA would keep retrying forever which could lead to serious problems as described in KB article 1009625 where multiple virtual machines would be registered on multiple hosts simultaneously leading to a confusing and inconsistent state. (http://kb.vmware.com/kb/1009625)

Important to note is that HA will try to start the virtual machine one of your hosts in the affected cluster; if this is unsuccessful on that host the restart count will be increased by 1. The first restart attempt will than occur after two minutes. If that one fails the next will occur after 4 minutes, and if that one fails the following will occur after 8 minutes until the “das.maxvmrestartcount” has been reached.

To make it more clear look at the following:

T+0 – Restart
T+2 – Restart retry 1
T+4 – Restart retry 2
T+8 – Restart retry 3
T+8 – Restart retry 4
T+8 – Restart retry 5

In other words, it could take up to 30 minutes before a successful restart has been initiated when using the default of “5” restarts max. If you increase that number, each following will also be “T+8” again.

DRS Sub Cluster? vSphere 4.next

Duncan Epping · Jun 21, 2010 ·

On the community forums a question was asked around Campus Clusters and pinning VMs to a specific set of hosts. In vSphere 4.0 that’s currently not possible unfortunately and it definitely is a feature that many customers would want to use.

Banjot Chanana revealed during VMworld that it was an upcoming feature but did not go into much details. However on the community forums, thanks @lamw for point this out, Elisha just revealed the following:

Controls will be available in the upcoming vSphere 4.1 release to enable this behavior. You’ll be able to set “soft” (ie. preferential) or “hard” (ie. strict) rules associating a set of vms with a set of hosts. HA will respect the hard rules and only failover vms to the appropriate hosts.

Basically DRS Host Affinity rules which VMware HA adheres to. Can’t wait for the upcoming vSphere version to be released and to figure out how all these nice “little” enhancements change our designs.

HA: Max amount of host failures?

Duncan Epping · Jun 18, 2010 ·

A colleague had a question around the maximum amount of host failures HA could take. The availability guide states the following:

The maximum Configured Failover Capacity that you can set is four. Each cluster has up to five primary hosts and if all fail simultaneously, failover of all hosts might not be successful.

However, when you select the “Percentage” admission control policy you can set it to 50% even when you have 32 hosts in a cluster. That means that the amount of failover capacity being reserved equals 16 hosts.

Although this is fully supported but there is a caveat of course. The amount of primary nodes is still limited to five. Even if you have the ability to reserve over 5 hosts as spare capacity that does not guarantee a restart. If, for what ever reason, half of your 32 hosts cluster fails and those 5 primaries happen to be part of the failed hosts your VMs will not restart. (One of the primary nodes coordinates the fail-over!) Although the “percentage” option enables you to save additional spare capacity there’s always the chance all primaries fail.

All in all, I still believe the Percentage admission control policy provides you more flexibility than any other admission control policy.

Which host is selected for an HA initiated restart?

Duncan Epping · Jun 16, 2010 ·

Got asked the following question today and thought it was valuable for everyone to know the answer to this:

How is a host selected for VM placement when HA restarts VMs from a failed host?

It’s actually a really simple mechanism. HA keeps track of the unreserved capacity of each host of the cluster. When a fail-over needs to occur the hosts are ordered. The host with the highest amount of unreserved capacity being the first option. Now to make it absolutely crystal clear, HA keeps track of the unreserved capacity and it is not DRS which does this. HA works completely independent of vCenter and as we all know DRS is part of vCenter. HA also works when DRS is disabled or unlicensed!

Now one thing to note is that HA will also verify if the host is compatible with the VM or not. What this means is that HA will verify if the VMs network is available on the target host and if the datastore is available on the target hosts. If both are the case a restart will be initiated on that host. To summarize:

Order available host based on unreserved capacity
Check compatibility (VM Network / Datastore)
Boot up!

VM Monitoring (aka VM HA) heartbeat

Duncan Epping · Jun 4, 2010 ·

I got a question around VM Monitoring (aka virtual machine level HA) this week. A customer wanted to test if VM Monitoring worked and as such disabled the NIC of the virtual machine and waited for 30 seconds for the VM Monitoring response to kick in…. nothing happened.

VM Monitoring restarts individual virtual machines when needed. VM monitoring uses a similar concept as HA, heartbeats. If heartbeats, and in this case VMware Tools heartbeats are not received for a specific amount of time the virtual machine will be rebooted. An example of when this will happen for instance is when a Windows virtual machine shows a BSOD.

The big question of course was why didn’t this trigger a response?

The answer is simple: The VMware Tools heartbeat does not use the virtual machine NIC. This heartbeat is “caught” by hostd and passed on to vCenter. vCenter uses this to show those “green/yellow/red” alarm dots. The same heartbeat is used by VM Monitoring to detect the failure of a virtual machine. Even without any NIC attached to your virtual machine these heartbeats will still be received.

One thing to keep in mind though is that when heartbeats are no longer received, by default sent out every second, VM Monitoring will check if there is any Network or Storage I/O to avoid false positives.

Question for you guys! One thing that I always wondered is how many people use VM Monitoring? And if you use it, do you use it on all VMs in every cluster?