Bilal Hashmi wrote a nice article about HA today and in this article he asked a couple of questions. As I think the info is useful for everyone I decided to respond through a blog article instead of by commenting.

Let me start by saying that in general HA should never be disabled. The later versions of vSphere have a neat option called “Enable Host Monitoring”. This option should be used for scheduled network maintenance. The difference between disabling host monitoring and disabling HA is that disabling host monitoring does not cause a full reconfiguration (see screenshot below) of HA and a new election process. Just the “host monitoring” functionality is disabled, which is what you want in this scenario.

Bilal asked multiple questions / made multiple statements in his article, I will respond to two of these specifically to explain the way HA handles failures/isolation:

In this case within 30 sec of the management network outage, each host would have declared itself isolated and wont attempt to restart any VMs like the primaries would in vSphere 5.

So why is this? As soon as a Master is isolated it will drop “ownership” of datastores on which VMs are running that are part of its cluster. Before the other hosts trigger the isolation response for a given VM they will validate if the datastore on which this VM is stored is “owned” by a master. In the case of a cluster wide isolation due to a network outage / maintenance the ownership would be dropped and this would result in HA not triggering the isolation response. This is a major change compared to vSphere 4.x and prior!

Now what happens when the network outage is over and the hosts are in a position to talk to each other? I have not been able to find documentation on whether an isolated host will enter an election (vSphere 4 or 5) ones the communication channel is open and bring the cluster back to life.

Lets focus on vSphere 5.0 as that seems most relevant. A host remains isolated until it observes HA network traffic, like for instance election messages OR it starts getting a response from an isolation address. Meaning that as long as the host is in “isolated state” it will continue to validate its isolation by pinging the isolation address. As soon as the isolation address responds it will initiate an election process or join an existing election process and the cluster will return to a normal state.

There’s absolutely no need to manually intervene. HA takes care of all of this for you.