Bilal Hashmi wrote a nice article about HA today and in this article he asked a couple of questions. As I think the info is useful for everyone I decided to respond through a blog article instead of by commenting.
Let me start by saying that in general HA should never be disabled. The later versions of vSphere have a neat option called “Enable Host Monitoring”. This option should be used for scheduled network maintenance. The difference between disabling host monitoring and disabling HA is that disabling host monitoring does not cause a full reconfiguration (see screenshot below) of HA and a new election process. Just the “host monitoring” functionality is disabled, which is what you want in this scenario.
Bilal asked multiple questions / made multiple statements in his article, I will respond to two of these specifically to explain the way HA handles failures/isolation:
In this case within 30 sec of the management network outage, each host would have declared itself isolated and wont attempt to restart any VMs like the primaries would in vSphere 5.
So why is this? As soon as a Master is isolated it will drop “ownership” of datastores on which VMs are running that are part of its cluster. Before the other hosts trigger the isolation response for a given VM they will validate if the datastore on which this VM is stored is “owned” by a master. In the case of a cluster wide isolation due to a network outage / maintenance the ownership would be dropped and this would result in HA not triggering the isolation response. This is a major change compared to vSphere 4.x and prior!
Now what happens when the network outage is over and the hosts are in a position to talk to each other? I have not been able to find documentation on whether an isolated host will enter an election (vSphere 4 or 5) ones the communication channel is open and bring the cluster back to life.
Lets focus on vSphere 5.0 as that seems most relevant. A host remains isolated until it observes HA network traffic, like for instance election messages OR it starts getting a response from an isolation address. Meaning that as long as the host is in “isolated state” it will continue to validate its isolation by pinging the isolation address. As soon as the isolation address responds it will initiate an election process or join an existing election process and the cluster will return to a normal state.
There’s absolutely no need to manually intervene. HA takes care of all of this for you.
Bilal Hashmi says
Duncan,
Thank you so much I completely forgot about the host monitoring feature. You are absolutely right that would certainly be the best way to go. And thank you for clarifying the other questions. Great to know that HA will auto adjust when the time is right. Awesome!! Just for curiosity sake will it do the same in vSphere 4 as well?
Jason Boche says
Great clarifications and “what-if” real world examples. Years of working with vSphere 4 and prior HA rules are still firmly embedded into my brain. I’m not finding it easy to let go of that information to make room for vSphere 5 HA rules of the road, especially while vSphere 4 is still relevant & the current platform of most customers I talk to. I find it both refreshing and exciting that every once in a while I’ll talk to a customer who is completely migrated to vSphere 5 already.
Bilal Hashmi says
Duncan,
One more question, you said:
“As soon as a Master is isolated it will drop “ownership” of datastores on which VMs are running that are part of its cluster. Before the other hosts trigger the isolation response for a given VM they will validate if the datastore on which this VM is stored is “owned” by a master. In the case of a cluster wide isolation due to a network outage / maintenance the ownership would be dropped and this would result in HA not triggering the isolation response.”
So the hosts will not be triggering an isolation response but will they still be updating their status as isolated in the poweron file?
Also, it sounds like in vSphere 5, the isolation response doesnt matter in case of complete isolation where no host can talk to one another including the master.
So I guess once the hosts are unable to ping the isolation addr, they will update their poweron file with the correct status and at that point check to see if the master has ownership of the VMs datastore, if yes, they trigger the isolation response. If not, they just sit and smile..
So the datastore ownership check is another added step thats taken prior to triggering the isolation response. Did I understand that correctly?
Loren Gordon says
Duncan, This is great info, thanks! I started thinking about a scenario recently and was wondering how HA would react. Perhaps you know the answer, or know if someone has else has already studied it.
With an HA cluster design where all the hosts’ management interfaces are on a distributed switch and all those interfaces are connected to the same dvPG, what does HA do if someone [by accident or otherwise] changes the VLAN # of the dvPG?
If I understand, HA should lose connectivity to the isolation address and initiate the isolation response (which defaults to shut down the VMs). Is that right? I also believe this would be rather difficult to recover from because there wouldn’t be any remote connectivity to the hosts, so it would be necessary to connect to each host’s console and reconfigure the mgmt interface to link to a vSS.
If all that is correct, what options do we have to mitigate this risk?
Cheers,
-Loren
Forbsy says
Nice explanation. Thanks.
kopper says
hi my questions regarding this topic is regarding disabling “Enable Host Monitoring” during network maintenance is the best way, my doubt is:
Will the VMs restart (HA) to the other hosts if one of hosts has physical problem and reboots or shutdown due to physical issue?
thanks
Duncan says
When disabled, nothing happens.
kopper says
so that means HA is completely disabled, Enable Host Monitoring is just a click to not completely disabled HA, that’s too bad I was thinking VMs were possible to reboot in other hosts… pray to God nothing happens during maintenance mode
hariprajan says
Duncan ,
I was just trying the Host Monitoring option , Let me tell what I have observed and please let me know I am right on this concept and please tell me what will work and what will not work.
HA Monitoring disable HA elections on hosts in a cluster , thus we can reduce the overhead in cluster while doing the host maintenance , but HA , high availability fail over feature continue to functional when HA monitoring is disable as well .
What will not work ?
– HA election won’t happen and will get fail over feature
other than anything will not work here ?
Regards
Hari Rajan
Duncan Epping says
When you disable “host monitoring” then HA should not do ANYTHING when it comes to failing over a VM. So this is only to be used for maintenance scenarios.