A couple of months ago I wrote this article about a future feature that would enable HA to recover from a Split Brain scenario. vSphere 4.0 Update 2 recently was released but the release notes or documentation did not mention this new feature.
I had never noticed this until I was having a discussion around this feature with one of my colleagues. I asked our HA Product Manager and one of our developers and it appears that this mysteriously has slipped the release notes. As I personally believe that this is a very important feature of HA I wanted to rehash some of the info stated in that article. I did rewrite it slightly though. Here we go:
One of the most common issues experienced in an iSCSI/NFS environment with VMware HA pre vSphere 4.0 Update 2 is a split brain situation.
First let me explain what a split brain scenario is, lets start with describing the situation which is most commonly encountered:
- 4 Hosts
- iSCSI / NFS based storage
- Isolation response: leave powered on
When one of the hosts is completely isolated, including the Storage Network, the following will happen:
- Host ESX001 is completely isolated including the storage network(remember iSCSI/NFS based storage!) but the VMs will not be powered off because the isolation response is set to “leave powered on”.
- After 15 seconds the remaining, non isolated, hosts will try to restart the VMs.
- Because of the fact that the iSCSI/NFS network is also isolated the lock on the VMDK will time out and the remaining hosts will be able to boot up the VMs.
- When ESX001 returns from isolation it will still have the VMX Processes running in memory and this is when you will see a “ping-pong” effect within vCenter, in other words VMs flipping back and forth between ESX001 and any of the other hosts.
As of version 4.0 Update 2 ESX(i) detects that the lock on the VMDK has been lost and issues a question which is automatically answered. The VM will be powered off to recover from the split-brain scenario and to avoid the ping-pong effect. Please note that HA will generate an event for this auto-answer which is viewable within vCenter.
Don’t you just love VMware HA!
Charlie says
Duncan –
Nice post as always. I question though whether or not this will really work. Also, I am concerned why VMware would even ask the question when there is only one option for response and no one would ever want to cancel the question….
Thanks for the post!
Charlie
NiTRo says
Hi Duncan, this is a great news but is it the AAM agent that has been updated or is it vCenter 4.0 U2 that provide this new feature (witch in this case also works with U1 i guess) ?
Brad Clarke says
“I asked our HA Product Manager and one of our developers and it appears that this mysteriously has slipped.”
Does that mean it is or is not part of 4.0u2?
Kelly Wruck says
Thank you for the enthrilling post, may I ask where you get your information from?
Greg says
So when the first host fails and the vm’s have been started elsewhere, if you could get to the console through ilo or something you could reboot the failed host and prevent this happening? As long as you can do this before the host comes back I suppose.
Duncan Epping says
@Brad: “it” refers to the release notes. So it is in the release but not part of the release notes.
@Charlie: yes it does work! if it wouldn’t it wouldn’t be part of the release. keep in mind there’s a huge QA process that every single piece of code undergoes. The question was asked pre-U2 and could be ignored, when it would time out a power off would not occur. This has been change in U2 as VMware believes a power off is the only way to go.
@Nitro: it is not the AAM agent but most likely the VMkernel/hostd as the AAM agent doesn’t directly communicate with VMs. This just monitors if the VMDKs are still locked or not!
@Greg: Not sure what your point is here? This actually PREVENTS getting in to a split brain scenario!
Greg says
For hosts prior to u2 i should have mentioned!
Methone says
great as every time …
is there no need anymore to have a second service console/vmkernel for this scenario ?
Duncan Epping says
@methone: there always is a need to have a second service console in my opinion as that is the way to prevent false positives. What if only your primary service is isolated and not your VMs?
@greg: that would indeed be a way to recover from it, correct!
Charlie says
Duncan –
Thanks for the reply. I am still a little confused why VMware would add a feature that corrects an issue after it occurs rather than taking the action specified by the user. In this I mean when someone selects “power off on isolation” one should be able to safely rely on the feature, regardless of access to the storage. (e.g. kill the vmx process on the hos seems logical)
Implementing a work-around that corrects the issue after the isolated host regains network connectivity seems a little flawed to me.
Just my two cents.
Charlie
duncan says
HUH?
If you select “power off” it will power it off, but if you select “leave powered on” it will not power it on. it does exactly what it is supposed to do.
However as an isolated Storage Network could lead to a time-out of the lock on the VMDK the other hosts will be able to power it up. Than this new mechanism kicks in and ensures that split brain is recovered. Basically VMware cleans up the mess of a bad design / implementation.
Matthijs Haverink says
Hey Duncan,
Great article. Just a minor hint; in this article (by yourself):
http://www.yellow-bricks.com/2010/07/15/vmware-view-without-ha/
you mention that this feature actually isn’t part of HA itself and will work with or without HA enabled.
I think it would be useful to mention that in this article too, that’s all :).