I already knew this was coming up but wasn’t allowed to talk about it. As it is out in the open on the VMTN community I guess I can talk about it as well.
One of the most common issues experienced with VMware HA is a split brain situation. Although currently undocumented, vSphere has a detection mechanism for these situations. Even more important the upcoming release ESX 4.0 Update 2 will also automatically prevent it!
First let me explain what a split brain scenario is, lets start with describing the situation which is most commonly encountered:
4 Hosts – iSCSI / NFS based storage – Isolation response: leave powered on
When one of the hosts is completely isolated, including the Storage Network, the following will happen:
Host ESX001 is completely isolated including the storage network(remember iSCSI/NFS based storage!) but the VMs will not be powered off because the isolation response is set to “leave powered on”. After 15 seconds the remaining, non isolated, hosts will try to restart the VMs. Because of the fact that the iSCSI/NFS network is also isolated the lock on the VMDK will time out and the remaining hosts will be able to boot up the VMs. When ESX001 returns from isolation it will still have the VMX Processes running in memory. This is when you will see a “ping-pong” effect within vCenter, in other words VMs flipping back and forth between ESX001 and any of the other hosts.
As of version 4.0 ESX(i) detects that the lock on the VMDK has been lost and issues a question if the VM should be powered off or not. Please note that you will(currently) only see this question if you directly connect to the ESX host. Below you can find a screenshot of this question.
With ESX 4 update 2 the question will be auto-answered though and the VM will be powered off to avoid the ping-pong effect and a split brain scenario! How cool is that…
Jason Boche says
“With ESX 4 update 2 the question will be auto-answered though and the VM will be powered off to avoid the ping-pong effect and a split brain scenario!”
Is it a configurable option to auto-answer the question? In other words, if a customer does not want the question auto-answered, can that behavior be toggled?
Rob Mokkink says
I allways leave the response power off vm, because i use FC based storage.
Arnim van Lieshout says
What will happen to the automatically powered off vm?
Will it be automatically deregistered from the host that failed?
David Owen says
Great top see this feature. Its not the most common issue in the world but was always somthing that was in the back of my mind when deploying HA.
Frank Brix Pedersen says
How do you directly connect to the ESX host if the network is lost and the host is isolated? 😉 I think it is great it auto answers yes.
Arkadiusz Krowczynski says
Any timetable when Update 2 will arrive for us?
Johan says
If you are on iSCSI/NFS why not set the isolation address to the storage and let them power off ? What would be better would be if HA would check the isolationaddress first and then the other hosts (in the case of iscsi/nfs) so if iscsi/nfs is dead just power off the vm’s so another host can power them on with out the risk for duplicated vm’s
Johan says
Frank: By having several sc/management port’s one on the iscsi/nfs network and one on the normal management network.
Frank Brix Pedersen says
Johan: If that was the case your host would never be isolated in the first case.
It was purely a rhetorical question
Johan says
Frank: well it depends what isolation network you choose…
Paul Geerlings says
Is that why they set the default isolation response on ESX4 HA back to “Shutdown”?
Will they be changing the default again with ESX4 Update 2 ?
rotary laser levels says
I was’nt sure I would like this site since it was about Cool new HA feature coming up to prevent a split brain situation! » Yellow Bricks but I was wrong and thought it was cool and found it on AOL . Thanks and I’ll be back as you update.
Craig says
How do the hosts handle a split brain when the underlying storage is FC ? Or in the case where the Storage network is still available ?
Duncan Epping says
Then the VMDK would be locked and the VM wouldn’t be restarted so a split brain can’t occur.