I already knew this was coming up but wasn’t allowed to talk about it. As it is out in the open on the VMTN community I guess I can talk about it as well.
One of the most common issues experienced with VMware HA is a split brain situation. Although currently undocumented, vSphere has a detection mechanism for these situations. Even more important the upcoming release ESX 4.0 Update 2 will also automatically prevent it!
First let me explain what a split brain scenario is, lets start with describing the situation which is most commonly encountered:
4 Hosts – iSCSI / NFS based storage – Isolation response: leave powered on
When one of the hosts is completely isolated, including the Storage Network, the following will happen:
Host ESX001 is completely isolated including the storage network(remember iSCSI/NFS based storage!) but the VMs will not be powered off because the isolation response is set to “leave powered on”. After 15 seconds the remaining, non isolated, hosts will try to restart the VMs. Because of the fact that the iSCSI/NFS network is also isolated the lock on the VMDK will time out and the remaining hosts will be able to boot up the VMs. When ESX001 returns from isolation it will still have the VMX Processes running in memory. This is when you will see a “ping-pong” effect within vCenter, in other words VMs flipping back and forth between ESX001 and any of the other hosts.
As of version 4.0 ESX(i) detects that the lock on the VMDK has been lost and issues a question if the VM should be powered off or not. Please note that you will(currently) only see this question if you directly connect to the ESX host. Below you can find a screenshot of this question.
With ESX 4 update 2 the question will be auto-answered though and the VM will be powered off to avoid the ping-pong effect and a split brain scenario! How cool is that…