Search Results for: vsphere ha isolation vm flipping

Partition / Isolation and VM flip flopping between hosts?

Duncan Epping · May 16, 2016 ·

Last week I was talking to one of our developers at our R&D offsite. He has a situation where he saw his VM flip flopping between two hosts when he was testing a certain failure scenario and he wondered why that was. In his case he had a 2 node cluster connected to vCenter Server and a bunch of VMs running on just 1 host. All of the VMs were running off iSCSI storage. When looking at vCenter he literally would see his VMs on host 1 and a split second later on host 2, and this would go on continuously. I have written about this behaviour before, but figured it never hurts to repeat it as not everyone goes back 2-3 years to read up on certain scenarios.

In the above diagram you see a VM running on the first host. vCenter Server is connected to both hosts through Network A and the Datastore being used is on Network C and the host management network is connected through Network B. Now just imagine that Network B is for whatever reason gone. The hosts won’t be able to ping each other any longer. In this case although it is an isolation, the VMs will have access through the central datastore and depending on how the isolation response is configured the VMs may or may not be restarted. Either way, as the datastore is still there, even if isolation response is set to “disabled” / “leave powered on” the VM will not be restarted on the second host as the “VM” is locked through that datastore, and you cannot have 2 locks on those files.

Now if simultaneously Network B and C are gone, this could potentially pose a problem. Just imagine this to be the case. Now the hosts are able to communicate to vCenter Server, however they cannot communicate to each other (isolation event will be triggered if configured), and the VM will lose access to storage (network C is down). If no isolation event was configured (disabled or leave powered on) then the VM on the first host will remain running, but as the second host has noticed the first host is isolated and it doesn’t see the VM any longer and the lock on those files are gone it is capable of restarting that VM. Both hosts however are still connected to vCenter Server and will send their updates to vCenter Server with regards to the inventory they are running… And that is when you will see the VM flip flopping (also sometimes referred to as ping-ponging) between those hosts.

And this, this is exactly why:

It is recommend to configure an Isolation Response based on the likelihood of a situation like this occurring
If you have vSphere 6.0 and higher, you should enable APD/PDL responses, so that the VM running on the first host will be killed when storage is gone.

I hope this helps…

vSphere 4 U2 and recovering from HA Split Brain

Duncan Epping · Jul 2, 2010 ·

A couple of months ago I wrote this article about a future feature that would enable HA to recover from a Split Brain scenario. vSphere 4.0 Update 2 recently was released but the release notes or documentation did not mention this new feature.

I had never noticed this until I was having a discussion around this feature with one of my colleagues. I asked our HA Product Manager and one of our developers and it appears that this mysteriously has slipped the release notes. As I personally believe that this is a very important feature of HA I wanted to rehash some of the info stated in that article. I did rewrite it slightly though. Here we go:

One of the most common issues experienced in an iSCSI/NFS environment with VMware HA pre vSphere 4.0 Update 2 is a split brain situation.

First let me explain what a split brain scenario is, lets start with describing the situation which is most commonly encountered:

4 Hosts
iSCSI / NFS based storage
Isolation response: leave powered on

When one of the hosts is completely isolated, including the Storage Network, the following will happen:

Host ESX001 is completely isolated including the storage network(remember iSCSI/NFS based storage!) but the VMs will not be powered off because the isolation response is set to “leave powered on”.
After 15 seconds the remaining, non isolated, hosts will try to restart the VMs.
Because of the fact that the iSCSI/NFS network is also isolated the lock on the VMDK will time out and the remaining hosts will be able to boot up the VMs.
When ESX001 returns from isolation it will still have the VMX Processes running in memory and this is when you will see a “ping-pong” effect within vCenter, in other words VMs flipping back and forth between ESX001 and any of the other hosts.

As of version 4.0 Update 2 ESX(i) detects that the lock on the VMDK has been lost and issues a question which is automatically answered. The VM will be powered off to recover from the split-brain scenario and to avoid the ping-pong effect. Please note that HA will generate an event for this auto-answer which is viewable within vCenter.

Don’t you just love VMware HA!

Cool new HA feature coming up to prevent a split brain situation!

Duncan Epping · Mar 29, 2010 ·

I already knew this was coming up but wasn’t allowed to talk about it. As it is out in the open on the VMTN community I guess I can talk about it as well.

One of the most common issues experienced with VMware HA is a split brain situation. Although currently undocumented, vSphere has a detection mechanism for these situations. Even more important the upcoming release ESX 4.0 Update 2 will also automatically prevent it!

First let me explain what a split brain scenario is, lets start with describing the situation which is most commonly encountered:

4 Hosts – iSCSI / NFS based storage – Isolation response: leave powered on

When one of the hosts is completely isolated, including the Storage Network, the following will happen:

Host ESX001 is completely isolated including the storage network(remember iSCSI/NFS based storage!) but the VMs will not be powered off because the isolation response is set to “leave powered on”. After 15 seconds the remaining, non isolated, hosts will try to restart the VMs. Because of the fact that the iSCSI/NFS network is also isolated the lock on the VMDK will time out and the remaining hosts will be able to boot up the VMs. When ESX001 returns from isolation it will still have the VMX Processes running in memory. This is when you will see a “ping-pong” effect within vCenter, in other words VMs flipping back and forth between ESX001 and any of the other hosts.

As of version 4.0 ESX(i) detects that the lock on the VMDK has been lost and issues a question if the VM should be powered off or not. Please note that you will(currently) only see this question if you directly connect to the ESX host. Below you can find a screenshot of this question.

With ESX 4 update 2 the question will be auto-answered though and the VM will be powered off to avoid the ping-pong effect and a split brain scenario! How cool is that…