BC-DR

Partition / Isolation and VM flip flopping between hosts?

Duncan Epping · May 16, 2016 ·

Last week I was talking to one of our developers at our R&D offsite. He has a situation where he saw his VM flip flopping between two hosts when he was testing a certain failure scenario and he wondered why that was. In his case he had a 2 node cluster connected to vCenter Server and a bunch of VMs running on just 1 host. All of the VMs were running off iSCSI storage. When looking at vCenter he literally would see his VMs on host 1 and a split second later on host 2, and this would go on continuously. I have written about this behaviour before, but figured it never hurts to repeat it as not everyone goes back 2-3 years to read up on certain scenarios.

In the above diagram you see a VM running on the first host. vCenter Server is connected to both hosts through Network A and the Datastore being used is on Network C and the host management network is connected through Network B. Now just imagine that Network B is for whatever reason gone. The hosts won’t be able to ping each other any longer. In this case although it is an isolation, the VMs will have access through the central datastore and depending on how the isolation response is configured the VMs may or may not be restarted. Either way, as the datastore is still there, even if isolation response is set to “disabled” / “leave powered on” the VM will not be restarted on the second host as the “VM” is locked through that datastore, and you cannot have 2 locks on those files.

Now if simultaneously Network B and C are gone, this could potentially pose a problem. Just imagine this to be the case. Now the hosts are able to communicate to vCenter Server, however they cannot communicate to each other (isolation event will be triggered if configured), and the VM will lose access to storage (network C is down). If no isolation event was configured (disabled or leave powered on) then the VM on the first host will remain running, but as the second host has noticed the first host is isolated and it doesn’t see the VM any longer and the lock on those files are gone it is capable of restarting that VM. Both hosts however are still connected to vCenter Server and will send their updates to vCenter Server with regards to the inventory they are running… And that is when you will see the VM flip flopping (also sometimes referred to as ping-ponging) between those hosts.

And this, this is exactly why:

It is recommend to configure an Isolation Response based on the likelihood of a situation like this occurring
If you have vSphere 6.0 and higher, you should enable APD/PDL responses, so that the VM running on the first host will be killed when storage is gone.

I hope this helps…

Some recent updates to HA deepdive

Duncan Epping · Apr 15, 2016 ·

Just a short post to point out that I updated the VVol section in the HA Deepdive. If you downloaded it, make sure to download the latest version. Note that I have added a version number to the intro and a changelog at the end so you can see what changes. Also, I recommend subscribing to it, as I plan to do some more updates in the upcoming months. For the update I’ve been playing with a Nimble (virtual) array all day today and it allowed me to create some cool screenshots of how HA works in a VVol environment. I was also seriously impressed by how easy it was to setup the Nimble (virtual) array and how simple VVol was to configure for them. Not just that, but the number of policy options Nimble exposes, I was amazed. Below is just an example of some of the things you can configure!

The screenshot below shows the Virtual Volumes created for a VM, this is the view from a Storage perspective:

Can I still provision VMs when a VSAN Stretched Cluster site has failed?

Duncan Epping · Apr 13, 2016 ·

A question was asked internally if you can still provision VMs when a site has failed in a VSAN stretched cluster environment. In a regular VSAN environment when you don’t have sufficient fault domains you cannot provision new VMs, unless you explicitly enable Force Provisioning, which most people do not have enabled. In a VSAN stretched cluster environment this behaviour is different. In my case I tested what would happen if the witness appliance would be gone. I had already created a VM before I failed the witness appliance, and I powered it on after I failed the witness, just to see if that worked. Well that worked, great, and if you look at the VM at a component level you can see that the witness component is missing.

Next test would be to create a new VM while the Witness Appliance is down. That also worked, although I am notified by vCenter during the provisioning process that there are less fault domain than expected as shown in the below screenshot. This is the difference with a normal VSAN environment, here we actually do allow you to provision new workloads, mainly because the site could be down for a longer period of time.

Now next step would be to power on the just created VM and then look at the components. The power on works without any issues and as shown below, the VM is created in the Preferred site with a single component. As soon though as the Witness recovers the the remaining components are created and synced.

Good to see that provisioning and power-on actually does work and that behaviour for this specific use-case was changed. If you want to know more about VSAN stretched clusters, there are a bunch of articles on it to be found here. And there is a deepdive white paper also available here.

Virtually Speaking Podcast Episode 7 – VSAN Customer Use Cases

Duncan Epping · Apr 2, 2016 ·

The Storage and Availability Tech Marketing team runs a podcast called Virtually Speaking Podcast every week. This week it was my turn to be a guest on their show. We spoke about VSAN / use cases / all-flash and various other random topic that came up. It was a fun conversation, and I am going to try to tune in more often for sure. (Although I do listen to it every week, I haven’t been able to join live…) Make to sign up, so you don’t miss out on an episode. Listen to Pete Flecha, John Nicholson and I through the below player. I hope you will enjoy it as much as I did.

vMSC and Disk.AutoremoveOnPDL on vSphere 6.x and higher

Duncan Epping · Mar 21, 2016 ·

I have discussed this topic a couple of times, and want to inform people about a recent change in recommendation. In the past when deploying a stretched cluster (vMSC) it was recommended by most storage vendors and by VMware to set Disk.AutoremoveOnPDL to 0. This basically disabled the feature that automatically removes LUNs which are in a PDL (permanent device loss) state. Upon return of the device a rescan would then allow you to use the device again. With vSphere 6.0 however there has been a change to how vSphere responds to a PDL scenario, vSphere does not expect the device to return. To be clear, the PDL behaviour in vSphere was designed around the removal of devices, they should not stay in the PDL state and return for duty, this did work however in previous version due to a bug.

With vSphere 6.0 and higher VMware recommends to set Disk.AutoremoveOnPDL to 1, which is the default setting. If you are a vMSC / stretched cluster customer, please change your environment and design accordingly. But before you do, please consult your storage vendor and discuss the change. I would also like to recommend testing the change and behaviour to validate that the environment returns for duty correctly after a PDL! Sorry about the confusion.

KB article backing my recommendation was just posted: https://kb.vmware.com/kb/2059622. Documentation (vMSC whitepaper) is also being updated.