I wrote this article about split brain scenarios for the vSphere Blog. Based on this article I received some questions around which “isolation response” to use. This is not something that can be answered by a simple “recommended practice” and applied to all scenarios out there. Note that below has got everything to do with your infrastructure. Are you using IP-Based storage? Do you have a converged network? All of these impact the decision around the isolation response.
The following table however could be used to make a decision:
Likelihood that host will retain access to VM datastores | Likelihood that host will retain access to VM network | Recommended Isolation policy | Explanation |
Likely | Likely | Leave Powered On | VM is running fine so why power it off? |
Likely | Unlikely | Either Leave Powered On or Shutdown | Choose shutdown to allow HA to restart VMs on hosts that are not isolated and hence are likely to have access to storage |
Unlikely | Likely | Power Off | Use Power Off to avoid having two instances of the same VM on the VM network |
Unlikely | Unlikely | Leave Powered On or Power Off | Leave Powered on if the VM can recover from the network/datastore outage if it is not restarted because of the isolation, and Power Off if it likely can’t. |
James Hess says
What’s one to do in a heterogeneous storage environment. It’s not as if you can configure a per-VM isolation response?
VM cluster with some NFS storage, some iSCSI VMFS datastores, some FC datastores, and some FCoE.
Likelihood that host will retain access to VM network: Unlikely (ESXi second management VMkernel on same subnet as one of the das.isolationaddress on 2x10gig interface team on same dvSwitch as VM network)
Likelihood that host will retain access to VM datastores: Some VM x datastores Likely (VMs on FC, iSCSI) — FC unaffected by Ethernet, iSCSI protected by iSCSI multipathing and additional 1-gigabit connections through a second storage-dedicated switch.
Some VM x datastores: Unlikely (NFS);
NFS datastore on 2x10gig team shared with ESXi management.
Because NFS does not support multipathing, there is no path failover possible in case of network issues if the problem cannot be detected by network teaming on the ESXi host.
Duncan says
Euh, yes you can set the isolation response per virtual machine in your cluster settings?
James Hess says
“Euh, yes you can set the isolation response per virtual machine in your cluster settings?”
Technically, the capability exists to select an isolation response in the HA VM settings drop down list, but a human limitation makes that unreasonable.
You can do that if you have 10 VMs, and you are very cautious about updating if ever moving a VM from one datastore to another, but it’s not really feasible if you have, oh say, 250 VMs in the cluster.
Duncan says
Then you will need to select the one that does the least damage, not much else you can do about it when you have various types of storage in 1 environment.
Marco Broeken says
Good overview Duncan! I was looking for this
Marty Greene says
Funny we just had an issue in our environment. We use HP Blades with virtual connect. At times we update firmware on the enclosures. Never had a problem in the past the hosts do have a network blip during the process. But the guests stay up this time host isolation was set on the cluster to power down so when all the hosts were not pingable in the silo the vm’s started shutting down.
Udubplate says
If this was configurable per data store cluster it would be much more manageable for large scale heterogenous environments. For example, set isolate response setting per data store so it applies to all VM’s on those data stores. Typically a data store cluster would not have more than one type of storage connectivity for datastores within it or could be configured as such at least in this scenario.
Duncan Epping says
I suggest you contact your local VMware representative and list the use case and ask him to file a feature request. It kinda makes sense, but I can also see why you would want a similar response for all VMs. Makes it more predictablr what happens.
Ashok says
If the isolation response is ‘Leave powered-on’, will they continue to be powered-on on the isolated host (until isolation issue is resolved) or will they get restarted by Master on one of the available hosts at some point..?
sidbrydon says
Hi Duncan,
In a service provider environment where you are using say UCS, what would you recommend for the isolation response?
I have been going back and forth about this, I am concerned that there is possibly no way out of causing a boot storm on the SAN for example.
If one chassis say is isolated from network but not storage would it not be better to leave powered on and let the operational team handle the powering off the host/VM’s so they are brought back in a controlled fashion and reduce the risk of affecting other customers environments, rather than using “Power off” and running the risk of having 400-500 VM’s definitely trying to power on in different clusters at the same time?