I was talking to a partner and customer last week at a VMUG. They were running a two node (direct connect) vSAN configuration and had some issues during maintenance which were, to them, not easy to explain. What they did is they placed the host which was in the “preferred fault domain” in to maintenance mode. After they placed that host in to maintenance mode the link between the two hosts for whatever reason failed. After they rebooted the host in the preferred host it connected back to the witness but at this point in time the connection between the hosts had not returned yet. This confused vSAN and that resulted in the scenario where the VMs in the secondary fault domain were powered off. As you can imagine an undesired effect.
This issue is solved in the near future in a new version of vSAN, but for those who need to do maintenance on a two-node (direct connect) configuration (or a full site maintenance in a stretched environment) I would highly recommend the following simple procedure. This will need to be done when doing maintenance on the host which is in the “preferred fault domain”:
- Change the preferred fault domain
- Under vSAN, click Fault Domains and Stretched Cluster.
- Select the secondary fault domain and click the Mark Fault Domain as preferred for Stretched Cluster icon
- Place the host in to maintenance mode
- Do you maintenance
Fairly straight forward, but important to remember…
Thanks for this useful information, I have deployed 19 vsan cluster with 2 node direct connect. I never had this issue while doing maintenance but I was worried after knowing about this issue. I will do this change during host patching/Maintenance just to avoid any unseen issues.
Arvin K says
Thanks for the info. unfortunately i was late to found this article, and all VMs down without any single warning. DISASTER ;(
Claudio R says
Thank you for the info, can I have more information on afflicted versions?
Does the problem is solved in the version 6.5.0 10175896 (October Patch) ?
We had a similar problem that maybe you can help us with.
2 Node DC 10GB vSAN… 1GB for LAN/Witness (WTS implemented)
All 3 hosts on the same L2
If we put 1 data host in maintenance mode and reboot it… everything is good (the VMs keep running on 2nd node)
If we put same host in maintenance mode and disconnect LAN (10G DC stays up) the VMs are shutdown on the 2nd node…
HA is on but Host isolation is turned off…
Do you know why?
VMware support have been looking at it for a week and can’t figure it out 🙁
I have narrowed down the issue
If you put a host on a 2node VSAN cluster in maintenance mode and disconnect the vmnic that the witness traffic (with WTS implemented) is using… it terminates the VMs on the other node!
Regardless of your fault domain settings (I tried preferred and secondary – no difference)
This is only if you have HA turned on and the host is in maintenance mode
If you disconnect the witness traffic vmnics when the host isn’t in maintenance mode… nothing happens
VMware have acknowledged this as a bug and have escalated it to engineering
Set VSAN.AutoTerminateGhostVm to 1
I think that will solve the problem.
1 is default… Do you mean set it to 0?
Yes that works but isn’t that setting for when the host is isolated from the network?
In this case the VM is being terminated on the host that still has full connectivity to the witness server so its not isolated
This setting was developed to kill VMs when a site is partitioned from the Witness location and the other Data location. If the VM is unusable the VM is killed. In your scenario this is the case, as you also kill the WTS link. I am on a holiday right now, limited access to internet, but I should have an article explaining the setting and what should be used
Just to confirm:
I’m killing the WTS link on the node in maintenance mode… not the node running the VM. The node running the VM still has full connectivity to the witness yet the VM is terminated.
And you are sure you set up the manual routes correctly? It almost feels like WTS is directed via the other host.
No routing required. All 3 hosts are in the same L2.
WTS is on vmk0 (shared with management) on the data nodes. Witness appliance is also vmk0 (with vSAN enabled)
I’ve even tried breaking out WTS on all 3 nodes onto its own vmk and VLAN – no diff.
VMware have even confirmed this is a serious bug in VSAN
Happy to send you the case number if you want to follow up further 🙂
Sure send it over. I am on holiday though, so won’t be able to access it for a few weeks, but can point others to it.
Done. Sent it to your gmail. 🙂
Thanks heaps for replying on your holiday! Have a good break 🙂