Doing maintenance on a Two-Node (Direct Connect) vSAN configuration

Duncan Epping · Mar 13, 2018 ·

I was talking to a partner and customer last week at a VMUG. They were running a two node (direct connect) vSAN configuration and had some issues during maintenance which were, to them, not easy to explain. What they did is they placed the host which was in the “preferred fault domain” in to maintenance mode. After they placed that host in to maintenance mode the link between the two hosts for whatever reason failed. After they rebooted the host in the preferred host it connected back to the witness but at this point in time the connection between the hosts had not returned yet. This confused vSAN and that resulted in the scenario where the VMs in the secondary fault domain were powered off. As you can imagine an undesired effect.

This issue is solved in the near future in a new version of vSAN, but for those who need to do maintenance on a two-node (direct connect) configuration (or a full site maintenance in a stretched environment) I would highly recommend the following simple procedure. This will need to be done when doing maintenance on the host which is in the “preferred fault domain”:

Change the preferred fault domain
- Under vSAN, click Fault Domains and Stretched Cluster.
- Select the secondary fault domain and click the Mark Fault Domain as preferred for Stretched Cluster icon
Place the host in to maintenance mode
Do you maintenance

Fairly straight forward, but important to remember…

Comments

Haridas says

13 March, 2018 at 17:01

Thanks for this useful information, I have deployed 19 vsan cluster with 2 node direct connect. I never had this issue while doing maintenance but I was worried after knowing about this issue. I will do this change during host patching/Maintenance just to avoid any unseen issues.

Thanks,
Haridas
Arvin K says

1 June, 2018 at 16:13

Thanks for the info. unfortunately i was late to found this article, and all VMs down without any single warning. DISASTER ;(
Claudio R says

26 October, 2018 at 11:18

Thank you for the info, can I have more information on afflicted versions?
Does the problem is solved in the version 6.5.0 10175896 (October Patch) ?
Paul says

23 July, 2019 at 01:05

Hi,

We had a similar problem that maybe you can help us with.

2 Node DC 10GB vSAN… 1GB for LAN/Witness (WTS implemented)

All 3 hosts on the same L2

If we put 1 data host in maintenance mode and reboot it… everything is good (the VMs keep running on 2nd node)

If we put same host in maintenance mode and disconnect LAN (10G DC stays up) the VMs are shutdown on the 2nd node…

HA is on but Host isolation is turned off…

Do you know why?

VMware support have been looking at it for a week and can’t figure it out 🙁
- blueisotope says
  
  26 July, 2019 at 02:54
  
  Update:
  
  I have narrowed down the issue
  
  If you put a host on a 2node VSAN cluster in maintenance mode and disconnect the vmnic that the witness traffic (with WTS implemented) is using… it terminates the VMs on the other node!
  
  Regardless of your fault domain settings (I tried preferred and secondary – no difference)
  
  This is only if you have HA turned on and the host is in maintenance mode
  
  If you disconnect the witness traffic vmnics when the host isn’t in maintenance mode… nothing happens
  
  VMware have acknowledged this as a bug and have escalated it to engineering
  - Duncan says
    
    26 July, 2019 at 06:33
    
    Set VSAN.AutoTerminateGhostVm to 1
    
    I think that will solve the problem.
    - blueisotope says
      
      26 July, 2019 at 07:28
      
      1 is default… Do you mean set it to 0?
      
      Yes that works but isn’t that setting for when the host is isolated from the network?
      
      In this case the VM is being terminated on the host that still has full connectivity to the witness server so its not isolated
      - Duncan says
        
        26 July, 2019 at 10:13
        
        This setting was developed to kill VMs when a site is partitioned from the Witness location and the other Data location. If the VM is unusable the VM is killed. In your scenario this is the case, as you also kill the WTS link. I am on a holiday right now, limited access to internet, but I should have an article explaining the setting and what should be used
        
        blueisotope says
        
        26 July, 2019 at 14:44
        
        Just to confirm:
        
        I’m killing the WTS link on the node in maintenance mode… not the node running the VM. The node running the VM still has full connectivity to the witness yet the VM is terminated.
Duncan says

26 July, 2019 at 15:30

And you are sure you set up the manual routes correctly? It almost feels like WTS is directed via the other host.
- blueisotope says
  
  26 July, 2019 at 22:36
  
  No routing required. All 3 hosts are in the same L2.
  
  WTS is on vmk0 (shared with management) on the data nodes. Witness appliance is also vmk0 (with vSAN enabled)
  
  I’ve even tried breaking out WTS on all 3 nodes onto its own vmk and VLAN – no diff.
  
  VMware have even confirmed this is a serious bug in VSAN
  
  Happy to send you the case number if you want to follow up further 🙂
  - Duncan says
    
    27 July, 2019 at 00:13
    
    Sure send it over. I am on holiday though, so won’t be able to access it for a few weeks, but can point others to it.
    - blueisotope says
      
      27 July, 2019 at 01:04
      
      Done. Sent it to your gmail. 🙂
      
      Thanks heaps for replying on your holiday! Have a good break 🙂

Related

Reader Interactions

Comments