• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

Doing maintenance on a Two-Node (Direct Connect) vSAN configuration

Duncan Epping · Mar 13, 2018 ·

I was talking to a partner and customer last week at a VMUG. They were running a two node (direct connect) vSAN configuration and had some issues during maintenance which were, to them, not easy to explain. What they did is they placed the host which was in the “preferred fault domain” in to maintenance mode. After they placed that host in to maintenance mode the link between the two hosts for whatever reason failed. After they rebooted the host in the preferred host it connected back to the witness but at this point in time the connection between the hosts had not returned yet. This confused vSAN and that resulted in the scenario where the VMs in the secondary fault domain were powered off. As you can imagine an undesired effect.

This issue is solved in the near future in a new version of vSAN, but for those who need to do maintenance on a two-node (direct connect) configuration (or a full site maintenance in a stretched environment) I would highly recommend the following simple procedure. This will need to be done when doing maintenance on the host which is in the “preferred fault domain”:

  • Change the preferred fault domain
    • Under vSAN, click Fault Domains and Stretched Cluster.
    • Select the secondary fault domain and click the Mark Fault Domain as preferred for Stretched Cluster icon
  • Place the host in to maintenance mode
  • Do you maintenance

Fairly straight forward, but important to remember…

Related

Server, Software Defined, Storage, vSAN 6.2, 6.6, 6.6.1, maintenance, VMware, vsan, vSphere

Reader Interactions

Comments

  1. Haridas says

    13 March, 2018 at 17:01

    Thanks for this useful information, I have deployed 19 vsan cluster with 2 node direct connect. I never had this issue while doing maintenance but I was worried after knowing about this issue. I will do this change during host patching/Maintenance just to avoid any unseen issues.

    Thanks,
    Haridas

  2. Arvin K says

    1 June, 2018 at 16:13

    Thanks for the info. unfortunately i was late to found this article, and all VMs down without any single warning. DISASTER ;(

  3. Claudio R says

    26 October, 2018 at 11:18

    Thank you for the info, can I have more information on afflicted versions?
    Does the problem is solved in the version 6.5.0 10175896 (October Patch) ?

  4. Paul says

    23 July, 2019 at 01:05

    Hi,

    We had a similar problem that maybe you can help us with.

    2 Node DC 10GB vSAN… 1GB for LAN/Witness (WTS implemented)

    All 3 hosts on the same L2

    If we put 1 data host in maintenance mode and reboot it… everything is good (the VMs keep running on 2nd node)

    If we put same host in maintenance mode and disconnect LAN (10G DC stays up) the VMs are shutdown on the 2nd node…

    HA is on but Host isolation is turned off…

    Do you know why?

    VMware support have been looking at it for a week and can’t figure it out 🙁

    • blueisotope says

      26 July, 2019 at 02:54

      Update:

      I have narrowed down the issue

      If you put a host on a 2node VSAN cluster in maintenance mode and disconnect the vmnic that the witness traffic (with WTS implemented) is using… it terminates the VMs on the other node!

      Regardless of your fault domain settings (I tried preferred and secondary – no difference)

      This is only if you have HA turned on and the host is in maintenance mode

      If you disconnect the witness traffic vmnics when the host isn’t in maintenance mode… nothing happens

      VMware have acknowledged this as a bug and have escalated it to engineering

      • Duncan says

        26 July, 2019 at 06:33

        Set VSAN.AutoTerminateGhostVm to 1

        I think that will solve the problem.

        • blueisotope says

          26 July, 2019 at 07:28

          1 is default… Do you mean set it to 0?

          Yes that works but isn’t that setting for when the host is isolated from the network?

          In this case the VM is being terminated on the host that still has full connectivity to the witness server so its not isolated

          • Duncan says

            26 July, 2019 at 10:13

            This setting was developed to kill VMs when a site is partitioned from the Witness location and the other Data location. If the VM is unusable the VM is killed. In your scenario this is the case, as you also kill the WTS link. I am on a holiday right now, limited access to internet, but I should have an article explaining the setting and what should be used

            • blueisotope says

              26 July, 2019 at 14:44

              Just to confirm:

              I’m killing the WTS link on the node in maintenance mode… not the node running the VM. The node running the VM still has full connectivity to the witness yet the VM is terminated.

  5. Duncan says

    26 July, 2019 at 15:30

    And you are sure you set up the manual routes correctly? It almost feels like WTS is directed via the other host.

    • blueisotope says

      26 July, 2019 at 22:36

      No routing required. All 3 hosts are in the same L2.

      WTS is on vmk0 (shared with management) on the data nodes. Witness appliance is also vmk0 (with vSAN enabled)

      I’ve even tried breaking out WTS on all 3 nodes onto its own vmk and VLAN – no diff.

      VMware have even confirmed this is a serious bug in VSAN

      Happy to send you the case number if you want to follow up further 🙂

      • Duncan says

        27 July, 2019 at 00:13

        Sure send it over. I am on holiday though, so won’t be able to access it for a few weeks, but can point others to it.

        • blueisotope says

          27 July, 2019 at 01:04

          Done. Sent it to your gmail. 🙂

          Thanks heaps for replying on your holiday! Have a good break 🙂

Primary Sidebar

About the author

Duncan Epping is a Chief Technologist in the Office of CTO of the Cloud Platform BU at VMware. He is a VCDX (# 007), the author of the "vSAN Deep Dive", the “vSphere Clustering Technical Deep Dive” series, and the host of the "Unexplored Territory" podcast.

Upcoming Events

May 24th – VMUG Poland
June 1st – VMUG Belgium
Aug 21st – VMware Explore
Sep 20th – VMUG DK
Nov 6th – VMware Explore
Dec 7th – Swiss German VMUG

Recommended Reads

Sponsors

Want to support Yellow-Bricks? Buy an advert!

Advertisements

Copyright Yellow-Bricks.com © 2023 · Log in