Doing network/ISL maintenance in a vSAN stretched cluster configuration!

I got a question earlier about the maintenance of an ISL in a vSAN Stretched Cluster configuration which had me thinking for a while. The question was what would you do with your workload during maintenance. I guess the easiest of course is to power off all VMs and simply shutdown the cluster, for which vSAN has a UI option, and there’s a KB you can follow. Now, of course, there could also be a situation where the VMs need to remain running. But how does this work when you end up losing the connection between all three locations? Normally this would lead to a situation where all VMs will become “inaccessible” as you will end up losing quorum.

As said, this had me thinking, you could take advantage of the “vSAN Witness Resiliency” mechanism which was introduced in vSAN 7.0 U3. How would this work?

Well, it is actually pretty straight forward, if all hosts of 1 site are in maintenance mode, failed, or powered off, the votes of the witness object for each VM/Object will be recalculated within 3 minutes. When this recalculation has completed the witness can go down without having any impact on the VM. We introduced this capability to increase resiliency in a double-failure scenario, but we can (ab)use this functionality also during maintenance. Of course I had to test this, so the first step I took was placing all hosts in 1 location into maintenance mode (no data evac). This resulted in all my VMs being vMotioned to the other site.

Now next I checked with RVC if my votes were recalculated or not. As stated, depending on the number of VMs this can take around 3 minutes in total, but usually will probably be quicker. After the recalculation had been completed I powered off the Witness, and this was the result as shown below, all VMs were still running.

Of course, I had to double check on the commandline using RVC (you can use the command “vsan.vm_object_info” to check a particular object for instance) to ensure that indeed the components of those VMs were still “ACTIVE” instead of “ABSENT”, and there you go!

Now when maintenance has been completed, you simply do the reverse, you power on the witness, and then you power on the hosts in the other location. After the “resync” has been completed the VMs will be rebalanced again by DRS. Note, DRS rebalancing (or should rules being applied) will only happen when the resync of the VM has been completed.

Comments

Laurent Blanchoud says

22 November, 2023 at 15:03

Hi Duncan !
Many thanks for that very interesting post.

What could we do in such case if we don’t have the vSAN capacity to replicate all VMs on both sites and also not the compute capacity to shutdown one site?

Can we keep VMs running on both sites with the loss of ISL and witness if we use storage policies with site disaster tolerance like “None – keep data on Secondary (stretched cluster)” or “None – keep data on Preferred (stretched cluster)” and of course the proper DRS rules to keep VMs on the correct site 😉

Many thanks for your feedback!
Laurent

- defdefred says
  
  23 November, 2023 at 12:50
  
  Hello Laurent,
  In a stretched sVAN cluster aren’t the data already replicated on both sites?
  If not, what’s the benefit of stretched configuration?
  
  - Laurent says
    
    24 November, 2023 at 10:29
    
    Hi defdefred,
    
    Well, replicate or not depends on the settings you choose in your storage policy.
    If you already have another software way to replicate data or have a cluster with 2 nodes, you can decide not to replicate data through vSAN and keep one node on each site.
    
    I was asking about that very specific use case as we have a network maintenance to do and we’ll lose ISL and witness at the same time and also don’t have capacity to replicate all VMs.
    
    So I’m trying to find the easiest and safest way to do that change while keeping our VMs running.
    
    Have a great day !

Related

Reader Interactions

Comments

Leave a ReplyCancel reply