vSAN 7.0 U3 enhanced stretched cluster resiliency, what is it?

I briefly discussed the enhanced stretched cluster resiliency capability in my vSAN 7.0 U3 overview blog. Of course, immediately questions started popping up. I didn’t want to go too deep in that post as I figured I would do a separate post on the topic sooner or later. What does this functionality add, and in which particular scenario?

In short, this enhancement to stretched clusters prevents downtime for workloads in a particular failure scenario. So the question then is, what failure scenario? Let’s take a look at this diagram first of a typical stretched vSAN cluster deployment.

If you look at the diagram you see the following: Datacenter A, Datacenter B, Witness. One of the situations customers have found themselves in is that Datacenter A would go down (unplanned). This of course would lead to the VMs in Datacenter A being restarted in Datacenter B. Unfortunately, sometimes when things go wrong, they go wrong badly, in some cases, the Witness would fail/disappear next. Why? Bad luck, networking issues, etc. Bad things just happen. If and when this happens, there would only be 1 location left, which is Datacenter B.

Now you may think that because Datacenter B typically will have a full RAID set of the VMs running that they will remain running, but that is not true. vSAN looks at the quorum of the top layer, so if 2 out of 3 datacenters disappear, all objects impacted will become inaccessible simply as quorum is lost! Makes sense right? We are not just talking about failures right, could also be that Datacenter A has to go offline for maintenance (planned downtime), and at some point, the Witness fails for whatever reason, this would result in the exact same situation, objects inaccessible.

Starting with 7.0 U3 this behavior has changed. If Datacenter A fails, and a few (let’s say 5) minutes later the witness disappears, all replicated objects would still be available! So why is this? Well in this scenario, if Datacenter A fails, vSAN will create a new votes layout for each of the objects impacted. It basically will assume that the witness can fail and give all components on the witness 0 votes, on top of that it will give the components in the active site additional votes so that we can survive that second failure. If the witness would fail, it would not render the objects inaccessible as quorum would not be lost.

Now, do note, when a failure occurs and Datacenter A is gone, vSAN will have to create a new votes layout for each object. If you have a lot of objects this can take some time. Typically it will take a few seconds per object, and it will do it per object, so if you have a lot of VMs (and a VM consists of various objects) it will take some time. How long, well it could be five minutes. So if anything happens in between, not all objects may have been processed, which would result in downtime for those VMs when the witness would go down, as for that VM/Object quorum would be lost.

What happens if Datacenter A (and the Witness) return for duty? Well at that point the votes would be restored for the objects across locations and the witness.

Pretty cool right?!

Comments

tntteam says

7 October, 2021 at 12:04

Thanks for the info, this is a nice improvement indeed !
Michael Schroeder says

17 March, 2022 at 17:17

Hello Duncan. This is really a cool improvement for stretched clusters. Today one of my students asked a smart question: What if the witness fails first? Will there be a shift in the vote distribution too? What will happen if one of the sites goes down a couple of minutes AFTER the witness site? Will the VMs on the last surviving site remain online, or is that just bad karma? 😉
- Duncan Epping says
  
  18 March, 2022 at 09:01
  
  Very valid question. Unfortunately, that is indeed not happening today. You can imagine that that is rather complex to work through. The Witness is the quorum and we can safely drop the witness as we only have 1 full raid tree available. If we have 2 full raid trees it becomes much more complex as we don’t want to end up in a situation where both locations can write to the same object. I do have some ideas around how we can solve this, but it would require the implementation of another feature before it is really effective.
  - Manuel Dal Bianco says
    
    15 June, 2022 at 18:21
    
    What about a manual trigger to switch to single site mode? It would be better than nothing and should be simple to implement
Marcos Ortiz says

22 November, 2022 at 10:44

Hi Duncan, this is a very awesome new feature but i have a simple question. is it activated by default once you upgrade to U3? or do you need to do something to activate it? i mean, upgrade disk versions or anything else…

Thanks a lot
- Patrick Haan says
  
  23 November, 2022 at 18:04
  
  Good questions – especially if it’s needed to do a “vSAN object format” Upgrade too?
  - Duncan Epping says
    
    24 November, 2022 at 17:14
    
    AFAIK you don’t, but I have personally never done an upgrade without an object level upgrade, especially not in the last few versions as those upgrades are typically meta data only.
    - Patrick Haan says
      
      24 November, 2022 at 17:58
      
      Disk Format Upgrade – Fully agree.
      
      But also upgrading “object format level”?
      – Cause vSAN expects to resync a bunch of files – which could be in some (hybrid) cases a problem (latency critical production environemts, etc.)
  - Duncan Epping says
    
    25 November, 2022 at 08:36
    
    No, it should not require an object format. This is just metadata and “accounting”.

Related

Reader Interactions

Comments