stretched cluster

vSAN Stretched: Why is the witness not part of the cluster when the link between a data site and the witness fails?

Duncan Epping · Jun 25, 2024 · Leave a Comment

Last week I received a question about vSAN Stretched which had me wondering for a while what on earth was going on. The person who asked this question was running through several failure scenarios, some of which I have also documented in the past here. The question I got is what is supposed to happen when I have the following scenario as shown in the diagram and the link between the preferred site (Site A) and the witness fails:

The answer, at least that is what I thought, was simple: All VMs will remain running, or said differently, there’s no impact on vSAN. While doing the test, indeed the outcome I documented, which is also documented in the Stretched Clustering Guide and the PoC Guide was indeed the same, the VMs remain running. However, one of the things that was noticed is that when this situation occurs, and indeed the connection between Site A and the Witness is lost, the witness is somehow no longer part of the cluster, which is not what I would expect. The reason I would not expect this to happen is because if a second failure would occur, and for instance the ISL between Site A and Site B goes down, it would direclty impact all VMs. At least, that is what I assumed.

However, when I triggered that second failure and I disconnected the ISL between Site A and Site B, I saw the witness re-appearing again immidiately, I saw the witness objects going from “absent” to “active”, and more importantly, all VMs remained running. The reason this happens is fairly straight forward, when running a configuration like this vSAN has a “leader” and a “backup”, and they each run in a seperate fault domain. Both the leader and the backup need to be able to communicate with the Witness for it to be able to function correctly. If the connection between Site A and the Witness is gone, then either the leader or the backup can no longer communicate with the Witness and the Witness is taken out of the cluster.

So why does the Witness return for duty when the second failure is triggered? Well, when the second failure is triggered the leader is restarted in Site B (as Site A is deemed lost), and the backup is already running in Site B. As both the leader and the backup can communicate again with the witness, the witness returns for duty and so will all of the components automatically and instantly. Which means that even though the ISL has failed between Site A and B after the witness was taken out of the cluster, all VMs remain accessible as the witness is reintroduced instantly to ensure availability of the workload. Pretty cool! (Thanks to vSAN engineering for providing these insights on why this happens!)

Doing network/ISL maintenance in a vSAN stretched cluster configuration!

Duncan Epping · Nov 21, 2023 · 3 Comments

I got a question earlier about the maintenance of an ISL in a vSAN Stretched Cluster configuration which had me thinking for a while. The question was what would you do with your workload during maintenance. I guess the easiest of course is to power off all VMs and simply shutdown the cluster, for which vSAN has a UI option, and there’s a KB you can follow. Now, of course, there could also be a situation where the VMs need to remain running. But how does this work when you end up losing the connection between all three locations? Normally this would lead to a situation where all VMs will become “inaccessible” as you will end up losing quorum.

As said, this had me thinking, you could take advantage of the “vSAN Witness Resiliency” mechanism which was introduced in vSAN 7.0 U3. How would this work?

Well, it is actually pretty straight forward, if all hosts of 1 site are in maintenance mode, failed, or powered off, the votes of the witness object for each VM/Object will be recalculated within 3 minutes. When this recalculation has completed the witness can go down without having any impact on the VM. We introduced this capability to increase resiliency in a double-failure scenario, but we can (ab)use this functionality also during maintenance. Of course I had to test this, so the first step I took was placing all hosts in 1 location into maintenance mode (no data evac). This resulted in all my VMs being vMotioned to the other site.

Now next I checked with RVC if my votes were recalculated or not. As stated, depending on the number of VMs this can take around 3 minutes in total, but usually will probably be quicker. After the recalculation had been completed I powered off the Witness, and this was the result as shown below, all VMs were still running.

Of course, I had to double check on the commandline using RVC (you can use the command “vsan.vm_object_info” to check a particular object for instance) to ensure that indeed the components of those VMs were still “ACTIVE” instead of “ABSENT”, and there you go!

Now when maintenance has been completed, you simply do the reverse, you power on the witness, and then you power on the hosts in the other location. After the “resync” has been completed the VMs will be rebalanced again by DRS. Note, DRS rebalancing (or should rules being applied) will only happen when the resync of the VM has been completed.

Do I need 2 isolation addresses with a (vSAN) stretched cluster for vSphere HA?

Duncan Epping · Sep 27, 2023 · Leave a Comment

This question has come up multiple times now, so I figured I would write a quick post about it, do you need 2 isolation addresses with a (vSAN) stretched cluster for vSphere HA? This question comes up as the documentation has best practices around the configuration of HA isolation addresses for stretched clusters. The documentation (both for vSAN as well as traditional stretched storage) states that you need to have two reliable addresses, one in each location.

Now I have had the above question multiple times as some folks have mentioned that they can use a Gateway Address with Cisco ACI which would still be accessible in both locations even if there’s a partition due to for instance an ISL failure. If that is the case, and the IP address is indeed available in both locations during those types of failure scenarios then it would suffice to use a single IP address as your isolation address.

You will however need to make sure that the IP address is reachable over the vSAN network when using vSAN as your stretched storage platform. (When vSAN is enabled vSphere HA uses the vSAN network for communications.) If it is reachable you can simply define the isolation address by setting the advanced setting “das.isolationaddress0”. It is also recommended to disable the use of the default gate of the management network by setting “das.usedefaultisolationaddress” to false for vSAN based environments.

I have requested the vSAN stretched clustering documentation to be updated to reflect this.

Performance Management Object reduced availability on stretched cluster

Duncan Epping · Jun 15, 2023 · Leave a Comment

I created a new lab environment not too long ago and I ran into this situation where the Performance Management Object showed up as Reduced Availability with no Rebuild in vSAN Skyline Health. This happened in my case because I created a Stretched Cluster configuration after I had already formed a cluster, which means that the performance management object was randomly placed across hosts without taking those “failure domains” into account. I completely forgot about it until someone on VMTN reminded me about this. I had two options, fix the existing perf database, or simply disable/enable the perf service to it is recreated.

As I had no data stored in the database I figured disable/enable is the easiest route. I looked for the option in vSphere 8.0 U1 but could not find it, it seems that the UI option no longer exists for whatever reason. How do I now disable/enable the service? Ruby vSphere Console (RVC) to the rescue!

When you log in to RVC you can simply run the following commands on the cluster object you want to disable/enable the performance service for. Fairly straight forward, and fixed the issue within a minute or so:

vsan.perf.stats_object_delete <cluster>
vsan.perf.stats_object_create <cluster>

I also documented this in the vSAN 8.0 ESA Deep Dive Book by the way, you can buy a paper copy or ebook on Amazon.

vSAN Stretched Cluster failure matrix

Duncan Epping · May 30, 2023 · 1 Comment

The last couple of weeks I was involved internally in a discussion around the different vSAN stretched cluster failure scenarios. I wrote a lengthy email about how vSAN and HA would respond in certain scenarios. I have documented many of these over the years on my blog already, but never really published them as a whole.

In some of the scenarios below, I discuss a “partition”, a partition is a scenario where both the L3 connection to the witness is down and the inter site / inter switch link to the other site for one of the locations. So in the diagram above for instance, if I say that Site B is partitioned then it means that Site A can still communicate with the witness, but Site B cannot communicate with the Witness and cannot communicate with Site A either.

For all of the below scenarios the following applies, Site A is the preferred location and Site B is the secondary location. When it comes to the table, the first two columns refer to the policy setting for the VM as shown in the screenshot below. The third column refers to the location where the VM runs from a compute perspective. The fourth discusses the type of failure, and the fifth and sixth columns discuss the behavior witnessed.

Time to list the various scenarios, and no, it doesn’t include all failures that could occur but should discuss most scenarios which are important for a stretched cluster configuration. Do note, the below-discussed behavior will only be witnessed when the best practices, as documented here and here, are followed. Also note that the table has multiple pages, there are close to 30 scenarios described! If there are any questions feel free to leave a comment, if you feel a failure scenario is missing, also please leave a comment.

Site Disaster Tolerance	Failures to Tolerate	VM Location	Failure	vSAN behavior	HA behavior
None Preferred	No data redundancy	Site A or B	Host failure Site A	Objects are inaccessible if failed host contained one or more components of objects	VM cannot be restarted as object is inaccessible
None Preferred	RAID-1/5/6	Site A or B	Host failure Site A	Objects are accessible as there's site local resiliency	VM does not need to be restarted, unless VM was running on failed host
None Preferred	No data redundancy / RAID-1/5/6	Site A	Full failure Site A	Objects are inaccessible as full site failed	VM cannot be restarted in Site B, as all objects reside in Site A
None Preferred	No data redundancy / RAID-1/5/6	Site B	Full failure Site B	Objects are accessible, as only Site A contains objects	VM can be restarted in Site A, as that is where all objects reside
None Preferred	No data redundancy / RAID-1/5/6	Site A	Partition Site A	Objects are accessible as all objects reside in Site A	VM does not need to be restarted
None Preferred	No data redundancy / RAID-1/5/6	Site B	Partition Site B	Objects are accessible in Site A, objects are not accessible in Site B as network is down	VM is restarted in Site A, and killed by vSAN in Site B
None Secondary	No data redundancy / RAID-1/5/6	Site B	Partition Site B	Objects are accessible in Site B	VM resides in Site B, does not need to be restarted
None Preferred	No data redundancy / RAID-1/5/6	Site A	Witness Host Failure	No impact, witness host is not used as data is not replicated	No impact
None Secondary	No data redundancy / RAID-1/5/6	Site B	Witness Host Failure	No impact, witness host is not used as data is not replicated	No impact
Site Mirroring	No data redundancy	Site A or B	Host failure Site A or B	Components on failed hosts inaccessible, read and write IO across ISL as no redundancy locally, rebuild across ISL	VM does not need to be restarted, unless VM was running on failed host
Site Mirroring	RAID-1/5/6	Site A or B	Host failure Site A or B	Components on failed hosts inaccessible, read IO locally due to RAID, rebuild locally	VM does not need to be restarted, unless VM was running on failed host
Site Mirroring	No data redundancy / RAID-1/5/6	Site A	Full failure Site A	Objects are inaccessible in Site A as full site failed	VM restarted in Site B
Site Mirroring	No data redundancy / RAID-1/5/6	Site A	Partition Site A	Objects are inaccessible in Site A as full site is partitioned and quorum is lost	VM restarted in Site B
Site Mirroring	No data redundancy / RAID-1/5/6	Site A	Witness Host Failure	Witness object inaccessible, VM remains accessible	VM does not need to be restarted
Site Mirroring	No data redundancy / RAID-1/5/6	Site B	Full failure Site A	Objects are inaccessible in Site A as full site failed	VM does not need to be restarted as it resides in Site B
Site Mirroring	No data redundancy / RAID-1/5/6	Site B	Partition Site A	Objects are inaccessible in Site A as full site is partitioned and quorum is lost	VM does not need to be restarted as it resides in Site B
Site Mirroring	No data redundancy / RAID-1/5/6	Site B	Witness Host Failure	Witness object inaccessible, VM remains accessible	VM does not need to be restarted
Site Mirroring	No data redundancy / RAID-1/5/6	Site A	Network failure between Site A and B (ISL down)	Site A binds with witness, objects in Site B becomes inaccessible	VM does not need to be restarted
Site Mirroring	No data redundancy / RAID-1/5/6	Site B	Network failure between Site A and B (ISL down)	Site A binds with witness, objects in Site B becomes inaccessible	VM restarted in Site A
Site Mirroring	No data redundancy / RAID-1/5/6	Site A or Site B	Network failure between Witness and Site A (or B)	Witness object absent, VM remains accessible	VM does not need to be restarted
Site Mirroring	No data redundancy / RAID-1/5/6	Site A	Full failure Site A, and simultaneous Witness Host Failure	Objects are inaccessible in Site A and Site B due to quorum being lost	VM cannot be restarted
Site Mirroring	No data redundancy / RAID-1/5/6	Site A	Full failure Site A, followed by Witness Host Failure a few minutes later	Pre vSAN 7.0 U3: Objects are inaccessible in Site A and Site B due to quorum being lost	VM cannot be restarted
Site Mirroring	No data redundancy / RAID-1/5/6	Site A	Full failure Site A, followed by Witness Host Failure a few minutes later	Post vSAN 7.0 U3: Objects are inaccessible in Site A, but accessible in Site B as votes have been recounted	VM restarted in Site B
Site Mirroring	No data redundancy / RAID-1/5/6	Site B	Full failure Site B, followed by Witness Host Failure a few minutes later	Post vSAN 7.0 U3: Objects are inaccessible in Site B, but accessible in Site A as votes have been recounted	VM restarted in Site A
Site Mirroring	No data redundancy	Site A	Full failure Site A, and simultaneous host failure in Site B	Objects are inaccessible in Site A, if components reside on failed host then object is inaccessible in Site B	VM cannot be restarted
Site Mirroring	No data redundancy	Site A	Full failure Site A, and simultaneous host failure in Site B	Objects are inaccessible in Site A, if components do not reside on failed host then object is accessible in Site B	VM restarted in Site B
Site Mirroring	RAID-1/5/6	Site A	Full failure Site A, and simultaneous host failure in Site B	Objects are inaccessible in Site A, accessible in Site B as there's site local resiliency	VM restarted in Site B