stretched cluster

vSAN Stretched Cluster failure scenarios and component votes

Duncan Epping · Apr 3, 2019 ·

I was at a customer last week and had an interesting question about the vSAN voting mechanism. This customer had a stretched cluster and used RAID-5 within each location to protect the data on top of replicating across locations. During certain failure scenarios unexpectedly the data remained available, of course, it is great that you have higher availability than expected, but why did this happen? What this customer tested was powering off the Witness (which is deemed as a site failure) and next powered of 2 hosts in 1 location, which exceeds the “failures to tolerate” in a single location. You would expect, based on all documentation so far, that the data would be unavailable. Well for some VMs this was the case, but for others, that was not the case. Why is this? Well, it is all about the vote count in this case. Look at the below diagram and the number of votes for each component first.

In the above scenario if the Witness (W) fails we have 4 votes less. Out of a total of 13 that is not a problem. If two additional hosts fail, this is most likely still not a problem, even though you are exceeding the provided “failures to tolerate”. However, if by any chance Host1 is one of those failed hosts then you would lose quorum. Host1 has a component with 2 votes. So if host1 has failed and the witness has failed and host2 for instance, you have now lost 7 out of 13 votes. This means quorum is lost. Please note that that single component with 2 votes is random. For a different VM/Object it could be that the component which is placed on host6 or host7 has 2 votes.

Another thing to point out, if host5-8 all would fail the data is still available. However, if then host3 and host4 would fail the object would become unavailable. Even though you still would have quorum across locations, you have now also exceeded the specified “failures to tolerate” within the location. This is also something that will be taken in to account.

I hope that helps.

vSAN Stretched Cluster: PFTT and SFTT what happens when a full site fails and multiple hosts fail?

Duncan Epping · Mar 19, 2018 ·

This question was asked on the VMTN community forum and it is a very valid question. Our documentation explains this scenario, but only to a certain level and it seems to be causing some confusion as we speak. To be honest, it is fairly complex to understand. Internally we had a discussion with engineering about it and it took us a while to grasp it. As the documentation explains, the failure scenarios are all about maintaining quorum. If quorum is lost, the data will become inaccessible. This makes perfect sense, as vSAN will always aim to protect the consistency and reliability of data first.

So how does this work, well when creating a policy for a stretched cluster you specify Primary Failures To Tolerate (PFTT) and Secondary Failures To Tolerate (SFTT). PFTT can be seen as “site failures”, and you can always only tolerate 1 at most. SFTT can be seen as host failures, and you can define this between 0 and 3. Where we by far see FTT=1 (RAID-1 or RAID-5) and FTT=2 (RAID-6) the most. Now, if you have 1 full site failure, then on top of that you can tolerate SFTT host failures. So if you have SFTT=1 then this means that 2 host failures in the site that survived would result in data becoming inaccessible.

Where this gets tricky is when the Witness fails, why? Well because the witness is seen as a site failure. This means that if you have lets say 2 hosts failing in Data Site A and 1 host failing in Data Site B, while you had SFTT=2 assigned to your components, that your objects that are impacted will become inaccessible. Simply because you exceeded PFTT and SFTT. I hope that makes sense? Lets show that in a diagram (borrowed it from our documentation) for different failures, I suggest you do a “vote count” so that it is obvious why this happens. The total vote count is 9. Which means that the object will be accessible as long as the remaining vote count is 5 or higher.

Now that the witness has failed, as shown in the next diagram, we lose 3 votes of the total 9 votes, no problem as we need 5 to remain access to the data.

In the next diagram another host has failed in the environment, we now lost 4 votes out of the 9. Which means we still have 5 out of 9 and as such remain access.

And there we go, in the next diagram we just lost another one host, in this case it is the same location as the first host, but this could also be a host in the secondary site. Either way, this means we only have 4 votes left out of the 9. We needed 5 at a minimum, which means we now lose access to the data for those objects impacted. As stated earlier, vSAN does this to avoid any type of corruption/conflicts.

The same applies to RAID-6 of course. With RAID-6 as stated you can tolerate 1 full site failure and 2 host failures on top of that, but if the witness fails this means you can only lose 1 host in each of the sites before data may become inaccessible. I hope this helps those people running through failure scenarios.

New whitepaper available: vSphere Metro Storage Cluster Recommended Practices (6.5 update)

Duncan Epping · Oct 24, 2017 ·

I had many requests for an updated version of this paper, so the past couple of weeks I have been working hard. The paper was outdated as it was last updated around the vSphere 6.0 timeframe, and it was only a minor update. I looked at every single section and added in new statements and guidance around vSphere HA Restart Priority for instance. So for those running a vSphere Metro Storage Cluster / Stretched Cluster of some kind, please read the brand new vSphere Metro Storage Cluster Recommended Practices (6.5 update) white paper.

It is available on storagehub.vmware.com in PDF and for reading within your browser. Any questions and comments, please do not hesitate to leave them here.

Sizing a vSAN Stretched Cluster

Duncan Epping · May 30, 2017 ·

I have had this question a couple of times already, how many hosts do I need per site when the Primary FTT is set to 1 and the Secondary FTT is set to 1 and RAID-5 is used as the Failure Tolerance Method? The answer is straight forward, you have a local RAID-5 set locally in each site. RAID-5 is a 3+1 configuration, meaning 3 data blocks and 1 parity block. As such each site will need 4 hosts at a minimum. So if the requirement is PFTT=1 and SFTT=1 with the Failure Tolerance Method (FTM) set to RAID-5 then the vSAN Stretched Clustering configuration will be: 4+4+1. Note, that also when you use RAID-1 you will need at minimum 3 hosts per site. This because locally you will have 2 “data” components and 1 witness component.

From a capacity stance, if you have a 100GB VM and do PFTT=1, SFTT=1 and FTM set to RAID-1 then you have a local RAID-1 set in each site. Which means 100GB requires 200GB in each location. So 200% required local capacity, 400% for the total cluster. Using the below table you can easily see the overhead. Note that RAID-5 and RAID-6 are only available when using all-flash.

I created a quick table to help those going through this exercise. I did not include “FTT=3” as this in practice is not used too often in stretched configurations.

Description	PFTT	SFTT	FTM	Hosts per site	Stretched Config	Single site capacity	Total cluster capacity
Standard Stretched across locations with local protection	1	1	RAID-1	3	3+3+1	200% of VM	400% of VM
Standard Stretched across locations with local RAID-5	1	1	RAID-5	4	4+4+1	133% of VM	266% of VM
Standard Stretched across locations with local RAID-6	1	2	RAID-6	6	6+6+1	150% of VM	300% of VM
Standard Stretched across locations no local protection	1	0	RAID-1	1	1+1+1	100% of VM	200% of VM
Not stretched, only local RAID-1	0	1	RAID-1	3	n/a	200% of VM	n/a
Not stretched, only local RAID-5	0	1	RAID-5	4	n/a	133% of VM	n/a
Not stretched, only local RAID-6	0	2	RAID-6	6	n/a	150% of VM	n/a

Hope this helps!

vSAN 6.6 Stretched Cluster Demo

Duncan Epping · May 19, 2017 ·

I had one more demo to finish and share and that is the vSAN 6.6 stretched cluster demo. I already did a stretched clustering demo when we initially released it, but with the enhanced functionality around local protection I figured I would re-record it. In this demo (~12 minutes) I will show you how to configure vSAN 6.6 with dedupe / compression enabled in a Stretched Cluster configuration. I will also create 3 VM Storage Policies, assign those to VMs and show you that vSAN has place the data across locations. I hope you find it useful.