I posted about HA/DRS settings for VSAN Stretched Clustering yesterday and posted an intro to 6.1 and all new functionality which includes stretched clustering. As part of our VMworld session Rawlinson Rivera recorded a nice demo. We figured we should share it with the world, so I added the voice-over so at least it is clear what you are looking at and why certain things are configured in a specific way. I hope this demo shows how dead simple it is to configure VSAN stretched clustering, and how it handles a full site failure. Enjoy,
Edme says
Nice demonstration, however i find the end very rushed. Its nice to know that vsan resynchronizes when both sites reconnect, but thats not what is worrying me. What happens in case of a prolonged downtime of a site, will vsan automatically start rebuilding the missing components when the default 60 minutes timeout expires, based on the information i read so far it will not because stretched clusters has maximum FTT of 1 and the fault domain is forcing vsan to have the second copy in the unreachable site. Hopefully im wrong on this part, but i would appreciate that u can shed some light on this scenario.
Duncan Epping says
If the site doesn’t return then VSAN won’t rebuild as there is no fault domain to rebuild too. It is FTT=1, which is 2 “data domains” and a witness. If the “data domain” is gone then it is difficult to rebuild. Yes I have put a feature request in to do some form of nest failure domains where you have FTT=1 within a site and across sites, if and when this will make it is something I cannot comment on at this point.
edme says
If i may make a suggestion; present nested FTT to the user as must-obey and should-obey data domains like the HA rules, VMware admins are familiar with should and must terms and in general have the same result when applied. Nested FTT sounds more complex and that goes a bit against the simplicity of vsan.
In must-obey vsan will wait till the data domain returns or perhaps a new one is configured. In should-obey vsan if the data domain doesnt returns within the timeout it starts ignoring the configured data domains and remediating the VMs against their storage policy. When the data domain returns it resyncs VMs, deletes the additional copy and starts obeying the data domains again.
The remediation process also brings a risk of not enough diskspace to remediate all VMs and / or the remaining hosts gets overloaded by the IOs. The best solution would be to extend the storage policy with two options for degraded stretched clusters, a disk space reservation so a remediation is possible and a rebuild priority with option to not remediate VMs.
Duncan Epping says
What the implementation looks like or how it will be worded the in the UI is to be seen. I will provide your suggestion to the PM team to see what is possible. Your suggestion sounds reasonable, the challenge though is that it doesn’t protect you against a failure within the first 60 minutes of a site failure, hence my suggestion to do nested domains. Especially with the upcoming “erasure coding” functionality additional availability could potentially be achieved with little cost.
Duncan Epping says
With regards to the end being rushed, yes we had to limit the time… mainly because most people lose interest after 8-10 minutes of watching a screen. It is very easy to test out and play with, and I highly encourage you to do so.
Brian Garrett says
Quick question on this. Is it possible to have multiple witnesses to provide additional failure points should a witness and a location go down at the same time?
Edme says
3 failure domains is the maximum for a stretched cluster in vsan 6.1
Duncan Epping says
No, you can only have 1 witness host. You can however if a witness has failed introduce a new one, this will recreate the witness components relatively fast, which removes the risk you mention.
Joseph Guan says
How is networking being handled in case of a entire site going dark? Do you have vlans carried over multiple sites and or if NSX works with it?
jose B says
Hi, Duncan
There is no way to configure not so critical VMs to have FFT=1 but replicated locally in the same failure domain? I guess the answer is no; in that case, any plan to have that option in the future?
Duncan Epping says
There is no way to do this. I can’t comment on futures unfortunately. suggest reaching out to a VMware pre-sales person for a roadmap.