update 3

Does vSAN Data Protection work with vSAN Stretched Clusters and can snapshots be stretched?

Duncan Epping · Oct 18, 2024 · Leave a Comment

I have written a few articles about vSAN Data Protection now, and my last article featured a nice vSAN DP demo. A very good question was asked in the comment section, and it was about vSAN Stretched Clusters. Basically, the question was whether Snapshots are also stretched across locations. This is a great question, as there are a couple of things which are probably worth explaining again.

vSAN Data Protection relies on the snapshot capability which was introduced with vSAN ESA. This snapshot capability in vSAN ESA is significantly different than with vSAN OSA or with VMFS. With vSAN OSA and VMFS when you create a snapshot a new object (vSAN) or file (VMFS) is created. With vSAN ESA this is no longer the case as we don’t create additional files or objects, but we create a copy of the metadata structure instead. This is why vSAN ESA snapshots perform much better than vSAN OSA or VMFS snapshots do, as we no longer need to traverse multiple files or objects to read data. We can simply use the same object, and leverage the metadata structure to keep track of what has changed.

Now, with vSAN, as most of you hopefully know, object (and it’s components) are placed across the cluster based on what is specified within the storage policy that is associated with the object or VM. In other words, if the policy states FTT=1 and RAID-1, then you will see 2 copies of the data. If the policy states the data needs to be stretched across locations, and within each location be protected with RAID-5, then you will see a RAID-1 configuration across sites and a RAID-5 configuration within each site. As vSAN ESA snapshots are an integral part of the object, the snapshots automatically follow all requirements as defined within the policy. In other words, if the policy says stretched then the snapshot will also automatically be stretched.

There is one caveat I want to call out, and for that I want to show a diagram. The diagram below shows the Data Protection Appliance, aka the snapshot manager appliance. As you can see, it states “metadata decoupled from appliance” and it links somehow to a global namespace object. This global namespace object is where all the details of the protected VMs (and more) is being stored. As you can imagine, both the Snapshot Manager, as well as the Global Namespace object should also be stretched. For the global namespace object this means that you need to ensure that the default datastore policy is set to “stretched”, and of course for the snapshot manager appliance you can simply select the correct policy when provisioning the appliance. Either way, make sure the default datastore policy aligns with the disaster recovery and data protection policy.

I hope this helps those exploring vSAN Data Protection in a stretched cluster configuration!

vSAN Data Protection Demo!

Duncan Epping · Oct 16, 2024 · Leave a Comment

I created a vSAN Data Protection Demo for VMware Explore in Las Vegas and Barcelona, and as I got some questions from people about vSAN Data Protection I figured I would share the demo on Youtube and post it on my blog. I have written a few articles on vSAN Data Protection already, and I am starting to notice that some customers are starting to test it and even use it in production where possible. As mentioned before, vSAN Data Protection is only available with vSAN ESA as it leverages the new snapshotting capabilities. The brand new UI is introduced through the Snap Manager Appliance. Make sure to download, deploy, and configure that first.

Stretched cluster witness failure resilience in vSAN 7.0

Duncan Epping · Mar 17, 2022 ·

Cormac and I have been busy the past couple of weeks updating the vSAN Deep Dive to 7.0 U3. Yes, there is a lot to update and add, but we are actually going through it at a surprisingly rapid pace. I guess it helps that we had already written dozens of blog posts on the various topics we need to update or add. One of those topics is “witness failure resilience” which was introduced in vSAN 7.0 U3. I have discussed it before on this blog (here and here) but I wanted to share some of the findings with you folks as well before the book is published. (No, I do not know when the book will be available on Amazon just yet!)

In the scenario below, we failed the secondary site of our stretched cluster completely. We can examine the impact of this failure through RVC on vCenter Server. This will provide us with a better understanding of the situation and how the witness failure resilience mechanism actually works. Note that the below output has been truncated for readability reasons. Let’s take a look at the output of RVC (vsan.vm_object_info) for our VM directly after the failure.

VM R1-R1:
Disk backing:
[vsanDatastore] 0b013262-0c30-a8c4-a043-005056968de9/R1-R1.vmx
RAID_1
RAID_1
Component: 0b013262-c2da-84c5-1eee-005056968de9 , host: 10.202.25.221
votes: 1, usage: 0.1 GB, proxy component: false)
Component: 0b013262-3acf-88c5-a7ff-005056968de9 , host: 10.202.25.201
votes: 1, usage: 0.1 GB, proxy component: false)
RAID_1
Component: 0b013262-a687-8bc5-7d63-005056968de9 , host: 10.202.25.238
votes: 1, usage: 0.1 GB, proxy component: true)
Component: 0b013262-3cef-8dc5-9cc1-005056968de9 , host: 10.202.25.236
votes: 1, usage: 0.1 GB, proxy component: true)
Witness: 0b013262-4aa2-90c5-9504-005056968de9 , host: 10.202.25.231
votes: 3, usage: 0.0 GB, proxy component: false)
Witness: 47123362-c8ae-5aa4-dd53-005056962c93 , host: 10.202.25.214
votes: 1, usage: 0.0 GB, proxy component: false)
Witness: 0b013262-5616-95c5-8b52-005056968de9 , host: 10.202.25.228
votes: 1, usage: 0.0 GB, proxy component: false)

As can be seen, the witness component holds 3 votes, the components on the failed site (secondary) hold 2 votes, and the components on the surviving data site (preferred) hold 2 votes. After the full site failure has been detected, the votes are recalculated to ensure that a witness host failure does not impact the availability of the VMs. Below shows the output of RVC once again.

VM R1-R1:
Disk backing:
[vsanDatastore] 0b013262-0c30-a8c4-a043-005056968de9/R1-R1.vmx
RAID_1
RAID_1
Component: 0b013262-c2da-84c5-1eee-005056968de9 , host: 10.202.25.221
votes: 3, usage: 0.1 GB, proxy component: false)
Component: 0b013262-3acf-88c5-a7ff-005056968de9 , host: 10.202.25.201
votes: 3, usage: 0.1 GB, proxy component: false)
RAID_1
Component: 0b013262-a687-8bc5-7d63-005056968de9 , host: 10.202.25.238
votes: 1, usage: 0.1 GB, proxy component: false)
Component: 0b013262-3cef-8dc5-9cc1-005056968de9 , host: 10.202.25.236
votes: 1, usage: 0.1 GB, proxy component: false)
Witness: 0b013262-4aa2-90c5-9504-005056968de9 , host: 10.202.25.231
votes: 1, usage: 0.0 GB, proxy component: false)
Witness: 47123362-c8ae-5aa4-dd53-005056962c93 , host: 10.202.25.214
votes: 3, usage: 0.0 GB, proxy component: false)

As can be seen, the votes for the various components have changed, the data site now has 3 votes per component instead of 1, the witness on the witness host went from 3 votes to 1, and on top of that, the witness that is stored in the surviving fault domain now also has 3 votes instead of 1 vote. This now results in a situation where quorum would not be lost even if the witness component on the witness host is impacted by a failure. A very useful enhancement to vSAN 7.0 Update 3 for stretched cluster configurations if you ask me.

vSAN 7.0 U3 feature overview

Duncan Epping · Sep 28, 2021 ·

In this blog post, I want to go over the features which have been released for vSAN as part of 7.0 U3. It is not going to be a deep dive, just a simple overview as most features speak for themselves! Let’s list the feature first, and then discuss some of them individually.

Cluster Shutdown feature
vLCM support for Witness Appliance
Skyline Health Correlation
IO Trip Analyzer
Nested Fault Domains for 2-Node Clusters
Enhanced Stretched Cluster durability
Access Based Enumeration for SMB shares via vSAN File Services

I guess the Cluster Shutdown feature speaks for itself. It basically enables you to power off all the hosts in a cluster when doing maintenance. Even if those hosts contain vCenter Server! If you want to trigger a shutdown, just right click the cluster object, go to vSAN, select “shutdown cluster” and follow the 2-step wizard. Pretty straight forward. Do note, besides the agent VMs, you will need to power-off the other VMs first. (yes, I requested this to be handled by the process as well in the future!)

The Skyline Health Correlation feature is very useful for customers who are seeing multiple alarms being triggered and are not sure what to do. In this scenario, starting with 7.0 U3, vSAN will now understand the correlation between the events and inform you what the issue is (most likely) and show you which other tests it would impact. This should enable you to fix the problem faster than before.

IO Trip Analyzer is also brand new in 7.0 U3. I actually wrote a blog post on the subject separately and included a demo, I would recommend watching that one. But if you just want the short summary, the IO Trip Analyzer basically provides an overview of latency introduced at every layer in the form of a diagram!

Nested Fault Domains for 2-node clusters has been on the “wish list” of some of our customers for a while. It is a very useful feature for those customers who want to be able to tolerate multiple failures even in a 2-node configuration. The feature requires you to have at least 3 disk groups per host, in each of the 2 hosts, and will then enable you to have “RAID-1” across those two hosts and RAID-1 within the host (Or RAID-5 if you have sufficient disk groups). If a host fails, and then a disk fails in the surviving host, the VMs would still be available. Basically a feature for customers who don’t need a lot of compute power (3 hosts or more), but do need added resiliency!

Enhanced Stretched Cluster durability (also applies to 2-node) is a feature that Cormac and I requested a while back. We requested this feature as we had heard from a few customers that unfortunately, they had found themselves in a situation where a datacenter would go offline, followed by the witness going down. This would then result in the VMs (only those which were stretched of course) in the remaining location also be unusable, as 2 out of the 3 parts of the RAID tree would be gone. This would even be the case in a situation where you would have a fully available RAID-1 / RAID-5 / RAID-6 tree in the remaining datacenter. This new feature now prevents this scenario!

Last, but not least, we now have support for Access Based Enumeration for SMB shares via vSAN File Services. What does this mean? Pre vSAN 7.0 U3, if a user had access to a file share the user would be able to see all folders/directories in this share. Starting with 7.0 U3 when looking at the share, only the folders that you have the appropriate permissions for will be displayed! (More about ABE here)

Capacity Overview in vSAN 6.7 U3

Duncan Epping · Sep 20, 2019 ·

I just wanted to do a short post before the weekend, in vSAN 6.7 U3 there’s a great capacity overview screen. It shows a couple of things, first of all, it provides a simple bar that shows “data written”, “reserved space” and “free space”. The second section provides you the ability to figure out what happens to your capacity consumption if you would change the policy on all VMs. The third section gives you a nice breakdown of the capacity per category and a great circular diagram which shows immediately what kind of data is consuming your capacity. Very useful!