7.0 u3

Doing network/ISL maintenance in a vSAN stretched cluster configuration!

Duncan Epping · Nov 21, 2023 ·

I got a question earlier about the maintenance of an ISL in a vSAN Stretched Cluster configuration which had me thinking for a while. The question was what would you do with your workload during maintenance. I guess the easiest of course is to power off all VMs and simply shutdown the cluster, for which vSAN has a UI option, and there’s a KB you can follow. Now, of course, there could also be a situation where the VMs need to remain running. But how does this work when you end up losing the connection between all three locations? Normally this would lead to a situation where all VMs will become “inaccessible” as you will end up losing quorum.

As said, this had me thinking, you could take advantage of the “vSAN Witness Resiliency” mechanism which was introduced in vSAN 7.0 U3. How would this work?

Well, it is actually pretty straight forward, if all hosts of 1 site are in maintenance mode, failed, or powered off, the votes of the witness object for each VM/Object will be recalculated within 3 minutes. When this recalculation has completed the witness can go down without having any impact on the VM. We introduced this capability to increase resiliency in a double-failure scenario, but we can (ab)use this functionality also during maintenance. Of course I had to test this, so the first step I took was placing all hosts in 1 location into maintenance mode (no data evac). This resulted in all my VMs being vMotioned to the other site.

Now next I checked with RVC if my votes were recalculated or not. As stated, depending on the number of VMs this can take around 3 minutes in total, but usually will probably be quicker. After the recalculation had been completed I powered off the Witness, and this was the result as shown below, all VMs were still running.

Of course, I had to double check on the commandline using RVC (you can use the command “vsan.vm_object_info” to check a particular object for instance) to ensure that indeed the components of those VMs were still “ACTIVE” instead of “ABSENT”, and there you go!

Now when maintenance has been completed, you simply do the reverse, you power on the witness, and then you power on the hosts in the other location. After the “resync” has been completed the VMs will be rebalanced again by DRS. Note, DRS rebalancing (or should rules being applied) will only happen when the resync of the VM has been completed.

Cleaning up old vSAN File Services OVF files on vCenter Server

Duncan Epping · Oct 3, 2022 ·

There was a question last week about the vSAN File Services OVF Files, the question was about the location where they were stored. I did some digging in the past, but I don’t think I ever shared this. The vSAN File Services OVF is stored on vCenter Server (VCSA) in a folder, for each version. The folder structure looks as show below, basically each version of an OVF has a directory with required OVF files.

root@vcsa-duncan [ ~ ]# ls -lha /storage/updatemgr/vsan/fileService/

total 24K

vsan-health users 4.0K Sep 16 16:09 .

vsan-health root  4.0K Nov 11  2020 ..

vsan-health users 4.0K Nov 11  2020 ovf-7.0.1.1000

vsan-health users 4.0K Mar 12  2021 ovf-7.0.2.1000-17692909

vsan-health users 4.0K Nov 24  2021 ovf-7.0.3.1000-18502520

vsan-health users 4.0K Sep 16 16:09 ovf-7.0.3.1000-20036589

root@vcsa-duncan [ ~ ]# ls -lha /storage/updatemgr/vsan/fileService/ovf-7.0.1.1000/

total 1.2G

vsan-health users 4.0K Nov 11  2020 .

vsan-health users 4.0K Sep 16 16:09 ..

vsan-health users 179M Nov 11  2020 VMware-vSAN-File-Services-Appliance-7.0.1.1000-16695758-cloud-components.vmdk

vsan-health users 5.9M Nov 11  2020 VMware-vSAN-File-Services-Appliance-7.0.1.1000-16695758-log.vmdk

vsan-health users  573 Nov 11  2020 VMware-vSAN-File-Services-Appliance-7.0.1.1000-16695758_OVF10.mf

vsan-health users  60K Nov 11  2020 VMware-vSAN-File-Services-Appliance-7.0.1.1000-16695758_OVF10.ovf

vsan-health users 998M Nov 11  2020 VMware-vSAN-File-Services-Appliance-7.0.1.1000-16695758-system.vmdk

I’ve asked the engineering team, and yes, you can simply delete obsolete versions if you need the disk capacity.

How to convert a standard cluster to a stretched cluster while expanding it!

Duncan Epping · Sep 27, 2022 ·

On VMTN a question was asked about how you could convert a 5-node standard cluster to a stretched cluster. It is not documented in our regular documentation, probably as the process is pretty straightforward, so I figured I would write it down. When you create a stretched cluster you will need a Witness Appliance in a third location first. I would recommend deploying that Witness Appliance before doing anything else.

After you deployed the Witness Appliance add the additional hosts to vCenter Server. DO NOT yet add them to the cluster yet though! First, configure each host separately. After you have configured each host, place the host into maintenance mode. After the host is placed into maintenance mode, move it into the cluster and do not take it out of maintenance mode!

Now, when all hosts are part of the cluster you can create the Stretched Cluster. This process is simple, you pick the hosts that belong to each location, and then you select the witness. After the cluster has been created you simply take the hosts out of maintenance mode and you should be good! Note, you take the host out of maintenance after the Stretched Cluster has been created to ensure that you don’t have any rebalancing happening while you are creating the stretched cluster. Simply avoiding unneeded resyncs from occuring.

Do note, all VMs will have the same storage policy assigned still, so you will need to change that policy to ensure that the vSAN objects are placed and replicated according to your requirements! (RAID1 across locations and RAID-1/5/6 within a location for instance.)

New book: VMware vSAN 7.0 U3 Deep Dive

Duncan Epping · May 9, 2022 ·

Yes, we’ve mentioned it a few times already on Twitter that we were working on it, but today Cormac and I are proud to announce that the VMware vSAN 7.0 U3 Deep Dive is available via Amazon on both ebook as well as paper! We had the pleasure of working with Pete Koehler again as a technical editor, the foreword was written by John Gilmartin (SVP and GM for Cloud Storage and Data), the cover was created by my son (Aaron Epping), and it is once again fully self-published! We changed the format (physical dimension) of the book to be able to increase the size of the screenshots, as we realize that most of us are middle-aged by now, we feel it really made a huge difference in readability.

VMware’s vSAN has rapidly proven itself in environments ranging from hospitals to oil rigs to e-commerce platforms and is the top player in the hyperconverged space. Along the way, it has matured to offer unsurpassed features for data integrity, availability, space efficiency, stretched clustering, and cloud-native storage services. vSAN 7.0 U3 has radically simplified IT operations and supports the transition to hyperconverged infrastructures (HCI). The authors of the vSAN Deep Dive have thoroughly updated their definitive guide to this transformative technology. Writing for vSphere administrators, architects, and consultants, Cormac Hogan, and Duncan Epping explain what vSAN is, how it has evolved, what it now offers, and how to gain maximum value from it. The book offers expert insight into preparation, installation, configuration, policies, provisioning, clusters, architecture, and more. You’ll also find practical guidance for using all data services, stretched clusters, two-node configurations, and cloud-native storage services.

Although we pressed publish, sometimes it takes a while before the book is available in all Amazon stores, but it should just trickle in the upcoming 24-48 hours. The book is priced at 9.99 USD (ebook) and 29.99 USD (paper) and is sold through Amazon only. Get it while it is hot, and we would appreciate it if you would use our referral links and leave a review when you finish it. Thanks, and we hope you will enjoy it!

Of course, we also have the links to other major Amazon stores:

United Kingdom – Kindle – Paper
Germany – Kindle – Paper
Netherlands – Kindle – Paper
Canada – Kindle – Paper
France – Kindle – Paper
Spain – Kindle – Paper
India – Kindle
Japan – Kindle – Paper
Italy – Kindle – Paper
Mexico – Kindle
Australia – Kindle – Paper
Or just do a search!

Stretched cluster witness failure resilience in vSAN 7.0

Duncan Epping · Mar 17, 2022 ·

Cormac and I have been busy the past couple of weeks updating the vSAN Deep Dive to 7.0 U3. Yes, there is a lot to update and add, but we are actually going through it at a surprisingly rapid pace. I guess it helps that we had already written dozens of blog posts on the various topics we need to update or add. One of those topics is “witness failure resilience” which was introduced in vSAN 7.0 U3. I have discussed it before on this blog (here and here) but I wanted to share some of the findings with you folks as well before the book is published. (No, I do not know when the book will be available on Amazon just yet!)

In the scenario below, we failed the secondary site of our stretched cluster completely. We can examine the impact of this failure through RVC on vCenter Server. This will provide us with a better understanding of the situation and how the witness failure resilience mechanism actually works. Note that the below output has been truncated for readability reasons. Let’s take a look at the output of RVC (vsan.vm_object_info) for our VM directly after the failure.

VM R1-R1:
Disk backing:
[vsanDatastore] 0b013262-0c30-a8c4-a043-005056968de9/R1-R1.vmx
RAID_1
RAID_1
Component: 0b013262-c2da-84c5-1eee-005056968de9 , host: 10.202.25.221
votes: 1, usage: 0.1 GB, proxy component: false)
Component: 0b013262-3acf-88c5-a7ff-005056968de9 , host: 10.202.25.201
votes: 1, usage: 0.1 GB, proxy component: false)
RAID_1
Component: 0b013262-a687-8bc5-7d63-005056968de9 , host: 10.202.25.238
votes: 1, usage: 0.1 GB, proxy component: true)
Component: 0b013262-3cef-8dc5-9cc1-005056968de9 , host: 10.202.25.236
votes: 1, usage: 0.1 GB, proxy component: true)
Witness: 0b013262-4aa2-90c5-9504-005056968de9 , host: 10.202.25.231
votes: 3, usage: 0.0 GB, proxy component: false)
Witness: 47123362-c8ae-5aa4-dd53-005056962c93 , host: 10.202.25.214
votes: 1, usage: 0.0 GB, proxy component: false)
Witness: 0b013262-5616-95c5-8b52-005056968de9 , host: 10.202.25.228
votes: 1, usage: 0.0 GB, proxy component: false)

As can be seen, the witness component holds 3 votes, the components on the failed site (secondary) hold 2 votes, and the components on the surviving data site (preferred) hold 2 votes. After the full site failure has been detected, the votes are recalculated to ensure that a witness host failure does not impact the availability of the VMs. Below shows the output of RVC once again.

VM R1-R1:
Disk backing:
[vsanDatastore] 0b013262-0c30-a8c4-a043-005056968de9/R1-R1.vmx
RAID_1
RAID_1
Component: 0b013262-c2da-84c5-1eee-005056968de9 , host: 10.202.25.221
votes: 3, usage: 0.1 GB, proxy component: false)
Component: 0b013262-3acf-88c5-a7ff-005056968de9 , host: 10.202.25.201
votes: 3, usage: 0.1 GB, proxy component: false)
RAID_1
Component: 0b013262-a687-8bc5-7d63-005056968de9 , host: 10.202.25.238
votes: 1, usage: 0.1 GB, proxy component: false)
Component: 0b013262-3cef-8dc5-9cc1-005056968de9 , host: 10.202.25.236
votes: 1, usage: 0.1 GB, proxy component: false)
Witness: 0b013262-4aa2-90c5-9504-005056968de9 , host: 10.202.25.231
votes: 1, usage: 0.0 GB, proxy component: false)
Witness: 47123362-c8ae-5aa4-dd53-005056962c93 , host: 10.202.25.214
votes: 3, usage: 0.0 GB, proxy component: false)

As can be seen, the votes for the various components have changed, the data site now has 3 votes per component instead of 1, the witness on the witness host went from 3 votes to 1, and on top of that, the witness that is stored in the surviving fault domain now also has 3 votes instead of 1 vote. This now results in a situation where quorum would not be lost even if the witness component on the witness host is impacted by a failure. A very useful enhancement to vSAN 7.0 Update 3 for stretched cluster configurations if you ask me.