VMware

Doing site maintenance in a vSAN Stretched Cluster configuration

Duncan Epping · Jan 15, 2025 · Leave a Comment

I thought I wrote an article about this years ago, but it appears I wrote an article about doing maintenance mode with a 2-node configuration instead. As I’ve received some questions on this topic, I figured I would write a quick article that describes the concept of site maintenance. Note that in a future version of vSAN, we will have an option in the UI that helps with this, as described here.

First and foremost, you will need to validate if all data is replicated. In some cases, we see customers pinning data (VMs) to a single location without replication, and those VMs will be directly impacted if a whole site is placed in maintenance mode. Those VMs will need to be powered off, or you will need to make sure those VMs are moved to the location that remains running if they need to stay running. Do note, if you flip “Preferred / Secondary” and there are many VMs that are site local, this could lead to a huge amount of resync traffic. If those VMs need to stay running, you may also want to reconsider your decision to replicate those VMs though!

These are the steps I would take when placing a site into maintenance mode:

Verify the vSAN Witness is up and running and healthy (see health checks)
Check compliance of VMs that are replicated
Configure DRS to “partially automated” or “Manual” instead of “Fully automated”
Manually vMotion all VMs from Site X to Site Y
Place each ESXi host in Site X into maintenance mode with the option “no data migration”
Power Off all the ESXi hosts in Site X
Enable DRS again in “fully automated” mode so that within Site Y the environment stays balanced
Do whatever needs to be done in terms of maintenance
Power On all the ESXi hosts in Site X
Exit maintenance mode for each host

Do note, that VMs will not automatically migrate back until the resync for that given VM has been fully completed. DRS and vSAN are aware of the replication state! Additionally, if VMs are actively doing IO when hosts in Site X are going into maintenance mode, the state of data stored on hosts within Site X will differ. This concern will be resolved in the future by providing a “site maintenance” feature as discussed at the start of this article.

Unexplored Territory Episode 087 – Microsoft on VMware VCF featuring Deji Akomolafe

Duncan Epping · Dec 16, 2024 · Leave a Comment

For the last episode of 2024, I invited Deji Akomolafe to discuss running Microsoft workloads on top of VCF. I’ve known Deji for a long time, and if anyone is passionate about VMware and Microsoft technology, it is him. Deji went over the many caveats, and best practices when it comes to running for instance SQL on top of VMware VCF (or vSphere for that matter). NUMA, CPU Scheduling, latency sensitive settings, power settings, virtual disk controllers, just some of the things you can expect in this episode. You can listen to the episode on Spotify, Apple, or via the embedded player below.

VCF-9 Vision for a Federated Storage View and vSAN (stretched cluster) visualizations!

Duncan Epping · Nov 19, 2024 · Leave a Comment

As mentioned last week, the sessions at Explore Barcelona were not recorded. I still wanted to share with you what it is we are working on, so I decided to record and share a few demos, and some of the slides we presented. In this video, I show our vision for a Federated Storage View for both vSAN and more traditional storage systems. This federated view will not only provide insights in terms of capacity and performance capabilities, it will also provide you with a visualization of a stretched cluster configuration. This is something that I have been asking for for a while now, and it looks like this will become a reality in VCF 9 at some point. As this revolves all around visualization, I would urge you to watch the below video. And as always, if you have feedback, please leave a comment!

VCF-9 announcements at Explore Barcelona – vSAN Site Takeover and vSAN Site Maintenance

Duncan Epping · Nov 15, 2024 · Leave a Comment

At Explore in Barcelona we had several announcements and showed several roadmap items which we did not reveal in Las Vegas. As the sessions were not recording in Barcelona, I wanted to share with you the features I spoke about at Explore which are currently planned for VCF 9. Please note, I don’t know when these features will be generally available, and there’s always a chance they are not released at all.

I created a video of the features we discussed, as I also wanted to share the demos with you. Now for those who don’t watch videos, the functionality that we are working on for VCF-9 is the following, I am just going to do a brief description, as we have not made a full public announcement about this, and I don’t want to get into trouble.

vSAN Site Maintenance

In a vSAN stretched cluster environment when you want to do site maintenance, today you will need to place every host into maintenance mode one by one. This not only is an administrative/operational burden, it also increases the chances of placing the wrong hosts into maintenance mode. On top of that, as you need to do this sequentially, it could also be that the data stored on host-1 in site A differs from host-2 in site A, meaning that there’s an inconsistent set of data in a site. Normally this is not a problem as the environment will resync when it comes back online, but if the other data site fails, now that existing (inconsistent) data set cannot be used to recover. With Site Maintenance we not only make it easier to place a full site into maintenance mode, we also remove that risk of data inconsistency as vSAN coordinates the maintenance and ensures that the data set is consistent within the site. Fantastic right?!

vSAN Site Takeover

One of the features I felt we were lacking for the longest time was the ability to promote a site when 2 out of the 3 sites had failed simultaneously. This is where Site Takeover comes into play. If you end up in a situation where both the Witness Site and a data site goes down at the same time, you want to be able to still recover. Especially as it is very likely that you will have healthy objects for each VM in that second site. This is what vSAN Site Takeover will help you with. It will allow you to manually (through the UI or script) inform vSAN that even though quorum is lost, it should make the local RAID set for each of the VMs impacted accessible again. After which, of course, vSphere HA would instruct the hosts to power-on those VMs.

If you have any feedback on the demos, and the planned functionality, feel free to leave a comment!

Does vSAN Data Protection work with vSAN Stretched Clusters and can snapshots be stretched?

Duncan Epping · Oct 18, 2024 · Leave a Comment

I have written a few articles about vSAN Data Protection now, and my last article featured a nice vSAN DP demo. A very good question was asked in the comment section, and it was about vSAN Stretched Clusters. Basically, the question was whether Snapshots are also stretched across locations. This is a great question, as there are a couple of things which are probably worth explaining again.

vSAN Data Protection relies on the snapshot capability which was introduced with vSAN ESA. This snapshot capability in vSAN ESA is significantly different than with vSAN OSA or with VMFS. With vSAN OSA and VMFS when you create a snapshot a new object (vSAN) or file (VMFS) is created. With vSAN ESA this is no longer the case as we don’t create additional files or objects, but we create a copy of the metadata structure instead. This is why vSAN ESA snapshots perform much better than vSAN OSA or VMFS snapshots do, as we no longer need to traverse multiple files or objects to read data. We can simply use the same object, and leverage the metadata structure to keep track of what has changed.

Now, with vSAN, as most of you hopefully know, object (and it’s components) are placed across the cluster based on what is specified within the storage policy that is associated with the object or VM. In other words, if the policy states FTT=1 and RAID-1, then you will see 2 copies of the data. If the policy states the data needs to be stretched across locations, and within each location be protected with RAID-5, then you will see a RAID-1 configuration across sites and a RAID-5 configuration within each site. As vSAN ESA snapshots are an integral part of the object, the snapshots automatically follow all requirements as defined within the policy. In other words, if the policy says stretched then the snapshot will also automatically be stretched.

There is one caveat I want to call out, and for that I want to show a diagram. The diagram below shows the Data Protection Appliance, aka the snapshot manager appliance. As you can see, it states “metadata decoupled from appliance” and it links somehow to a global namespace object. This global namespace object is where all the details of the protected VMs (and more) is being stored. As you can imagine, both the Snapshot Manager, as well as the Global Namespace object should also be stretched. For the global namespace object this means that you need to ensure that the default datastore policy is set to “stretched”, and of course for the snapshot manager appliance you can simply select the correct policy when provisioning the appliance. Either way, make sure the default datastore policy aligns with the disaster recovery and data protection policy.

I hope this helps those exploring vSAN Data Protection in a stretched cluster configuration!