BC-DR

Where is the vSAN Snapmanager Appliance with 9.0?

Duncan Epping · Jul 7, 2025 · 4 Comments

I was talking to my colleague Paudie and he mentioned various folks were having problems finding the vSAN Snapmanager Appliance for vSAN / VCF / vSphere 9.0. The appliance used to be stored on the Broadcom Support portal under VMware vSAN >> Drivers & Tools, but it is no longer there.

This is not by mistake. Some may have heard about this, others may have skipped over it, but VMware Live Recovery, vSphere Replication, and vSAN Data Protection (which includes the Snapmanager Appliance) have all converged into a single appliance to make your life easier! This means that if you want to enable vSAN Data Protection, you now need to download the VMware Live Recovery Appliance, specifically version 9.0.3.0 or later.

vSphere HA restart times, how long does it actually take?

Duncan Epping · Mar 13, 2025 · Leave a Comment

I had a question today, and it was based on material I wrote years ago for the Clustering Deepdive. (read it here) The material talks about the sequence HA goes through when a failure has occurred. If you look at the sequence for instance where a “secondary” host has failed, it looks as follows:

T0 – Secondary host failure.
T3s – Primary host begins monitoring datastore heartbeats for 15 seconds.
T10s – The secondary host is declared unreachable and the primary will ping the management network of the failed secondary host. This is a continuous ping for 5 seconds.
T15s – If no heartbeat datastores are configured, the secondary host will be declared dead if there is no reply to the ping.
T18s – If heartbeat datastores are configured, the secondary host will be declared dead if there’s no reply to the ping and the heartbeat file has not been updated or the lock was lost.

So, depending on whether you have heartbeat datastores or not, this sequence takes either 15 or 18 seconds. Does that mean the VMs are then instantly restarted, and if so, how long does that take? Well no, they won’t instantly restart, because when this sequence has ended, the secondary host which has failed is actually declared dead. Now the potentially impacted VMs will need to be verified if they have actually failed, a list of “to be restarted” VMs will need to be created, and a placement request will need to be done.

The placement request will either go to DRS, or will be handled by HA itself, depending on whether DRS is enabled and if vCenter Server is available. After placement has been determined, the primary host will then request the individual hosts to restart the VMs which should be restarted. After the host(s) has received the list of VMs it needs to restart it will do this in batches of 32, and of course restart priority / order, will be applied. The whole aforementioned process can easily take 10-15 seconds (if not longer), which means that in a perfect world, the restart of the VM occurs after about 30 seconds. Now, this is when the restart of the VM is initiated, that does not mean that the VM, or the services it is hosting, will be available after 30 seconds. The power-on sequence of the VM can take anywhere from seconds, to minutes, depending of course on the size of the VM and the services that need to be started during the power-on sequence.

So, although it only takes 15 to 18 seconds for vSphere HA to determine and declare a failure, there’s much more to it, hopefully, this post provides a better understanding of all that is involved.

Unexplored Territory Episode 088 – Stretching VMware Cloud Foundation featuring Paudie O’Riordan

Duncan Epping · Jan 13, 2025 · Leave a Comment

The first episode of 2025 features one of my favorite colleagues, Paudie O’Riordan. Paudie works for the same team as I do, and although we’ve both roamed around a lot, somehow we always ended up either in the same team, or in very close proximity. Paudie is a storage guru, and the last years helped many customers with their VCF (or vSAN) proof of concept, and on top of that helped countless customers understand difficult failure scenarios in a stretched environment when things went south. In Episode 088 Paudie discusses the many dos and don’ts! This is an episode you need cannot miss out on!

VCF-9 announcements at Explore Barcelona – vSAN Site Takeover and vSAN Site Maintenance

Duncan Epping · Nov 15, 2024 · Leave a Comment

At Explore in Barcelona we had several announcements and showed several roadmap items which we did not reveal in Las Vegas. As the sessions were not recording in Barcelona, I wanted to share with you the features I spoke about at Explore which are currently planned for VCF 9. Please note, I don’t know when these features will be generally available, and there’s always a chance they are not released at all.

I created a video of the features we discussed, as I also wanted to share the demos with you. Now for those who don’t watch videos, the functionality that we are working on for VCF-9 is the following, I am just going to do a brief description, as we have not made a full public announcement about this, and I don’t want to get into trouble.

vSAN Site Maintenance

In a vSAN stretched cluster environment when you want to do site maintenance, today you will need to place every host into maintenance mode one by one. This not only is an administrative/operational burden, it also increases the chances of placing the wrong hosts into maintenance mode. On top of that, as you need to do this sequentially, it could also be that the data stored on host-1 in site A differs from host-2 in site A, meaning that there’s an inconsistent set of data in a site. Normally this is not a problem as the environment will resync when it comes back online, but if the other data site fails, now that existing (inconsistent) data set cannot be used to recover. With Site Maintenance we not only make it easier to place a full site into maintenance mode, we also remove that risk of data inconsistency as vSAN coordinates the maintenance and ensures that the data set is consistent within the site. Fantastic right?!

vSAN Site Takeover

One of the features I felt we were lacking for the longest time was the ability to promote a site when 2 out of the 3 sites had failed simultaneously. This is where Site Takeover comes into play. If you end up in a situation where both the Witness Site and a data site goes down at the same time, you want to be able to still recover. Especially as it is very likely that you will have healthy objects for each VM in that second site. This is what vSAN Site Takeover will help you with. It will allow you to manually (through the UI or script) inform vSAN that even though quorum is lost, it should make the local RAID set for each of the VMs impacted accessible again. After which, of course, vSphere HA would instruct the hosts to power-on those VMs.

If you have any feedback on the demos, and the planned functionality, feel free to leave a comment!

Does vSAN Data Protection work with vSAN Stretched Clusters and can snapshots be stretched?

Duncan Epping · Oct 18, 2024 · Leave a Comment

I have written a few articles about vSAN Data Protection now, and my last article featured a nice vSAN DP demo. A very good question was asked in the comment section, and it was about vSAN Stretched Clusters. Basically, the question was whether Snapshots are also stretched across locations. This is a great question, as there are a couple of things which are probably worth explaining again.

vSAN Data Protection relies on the snapshot capability which was introduced with vSAN ESA. This snapshot capability in vSAN ESA is significantly different than with vSAN OSA or with VMFS. With vSAN OSA and VMFS when you create a snapshot a new object (vSAN) or file (VMFS) is created. With vSAN ESA this is no longer the case as we don’t create additional files or objects, but we create a copy of the metadata structure instead. This is why vSAN ESA snapshots perform much better than vSAN OSA or VMFS snapshots do, as we no longer need to traverse multiple files or objects to read data. We can simply use the same object, and leverage the metadata structure to keep track of what has changed.

Now, with vSAN, as most of you hopefully know, object (and it’s components) are placed across the cluster based on what is specified within the storage policy that is associated with the object or VM. In other words, if the policy states FTT=1 and RAID-1, then you will see 2 copies of the data. If the policy states the data needs to be stretched across locations, and within each location be protected with RAID-5, then you will see a RAID-1 configuration across sites and a RAID-5 configuration within each site. As vSAN ESA snapshots are an integral part of the object, the snapshots automatically follow all requirements as defined within the policy. In other words, if the policy says stretched then the snapshot will also automatically be stretched.

There is one caveat I want to call out, and for that I want to show a diagram. The diagram below shows the Data Protection Appliance, aka the snapshot manager appliance. As you can see, it states “metadata decoupled from appliance” and it links somehow to a global namespace object. This global namespace object is where all the details of the protected VMs (and more) is being stored. As you can imagine, both the Snapshot Manager, as well as the Global Namespace object should also be stretched. For the global namespace object this means that you need to ensure that the default datastore policy is set to “stretched”, and of course for the snapshot manager appliance you can simply select the correct policy when provisioning the appliance. Either way, make sure the default datastore policy aligns with the disaster recovery and data protection policy.

I hope this helps those exploring vSAN Data Protection in a stretched cluster configuration!