• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • Unexplored Territory Podcast
  • HA Deepdive
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

BC-DR

vSAN Stretched Cluster vs Fault Domains in a “campus” setting?

Duncan Epping · Sep 25, 2025 · 2 Comments

I got this question internally recently: Should we create a vSAN Stretched Cluster configuration or create a vSAN Fault Domains configuration when we have multiple datacenters within close proximity on our campus? In this case, we are talking about less than 1ms latency RTT between buildings, maybe a few hundred meters at most. I think it is a very valid question, and I guess it kind of depends on what you are looking to get out of the infrastructure. I wrote down the pros and cons, and wanted to share those with the rest of the world as well, as it may be useful for some of you out there. If anyone has additional pros and cons, feel free to share those in the comments!

vSAN Stretched Clusters:

  • Pro: You can replicate across fault domains AND protect additionally within a fault domain with R1/R5/R6 if required.
  • Pro: You can decide whether VMs should be stretched across Fault Domains or not, or just protected within a fault domain/site
  • Pro: Requires less than 5MS RTT latency, which is easily achievable in this scenario
  • Con/pro: you probably also need to think about DRS/HA groups (VM-to-Host)
  • Con: From an operational perspective, it also introduces a witness host, and sites, which may complicate things, and at the various least requires a bit more thinking
  • Con: Witness needs to be hosted somewhere
  • Con: Limited to 3 Fault Domains (2x data + 1x witness)
  • Con: Limited to 20+20+1 configuration

vSAN Fault Domains:

  • Pro: No real considerations around VM-to-host rules usually, although you can still use it to ensure certain VMs are spread across buildings
  • Pro: No Witness Appliance to manage, update or upgrade. No overhead of running a witness somewhere
  • Pro: No design considerations around “dedicated” witness sites and “data site”, each site has the same function
  • Pro: Can also be used with more than 3 Fault Domains or Datacenters, so could even be 6 Fault Domains, for instance
  • Pro: Theoretically can go up to 64 hosts
  • Con: No ability to protect additionally within a fault domain
  • Con: No ability to specify that you don’t want to replicate VMs across Fault Domains
  • Con/Pro: Requires sub-1ms RTT latency at all times, which is low, but will be achievable in a campus cluster, usually

Where is the vSAN Snapmanager Appliance with 9.0?

Duncan Epping · Jul 7, 2025 · 4 Comments

I was talking to my colleague Paudie and he mentioned various folks were having problems finding the vSAN Snapmanager Appliance for vSAN / VCF / vSphere 9.0. The appliance used to be stored on the Broadcom Support portal under VMware vSAN >> Drivers & Tools, but it is no longer there.

This is not by mistake. Some may have heard about this, others may have skipped over it, but VMware Live Recovery, vSphere Replication, and vSAN Data Protection (which includes the Snapmanager Appliance) have all converged into a single appliance to make your life easier! This means that if you want to enable vSAN Data Protection, you now need to download the VMware Live Recovery Appliance, specifically version 9.0.3.0 or later.

Where is the vSAN Snapmanager Appliance with 9.0?

vSphere HA restart times, how long does it actually take?

Duncan Epping · Mar 13, 2025 · Leave a Comment

I had a question today, and it was based on material I wrote years ago for the Clustering Deepdive. (read it here) The material talks about the sequence HA goes through when a failure has occurred. If you look at the sequence for instance where a “secondary” host has failed, it looks as follows:

  • T0 – Secondary host failure.
  • T3s – Primary host begins monitoring datastore heartbeats for 15 seconds.
  • T10s – The secondary host is declared unreachable and the primary will ping the management network of the failed secondary host. This is a continuous ping for 5 seconds.
  • T15s – If no heartbeat datastores are configured, the secondary host will be declared dead if there is no reply to the ping.
  • T18s – If heartbeat datastores are configured, the secondary host will be declared dead if there’s no reply to the ping and the heartbeat file has not been updated or the lock was lost.

So, depending on whether you have heartbeat datastores or not, this sequence takes either 15 or 18 seconds. Does that mean the VMs are then instantly restarted, and if so, how long does that take? Well no, they won’t instantly restart, because when this sequence has ended, the secondary host which has failed is actually declared dead. Now the potentially impacted VMs will need to be verified if they have actually failed, a list of “to be restarted” VMs will need to be created, and a placement request will need to be done.

The placement request will either go to DRS, or will be handled by HA itself, depending on whether DRS is enabled and if vCenter Server is available. After placement has been determined, the primary host will then request the individual hosts to restart the VMs which should be restarted. After the host(s) has received the list of VMs it needs to restart it will do this in batches of 32, and of course restart priority / order, will be applied. The whole aforementioned process can easily take 10-15 seconds (if not longer), which means that in a perfect world, the restart of the VM occurs after about 30 seconds. Now, this is when the restart of the VM is initiated, that does not mean that the VM, or the services it is hosting, will be available after 30 seconds. The power-on sequence of the VM can take anywhere from seconds, to minutes, depending of course on the size of the VM and the services that need to be started during the power-on sequence.

So, although it only takes 15 to 18 seconds for vSphere HA to determine and declare a failure, there’s much more to it, hopefully, this post provides a better understanding of all that is involved.

Unexplored Territory Episode 088 – Stretching VMware Cloud Foundation featuring Paudie O’Riordan

Duncan Epping · Jan 13, 2025 · Leave a Comment

The first episode of 2025 features one of my favorite colleagues, Paudie O’Riordan. Paudie works for the same team as I do, and although we’ve both roamed around a lot, somehow we always ended up either in the same team, or in very close proximity. Paudie is a storage guru, and the last years helped many customers with their VCF (or vSAN) proof of concept, and on top of that helped countless customers understand difficult failure scenarios in a stretched environment when things went south. In Episode 088 Paudie discusses the many dos and don’ts! This is an episode you need cannot miss out on!

VCF-9 announcements at Explore Barcelona – vSAN Site Takeover and vSAN Site Maintenance

Duncan Epping · Nov 15, 2024 · Leave a Comment

At Explore in Barcelona we had several announcements and showed several roadmap items which we did not reveal in Las Vegas. As the sessions were not recording in Barcelona, I wanted to share with you the features I spoke about at Explore which are currently planned for VCF 9. Please note, I don’t know when these features will be generally available, and there’s always a chance they are not released at all.

I created a video of the features we discussed, as I also wanted to share the demos with you. Now for those who don’t watch videos, the functionality that we are working on for VCF-9 is the following, I am just going to do a brief description, as we have not made a full public announcement about this, and I don’t want to get into trouble.

vSAN Site Maintenance

In a vSAN stretched cluster environment when you want to do site maintenance, today you will need to place every host into maintenance mode one by one. This not only is an administrative/operational burden, it also increases the chances of placing the wrong hosts into maintenance mode. On top of that, as you need to do this sequentially, it could also be that the data stored on host-1 in site A differs from host-2 in site A, meaning that there’s an inconsistent set of data in a site. Normally this is not a problem as the environment will resync when it comes back online, but if the other data site fails, now that existing (inconsistent) data set cannot be used to recover. With Site Maintenance we not only make it easier to place a full site into maintenance mode, we also remove that risk of data inconsistency as vSAN coordinates the maintenance and ensures that the data set is consistent within the site. Fantastic right?!

vSAN Site Takeover

One of the features I felt we were lacking for the longest time was the ability to promote a site when 2 out of the 3 sites had failed simultaneously. This is where Site Takeover comes into play. If you end up in a situation where both the Witness Site and a data site goes down at the same time, you want to be able to still recover. Especially as it is very likely that you will have healthy objects for each VM in that second site. This is what vSAN Site Takeover will help you with. It will allow you to manually (through the UI or script) inform vSAN that even though quorum is lost, it should make the local RAID set for each of the VMs impacted accessible again. After which, of course, vSphere HA would instruct the hosts to power-on those VMs.

If you have any feedback on the demos, and the planned functionality, feel free to leave a comment!

  • Page 1
  • Page 2
  • Page 3
  • Interim pages omitted …
  • Page 63
  • Go to Next Page »

Primary Sidebar

About the Author

Duncan Epping is a Chief Technologist and Distinguished Engineering Architect at Broadcom. Besides writing on Yellow-Bricks, Duncan is the co-author of the vSAN Deep Dive and the vSphere Clustering Deep Dive book series. Duncan is also the host of the Unexplored Territory Podcast.

Follow Us

  • X
  • Spotify
  • RSS Feed
  • LinkedIn

Recommended Book(s)

Advertisements




Copyright Yellow-Bricks.com © 2025 · Log in