• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • Unexplored Territory Podcast
  • HA Deepdive
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

BC-DR

vSphere HA setting Performance degradation VMs tolerate

Duncan Epping · Apr 8, 2026 · Leave a Comment

There was a question this week internally and I really had to start digging, as I have not looked at this in a loooong time. What does “Performance degradation VMs tolerate” do? And does this feature require admission control to be enabled or not?

vSphere HA setting Performance degradation VMs tolerate

I had to test this, as I barely ever play around with the HA settings these days. But, let’s first describe what this feature is for. I think the UI explains it fairly decently, but here’s my explanation from the vSphere Clustering Deep Dive:

This feature allows you to specify the performance degradation you are willing to incur if a failure happens. It is set to 100% by default, but it is our recommendation to consider changed the value. You can for instance change this to 25% or 50.

Now, the requirement for this feature to work is to have DRS enables, but Admission Control does not need to be enabled! A lot of people are under the impression that it requires Admission Control in order to take “an X number of failures” into account, but it does not. It actually does not use what is specified for Admission Control. It takes a single failure into account when it comes to this feature, and then uses DRS to do the calculations if powered on VMs will get the same amount of resources allocated after a failure. If the answer is no, or performance degradation is higher than the percentage specified, a warning is triggered. You will still be able to power on new VMs, but the warning will not go away unless the resource usage changes, or you add more resources to the cluster.

vSAN to vSAN Replication and Recovery Plan creation demo!

Duncan Epping · Dec 9, 2025 · 1 Comment

As I was going through the various recordings I had of demos I created for Explore, I realized I hadn’t published the demos I created for vSAN to vSAN Replication, and on creating a Recovery Plan based on a vSAN Protection Group in VMware Live Recovery. So here it is. It is a pretty lengthy video as I go through all the various steps involved. So what you will see in this demo is the following:

  • vCenter Server Pairing between my 2 sites
  • Cluster pairing
  • Creation of a vSAN Protection Group, including vSAN to vSAN Replication
  • Creation of a Recovery Plan based on the previously created Protection Group
  • Test of the Recovery Plan

What happens after a Site Takeover when my failed sites come back online again?

Duncan Epping · Dec 4, 2025 · Leave a Comment

I got a question after the previous demo: what would happen if, after a Site Takeover, the two failed sites came back online again? I completely ignored this part of the scenario so far, I am not even sure why. I knew what would happen, but I wanted to test it anyway to confirm that what engineering had described actually happened. For those who cannot be bothered to watch a demo, what happens when the two failed sites come back online again is pretty straightforward. The “old” components of the impacted VMs are discarded, vSAN will recreate the RAID configuration as specified within the associated vSAN Storage Policy, and then a full resync will occur so that the VM is compliant again with the policy. Let me repeat one part: a full resync will occur! So if you do a Site Takeover, I hope you do understand what the impact will be. A full resync will take time, of course, depending on the connection between the data locations.

What do I do after a vSAN Stretched Cluster Site Takeover?

Duncan Epping · Nov 10, 2025 · 4 Comments

Over the last couple of months, various new vSAN features were announced. Two of those features are around the Stretched Cluster configuration, and have probably been the number 1 feature request for a few years. Now that we have Site Takeover and Site Maintenance functionality available, I am starting to get some questions about the impact of them, and in particular, the Site Takeover functionality is raising some questions.

For those who don’t know what these features are, let me describe them briefly:

Site Maintenance = The ability to place a full vSAN stretched cluster Fault Domain into maintenance mode at once. This ensures that all hosts within the fault domain have consistently stored the data, and all hosts will go into maintenance mode at the same time.

Site Takeover = This provides the ability when a Witness and a Data Site has failed to bring back the remaining site through a command line interface. This will reconstruct the remaining “site local” RAID configuration, making the objects available again, which will then allow vSphere HA to restart the VMs.

Now, the question that the above typically raises is what happens to the Witness and the Data Site that failed when you do the Site Takeover? If you look at the VMs RAID configuration, you will notice that both the Witness and the Data Site components of the sites that failed will completely disappear from the RAID configuration.

But what do you do next, because even after you run the Site Takeover, you still see your hosts and the witness in vCenter Server, and you still see a stretched cluster configuration in the UI. Now at first I thought that if the environment was completely up and running again, you had to go through some manual effort to reconstruct the stretched cluster. Basically, remove the failed hosts, wipe the disks, and recreate the stretched cluster. This is, however, not the case.

In the example above, if the Preferred site and the Witness site return for duty, vSAN will automatically discard the stale components in those previously failed sites. It will recreate new components for all objects, and it will do a full resync of the data.

If you end up in a situation where your hosts are completely gone (let’s say as a result of a fire), then you will have to do some kind of manual cleanup as follows, before you rebuild and add hosts back:

  • Remove the failed hosts from the vCenter inventory
  • Remove the witness from the vCenter inventory
    • Delete the witness from the vCenter Server it is running, a real delete!
  • Delete the surviving Fault Domain, this should be the only Fault Domain still listed in the vCenter interface
  • You now have a normal cluster again
  • Rebuild hosts and recreate the stretched cluster

I hope that helps,

vSAN Stretched Cluster vs Fault Domains in a “campus” setting?

Duncan Epping · Sep 25, 2025 · 2 Comments

I got this question internally recently: Should we create a vSAN Stretched Cluster configuration or create a vSAN Fault Domains configuration when we have multiple datacenters within close proximity on our campus? In this case, we are talking about less than 1ms latency RTT between buildings, maybe a few hundred meters at most. I think it is a very valid question, and I guess it kind of depends on what you are looking to get out of the infrastructure. I wrote down the pros and cons, and wanted to share those with the rest of the world as well, as it may be useful for some of you out there. If anyone has additional pros and cons, feel free to share those in the comments!

vSAN Stretched Clusters:

  • Pro: You can replicate across fault domains AND protect additionally within a fault domain with R1/R5/R6 if required.
  • Pro: You can decide whether VMs should be stretched across Fault Domains or not, or just protected within a fault domain/site
  • Pro: Requires less than 5MS RTT latency, which is easily achievable in this scenario
  • Con/pro: you probably also need to think about DRS/HA groups (VM-to-Host)
  • Con: From an operational perspective, it also introduces a witness host, and sites, which may complicate things, and at the various least requires a bit more thinking
  • Con: Witness needs to be hosted somewhere
  • Con: Limited to 3 Fault Domains (2x data + 1x witness)
  • Con: Limited to 20+20+1 configuration

vSAN Fault Domains:

  • Pro: No real considerations around VM-to-host rules usually, although you can still use it to ensure certain VMs are spread across buildings
  • Pro: No Witness Appliance to manage, update or upgrade. No overhead of running a witness somewhere
  • Pro: No design considerations around “dedicated” witness sites and “data site”, each site has the same function
  • Pro: Can also be used with more than 3 Fault Domains or Datacenters, so could even be 6 Fault Domains, for instance
  • Pro: Theoretically can go up to 64 hosts
  • Con: No ability to protect additionally within a fault domain
  • Con: No ability to specify that you don’t want to replicate VMs across Fault Domains
  • Con/Pro: Requires sub-1ms RTT latency at all times, which is low, but will be achievable in a campus cluster, usually
  • Page 1
  • Page 2
  • Page 3
  • Interim pages omitted …
  • Page 63
  • Go to Next Page »

Primary Sidebar

About the Author

Duncan Epping is a Chief Technologist and Distinguished Engineering Architect at Broadcom. Besides writing on Yellow-Bricks, Duncan is the co-author of the vSAN Deep Dive and the vSphere Clustering Deep Dive book series. Duncan is also the host of the Unexplored Territory Podcast.

Follow Us

  • X
  • Spotify
  • RSS Feed
  • LinkedIn

Recommended Book(s)

Also visit!

For the Dutch-speaking audience, make sure to visit RunNerd.nl to follow my running adventure, read shoe/gear/race reviews, and more!

Do you like Hardcore-Punk music? Follow my Spotify Playlist!

Do you like 80s music? I got you covered!

Copyright Yellow-Bricks.com © 2026 · Log in