vsphere ha

vSphere HA restart times, how long does it actually take?

Duncan Epping · Mar 13, 2025 · Leave a Comment

I had a question today, and it was based on material I wrote years ago for the Clustering Deepdive. (read it here) The material talks about the sequence HA goes through when a failure has occurred. If you look at the sequence for instance where a “secondary” host has failed, it looks as follows:

T0 – Secondary host failure.
T3s – Primary host begins monitoring datastore heartbeats for 15 seconds.
T10s – The secondary host is declared unreachable and the primary will ping the management network of the failed secondary host. This is a continuous ping for 5 seconds.
T15s – If no heartbeat datastores are configured, the secondary host will be declared dead if there is no reply to the ping.
T18s – If heartbeat datastores are configured, the secondary host will be declared dead if there’s no reply to the ping and the heartbeat file has not been updated or the lock was lost.

So, depending on whether you have heartbeat datastores or not, this sequence takes either 15 or 18 seconds. Does that mean the VMs are then instantly restarted, and if so, how long does that take? Well no, they won’t instantly restart, because when this sequence has ended, the secondary host which has failed is actually declared dead. Now the potentially impacted VMs will need to be verified if they have actually failed, a list of “to be restarted” VMs will need to be created, and a placement request will need to be done.

The placement request will either go to DRS, or will be handled by HA itself, depending on whether DRS is enabled and if vCenter Server is available. After placement has been determined, the primary host will then request the individual hosts to restart the VMs which should be restarted. After the host(s) has received the list of VMs it needs to restart it will do this in batches of 32, and of course restart priority / order, will be applied. The whole aforementioned process can easily take 10-15 seconds (if not longer), which means that in a perfect world, the restart of the VM occurs after about 30 seconds. Now, this is when the restart of the VM is initiated, that does not mean that the VM, or the services it is hosting, will be available after 30 seconds. The power-on sequence of the VM can take anywhere from seconds, to minutes, depending of course on the size of the VM and the services that need to be started during the power-on sequence.

So, although it only takes 15 to 18 seconds for vSphere HA to determine and declare a failure, there’s much more to it, hopefully, this post provides a better understanding of all that is involved.

New vCLS architecture with vSphere 8.0 Update 3

Duncan Epping · Sep 24, 2024 · 1 Comment

Some of you may have seen this, others may have not, as I had a question today around vCLS retreat mode with 8.0U3 I figured I would write something on the topic quickly. Starting with vSphere 8.0 Update we introduced a new architecture for vCLS aka vSphere Cluster Services. Pre-vSphere 8.0 Update the vCLS architecture was based on virtual machines with Photon OS. These VMs were there to assist in enabling and disabling primarily DRS. If something was wrong with these VMs then DRS would also be unable to function normally. In the past many of you have probably experienced situations where you had to kill and delete the vCLS VMs to restore functionality of DRS, for that VMware introduced a feature called “retreat mode” which basically killed and deleted the VMs for you. There were some other challenges with the vCLS VMs and as a result the team decided to create a new design for vCLS.

Starting with vSphere 8.0 Update 3 vCLS is now implemented as what I would call a container runtime, sometimes referred to as a Pod VM or PodCRX. In other words, when you upgrade to vSphere 8.0 Update 3 you will see your current vCLS VMs be deleted, and these new shiny vCLS VMs will pop up. How do you know if these VMs are created using a different mechanism? Well you can simply just see that in the UI as demonstrated below. See the “CRX” mention in the UI?

So you may ask yourself, why should I even care? Well the thing is, you should not indeed. The new vCLS architecture uses less resources per VM, there are less vCLS VMs deployed to begin with (two instead of three), and they are more resilient. Also, when a host is for instance placed into maintenance mode while it has a vCLS VM running, that vCLS instance is deleted and recreated elsewhere. Considering the VMs are stateless and tiny, that is much more efficient than trying to vMotion it. Note, vMotion and SvMotion of these new (Embedded as they call them) type of vCLS VMs isn’t even supported to begin with.

Normally, vCLS retreat mode shouldn’t be needed anymore, but if you do end up in a situation where you need to clean up these instances, Retreat Mode is still fully supported with 8.0 U3 as well. You can find the Retreat Mode option in the same place as before, on your cluster object under “Configure –> vSphere Cluster Services –> General –> Edit vCLS Mode”. Simply select “Retreat Mode” and the clean up should happen automatically. When you want the VMs to be recreated, simply go back to the same UI and select “System managed”. This should then lead to the vCLS VMs being recreated.

I hope this helps,

Disable the re-registering of HA disabled VMs on other hosts!

Duncan Epping · Jan 24, 2023 ·

Years ago we had various customers that complained about the fact that they had VMs that were disabled for HA and that these VMs would not be re-registered when the host they were registered on would fail. I can understand why you would want the VMs to be re-registered as this makes it easier to power-on those VMs when a host has failed. If the VM would not be re-registered automatically, and the host to which it was registered has failed, you would have to manually register the VM and only then would you be able to power on that VM.

Now, it doesn’t happen too often, but there are also situations where certain VMs are disabled for HA restarts (or powered off) and customers don’t want to have those VMs to be re-registered as these VMs are only allowed to run on one particular host. In that particular case you can simply disable the re-registering of HA disabled VMs through the use of an advanced setting. The advanced setting for this, and the value to use, is the following:

das.reregisterRestartDisabledVMs - false

The following VIBs on the host are missing from the image and will be removed from the host during remediation: vmware-fdm

Duncan Epping · Jan 2, 2023 ·

I’ve seen a few people being confused about a message which is shown when upgrading ESXi. The message is: The following VIBs on the host are missing from the image and will be removed from the host during remediation: vmware-fdm(version number + build number). Now this happens when you use vLCM (Lifecycle Manager) to upgrade from one version of ESXi to the next. The reason for it is simple, the vSphere HA VIB (vmware-fdm) is never included in the image.

If it is not included, how do the hosts get the VIB? The VIB is pushed by vCenter Server to the hosts when required! (When you enable HA for instance on a cluster.) This also is the case after an upgrade. After the VIB is removed it will simply be replaced by the latest version of it by vCenter Server. So no need to be worried, HA will work perfectly fine after the upgrade!

Why does HA not power-on VMs after a full cluster shutdown?

Duncan Epping · Dec 20, 2021 ·

I received this question and figured I would write a quick post about it, as it comes up occasionally. Why does vSphere HA no power-on VMs after a full cluster is brought back online after a full cluster shutdown? In this case, the customer had a power outage, so their hosts and all VMs were powered off, by an administrator cleanly, as a result of the backup power unit running out of power. Unfortunately, this happens more frequently than you would think.

When VMs are powered off by an administrator, or anyone/anything (PowerCLI etc) else which has permissions to power off VMs, then vCenter Server will mark these VMs as “cleanly powered off”. Next, also the state of the VMs is tracked by vSphere HA. So if a host is powered off, HA will know if the VM was powered on, or powered off at the time the host goes missing.

Now, when the host (or hosts) returns for duty, vSphere HA will of course verify what the last known state was of the cluster. It will read the list of all the VMs that were powered on, and it will restart those that were powered on and are configured for HA. It will also look at a VM property called “runtime.cleanPowerOff”, this property indicates if the VM was cleanly powered off by an Admin or a script, or if the VM was for instance powered off by vSphere HA itself. (PDL response etc.) Depending on the value of the property, the VM will, or will not be restarted.

Having said all of that, when you power off a VM manually through the UI, or via a script, then the VM will be marked as being “cleanly powered off”. This means that HA has no reason to restart it, as the powered-off VM is not the result of a host, network, or storage failure.