vsphere ha

Disable the re-registering of HA disabled VMs on other hosts!

Duncan Epping · Jan 24, 2023 · Leave a Comment

Years ago we had various customers that complained about the fact that they had VMs that were disabled for HA and that these VMs would not be re-registered when the host they were registered on would fail. I can understand why you would want the VMs to be re-registered as this makes it easier to power-on those VMs when a host has failed. If the VM would not be re-registered automatically, and the host to which it was registered has failed, you would have to manually register the VM and only then would you be able to power on that VM.

Now, it doesn’t happen too often, but there are also situations where certain VMs are disabled for HA restarts (or powered off) and customers don’t want to have those VMs to be re-registered as these VMs are only allowed to run on one particular host. In that particular case you can simply disable the re-registering of HA disabled VMs through the use of an advanced setting. The advanced setting for this, and the value to use, is the following:

das.reregisterRestartDisabledVMs - false

The following VIBs on the host are missing from the image and will be removed from the host during remediation: vmware-fdm

Duncan Epping · Jan 2, 2023 · 1 Comment

I’ve seen a few people being confused about a message which is shown when upgrading ESXi. The message is: The following VIBs on the host are missing from the image and will be removed from the host during remediation: vmware-fdm(version number + build number). Now this happens when you use vLCM (Lifecycle Manager) to upgrade from one version of ESXi to the next. The reason for it is simple, the vSphere HA VIB (vmware-fdm) is never included in the image.

If it is not included, how do the hosts get the VIB? The VIB is pushed by vCenter Server to the hosts when required! (When you enable HA for instance on a cluster.) This also is the case after an upgrade. After the VIB is removed it will simply be replaced by the latest version of it by vCenter Server. So no need to be worried, HA will work perfectly fine after the upgrade!

Why does HA not power-on VMs after a full cluster shutdown?

Duncan Epping · Dec 20, 2021 ·

I received this question and figured I would write a quick post about it, as it comes up occasionally. Why does vSphere HA no power-on VMs after a full cluster is brought back online after a full cluster shutdown? In this case, the customer had a power outage, so their hosts and all VMs were powered off, by an administrator cleanly, as a result of the backup power unit running out of power. Unfortunately, this happens more frequently than you would think.

When VMs are powered off by an administrator, or anyone/anything (PowerCLI etc) else which has permissions to power off VMs, then vCenter Server will mark these VMs as “cleanly powered off”. Next, also the state of the VMs is tracked by vSphere HA. So if a host is powered off, HA will know if the VM was powered on, or powered off at the time the host goes missing.

Now, when the host (or hosts) returns for duty, vSphere HA will of course verify what the last known state was of the cluster. It will read the list of all the VMs that were powered on, and it will restart those that were powered on and are configured for HA. It will also look at a VM property called “runtime.cleanPowerOff”, this property indicates if the VM was cleanly powered off by an Admin or a script, or if the VM was for instance powered off by vSphere HA itself. (PDL response etc.) Depending on the value of the property, the VM will, or will not be restarted.

Having said all of that, when you power off a VM manually through the UI, or via a script, then the VM will be marked as being “cleanly powered off”. This means that HA has no reason to restart it, as the powered-off VM is not the result of a host, network, or storage failure.

Running vSphere 6.7 or 6.5 and configured HA APD/PDL responses? Read this…

Duncan Epping · May 14, 2020 ·

If you are running vSphere 6.7 or 6.5 and have not installed 6.7 P02 yet (6.5 P05 is available soon) and you have APD/PDL responses configured within vSphere HA it could be that an issue causes VMs not to be failed over when an APD or PDL occurs. This is a known issue in the release, and P02 or P05 solves this problem. What is the problem? Well, a bug causes VMs which are listed in “VM overrides” to have settings that are not configured to be set to “disabled” instead of “unset”, in specific the APD/PDL setting.

This means that even though you have APD/PDL responses configured on a cluster level, the VM level configuration overrides it as it would be set to “disabled”. It doesn’t matter really why you added them to VM Overrides, could be to configure VM Restart Priority for instance. The frustrating part is that the UI doesn’t show you it is disabled as it looks like it is not configured.

If you can’t install the patch just yet, for whatever reason, but you do have VMs in VM Overrides, make sure to go to VM Overrides and explicitly configure the VMs to have the APD/PDL responses enabled similar to what it is configured to on a cluster level as shown in the screenshots below.

vSphere HA internals: VMCP super aggressive option in vSphere 7

Duncan Epping · May 11, 2020 ·

Most of you probably heard about a feature called VMCP aka VM Component Protection. If not, this is the functionality in vSphere HA that enabled you to restart VMs which have been impacted by a PDL (permanent device loss) or APD (all paths down) scenario. (If you have no idea what I am talking about read this article first.)

When you configure the APD response you have four options:

Disable
Issue Event
Power Off / Restart – Conservative
Power Off / Restart – Aggressive

The main difference between Conservative and Aggressive is that if you find yourself in a situation where HA isn’t sure whether a VM can be restarted during an APD scenario it will not power off the VM when using Conservative. If you have it configured as Aggressive it will power off the VM. However, if HA is certain that a VM can’t be powered on it will not power off the VM. Basically it prefers availability of the VM.

As you can imagine, in certain scenarios having a VM running while it is impacted by an “APD” situation makes no sense. The VM has lost access to storage, and you simply may prefer to kill the workload. Why? Well, when it loses access to storage it can’t write to disk. You could find yourself in a situation where a change is acknowledged and you think it is written to disk but it somehow is sitting in a memory cache etc.

If you prefer the VM to be killed, regardless of whether it can be restarted or not, you can enable this via a vSphere HA advanced setting. Now before you implement this, do note that if a cluster-wide APD situation occurs, you could find yourself in the scenario where ALL virtual machines are powered off by HA and not restarted as the resources are not available. Anyway, if you feel this is a requirement, you can configure the following vSphere HA advanced setting in vSphere 7:

das.restartVmsWithoutResourceChecks = true