vSphere

VCG Notification demo and Changing the Default vSAN Policy Demo

Duncan Epping · Jun 22, 2020 ·

I created two youtube video last week which I just wanted to share with everyone. In these demo’s I am showing the new VCG Notification option. The VCG Notification option is very useful for customers who want to be notified via email when a change to a component of a ready node configuration has occurred. This could be a change in support, change of driver / firmware etc.

Another demo that I recorded was around how to change the default policy for a vSAN Cluster. This seems to be an option that many folks haven’t been able to find in the UI. It is pretty straight forward, hence I am sharing it here.

Running ESXi in “Degraded Mode”, what does that mean?

Duncan Epping · Jun 15, 2020 ·

I received a question today, and I didn’t have the answer so I reached out to one of the developers. This person found this line in the ESXi documentation where it states the following, and the question was what does running ESXi in degrade mode actually means, or what is the impact?

If a local disk cannot be found, then ESXi 7.0 operates in degraded mode where certain functionality is disabled and the /scratch partition is on the RAM disk, linked to /tmp. You can reconfigure /scratch to use a separate disk or LUN. For best performance and memory optimization, do not run ESXi in degraded mode.

In other words “degrade mode” is a situation where you are running ESXi with a boot disk configuration which is undesired. In this case, the boot disk configuration (size, etc) results in the fact that /scratch is not stored on persistent media, but rather in RAM, which means that state is lost during a reboot. This could lead to various problems, hence it called degraded mode or state. Note that although you are now running in “degraded” mode, it could easily prevent you from upgrading potentially in the future.

So how do you resolve this problem? Follow the recommendations VMware provides for the ESXi configuration:

An 8 GB USB or SD and an additional 32 GB local disk. The ESXi boot partitions reside on the USB or SD and the ESX-OSData volume resides on the local disk.
A local disk with a minimum of 32 GB. The disk contains the boot partitions and ESX-OSData volume.
A local disk of 142 GB or larger. The disk contains the boot partitions, ESX-OSData volume, and VMFS datastore.

Although not a requirement, I would urge to read and follow the next sections from the documentation:

Although an 8 GB USB or SD device is sufficient for a minimal installation, you should use a larger device. The additional space is used for an expanded core dump file and the extra flash cells of a high-quality USB flash drive can prolong the life of the boot media. Use a 32 GB or larger high-quality USB flash drive.
If you install ESXi on M.2 or other non-USB low-end flash media, delete the VMFS datastore on the device immediately after installation.

If you want to mitigate the situation after upgrading to ESXi 7.0 you can add a new local disk and enable “autoPartition=TRUE” and reboot. At reboot, the disk will be partitioned and populated for usage. The use of this advanced setting, and others which relate to ESXi 7.0, are described in this KB article here.

For those wondering, “ESXi-OSData” is the partition where we now store the content of what was previously stored in “scratch”, “core”, and “locker”. Niels wrote a deep-dive on the vSphere blog here, go check that out.

vCenter 7.0 stating “New vCenter server updates are available” while there are no updates?

Duncan Epping · Jun 8, 2020 ·

I have seen this issue reported on the VMware Community Forum a few times, when you run vCenter 7.0 you will receive a message in the vSphere Client stating the following “New vCenter server updates are available”. When you then click “View Updates” however you will notice that there are no updates available for vCenter Server and you are indeed running the latest and greatest version. We (Cormac and I) actually encountered the issue in our lab as well, which is demonstrated in the screenshot below.

Pretty confusing indeed. Please note that this is a known issue, there’s no need to report this to VMware. Soon a patch will be released for vCenter Server which will fix this problem.

The issue is fixed in 7.0b as documented in the release notes!

Running vSphere 6.7 or 6.5 and configured HA APD/PDL responses? Read this…

Duncan Epping · May 14, 2020 ·

If you are running vSphere 6.7 or 6.5 and have not installed 6.7 P02 yet (6.5 P05 is available soon) and you have APD/PDL responses configured within vSphere HA it could be that an issue causes VMs not to be failed over when an APD or PDL occurs. This is a known issue in the release, and P02 or P05 solves this problem. What is the problem? Well, a bug causes VMs which are listed in “VM overrides” to have settings that are not configured to be set to “disabled” instead of “unset”, in specific the APD/PDL setting.

This means that even though you have APD/PDL responses configured on a cluster level, the VM level configuration overrides it as it would be set to “disabled”. It doesn’t matter really why you added them to VM Overrides, could be to configure VM Restart Priority for instance. The frustrating part is that the UI doesn’t show you it is disabled as it looks like it is not configured.

If you can’t install the patch just yet, for whatever reason, but you do have VMs in VM Overrides, make sure to go to VM Overrides and explicitly configure the VMs to have the APD/PDL responses enabled similar to what it is configured to on a cluster level as shown in the screenshots below.

vSphere HA internals: restart placement changes in vSphere 7!

Duncan Epping · May 13, 2020 ·

Frank and I are looking to update the vSphere Clustering deep dive to vSphere 7. While scoping the work I stumbled on to something interesting, and this is the change that was introduced for the vSphere HA restart mechanism, and specifically the placement of VMs in vSphere 7. In previous releases vSphere HA had a straight forward way of doing placement for VMs when VMs need to be restarted as a result of a failure. In vSphere 7.0 this mechanism was completely overhauled.

So how did it work pre-vSphere 7?

HA uses the cluster configuration
HA uses the latest compatibility list it received from vCenter
HA leverages a local copy of the DRS algorithm with a basic (fake) set of stats and runs the VMs through the algorithm
HA receives a placement recommendation from the local algorithm and restarts the VM on the suggested host
Within 5 minutes DRS runs within vCenter, and will very likely move the VM to a different host based on actual load

As you can imagine this is far from optimal. So what is introduced in vSphere 7? Well, we introduce two different ways of doing placement for restarts in vSphere 7:

Remote Placement Engine
Simple Placement Engine

The Remote Placement Engine, in short, is the ability for vSphere HA to make a call to DRS for the recommendation of the placement of a VM. This will take the current load of the cluster, the VM happiness, and all configured affinity/anti-affinity/vm-host affinity rules into consideration! Will this result in a much slower restart? The great thing is that the DRS algorithm has been optimized over the past years and it is so fast that there will not be a noticeable difference between the old mechanism and the new mechanism. Added benefit of course for the engineering team is that they can remove the local DRS module, which means there’s less code to maintain. How this works is that the FDM Master communicated with the FDM Manager which runs in vCenter Server. FDM Manager communicates with the DRS service to request a placement recommendation.

Now some of you will probably wonder what happens when vCenter Server is unavailable, well this is where the Simple Placement Engine comes into play. The team has developed a new placement engine that basically takes a round-robin approach, but does consider of course “must rules” (VM to Host) and the compatibility list. Note, affinity, or anti-affinity rules, are not considered when SPE is used instead of RPE! This is a known limitation, which is considered to be fixed in the future. If a host, for instance, is not connected to the datastore the VM is running on that needs to be restarted than that host is excluded from the list of potential placement targets. By the way, before I forget, version 7 also introduced a vCenter heartbeat mechanism as a result. HA will be heart beating the vCenter Server instance to understand when it will need to resort to the Simple Placement Engine vs the Remote Placement Engine.

I dug through the FDM log to find some proof of these new mechanisms, (/var/log/fdm.log) and found an entry that shows there are indeed two placement engines:

Invoking the RPE + SPE Placement Engine

RPE stands for “remote placement engine”, and SPE for “simple placement engine”. Where Remote of course refers to DRS. You may ask yourself, how do you know if DRS is being called? Well, that is something you can see in the logs in the DRS log files, when a placement request is received, the below entry shows up in the log file:

FdmWaitForUpdates-vim.ClusterComputeResource:domain-c8-26307464

This even happens when DRS is disabled and also when you use a license edition which does not include DRS even, which is really cool if you ask me. If for whatever reason vCenter Server is unavailable, and as a result DRS can’t be called, you will see this mentioned in the FDM log, and as shown below, it will use the Simple Placement Engine’s recommendation for the placement of the VM:

Invoke the placement service to process the placement update from SPE

A cool and very useful small HA enhancement if you ask me for vSphere 7.0!

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **