update 1

Demo Time: How to delete the vCLS VMs

Duncan Epping · Oct 27, 2020 ·

As I have a bunch of questions about how you can delete the vSphere Cluster Service VMs (vCLS VMs) I figured I would create a quick demo. It is pretty straightforward, and it should only be used when people are doing some kind of full cluster maintenance. This demo shows you how to get the VMs deleted by leveraging a vCenter Server Level Advanced setting (config.vcls.clusters.domain-c<identifier>.enabled). I have also written a post that has a bunch of requirements, Q&A, and considerations for the vCLS VMs, if you are interested in that read it here.

Here’s the summary of how to delete the VMs: Go to your vCenter Server object, go to the configure tab, then go to “Advanced Settings”, add the key “config.vcls.clusters.domain-c<identifier>.enabled” and set it to false. The domain “c-number” for your cluster can be found in the URL when you click on the cluster. It should look something like the following, where the bold part is the important bit: https://vcsa-06.rainpole.com/ui/app/cluster;nav=h/urn:vmomi:ClusterComputeResource:domain-c22:4df0badc-1655-40de-9181-3422d6c36a3e/summary. If you want to recreate the VMs, simply set the value to “true” when the deletion task has completed.

Note, if you have a resource pool configuration, enabling “retreat mode” (disabling vCLS)) doesn’t impact resource pools in any shape or form, it just impacts DRS load balancing. Anyway, I hope you find the demo useful.

What’s new for vSAN 7.0 U1!?

Duncan Epping · Sep 15, 2020 ·

Every 6-9 months VMware has been pushing out a new feature release of vSAN. After vSphere and vSAN 7.0, which introduced vSphere Lifecycle Manager and vSAN File Services, it is now time to share with you what is new for vSAN 7.0 U1. Again it is a feature-packed release, with many “smaller enhancements, but also the introduction of some bigger functionality. Let’s just list the key features that have just been announced, and then discuss each of these individually. You better sit down, as this is going to be a long post. Oh, and note, this is an announcement, not the actual availability of vSAN 7.0 U1, for that you will have to wait some time.

vSAN HCI Mesh
vSAN Data Persistence Platform
vSAN Direct Configuration
vSAN File Services – SMB support
vSAN File Services – Performance enhancements
vSAN File Services – Scalability enhancements
vSAN Shared Witness
Compression-only
Data-in-transit encryption
Secure wipe
vSAN IO Insight
Effective capacity enhancements
Enhanced availability during maintenance mode
Faster host restarts
Enhanced pre-check for vSAN maintenance mode
Ability to override default gateway through the UI
vLCM support for Lenovo

Slight change in “restart” behavior for HA with vSphere 5.0 Update 1

Duncan Epping · Mar 27, 2012 ·

Although this is a corner case scenario I did wanted to discuss it to make sure people are aware of this change. Prior to vSphere 5.0 Update 1 a virtual machine would be restarted by HA when the master had detected that the state of the virtual machine had changed compared to the “protectedlist” file. In other words, a master would filter the VMs it thinks had failed before trying to restart any. Prior to Update 1, a master used the protection state it read from the protectedlist. If the master did not know the on-disk protection state for the VM, the master did not try to restart it. Keep in mind that only one master can open the protectedList file in exclusive mode.

In Update 1 this logic has slightly changed. HA can know retrieve the state information from either the protectionlist stored on the datastore or from vCenter Server. So now multiple masters could try to restart a VM. If one of those restarts would fail, for instance because a “partition” does not have sufficient resources, the master in the other partition might be able to restart it. Although these scenarios are highly unlikely, this behavior change was introduced as a safety net!

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

Permanent Device Loss (PDL) enhancements in vSphere 5.0 Update 1 for Stretched Clusters

Duncan Epping · Mar 16, 2012 ·

In the just released vSphere 5.0 Update 1 some welcome enhancements were added around vSphere HA and how a Permanent Device Loss (PDL) condition is handled. The PDL condition is a condition that is communicated by the array to ESXi via a SCSI sense code and indicates that a device (LUN) is unavailable and more than likely permanently unavailable. This is a condition which is useful for “stretched storage cluster” configurations where in the case of a failure in Datacenter-A the configuration in Datacenter-B can take over. An example of when a condition like this would be communicated by the array would be when a LUN “detached” in a site isolation. PDL is probably most common in non-uniform stretched solutions like EMC VPLEX. With VPLEX site affinity is defined per LUN. If your VM resides in Datacenter-A while the LUN it is stored on has affinity to Datacenter-B in case of failure this VM could lose access to the LUN. These enhancements will ensure the VM is killed and restarted on the other side.

Please note that action will only be taken when a PDL sense code is issued. When your storage completely fails for instance it is impossible to reach the PDL condition as there is no communication possible anymore from the array to the ESXi host and the state will be identified by the ESXi host as an All Paths Down (APD) condition. APD is a more common scenario in most environments. If you are testing these enhancements please check the log files to validate which problem has been identified.

With vSphere 5.0 and prior HA did not have a response in a PDL condition, meaning that when a virtual machine was residing on a datastore which had a PDL condition the virtual machine would just sit there. This virtual machine would be unable to read or write from disk however. As of vSphere 5.0 Update 1 a new mechanism has been introduced which allows vSphere HA to take action when a datastore has reached a PDL state. Two advanced settings make this possible. The first setting is configured on a host level and is “disk.terminateVMOnPDLDefault”. This setting can be configured in /etc/vmware/settings and should be set to “True”. This setting ensures that a virtual machine is killed when the datastore it resides on is in a PDL state. The virtual machine is killed as soon as it initiates disk I/O on a datastore which is in a PDL condition and all of the virtual machine files reside on this datastore. Note that if a virtual machine does not initiate any I/O it will not be killed!

The second setting is a vSphere HA advanced setting called das.maskCleanShutdownEnabled. This setting is also not enabled by default and it will need to be set to “True”. This settings allows HA to trigger a restart response for a virtual machine which has been killed automatically due to a PDL condition. This setting allows HA to differentiate between a virtual machine which was killed due to the PDL state or a virtual machine which has been powered off by an administrator.

As soon as “disaster strikes” and the PDL sense code is sent you will see the following popping up in the vmkernel.log that indicates the PDL condition and the kill of the VM:

2012-03-14T13:39:25.085Z cpu7:4499)WARNING: VSCSI: 4055: handle 8198(vscsi4:0):opened by wid 4499 (vmm0:fri-iscsi-02) has Permanent Device Loss. Killing world group leader 4491 2012-03-14T13:39:25.085Z cpu7:4499)WARNING: World: vm 4491: 3173: VMMWorld group leader = 4499, members = 1

As mentioned earlier, this is a welcome enhancement which especially in non-uniform stretched storage environment can help in specific failure scenarios.

vSphere 5.0 Update 1 released

Duncan Epping · Mar 16, 2012 ·

Although only a minor version I do feel that it is worth mentioning and notifying people about. vSphere 5.0 Update 1 (click here for vCenter release notes and here for ESXi release notes) contains some cool enhancements. I listed the fixes or new features which I personally ran in to the last couple of months or which are worth implementing or important to list in specific scenarios. Especially the HA (FDM) fixes are welcome, but also the “disk.terminateVMonPDLDefault” enhancement was. I will write some more about that later today though.

vCenter Server 5.0 Update 1:

Resolved: HA and DRS appear disabled when VM Storage profiles feature is enabled or disabled for a cluster.
When VM storage profiles feature is enabled or disabled for a cluster, it causes a discrepancy in HA and DRS cluster configuration.
Resolved: File-based FDM logging can be enabled inadvertently for ESX 5.x hosts in a mixed cluster with ESX 5.x and ESX 4.x hosts.
The default FDM logging behavior for ESX 5.x hosts is to use syslog, file-based logging is disabled. In a HA cluster with mixed of 5.x and pre-5.x hosts using DAS advanced option das.config.log.maxFileNum to increase number of log files on the pre-5.0 hosts will inadvertently enable file-based logging for ESX 5.x hosts. This can lead to ESX scratch partition to run out of space.
This issue is resolved in this release by introducing HA cluster advanced parameter “das.config.log.outputToFiles”. To enable file-based logging for ESX 5.x hosts, both “das.config.log.maxFileNum” need to configure to a value greater than 2 and “das.config.log.outputToFiles” is equal to “true”.

ESXi 5.0 Update 1:

Resolved / New: No error message is logged when VMkernel stops a virtual machine on a datastore that is in PDL state
When a SCSI device goes into permanent device loss (PDL) state, all the virtual machines that use datastores backed by that SCSI device are affected. Some third party HA solutions incorporate a VMX option where disk.terminateVMOnPDLDefault is set to True. With this option the VMkernel stops such affected virtual machines. Starting with this release, when VMkernel stops affected virtual machines, a warning message similar to the following is logged in vmkernel.log once for each virtual machine.
New: Enablement of session timeout to ESXi Tech Support Mode (TSM)
After you log in to an ESXi host at the console and then log in to the Tech Support Mode (Mode) as root user and initiate a remote server console access session, a non-privileged user might obtain root access to the ESXi host, if the remote access session has not timed out or remains idle.Starting with this release, you can configure a session timeout to exit ESXi Tech Support Mode (TSM) as follows:
1. Log in to Tech Support Mode (Mode) as root user.
2. Edit /etc/profile file to add TMOUT=<timeout value in seconds>.
3. Exit Tech Support Mode (Mode).

vShield 5.0.1:

New: vShield App High Availability enhancements automatically restarts vShield App or virtual machines if a heartbeat is not detected.
New: Enablement of Autodeploy (Stateless ESXi) by providing vShield VIBs (host modules) for download from vShield Manager.