drs

New vCLS architecture with vSphere 8.0 Update 3

Duncan Epping · Sep 24, 2024 · 1 Comment

Some of you may have seen this, others may have not, as I had a question today around vCLS retreat mode with 8.0U3 I figured I would write something on the topic quickly. Starting with vSphere 8.0 Update we introduced a new architecture for vCLS aka vSphere Cluster Services. Pre-vSphere 8.0 Update the vCLS architecture was based on virtual machines with Photon OS. These VMs were there to assist in enabling and disabling primarily DRS. If something was wrong with these VMs then DRS would also be unable to function normally. In the past many of you have probably experienced situations where you had to kill and delete the vCLS VMs to restore functionality of DRS, for that VMware introduced a feature called “retreat mode” which basically killed and deleted the VMs for you. There were some other challenges with the vCLS VMs and as a result the team decided to create a new design for vCLS.

Starting with vSphere 8.0 Update 3 vCLS is now implemented as what I would call a container runtime, sometimes referred to as a Pod VM or PodCRX. In other words, when you upgrade to vSphere 8.0 Update 3 you will see your current vCLS VMs be deleted, and these new shiny vCLS VMs will pop up. How do you know if these VMs are created using a different mechanism? Well you can simply just see that in the UI as demonstrated below. See the “CRX” mention in the UI?

So you may ask yourself, why should I even care? Well the thing is, you should not indeed. The new vCLS architecture uses less resources per VM, there are less vCLS VMs deployed to begin with (two instead of three), and they are more resilient. Also, when a host is for instance placed into maintenance mode while it has a vCLS VM running, that vCLS instance is deleted and recreated elsewhere. Considering the VMs are stateless and tiny, that is much more efficient than trying to vMotion it. Note, vMotion and SvMotion of these new (Embedded as they call them) type of vCLS VMs isn’t even supported to begin with.

Normally, vCLS retreat mode shouldn’t be needed anymore, but if you do end up in a situation where you need to clean up these instances, Retreat Mode is still fully supported with 8.0 U3 as well. You can find the Retreat Mode option in the same place as before, on your cluster object under “Configure –> vSphere Cluster Services –> General –> Edit vCLS Mode”. Simply select “Retreat Mode” and the clean up should happen automatically. When you want the VMs to be recreated, simply go back to the same UI and select “System managed”. This should then lead to the vCLS VMs being recreated.

I hope this helps,

vSphere 7.0 U3 contains two great vCLS enhancements

Duncan Epping · Sep 28, 2021 ·

I have written about vCLS a few times, so I am not going to explain what it is or what it does (detailed blog here). I do want to talk about what is part of vSphere 7.0 U3 specifically though as I feel these features are probably what most folks have been waiting for. Starting with vSphere 7.0 U3 it is now possible to configure the following for vCLS VMs:

Preferred Datastores for vCLS VMs
Anti-Affinity for vCLS VMs with specific other VMs

I created a quick demo for those who prefer to watch videos to learn these things if you don’t skip to the text below. Oh and before I forget, a bonus enhancement is that the vCLS VMs now have a unique name, this was a very common request from customers.

Why would you need the above functionality? Let’s begin with the “preferred datastore” feature, this allows you to specify where the vCLS VMs need to be provisioned to from a storage point of view. This would be useful in a scenario where you have a number of datastores that you would prefer to avoid. Examples could be datastores that are replicated or a datastore that is only intended to be used for ISOs and templates, or maybe you prefer to provision on hybrid storage versus flash storage.

So how do you fix this? Well, it is simple, you click on your cluster object. You then click on “Configure”, and on “Datastores” under “vSphere Cluster Services”. Now you will see “VCLS Allowed”, if you click on “ADD” you now will be able to select the datastores to which these vCLS VMs should be provisioned.

Next up Anti-Affinity for vCLS. You would this feature for situations where for instance a single workload needs to be able to solely run on a host, something like SAP for instance. In order to achieve this, you can use anti-affinity rules. We are not talking about regular anti-affinity rules. This is the very first time a brand new mechanism is used on-premises. I am talking about compute policies. Compute policies have been available for VMware Cloud on AWS customers for a while, but now are also appear to be coming to on-prem customers. What does it do? It enables you to create “anti-affinity” rules for vCLS VMs and specific other VMs in your environment by creating Compute Policies and using Tags!

How does this work? Well, you go to “Policies and Profiles” and then click “Compute Policies”. Now you can click “ADD” and create a policy. You now select “Anti Affinity with vSphere Cluster Services (vCLS) VMs”. Then you select the Tag you created for the VMs that should not run on the same hosts as the vCLS VMs, and then you click create. The vCLS VM Scheduler will then ensure that the vCLS VMs will not run on the same hosts as the tagged VMs. If there’s a conflict, the vCLS Scheduler will move away the vCLS VMs to other hosts within the cluster. Let’s reiterate that, the vCLS VMs will be vMotioned to another host in your cluster, the tagged VMs will not be moved!

Hope that helps!

Issue adding tags to the vCLS VMs with vCenter Server 7.0 U2b

Duncan Epping · Jun 1, 2021 ·

Today I was talking to one of our field folks and he asked if there was an issue with Tags in combination with vCLS VMs in 7.0 U2b specifically. I had tested assigning tags to vCLS VMs before, and it worked just fine. With 7.0 U2b unfortunately this has stopped working. The error you will see displayed in the vSphere Client is the following:

(vmodl.fault.SecurityError) {
faultCause = null,
faultMessage = null
}

Or as it shows in the UI:

So what can you do about it? Well, unfortunately not much right now, I filed a bug and uploaded the logs, engineers are looking at it as we speak, and hopefully, I will have an answer for those who need to use tags soon.

UPDATE: Engineering has found a workaround, customers who can’t wait for the fix can contact GSS to get the workaround implemented!

vSAN 7.0 U2 now integrates with vSphere DRS

Duncan Epping · Mar 24, 2021 ·

One of the features our team requested a while back was integration between DRS and vSAN. The key use case we had was for stretched clusters. Especially in scenarios where a failure has occurred, it would be useful if DRS would understand what vSAN is doing. What do I mean by that?

Today when customers create a stretched cluster they have two locations. Using vSAN terminology these locations are referred to as the Preferred Fault Domain and the Secondary Fault Domain. Typically when VMs are then deployed, customers will create VM-to-Host Affinity Rules which state that VMs should reside in a particular location. When these rules are created DRS will do its best to ensure that the defined rule is adhered to. What is the problem?

Well if you are running a stretched cluster and let’s say one of the sites go down, then what happens when the failed location returns for duty is the following:

vSAN detects the missing components are available again
vSAN will start the resynchronization of the components
DRS runs every minute and rebalances and will move VMs based on the DRS rules

This means that the VMs for which rules are defined will move back to their respective location, even though vSAN is potentially still resynchronizing the data. First of all, the migration will interfere with the replication traffic. Secondly, for as long as the resync has not completed, I/O will across the network between the two locations, this will not only interfere with resync traffic, it will also increase latency for those workloads. So, how does vSAN 7.0 U2 solve this?

Starting with vSAN 7.0 U2 and vSphere 7.0 U2 we now have DRS and vSAN communicating. DRS will verify with vSAN what the state is of the environment, and it will not migrate the VMs back as long the VMs are healthy again. When the VMs are healthy and the resync has completed, you will see the rules being applied and the VMs automatically migrate back (when DRS is configured to Fully Automated that is).

I can’t really show it with a screenshot or anything, as this is a change in the vSAN/DRS architecture, but to make sure it worked I recorded a quick demo which I published through Youtube. Make sure to watch the video!

How to login to the vCLS VMs!?

Duncan Epping · Nov 17, 2020 ·

I was asked this question this week, how you can login to the vCLS VMs. Now before I share the video, I want to mention that I do not encourage people doing this, but as it is documented and supported I do want to provide a simple “how to” for how this works. If you want to login to the vCLS VM, maybe for troubleshooting if needed or for auditing, you can do so by SSH’ing first into your vCenter Server. When logged in to the vCenter Server you run the following command, which then returns the password, this will then allow you to login to the console of the vCLS VM. Again, I do not want to encourage you to do this. Either way, below you find the command for retrieving the password, and a short demo of me retrieving the password and logging in.

/usr/lib/vmware-wcp/decrypt_clustervm_pw.py