Yellow Bricks

vSphere HA configuration for HCI Mesh!

Duncan Epping · Oct 29, 2020 ·

I wrote a vSAN HCI Mesh Considerations blog post a few weeks ago. Based on that post I received some questions, and one of the questions was around vSphere HA configurations. Interestingly I also had some internal discussions around how vSAN HCI Mesh and HA were integrated. Based on the discussions I did some testing just to validate my understanding of the implementation.

Now when it comes to vSphere HA and vSAN the majority of you will be following the vSAN Design Guide and understand that having HA enabled is crucial for vSAN. Also when it comes to vSAN configuring the Isolation Response is crucial, and of course setting the correct Isolation Address. However, so far there’s been an HA feature which you did not have to configure for vSAN and HA to function correctly, and that feature is VM Component Protection aka APD / PDL responses.

Now, this changes with HCI Mesh. Specifically for HCI Mesh the HA and vSAN team have worked together to detect APD (all paths down) down scenarios! When would this happen? Well if you look at the below diagram you can see that we have “Client Clusters” and a “Server Cluster”. The “Client Cluster” consumes storage from the “Server Cluster”. If for whatever reason a host in the “Client Cluster” loses access to the “Server Cluster”, it results in the VMs on that host consuming storage on the “Server Cluster” to lose access to the datastore. This is essentially an APD (all paths down) scenario.

Now, to ensure the VMs are protected by HA for this situation you only need to enable the APD response. This is very straight-forward. You simply go to the HA cluster settings and set the “Datastore with APD” setting to either “Power off and restart VMs – Conservative” or “Power off and restart VMs – Aggressive”. The difference between conservative and aggressive is that with conservative HA will only kill the VMs when it knows for sure the VMs can be restarted, wherewith aggressive it will also kill the VMs on a host impacted by an APD while it isn’t sure it can restart the VMs. Most customers will use the “Conservative Restart Policy” by the way.

As I also mentioned in the HCI Mesh Considerations blog, one thing I would like to call out is the timing for the APD scenario: The APD is declared after 60 seconds, after which the APD response (restart) is triggered automatically after 180 seconds. Mind that this is different than with an APD response with traditional storage, as with traditional storage it will take 140 seconds before the APD is declared. You can, of course, in the log file see that an APD is detected, declared and VMs are killed as a result. Note that the “fdm.log” is quite verbose, so I copied only the relevant lines from my tests.

APD detected for remote vSAN Datastore /vmfs/volumes/vsan:52eba6db0ade8dd9-c04b1d8866d14ce5
Go to terminate state for VM /vmfs/volumes/vsan:52eba6db0ade8dd9-c04b1d8866d14ce5/a57d9a5f-a222-786a-19c8-0c42a162f9d0/YellowBricks.vmx due to APD timeout (CheckCapacity:false)
Failover operation in progress on 1 Vms: 1 VMs being restarted, 0 VMs waiting for a retry, 0 VMs waiting for resources, 0 inaccessible vSAN VMs.

Now for those wondering if it actually works, of course, I tested it a few times and recorded a demo, which can be watched on youtube (easier to follow in full screen), or click play below. (Make sure to subscribe to the channel for the latest videos!)

I hope this helps!

Demo Time: How to delete the vCLS VMs

Duncan Epping · Oct 27, 2020 ·

As I have a bunch of questions about how you can delete the vSphere Cluster Service VMs (vCLS VMs) I figured I would create a quick demo. It is pretty straightforward, and it should only be used when people are doing some kind of full cluster maintenance. This demo shows you how to get the VMs deleted by leveraging a vCenter Server Level Advanced setting (config.vcls.clusters.domain-c<identifier>.enabled). I have also written a post that has a bunch of requirements, Q&A, and considerations for the vCLS VMs, if you are interested in that read it here.

Here’s the summary of how to delete the VMs: Go to your vCenter Server object, go to the configure tab, then go to “Advanced Settings”, add the key “config.vcls.clusters.domain-c<identifier>.enabled” and set it to false. The domain “c-number” for your cluster can be found in the URL when you click on the cluster. It should look something like the following, where the bold part is the important bit: https://vcsa-06.rainpole.com/ui/app/cluster;nav=h/urn:vmomi:ClusterComputeResource:domain-c22:4df0badc-1655-40de-9181-3422d6c36a3e/summary. If you want to recreate the VMs, simply set the value to “true” when the deletion task has completed.

Note, if you have a resource pool configuration, enabling “retreat mode” (disabling vCLS)) doesn’t impact resource pools in any shape or form, it just impacts DRS load balancing. Anyway, I hope you find the demo useful.

How to delete a vCenter Server advanced setting

Duncan Epping · Oct 23, 2020 ·

I had a customer asking this week how he could delete an advanced setting that he had incorrectly added to vCenter Server. Some of you may have found yourself in this situation as well where you realized you made a typo while creating an advanced setting for the vCenter Server configuration. Unfortunately, there’s no option to delete an advanced setting in the H5 interface, but you can manually remove them via the command-line. It is rather straight forward:

SSH to you vCenter Server
Go to the “shell”
go to director: /etc/vmware-vpx/
Edit the file “vpxd.cfg”
Simply find the entry and delete the entry (with “vi” you use “/” to search)
Restart VPXD by running the following command
```
service-control --restart vmware-vpxd
```

And that is it, now your advanced setting should be cleared, I will put in a request though for a “delete option” in the H5 interface.

vCenter Server 7.0 U1a released, compatible with SRM!

Duncan Epping · Oct 23, 2020 ·

I’ve had a bunch of customers asking the past couple of weeks when vSphere / vCenter 7.0 U1 would be supported with SRM. Yesterday (22nd of October) vCenter Server 7.0 U1a was released and this release introduced support/compatibility with SRM. For those wondering why it wasn’t supported, there was an issue with vCLS and SRM which had to be fixed first. So if you are one of those customers who runs the latest and greatest version of vSphere in combination with SRM you can now move to 7.0 U1a. If you haven’t seen the details yet of the release you can find it here:

Start those download engines and plan your upgrades!

VMware vSphere Cluster Services (vCLS) considerations, questions and answers.

Duncan Epping · Oct 9, 2020 ·

In the vSphere 7.0 Update 1 release VMware introduced a new service called the VMware vSphere Cluster Services (vCLS). vCLS provides a mechanism that allows VMware to decouple both vSphere DRS and vSphere HA from vCenter Server. Niels Hagoort wrote a lengthy article on this topic here. You may wonder why VMware introduces this, well as Niels states. by decoupling the clustering services (DRS and HA) from vCenter Server via vCLS we ensure the availability of critical services even when vCenter Server is impacted by a failure.

vCLS is a collection of multiple VMs which, over time, will be the backbone for all clustering services. In the 7.0 U1 release a subset of DRS functionality is enabled through vCLS. Over the past week(s) I have seen many questions coming in and I wanted to create a blog with answers to these questions. When new questions or considerations come up, I will add these to the list below.