7.0

What if the disk controller driver included in my vendor’s ESXi image is not on the vSAN HCL?

Duncan Epping · Jan 15, 2021 ·

Sometimes unfortunately there are situations where a vendor’s ESXi image includes a disk controller driver that is not on the vSAN HCL/VCG (VMware Compatibility Guide). Typically it is a new version of the driver which is supported for vSphere, but not yet for vSAN. In that situation, what should you do? So far there are two approaches I have seen customers take:

Keep running with the included driver and wait for the driver to be supported and listed on the vSAN HCL/VCG
Downgrade the driver to the version which is listed on the vSAN HCL/VCG

Personally, I feel that option 2 is the correct way to go. The reason is simple, vSphere and vSAN have different certification requirements for disk controllers and the vSAN certification criteria are just more stringent than vSphere’s. Hence, sometimes you see vSAN skipping certain versions of drivers, this usually means a version did not pass the tests. Now, of course, you could keep running the driver and wait for it to appear on the vSAN HCL/VC. If however, you hit a problem, VMware Support will always ask you first to bring the environment to a fully supported state. Personally, I would not want the extra stress while troubleshooting. But that is my experience and preference. Just to be clear, from a VMware stance, there’s only one option, and that is option two, downgrade to the supported version!

VMs which are not stretched in a stretched cluster, which policy to use?

Duncan Epping · Dec 14, 2020 ·

I’ve seen this question popping up regularly. Which policy setting (“site disaster tolerance” and “failures to tolerate”) should I use when I do not want to stretch my VMs? Well, that is actually pretty straight forward, in my opinion, you really only have two options you should ever use:

None – Keep data on preferred (stretched cluster)
None – Keep data on non-preferred (stretched cluster)

Yes, there is another option. This option is called “None – Stretched Cluster” and then there’s also “None – Standard Cluster”. Why should you not use these? Well, let’s start with “None – Stretched Cluster”. In the case of “None – Stretched Cluster”, vSAN will per object decide where to place it. As you hopefully know, a VM consists of multiple objects. As you can imagine, this is not optimal from a performance point of view, as you could end up having a VMDK being placed in Site A and a VMDK being placed in Site B. Which means it would read and write from both locations from a storage point of view, while the VM would be sitting in a single location from a compute point of view. It is also not very optimal from an availability stance, as it would mean that when the intersite link is unavailable, some objects of the VM would also become inaccessible. Not a great situation. What would it look like? Well, potentially something like the below diagram!

Then there’s “None – Standard Cluster”, what happens in this case? When you use “None – Standard Cluster” with “RAID-1”, what is going to happen is that the VM is configured with FTT=1 and RAID-1, but in a stretched cluster “FTT” does not exist, and FTT automatically will become PFTT. This means that the VM is going to be mirrored across locations, and you will have SFTT=0, which means no resiliency locally. It is the same as “Dual Site Mirroring”+”No Data Redundancy”!

In summary, if you ask me, “none – standard cluster” and “none – stretched cluster” should not be used in a stretched cluster.

vSphere HA reporting not enough failover resources fault with stretched cluster failure scenario

Duncan Epping · Nov 20, 2020 ·

Last few months I had a couple of customers asking why vSphere HA was reporting “not enough failover resources” fault in a stretched cluster failure scenario for virtual machines that are still up and running. Now before I explain why, let’s paint a picture first to make it clear what is happening here. When you run a stretched cluster you can have a scenario where a particular VM (or multiple VMs) are not mirrored/replicated across locations. Now note, with vSAN you can specify for any given VM on a VM level how and if the VM should be available across locations. Typically you would see a VM with RAID-1 across locations, and then RAID-1/5/6 within a location. However, you can also have a scenario where a VM is not replicated across locations, but from a storage point of view only available within a location, this is depicted in the diagram below.

Now in this scenario, when Site A is somehow partitioned from Site B, you will see alarms/errors which indicate that vSphere HA has tried to restart the VM that is located in Site B in Site A and that is has failed as a result of not having enough failover resources.

This, of course, is not the result of not having sufficient failover resources, but it is the result of the fact that Site A does not have access to the required storage components to restart the VM. Basically what HA is reporting is that it doesn’t have the resources which have the ability to restart the impacted VM(s).

Now, if you have paid attention, you will probably wonder why HA tries to restart the VM in the first place, as the VM will still be running in this scenario. Why is it still running? Well the VM isn’t stretched, and this is a partition and not an isolation, which means the isolation response doesn’t kill the VM. So why restart it? Well, as Site A is partitioned from Site B, Site A does not know what the status is of Site B. Site A only knows that Site B is not responding at all, and the only thing it can do is assume the full site has failed. As a result it will attempt a failover for all VMs that were/are running in Site B and were protected by vSphere HA.

Hope that explains why this happens. If you are not sure you understand the full scenario, I recorded a quick five minute video actually walking through the scenario and explaining what happens. You can watch that below, or simply go to youtube.

How to login to the vCLS VMs!?

Duncan Epping · Nov 17, 2020 ·

I was asked this question this week, how you can login to the vCLS VMs. Now before I share the video, I want to mention that I do not encourage people doing this, but as it is documented and supported I do want to provide a simple “how to” for how this works. If you want to login to the vCLS VM, maybe for troubleshooting if needed or for auditing, you can do so by SSH’ing first into your vCenter Server. When logged in to the vCenter Server you run the following command, which then returns the password, this will then allow you to login to the console of the vCLS VM. Again, I do not want to encourage you to do this. Either way, below you find the command for retrieving the password, and a short demo of me retrieving the password and logging in.

/usr/lib/vmware-wcp/decrypt_clustervm_pw.py

Which vSAN policy changes will trigger a rebuild?

Duncan Epping · Nov 10, 2020 ·

A couple of years ago I did a VMworld session with Cormac and we discussed the top things everyone should know about vSAN. One of the items discussed was which policy changes would trigger a rebuild. We tested the various situations and documented them. Two weeks ago a question around this was asked on a VMware internal Slack channel so I shared our findings. Considering it is already a few years ago, I wanted to make sure that our documented findings were still valid, so I redid the tests.

Now before I provide a table with the findings, I just want to explain what I tested, what I did is I created a VM with a default policy. I dumped a bunch of random data on the two VMDKs attached to the VM, and I then changed the policy of the VM while the VM is running. After changing the policy I verified through the command-line, and UI, if a rebuild of the objects was occurring or not. In some cases a policy change does not require a rebuild, while in other cases it does. This, of course, depends on what is being changed within the policy, and what that means for the objects associated with the policy. Hopefully, you will find the below table useful.

From	To	Resync
RAID-1	RAID-1 with higher FTT	Yes
RAID-1	RAID-1 with lower FTT	No
RAID-1	RAID-5/6	Yes
RAID-5/6	RAID-1	Yes
RAID-5	RAID-6	Yes
RAID-6	RAID-5	Yes
Stripe width 1	Stripe width increase by 1 (or more)	Yes
Stripe width x	Stripe width decrease by 1 (or more)	Yes
Space Reservation 0	Increase to larger than 0	No
Space Reservation >= 1	Increase by 1 (or more)	No
Space reservation > 0	Decrease to 0	No
Read Cache 0	Increase to larger than 0	No
Read Cache >= 1	Increase by 1 (or more)	No
Read Cache >= 1	Decrease by 1 (or more)	No
Checksum enabled	Checksum disabled	No
Checksum disabled	Checksum enabled	Yes