Yellow Bricks

vSphere HA reporting not enough failover resources fault with stretched cluster failure scenario

Duncan Epping · Nov 20, 2020 ·

Last few months I had a couple of customers asking why vSphere HA was reporting “not enough failover resources” fault in a stretched cluster failure scenario for virtual machines that are still up and running. Now before I explain why, let’s paint a picture first to make it clear what is happening here. When you run a stretched cluster you can have a scenario where a particular VM (or multiple VMs) are not mirrored/replicated across locations. Now note, with vSAN you can specify for any given VM on a VM level how and if the VM should be available across locations. Typically you would see a VM with RAID-1 across locations, and then RAID-1/5/6 within a location. However, you can also have a scenario where a VM is not replicated across locations, but from a storage point of view only available within a location, this is depicted in the diagram below.

Now in this scenario, when Site A is somehow partitioned from Site B, you will see alarms/errors which indicate that vSphere HA has tried to restart the VM that is located in Site B in Site A and that is has failed as a result of not having enough failover resources.

This, of course, is not the result of not having sufficient failover resources, but it is the result of the fact that Site A does not have access to the required storage components to restart the VM. Basically what HA is reporting is that it doesn’t have the resources which have the ability to restart the impacted VM(s).

Now, if you have paid attention, you will probably wonder why HA tries to restart the VM in the first place, as the VM will still be running in this scenario. Why is it still running? Well the VM isn’t stretched, and this is a partition and not an isolation, which means the isolation response doesn’t kill the VM. So why restart it? Well, as Site A is partitioned from Site B, Site A does not know what the status is of Site B. Site A only knows that Site B is not responding at all, and the only thing it can do is assume the full site has failed. As a result it will attempt a failover for all VMs that were/are running in Site B and were protected by vSphere HA.

Hope that explains why this happens. If you are not sure you understand the full scenario, I recorded a quick five minute video actually walking through the scenario and explaining what happens. You can watch that below, or simply go to youtube.

Virtual Events I will be presenting/speaking at the upcoming months

Duncan Epping · Nov 19, 2020 ·

The past year has been really strange for me. As you know a lot of time goes into speaking at events, or doing a series of customer meetings. Due to COVID that all rapidly changed in March of this year. Fortunately, it didn’t mean I was out of work, we had to figure some stuff out in the first few weeks, but quickly everyone shifted towards this virtual approach. We (Cormac, Frank, and I) have presented at some great events the past 8 months, and there are some very interesting events coming which I will be speaking at (some with Cormac and Frank, and some without them).

Even though many of these events are listed/advertized as regional events, as they are all virtual it means anyone can tune in. I wanted to share a few events I have planned before Christmas, which could be worth attending! In some cases my session is in Dutch, but I called that out in the list.

1-December >> Define Tomorrow Keynote – https://live.computerworld.co.uk/talks/keynote-how-hci-is-revolutionizing-the-datacenter-today-and-tomorrow/
3-December >> VMware vSAN 7.0 U1 Webinar – https://www.vmware.com/learn/695402_EN_REG.html?src=so_5fa4046b6e733
8-December >> VMUG Usercon Nederland Breakout – https://vmugvirtualnlusercon.vfairs.com/ (Note, my session will be in Dutch)
10-December >> VMUG Usercon Portland Keynote – https://vmugvirtualportlandusercon.vfairs.com/

So if you are interesting in hearing about things like vSAN, Cloud Native Storage, vSphere, and much more, make sure to sign up for one of these events!

How to login to the vCLS VMs!?

Duncan Epping · Nov 17, 2020 ·

I was asked this question this week, how you can login to the vCLS VMs. Now before I share the video, I want to mention that I do not encourage people doing this, but as it is documented and supported I do want to provide a simple “how to” for how this works. If you want to login to the vCLS VM, maybe for troubleshooting if needed or for auditing, you can do so by SSH’ing first into your vCenter Server. When logged in to the vCenter Server you run the following command, which then returns the password, this will then allow you to login to the console of the vCLS VM. Again, I do not want to encourage you to do this. Either way, below you find the command for retrieving the password, and a short demo of me retrieving the password and logging in.

/usr/lib/vmware-wcp/decrypt_clustervm_pw.py

Which vSAN policy changes will trigger a rebuild?

Duncan Epping · Nov 10, 2020 ·

A couple of years ago I did a VMworld session with Cormac and we discussed the top things everyone should know about vSAN. One of the items discussed was which policy changes would trigger a rebuild. We tested the various situations and documented them. Two weeks ago a question around this was asked on a VMware internal Slack channel so I shared our findings. Considering it is already a few years ago, I wanted to make sure that our documented findings were still valid, so I redid the tests.

Now before I provide a table with the findings, I just want to explain what I tested, what I did is I created a VM with a default policy. I dumped a bunch of random data on the two VMDKs attached to the VM, and I then changed the policy of the VM while the VM is running. After changing the policy I verified through the command-line, and UI, if a rebuild of the objects was occurring or not. In some cases a policy change does not require a rebuild, while in other cases it does. This, of course, depends on what is being changed within the policy, and what that means for the objects associated with the policy. Hopefully, you will find the below table useful.

From	To	Resync
RAID-1	RAID-1 with higher FTT	Yes
RAID-1	RAID-1 with lower FTT	No
RAID-1	RAID-5/6	Yes
RAID-5/6	RAID-1	Yes
RAID-5	RAID-6	Yes
RAID-6	RAID-5	Yes
Stripe width 1	Stripe width increase by 1 (or more)	Yes
Stripe width x	Stripe width decrease by 1 (or more)	Yes
Space Reservation 0	Increase to larger than 0	No
Space Reservation >= 1	Increase by 1 (or more)	No
Space reservation > 0	Decrease to 0	No
Read Cache 0	Increase to larger than 0	No
Read Cache >= 1	Increase by 1 (or more)	No
Read Cache >= 1	Decrease by 1 (or more)	No
Checksum enabled	Checksum disabled	No
Checksum disabled	Checksum enabled	Yes

Did you know vSphere 7.0 Update 1 also has a Skyline Health Check for vSphere Clustering Services?

Duncan Epping · Nov 6, 2020 ·

I did not know this, but yesterday the PM for vCLS reached out to me and informed me that we now have a Skyline Health Check as well for vSphere Clustering Services. The funny thing is that I actually requested this health check to be added after having a discussion on the topic of vCLS with the PM. Very impressive how fast the engineering team managed to include an additional health check for a brand new feature, this close to the release. I created a short demo, which shows you where you can find the vSphere Skyline Health option in the vSphere Client, and of course, it shows the vCLS Health Check being triggered. If you see the health check triggered, you can as mentioned enable retread mode and disable it again, this will provision a fresh set of vCLS VMs. How you do this you can find in this “considerations blog“, or simply watch the demo I shared here.