6.5

I want vSphere HA to use a specific Management VMkernel interface

Duncan Epping · Apr 30, 2019 ·

This comes up occasionally, customers have multiple Management VMkernel interfaces and see vSphere HA traffic on both of the interfaces, or on the incorrect interface. In some cases, customers use the IP address of the first management VMkernel interface to connect the host to vCenter and then set an isolation address that is on the network of the second management VMkernel so that HA uses that network. This is unfortunately not how it works. I wrote about this 6 years ago, but it never hurts to reiterate as it came up twice over the past couple of weeks. The “management” tickbox is all about HA traffic. Whether “Management” is enabled or not makes no difference for vCenter or SSH for instance. If you create a VMkernel interface without the Management tickbox enabled you can still connect to it over SSH and you can still use the IP to add it to vCenter Server. Yes, it is confusing, but that is how it works.

If you want the interface to not be used by vSphere HA, make sure to untick “Management”. Note, this is not the case for vSAN, with vSAN automatically the vSAN network is used by HA. This only pertains to traditional, non-vSAN based, infrastructures.

Mixing versions of ESXi in the same vSphere / vSAN cluster?

Duncan Epping · Apr 15, 2019 ·

I have seen this question being asked a couple of times in the past months, and to be honest I was a bit surprised people asked about this. Various customers were wondering if it is supported to mix versions of ESXi in the same vSphere or vSAN Cluster? I can be short whether this is support or not, yes it is but only for short periods of time (72hrs max). Would I recommend it? No, I would not!

Why not? Well mainly for operational reasons, it just makes life more complex. Just think about a troubleshooting scenario, you now need to remember which version you are running on which host and understand the “known issues” for each version. Also, for vSAN things are even more complex as you could have “components” running on a different version of ESXi. On top of that, it could even be the case that a certain command or esxcli namespace is not available on a particular version of ESXi.

Another concern is when doing upgrades or updates, you need to take the current version into account when updating, or more importantly when upgrading! Also, remember that firmware/driver combination may be different for a particular version of vSphere/vSAN as well, this could also make life more complex and definitely increases the chances of making mistakes!

Is this documented anywhere? Yes, check out the following KB:

https://kb.vmware.com/s/article/2146381

vSphere HA virtual machine failed to failover error on VMs in a partitioned cluster

Duncan Epping · Apr 12, 2019 ·

I received two questions this week around partition scenarios where after the failure has been lifted some VMs display the error message “vSphere HA virtual machine failed to failover”. The question that then arises is: why did HA try to restart it, and why did it fail? Well, first of all, this is an error that in most cases you can safely ignore. There’s a KB on the topic which gives a bit of detail to be found here, but let me explain also in a bit more depth.

In a partitioning scenario, each partition will have its own primary node. If there is no form of communication (datastore/network) possible, what the HA primary will do is it will list all the VMs that are currently not running within that partition. It will also want to try to restart those VMs. A partition is extremely uncommon in normal environments but may happen in a stretched cluster. In a stretched cluster when a partition happens a datastore only belongs to 1 location. The VMs which appear to be missing typically are running in the other location, as typically the other location will have access to that particular datastore. Although the primary has listed these VMs as “missing and need to restart” it will not be able to do this. Why? It doesn’t have access to the datastore itself, or when it has access to the datastore the files are locked as the VMs are still running. As a result, this will, unfortunately, be reported as a failed failover. Even though the VM was still running and there was no need for a failover. So if you hit this during certain failure scenarios, and the VMs were running as you expected, you can safely ignore this error.

Device X is not listed on the vSAN Compatability Guide, can I still use it?

Duncan Epping · Jan 8, 2019 ·

I get this question almost daily, and I am pretty sure I have said this various times, but just in case it wasn’t clear I figured I would share the answer to the question whether a device should be used in a vSAN cluster when it is not listed on the vSAN Compatibility Guide? if you have not looked at the components variant of the VCG for vSAN please take a look here: http://vmwa.re/vsanhclc. Of course, we also have an easier route, which is the ReadyNode VCG. But some may want to tweak based on performance, cost etc. I get that, and so does VMware, that is why we have listed all supported and tested components. Can you use a device which is not listed? Sure you can. Will VMware support the environment? Maybe they will, maybe they won’t! Should you use a device which is not listed if the previous answer is maybe? No!

So let’s be clear and let’s answer the two most asked questions:

Device X is not listed on the vSAN Compatability Guide, can I still use it?
- No, you should not. If any problem arises chances are you will not get the support you need as a result of an unsupported configuration. Sure, usually VMware Support will do their best to help, but if it appears the unsupported device is causing the problem then it becomes difficult. Please do not use devices which are not listed
Device X is listed with Firmware version Y, but the OEM says I should use Z, what to do?
- Ask the OEM why the version is not listed on VMware’s VCG website. Vendors are responsible for certifying components and the software (drivers / firmware) associated with it. If it is not listed then it has either not been submitted yet, it has not been tested, or it has not passed the test. Please only use tested and listed versions, the only exception is when both VMware GSS and the OEM points you to a new version.

Hope that helps,

This host has no isolation addresses defined as required by vSphere HA

Duncan Epping · Dec 19, 2018 ·

I had a comment on one of my 2-node vSAN cluster articles that there was an issue with HA when disabling the Isolation Response. The isolation response is not required for 2-node as it is impossible to properly detect an isolation event and vSAN has a mechanism to do exactly what the Isolation Response does: kill the VMs when they are useless. The error witnessed was “This host has no isolation addresses defined as required by vSphere HA” as shown also in the screenshot below.

So now what? Well first of all, as mentioned in the comments section as well, vSphere always checks if an isolation address is specified, that could be the default gateway of the management network or it could be the isolation address that you specified through advanced setting das.isolationaddress. When you use das.isolationaddress it often goes hand in hand with das.usedefaultisolationaddress set to false. That last setting, das.usedefaultisolationaddress, is what causes the error above to be triggered. What you should do in a 2-node configuration is the following:

Do not configure the isolation response, explanation to be found in the above-mentioned article
Do not configure das.usedefaultisolationaddress, if it is configured set it to true
Make sure you have a gateway on the management vmkernel, if that is not the case you could set das.isolationaddress and simply set it to 127.0.0.1 to prevent the error from popping up.

Hope this helps those hitting this error message.