ha

2 node direct connect vSAN and error “vSphere HA agent on this host could not reach isolation address”

Duncan Epping · Oct 7, 2019 ·

I’ve had this question over a dozen times now, so I figured I would add a quick pointer to my blog. What is causing the error “vSphere HA agent on this host could not reach isolation address” to pop up on a 2-node direct connect vSAN cluster? The answer is simple, when you have vSAN enabled HA uses the vSAN network for communication. When you have a 2-node Direct Connect the vSAN network is not connected to a switch and there are no other reachable IP addresses other than the IP addresses of the vSAN VMkernel interfaces.

When HA tries to test if the isolation address is reachable (the default gateway of the management interface) the ping will fail as a result. How you can solve this is simply by disabling the isolation response as described in this post here.

I discovered .PNG files on my datastore, can I delete them?

Duncan Epping · Sep 26, 2019 ·

I noticed this question on Reddit about .PNG which were located in VM folders on a datastore. The user wanted to remove the datastore from the cluster but didn’t know where these files were coming from and if the VM required those files to be available in some shape or form. I can be brief about it, you can safely delete this .PNG files. These files are typically created by VM Monitoring (part of vSphere HA) when a VM is rebooted by VM Monitoring. This is to ensure you can troubleshoot the problem potentially after the reboot has occurred. So it takes a screenshot of the VM to for instance capture the blue screen of death. This feature has been in vSphere for a while, but I guess most people have never really noticed it. I wrote an article about it when vSphere 5.0 was released and below is the screenshot from that article where the .PNG file is highlighted. For whatever reason I had trouble finding my own article on this topic so I figured I would write a new one on it. Of course, after finishing this post I found the original article. Anyway, I hope it helps others who find these .PNG files in their VM folders.

Oh, and I should have added, it can also be caused by vCloud Director or be triggered through the API, as described by William in this post from 2013.

I want vSphere HA to use a specific Management VMkernel interface

Duncan Epping · Apr 30, 2019 ·

This comes up occasionally, customers have multiple Management VMkernel interfaces and see vSphere HA traffic on both of the interfaces, or on the incorrect interface. In some cases, customers use the IP address of the first management VMkernel interface to connect the host to vCenter and then set an isolation address that is on the network of the second management VMkernel so that HA uses that network. This is unfortunately not how it works. I wrote about this 6 years ago, but it never hurts to reiterate as it came up twice over the past couple of weeks. The “management” tickbox is all about HA traffic. Whether “Management” is enabled or not makes no difference for vCenter or SSH for instance. If you create a VMkernel interface without the Management tickbox enabled you can still connect to it over SSH and you can still use the IP to add it to vCenter Server. Yes, it is confusing, but that is how it works.

If you want the interface to not be used by vSphere HA, make sure to untick “Management”. Note, this is not the case for vSAN, with vSAN automatically the vSAN network is used by HA. This only pertains to traditional, non-vSAN based, infrastructures.

vSphere HA virtual machine failed to failover error on VMs in a partitioned cluster

Duncan Epping · Apr 12, 2019 ·

I received two questions this week around partition scenarios where after the failure has been lifted some VMs display the error message “vSphere HA virtual machine failed to failover”. The question that then arises is: why did HA try to restart it, and why did it fail? Well, first of all, this is an error that in most cases you can safely ignore. There’s a KB on the topic which gives a bit of detail to be found here, but let me explain also in a bit more depth.

In a partitioning scenario, each partition will have its own primary node. If there is no form of communication (datastore/network) possible, what the HA primary will do is it will list all the VMs that are currently not running within that partition. It will also want to try to restart those VMs. A partition is extremely uncommon in normal environments but may happen in a stretched cluster. In a stretched cluster when a partition happens a datastore only belongs to 1 location. The VMs which appear to be missing typically are running in the other location, as typically the other location will have access to that particular datastore. Although the primary has listed these VMs as “missing and need to restart” it will not be able to do this. Why? It doesn’t have access to the datastore itself, or when it has access to the datastore the files are locked as the VMs are still running. As a result, this will, unfortunately, be reported as a failed failover. Even though the VM was still running and there was no need for a failover. So if you hit this during certain failure scenarios, and the VMs were running as you expected, you can safely ignore this error.

HA Admission Control Policy: Dedicated Failover Hosts

Duncan Epping · Feb 5, 2019 ·

This week I received some questions on the topic of HA Admission Control. There was a customer that had a cluster configured with the dedicated failover host admission control policy and they had no clue why. This cluster had been around for a while and it was configured by a different admin, who had left the company. As they upgraded the environment they noticed that it was configured with an admission control policy they never used anywhere else, but why? Well, of course, the design was documented, but no one documented the design decision so that didn’t really help. So they came to me and asked what it exactly did and why you would use it.

Let’s start with that last question, why would you use it? Well normally you would not, you should not. Forget about it, well unless you have a specific use case and I will discuss that later. What does it do?

When you designate hosts as failover hosts, they will not participate in DRS and you will not be able to run VMs on these hosts! Not even in a two-host cluster when placing one of the two in maintenance. These hosts are literally reserved for failover situations. HA will attempt to use these hosts first to failover the VMs. If, for whatever reason, this is unsuccessful, it will attempt a failover on any of the other hosts in the cluster. For example, in a when two hosts would fail, including the hosts designated as failover hosts, HA will still try to restart the impacted VMs on the host that is left. Although this host was not a designated failover host, HA will use it to limit downtime.

As can be seen above, you select the correct admission control policy and then add the hosts. As mentioned earlier, the hosts added to this list will not be considered by DRS at all. This means that the resources go wasted unless there’s a failure. So why would you use it?

If you need to know where a VM runs all the time, this admission control policy dictates where the restart happens.
There is no resource fragmentation, as a full host (or multiple) worth of resources will be available to restart VMs on, instead of 1 host worth of resources across multiple hosts.

In some cases, the above may be very useful, for instance knowing where a VM is all the time could be required for regulatory compliance, or could be needed for licensing reasons when you run Oracle for instance.