vsphere ha

2 node direct connect vSAN and error “vSphere HA agent on this host could not reach isolation address”

Duncan Epping · Oct 7, 2019 ·

I’ve had this question over a dozen times now, so I figured I would add a quick pointer to my blog. What is causing the error “vSphere HA agent on this host could not reach isolation address” to pop up on a 2-node direct connect vSAN cluster? The answer is simple, when you have vSAN enabled HA uses the vSAN network for communication. When you have a 2-node Direct Connect the vSAN network is not connected to a switch and there are no other reachable IP addresses other than the IP addresses of the vSAN VMkernel interfaces.

When HA tries to test if the isolation address is reachable (the default gateway of the management interface) the ping will fail as a result. How you can solve this is simply by disabling the isolation response as described in this post here.

I discovered .PNG files on my datastore, can I delete them?

Duncan Epping · Sep 26, 2019 ·

I noticed this question on Reddit about .PNG which were located in VM folders on a datastore. The user wanted to remove the datastore from the cluster but didn’t know where these files were coming from and if the VM required those files to be available in some shape or form. I can be brief about it, you can safely delete this .PNG files. These files are typically created by VM Monitoring (part of vSphere HA) when a VM is rebooted by VM Monitoring. This is to ensure you can troubleshoot the problem potentially after the reboot has occurred. So it takes a screenshot of the VM to for instance capture the blue screen of death. This feature has been in vSphere for a while, but I guess most people have never really noticed it. I wrote an article about it when vSphere 5.0 was released and below is the screenshot from that article where the .PNG file is highlighted. For whatever reason I had trouble finding my own article on this topic so I figured I would write a new one on it. Of course, after finishing this post I found the original article. Anyway, I hope it helps others who find these .PNG files in their VM folders.

Oh, and I should have added, it can also be caused by vCloud Director or be triggered through the API, as described by William in this post from 2013.

This host has no isolation addresses defined as required by vSphere HA

Duncan Epping · Dec 19, 2018 ·

I had a comment on one of my 2-node vSAN cluster articles that there was an issue with HA when disabling the Isolation Response. The isolation response is not required for 2-node as it is impossible to properly detect an isolation event and vSAN has a mechanism to do exactly what the Isolation Response does: kill the VMs when they are useless. The error witnessed was “This host has no isolation addresses defined as required by vSphere HA” as shown also in the screenshot below.

So now what? Well first of all, as mentioned in the comments section as well, vSphere always checks if an isolation address is specified, that could be the default gateway of the management network or it could be the isolation address that you specified through advanced setting das.isolationaddress. When you use das.isolationaddress it often goes hand in hand with das.usedefaultisolationaddress set to false. That last setting, das.usedefaultisolationaddress, is what causes the error above to be triggered. What you should do in a 2-node configuration is the following:

Do not configure the isolation response, explanation to be found in the above-mentioned article
Do not configure das.usedefaultisolationaddress, if it is configured set it to true
Make sure you have a gateway on the management vmkernel, if that is not the case you could set das.isolationaddress and simply set it to 127.0.0.1 to prevent the error from popping up.

Hope this helps those hitting this error message.

What happens if all hosts in a vSphere HA cluster are isolated?

Duncan Epping · Aug 15, 2018 ·

I received this question through twitter today from Markus who was going through the vSphere 6.7 Clustering Deep Dive. And it is fairly straightforward: what happens when all hosts are isolated in a cluster, will the isolation response be triggered?

https://twitter.com/RealRockaut/status/1029652167735631874

I wrote about this a long long time ago, but it doesn’t hurt to re-iterate this. Before triggering the isolation response HA will actually verify the state of the rest of the cluster. Does anyone own the datastore on which the VMs that are impacted by this isolation run? If the answer is no, the ownership of a datastore is dropped during the election, then HA will not trigger the isolation response. I will try to update the book when I have time to include that, hopefully, that means a new version of the ebook will be pushed out to all owners automatically.

Trigger APD on iSCSI LUN on vSphere

Duncan Epping · Jun 21, 2018 ·

I was testing various failure scenarios in my lab today for the vSphere Clustering Deepdive session I have scheduled for VMworld. I needed some screenshots and log files of when a datastore hit an APD scenario, for those who don’t know APD stands for all paths down. In other words: the storage is inaccessible and ESXi doesn’t know what has happened and why. vSphere HA has the ability to respond to that kind of failure. I wanted to test this, but my setup was fairly simple and virtual. So I couldn’t unplug any cables. I also couldn’t make configuration changes to the iSCSI array as that would rather trigger a PDL (permanent device loss), so how do you test and APD scenario?

After trying various things like killing the iSCSI daemon (it gets restarted automatically with no impact on the workload) I bumped in to this command which triggered the APD:

SSH in to the host you want to trigger the APD on, run the following command
```
esxcli iscsi session remove  -A vmhba65
```
Make sure of course to replace “vmhba65” with the name of your iSCSI adapter

This triggered APD, as witness in the fdm.log and vmkernel.log, and ultimately resulted in vSphere HA killing the impacted VM and restarting it on a healthy host. Anyway, just wanted to share this as I am sure there are others who would like to test APD responses in their labs or before their environment goes in to production.

There may be other easy ways as well, if you know any, please share in the comments section.