I’ve written about vSAN and vSphere HA various times, but I don’t think this has been explicitly called out before. Cormac and I were doing some tests this week and noticed something. When we were looking at results I realized I described it in my HA book a long time ago, but it is so far hidden away that probably no one has noticed.
In a traditional environment when you enable HA you will automatically have HA heartbeat datastores selected. These heartbeat datastores are used by the HA master host to determine what has happened to a host which is no longer reachable over the management network. In other words, when a host is isolated it will communicate this to the HA master using the heartbeat datastores. It will also inform the HA master which VMs were powered off as the result of this isolation event (or not powered off when the isolation response is not configured).
Now, with vSAN the management network is not used for communication between the hosts but the vSAN network is used. Typically in a vSAN environment there’s only vSAN storage so there are no heartbeat datastores. As such, when a host is isolated it is not possible to communicate this to the HA master. Remember, the network is down and there is no access to the vSAN datastore so the host cannot communicate through that either. HA will still function as expected though. You can set the isolation response to power-off and then the VMs will be killed and restarted. That is, if isolation is declared.
So when is isolation declared? A host declares itself isolated when:
- It is not receiving any communication from the master
- It cannot ping the isolation address
Now, if you have not set any advanced settings then the default gateway of the management network will be the isolation address. Just imagine your vSAN Network to be isolated on a given host, but for whatever reason the Management Network is not. In that scenario isolation is not declared, the host can still ping the isolation address using the management network vmkernel interface. HOWEVER… vSphere HA will restart the VMs. The VMs have lost access to disk, as such the lock on the VMDK is lost. HA notices the hosts are gone, which must mean that the VMs are dead as the locks are lost, lets restart them.
That is when you could be in the situation where the VMs are running on the isolated hosts and also somewhere else in the cluster. Both with the same mac address and the same name / IP address. Not a good situation. Now, if you would have had datastore heartbeats enabled then this would be prevented. As the isolated host would inform the master it is isolated, but it would also inform the master about the state of the VMs, which would be powered-on. The master would then decide not to restart the VMs. However, the VMs which are running on the isolated host are more or less useless as they cannot write to disk anymore.
Let’s describe what we tested and what the outcome was in a way that is a bit easier to consume, a table:
|Isolation Address||Datastore Heartbeats||Observed behavior|
|IP on vSAN Network||Not configured||Isolated host cannot ping the isolation address, isolation declared, VMs killed and VMs restarted|
|Management Network||Not configured||Can ping the isolation address, isolation not declared, yet rest of the cluster restarts the VMs even though they are still running on the isolated hosts|
|IP on vSAN Network||Configured||Isolated host cannot ping the isolation address, isolation declared, VMs killed and VMs restarted|
|Management Network||Configured||VMs are not powered-off and not restarted as the “isolated host” can still ping the management network and the datastore heartbeat mechanism is used to inform the master about the state. So the master knows HA network is not working, but the VMs are not powered off.|
So what did we learn, what should you do when you have vSAN? Always use an isolation address which is in the same network as vSAN! This way during an isolation the isolation is validated using the vSAN vmkernel interface. Always set the isolation response to power-off. (My personal opinion based on testing.) This would avoid the scenario of duplicate mac / ip / names on the network when you have a single network being isolated for a specific host! And if you have traditional storage, then you can enable heartbeat datastores. It doesn’t add much in terms of availability, but still it will allow the HA hosts to communicate state through the datastore.
PS1: For those who don’t know, HA is configured to automatically select a heartbeat datastore. In a vSAN only environment you can disable this by selecting “Use datastore from only the specified list” in the HA interface and then set “das.ignoreInsufficientHbDatastore = true” in the advanced HA settings.
PS2: In a non-routable vSAN network environment you could create a Switch Virtual Interface on the physical switch. This will give you an IP on the vSAN segment for the isolation address leveraging the advanced setting das.isolationaddress0.
** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **