I was just listening to some of the VMworld sessions and one was about HA. The presenter had a section about Datastore Heartbeats and mentioned that Datastore Heartbeats was added to prevent “Isolation Events”. I’ve heard multiple people make this statement over the last couple of months and I want to make it absolutely clear that this is NOT true. Let me repeat this, Datastore Heartbeats do not prevent an isolation event from occurring.
Lets explain this a bit more in-depth. What happens when a Host is cut off from the network because its NIC which carries the management traffic has just failed?
- T0 – Isolation of the host (slave)
- T10s – Slave enters “election state”
- T25s – Slave elects itself as master
- T25s – Slave pings “isolation addresses”
- T30s – Slave declares itself isolated and “triggers” isolation response
Now as you can see the Datastore Heartbeat mechanism plays no role whatsoever in the process for declaring a host isolated, or does it? No from the perspective of the host which is isolated it does not. The Datastore Heartbeat mechanism is used by the master to determine the state of the unresponsive host. The Datastore Heartbeat mechanism allows the the master to determine if the host which stopped sending network heartbeats is isolated or has failed completely. Depending on the determined state the master will take appropriate action.
To summarize, the datastore heartbeat mechanism has been introduced to allow the master to identify the state of hosts and is not use by the “isolated host” to prevent isolation.
** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **