I was just listening to some of the VMworld sessions and one was about HA. The presenter had a section about Datastore Heartbeats and mentioned that Datastore Heartbeats was added to prevent “Isolation Events”. I’ve heard multiple people make this statement over the last couple of months and I want to make it absolutely clear that this is NOT true. Let me repeat this, Datastore Heartbeats do not prevent an isolation event from occurring.
Lets explain this a bit more in-depth. What happens when a Host is cut off from the network because its NIC which carries the management traffic has just failed?
- T0 – Isolation of the host (slave)
- T10s – Slave enters “election state”
- T25s – Slave elects itself as master
- T25s – Slave pings “isolation addresses”
- T30s – Slave declares itself isolated and “triggers” isolation response
Now as you can see the Datastore Heartbeat mechanism plays no role whatsoever in the process for declaring a host isolated, or does it? No from the perspective of the host which is isolated it does not. The Datastore Heartbeat mechanism is used by the master to determine the state of the unresponsive host. The Datastore Heartbeat mechanism allows the the master to determine if the host which stopped sending network heartbeats is isolated or has failed completely. Depending on the determined state the master will take appropriate action.
To summarize, the datastore heartbeat mechanism has been introduced to allow the master to identify the state of hosts and is not use by the “isolated host” to prevent isolation.






Hi Duncan,
I’m a little bit confused.Now correct me if i’m wrong. Isolation is only verified by the poweron file? Right? So the datastore isn’t available it’s assuming the host is isolated and it will trigger to start or restart the virtual machines. But if it’s receiving datastore hearbeats and only the management network is failed (let’s assume 2 nic’s failed simultanely of the management network but virtual machine network and storage network is online) What kind of mechanism is then used?
Thanks
There are two things Ivan:
1) the heartbeat file
2) the poweron file
The Master will first check if the heartbeat region has been updated for this host. If that is the case then it knows that the host is “isolated”.
Then the master will check the power-on file to see whether the host has taken action for this “isolation”.
Now if both the Network has failed and the Datastores are inaccessible for a given then the master will indeed restart the VMs.
Hi Duncan,
Thanks for the reply. Now the heartbeat file is only there to see if the datastore is accessible right?
Now are my assumptions correct?
1. It first checks hearbeat on the datastore, checking if the datastore is alive with the hearbeat file is there.
2. If the datastore is alive then it checks the poweron files of all virtual machines on that host.
So in this example, the situation could be that HA won’t do anything even if the host is isolated as the datastore and virtual machines are still available?
I can’t find a real example when only one datastore is offline but how will i try to restart the VM’s if the datastore is offline as it can access it.
i mean CAN’t access the datastore
I have found that you seem to like make very clear points about things.
Discussions and debates – dispelling myths and what not.
Thing is, usually people that do this usually cme across as a jerk or unprofessional.
But i have to say you are seldom wrong and always professional. Bravo duncan, keep up the good work.
Good article Duncan. One small suggestion here though…
You mention the timing here for when a isolated host initiates a isolation response. You might also want to mention the role the heartbeat datastore plays for a isolated host in regards to when it will initiate that isolation response.
Hi Duncan,
Can multiple Clusters make use of the same datastores for Datastore heartbeating? – or should I use different datastores for different Clusters?
Yes they can… A folder is created per cluster, this folder contains the heartbeat files.