I was just listening to some of the VMworld sessions and one was about HA. The presenter had a section about Datastore Heartbeats and mentioned that Datastore Heartbeats was added to prevent “Isolation Events”. I’ve heard multiple people make this statement over the last couple of months and I want to make it absolutely clear that this is NOT true. Let me repeat this, Datastore Heartbeats do not prevent an isolation event from occurring.
Lets explain this a bit more in-depth. What happens when a Host is cut off from the network because its NIC which carries the management traffic has just failed?
- T0 – Isolation of the host (slave)
- T10s – Slave enters “election state”
- T25s – Slave elects itself as master
- T25s – Slave pings “isolation addresses”
- T30s – Slave declares itself isolated and “triggers” isolation response
Now as you can see the Datastore Heartbeat mechanism plays no role whatsoever in the process for declaring a host isolated, or does it? No from the perspective of the host which is isolated it does not. The Datastore Heartbeat mechanism is used by the master to determine the state of the unresponsive host. The Datastore Heartbeat mechanism allows the the master to determine if the host which stopped sending network heartbeats is isolated or has failed completely. Depending on the determined state the master will take appropriate action.
To summarize, the datastore heartbeat mechanism has been introduced to allow the master to identify the state of hosts and is not use by the “isolated host” to prevent isolation.
** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **
I’m a little bit confused.Now correct me if i’m wrong. Isolation is only verified by the poweron file? Right? So the datastore isn’t available it’s assuming the host is isolated and it will trigger to start or restart the virtual machines. But if it’s receiving datastore hearbeats and only the management network is failed (let’s assume 2 nic’s failed simultanely of the management network but virtual machine network and storage network is online) What kind of mechanism is then used?
Duncan Epping says
There are two things Ivan:
1) the heartbeat file
2) the poweron file
The Master will first check if the heartbeat region has been updated for this host. If that is the case then it knows that the host is “isolated”.
Then the master will check the power-on file to see whether the host has taken action for this “isolation”.
Now if both the Network has failed and the Datastores are inaccessible for a given then the master will indeed restart the VMs.
Thanks for the reply. Now the heartbeat file is only there to see if the datastore is accessible right?
Now are my assumptions correct?
1. It first checks hearbeat on the datastore, checking if the datastore is alive with the hearbeat file is there.
2. If the datastore is alive then it checks the poweron files of all virtual machines on that host.
So in this example, the situation could be that HA won’t do anything even if the host is isolated as the datastore and virtual machines are still available?
I can’t find a real example when only one datastore is offline but how will i try to restart the VM’s if the datastore is offline as it can access it.
i mean CAN’t access the datastore
I have found that you seem to like make very clear points about things.
Discussions and debates – dispelling myths and what not.
Thing is, usually people that do this usually cme across as a jerk or unprofessional.
But i have to say you are seldom wrong and always professional. Bravo duncan, keep up the good work.
Tom Stephens says
Good article Duncan. One small suggestion here though…
You mention the timing here for when a isolated host initiates a isolation response. You might also want to mention the role the heartbeat datastore plays for a isolated host in regards to when it will initiate that isolation response.
Christian Moeller says
Can multiple Clusters make use of the same datastores for Datastore heartbeating? – or should I use different datastores for different Clusters?
Yes they can… A folder is created per cluster, this folder contains the heartbeat files.
Ron Bremer says
What is the frequency of the datastore heartbeat on VMFS datastores? Your book mentions that the NFS datastore heartbeat file is updated once every five seconds. Is it the same frequency of the VMFS datastore heartbeat?
Quote from a staff engineer:
“For datastore heartbeating on vmfs datastores – the slaves just open their corresponding file with a special kind of exclusive lock (the files themselves are not updated). This causes the ESX kernel to periodically update a counter in the heartbeat region of the vmfs datastore. Each host with access to the datatore has it’s own entry in this heartbeat region. The HA master can check if the entry in the heartbeat region for a specific host is being updated which is an indication of liveness. The master does not need to check which host is locking each particular heartbeat file (there isn’t a way to check this on the command-line).”
Duncan can you please clearify something: Your artilce states
No from the perspective of the host which is isolated it does not. The Datastore Heartbeat mechanism is used by the master to determine the state of the unresponsive host.
So if the datastore is up but network is down and the isolation response is configured to shutdown vm. Will the VM remain on original host or will it be restarted by the master onto a new host. What about if the VM is configured to stay powered on?
If the network is down, isolation response will e triggered.
Duncan Please correct me if wrong
What about if the VM is configured to shut down?
So if the datastore is up but network is down and the isolation response is configured to shutdown VM isolation response will be triggered & VM’s will be shutdown which will release the lock now master will take ownership & restart the VM’s on a different Host.
What about if the VM is configured to stay powered on?
However before it will trigger the isolation response, the host will first validate if a master owns the datastore on which the VMs configuration files are stored. As the VM’s are up locks are not released so the host will not trigger the isolation response.
Duncan Epping says
Not sure I am following it:
Isolation Response = Response on the host that is isolated.
Can be: leave powered on / power off / shutdown
HA in vSphere 5.0 and up will only try to restart a virtual machine when:
1) the host on which the VM resided is dead (not receiving any heartbeats)
2) the host on which the VM resides is isolated and it has powered-off / shutdown the VMs
Hi Duncan I hope this makes sense and hopefully you’re still checking responses to this old but still relevant post.
We just experienced a partial network outage where our core was almost non responsive. Our isolation response is to Leave powered on. Our storage is FC and so the datastores were still accessible yet HA attempted to start VMs from one of the 4 hosts on one of the other hosts. Each of those VM restart attempts failed due to files being locked on the powered on VMs. This seems to contradict your statement above.
We are running with third option in Datastore heartbeat where cluster selects two datastores of its own. Is the the ideal or best option or can check the rest of two. We recently had an outage where the host stop pinging the gateway and VMware says you have configured the cluster with third option i.e.”set datastore on my preference” due to which Datastore heatbeat didn’t work. But after reading this article I am confused. Is that means VMware engg. was not correct. Please help to understand me.
I’m still very confused about necessity of Datastore heartbeat.
Please correct me, if I am worng.
If network heartbeat is failed, slave will decide by itself, if the slave is isolated or not.
Master is not possible to communicate with the slave, and nothing can do to the slave, such as restart VM on the slave.
The other hand, master will check the Datastore heartbeat, if the slave is really dead or not. But already action for isolated host was taken by slave. Master will restart VM whatever the result of Datastore heartbeat checking.
What is the point of checking Datastore heartbeat by the master?
Thank you for your time.
Sorry again, but I would like to specify my question.
1. First, master is not receiving any network heartbeat from one slave.
2. Master will check the Datastore heartbeat for checking if the slave is still alive or not.
3.If master find from the Datastore heartbeat, that the slave is still alive;
Master will ping to default gateway by default. This is for checking if the management network for master is okay or not.
A) If ping to the default gateway is successful:
Master will think that the problem is the slave’s management network.
Master will restart VMs on the other hosts.
The other hand, the slave must know that the slave is isolated and take action as defined.
B) If ping to the default gateway is failed:
Master will know that the problem is not slave, but it is master’s management network itself and isolated.
Master will take action as an isolated host.