Datastore Heartbeating and preventing Isolation Events?

Duncan Epping · Oct 3, 2011 ·

I was just listening to some of the VMworld sessions and one was about HA. The presenter had a section about Datastore Heartbeats and mentioned that Datastore Heartbeats was added to prevent “Isolation Events”. I’ve heard multiple people make this statement over the last couple of months and I want to make it absolutely clear that this is NOT true. Let me repeat this, Datastore Heartbeats do not prevent an isolation event from occurring.

Lets explain this a bit more in-depth. What happens when a Host is cut off from the network because its NIC which carries the management traffic has just failed?

T0 – Isolation of the host (slave)
T10s – Slave enters “election state”
T25s – Slave elects itself as master
T25s – Slave pings “isolation addresses”
T30s – Slave declares itself isolated and “triggers” isolation response

Now as you can see the Datastore Heartbeat mechanism plays no role whatsoever in the process for declaring a host isolated, or does it? No from the perspective of the host which is isolated it does not. The Datastore Heartbeat mechanism is used by the master to determine the state of the unresponsive host. The Datastore Heartbeat mechanism allows the the master to determine if the host which stopped sending network heartbeats is isolated or has failed completely. Depending on the determined state the master will take appropriate action.

To summarize, the datastore heartbeat mechanism has been introduced to allow the master to identify the state of hosts and is not use by the “isolated host” to prevent isolation.

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

Comments

Ivan says

3 October, 2011 at 20:33

Hi Duncan,

I’m a little bit confused.Now correct me if i’m wrong. Isolation is only verified by the poweron file? Right? So the datastore isn’t available it’s assuming the host is isolated and it will trigger to start or restart the virtual machines. But if it’s receiving datastore hearbeats and only the management network is failed (let’s assume 2 nic’s failed simultanely of the management network but virtual machine network and storage network is online) What kind of mechanism is then used?

Thanks
- Duncan Epping says
  
  3 October, 2011 at 21:19
  
  There are two things Ivan:
  
  1) the heartbeat file
  2) the poweron file
  
  The Master will first check if the heartbeat region has been updated for this host. If that is the case then it knows that the host is “isolated”.
  
  Then the master will check the power-on file to see whether the host has taken action for this “isolation”.
  
  Now if both the Network has failed and the Datastores are inaccessible for a given then the master will indeed restart the VMs.
  - Ivan says
    
    4 October, 2011 at 13:54
    
    Hi Duncan,
    
    Thanks for the reply. Now the heartbeat file is only there to see if the datastore is accessible right?
    
    Now are my assumptions correct?
    
    1. It first checks hearbeat on the datastore, checking if the datastore is alive with the hearbeat file is there.
    2. If the datastore is alive then it checks the poweron files of all virtual machines on that host.
    
    So in this example, the situation could be that HA won’t do anything even if the host is isolated as the datastore and virtual machines are still available?
    
    I can’t find a real example when only one datastore is offline but how will i try to restart the VM’s if the datastore is offline as it can access it.
    - Ivan says
      
      4 October, 2011 at 13:56
      
      i mean CAN’t access the datastore
Cwjking says

4 October, 2011 at 02:35

I have found that you seem to like make very clear points about things.

Discussions and debates – dispelling myths and what not.

Thing is, usually people that do this usually cme across as a jerk or unprofessional.

But i have to say you are seldom wrong and always professional. Bravo duncan, keep up the good work.
Tom Stephens says

6 October, 2011 at 15:52

Good article Duncan. One small suggestion here though…

You mention the timing here for when a isolated host initiates a isolation response. You might also want to mention the role the heartbeat datastore plays for a isolated host in regards to when it will initiate that isolation response.
Christian Moeller says

10 December, 2011 at 17:01

Hi Duncan,
Can multiple Clusters make use of the same datastores for Datastore heartbeating? – or should I use different datastores for different Clusters?
Duncan says

10 December, 2011 at 20:04

Yes they can… A folder is created per cluster, this folder contains the heartbeat files.
Ron Bremer says

16 March, 2012 at 00:32

Hi Duncan:

What is the frequency of the datastore heartbeat on VMFS datastores? Your book mentions that the NFS datastore heartbeat file is updated once every five seconds. Is it the same frequency of the VMFS datastore heartbeat?

Thanks,
Ron
Harry says

12 April, 2012 at 16:44

Quote from a staff engineer:

“For datastore heartbeating on vmfs datastores – the slaves just open their corresponding file with a special kind of exclusive lock (the files themselves are not updated). This causes the ESX kernel to periodically update a counter in the heartbeat region of the vmfs datastore. Each host with access to the datatore has it’s own entry in this heartbeat region. The HA master can check if the entry in the heartbeat region for a specific host is being updated which is an indication of liveness. The master does not need to check which host is locking each particular heartbeat file (there isn’t a way to check this on the command-line).”

http://communities.vmware.com/thread/342995
Kahlan says

5 October, 2012 at 20:36

Duncan can you please clearify something: Your artilce states

No from the perspective of the host which is isolated it does not. The Datastore Heartbeat mechanism is used by the master to determine the state of the unresponsive host.

So if the datastore is up but network is down and the isolation response is configured to shutdown vm. Will the VM remain on original host or will it be restarted by the master onto a new host. What about if the VM is configured to stay powered on?
- Duncan says
  
  5 October, 2012 at 21:31
  
  If the network is down, isolation response will e triggered.
Hemanth says

21 May, 2013 at 16:56

Duncan Please correct me if wrong
What about if the VM is configured to shut down?
So if the datastore is up but network is down and the isolation response is configured to shutdown VM isolation response will be triggered & VM’s will be shutdown which will release the lock now master will take ownership & restart the VM’s on a different Host.
======================================================
What about if the VM is configured to stay powered on?
However before it will trigger the isolation response, the host will first validate if a master owns the datastore on which the VMs configuration files are stored. As the VM’s are up locks are not released so the host will not trigger the isolation response.
- Duncan Epping says
  
  22 May, 2013 at 13:25
  
  Not sure I am following it:
  Isolation Response = Response on the host that is isolated.
  Can be: leave powered on / power off / shutdown
  
  HA in vSphere 5.0 and up will only try to restart a virtual machine when:
  1) the host on which the VM resided is dead (not receiving any heartbeats)
  2) the host on which the VM resides is isolated and it has powered-off / shutdown the VMs
  - Eugene says
    
    27 September, 2014 at 17:02
    
    Hi Duncan I hope this makes sense and hopefully you’re still checking responses to this old but still relevant post.
    
    We just experienced a partial network outage where our core was almost non responsive. Our isolation response is to Leave powered on. Our storage is FC and so the datastores were still accessible yet HA attempted to start VMs from one of the 4 hosts on one of the other hosts. Each of those VM restart attempts failed due to files being locked on the powered on VMs. This seems to contradict your statement above.
Tapan says

17 June, 2013 at 15:59

We are running with third option in Datastore heartbeat where cluster selects two datastores of its own. Is the the ideal or best option or can check the rest of two. We recently had an outage where the host stop pinging the gateway and VMware says you have configured the cluster with third option i.e.”set datastore on my preference” due to which Datastore heatbeat didn’t work. But after reading this article I am confused. Is that means VMware engg. was not correct. Please help to understand me.
J says

18 October, 2014 at 08:27

Hi Duncan,

I’m still very confused about necessity of Datastore heartbeat.
Please correct me, if I am worng.

If network heartbeat is failed, slave will decide by itself, if the slave is isolated or not.
Master is not possible to communicate with the slave, and nothing can do to the slave, such as restart VM on the slave.

The other hand, master will check the Datastore heartbeat, if the slave is really dead or not. But already action for isolated host was taken by slave. Master will restart VM whatever the result of Datastore heartbeat checking.

What is the point of checking Datastore heartbeat by the master?

Thank you for your time.

J
- J says
  
  18 October, 2014 at 11:12
  
  Hi Duncan,
  
  Sorry again, but I would like to specify my question.
  
  1. First, master is not receiving any network heartbeat from one slave.
  
  2. Master will check the Datastore heartbeat for checking if the slave is still alive or not.
  
  3.If master find from the Datastore heartbeat, that the slave is still alive;
  
  Master will ping to default gateway by default. This is for checking if the management network for master is okay or not.
  
  A) If ping to the default gateway is successful:
  
  Master will think that the problem is the slave’s management network.
  Master will restart VMs on the other hosts.
  
  The other hand, the slave must know that the slave is isolated and take action as defined.
  
  B) If ping to the default gateway is failed:
  
  Master will know that the problem is not slave, but it is master’s management network itself and isolated.
  Master will take action as an isolated host.
  
  Thank you.
  
  J
  

Related

Reader Interactions

Comments