I had a discussion on the VMTN forums about this last week and the question basically was, what should my das.failuredetection time be set to when the isolation response is set to “Shut down”.

Lets first explain what the das.failuredetectiontime is, I described it on our book as follows:

Failure Detection Time is basically the time it takes before the “isolation response” is triggered. There are two primary concepts when we are talking about failure detection time:

  • The time it will take the host to detect it is isolated
  • The time it will take the non-isolated hosts to mark the unavailable host as isolated and initiate the failover

So what does this have to do with your Isolation Response? Well not much actually, and that might sound weird but it had me thinking about it for a second as well….

What if your Isolation Response is set to “Shut down” and an isolation occurs? Well in that case HA will try to “Shut down” the VMs in a clean way when Isolation has been detected. HA will do that on the 14th second. On the 16th second the restart will be initiated. So that leaves your VMs exactly two second to shut down in a clean way….So two questions pop-up immediately:

  1. What if I increase the das.failuredetectiontime?
  2. What are the chances restarts happens in time?

Increasing the das.failuredetectiontime wouldn’t make a difference as the “2 second” gap will just move up as well. HA will always ping the isolation address on “das.failuredetectiontime – 1″ and it will always initiate the restarts on “das.failuredetection + 1″. In other words, 2 minutes or 20 seconds, it makes no difference. I guess a nice diagram makes this a bit clearer:

(created by Frank D. for our book)

So what are the chances these restarts will occur within 16 seconds? Slim indeed. So when will they be restarted? Well a year ago I wrote this article and the following still applies for vSphere 4.1:

  • T+0 – Restart
  • T+2 – Restart retry 1
  • T+4 – Restart retry 2
  • T+8 – Restart retry 3
  • T+8 – Restart retry 4
  • T+8 – Restart retry 5

In other words, if T+0 fails the restart will be retried 2 minutes later. If that one fails the restart will be retried 4 minutes later. (2+4 = 6 minutes after the initial restart) So as you can see selecting “shut down” will more than likely increase your restart latency and this needs to be taken into account for your SLA.