das.failuredetection time and relationship with isolation response

I had this question coincidentally two times of the last 3 weeks and I figured that it couldn’t hurt explaining it here as well. The question on the VMTN community was as follows:

on 13 sec: a host which hears from none of the partners will ping the isolation address
on 14 sec: if no reply from isolation address it will trigger the isolation response
on 15 sec: the host will be declared dead from the remaining hosts, this will be confirmed by pinging the missing host
on 16 sec: restarts of the VMs will begin

My first question is: Do all these timings come from the das.failuredetectiontime? That is, if das.failuredetectiontime is set to e.g. 30000 (30 sec) then on the 28th second a potential isolated host will try to ping the isolation address and do the Isolation Response action at 29 second?

Or is the Isolation Response timings hardcoded and always happens at 13 sec?

My second question, if the answer is Yes on above, why is the recommendation to increase das.failuredetectiontime to 20000 if having multiple Isolation Response addresses? If the above is correct then this would make to potential isolated host to test its isolation addresses at 18th second and the restart of the VMs will begin at 21 second, but what would be the gain from this really?

To which my answer was very short fortunately:

Yes, the relationship between these timings is das.failuredetectiontime.

Increasing the das.failuredetectiontime is usually recommended when an additional das.isolationaddress is specified. the reason for this is that the “ping” and the “result of the ping” needs time and by added 5 seconds to the failure detection time you allow for this test to complete correctly. After which the isolation response could be triggered.

After having a discussion on VMTN about this and giving it some thought and bouncing my thoughts with the engineers I came to the conclusion that the recommendation to increase das.failuredetectiontime with 5 seconds when multiple isolation addresses are specified is incorrect. The sequence is always as follows regardless of the value of das.failuredetectiontime:

The ping will always occur at “das.failuredetectiontime -2”
The isolation response is always triggered at “das.failuredetectiontime -1”
The fail-over is always initiated at “das.failuredetectiontime +1”

The timeline in this article explains the process well.

Now, this recommendation to increase das.failuredetectiontime was probably made in times where many customers were experiencing network issues. Increasing the time decreases the chances of running in to an issue where VMs are powered down due to a network outage. Sorry about all the confusion and unclear recommendations.

Comments

Peter Linkletter says

28 May, 2011 at 20:13

Sorry but reading the 2 articles, I see a discrepancy between them. In your article “a couple of weeks back”, you state:
“Increasing the das.failuredetectiontime wouldn’t make a difference as the “2 second” gap will just move up as well. HA will always ping the isolation address on “das.failuredetectiontime – 1”.

This clearly states that increasing the das.failuredetectiontime WILL NOT provide any more time for the pings. However, in this article, you state:

“Increasing the das.failuredetectiontime is usually recommended when an additional das.isolationaddress is specified. the reason for this is that the “ping” and the “result of the ping” needs time and by added 5 seconds to the failure detection time you allow for this test to complete correctly.”

This clearly states that increasing the das.failuredetectiontime WILL increase the time for the ping processing. Which is it?

BTW, Thanks for all the great articles!!!
- Brandon says
  
  29 May, 2011 at 01:48
  
  You’ve got your wires crossed. The failure detection time (the pings that are doing the testing happen within that time frame) have nothing to do with the two second gap. A larger failuredetection time (since he says 5 seconds… I’m assuming the standard recommendation of 20000 vs the 15000 default) gives the 5 extra seconds to be sure it occurs.
Brandon says

1 June, 2011 at 14:48

ERROR — Duncan changed his post, and my understanding of this at the same time. I’ll be damn o.O.
Rickard Nobel says

1 June, 2011 at 20:16

“The ping will always occur at “das.failuredetectiontime – 1″ and the isolation response is always triggered at “das.failuredetectiontime + 1″ regardless of the value of “das.failuredetectiontime”.”

From the text above, is this really correct? Should it not be ping at das.failuredetectiontime-2 and isolation response at das.failuredetectiontime-1?
- Duncan Epping says
  
  1 June, 2011 at 21:21
  
  Yes you are correct. I should try editing articles when I am awake 🙂

Related

Reader Interactions

Comments