Today someone asked a question about advanced settings for VM Monitoring which is probably the most underestimated feature of VMware HA. As the Availability Guide is not really clear on this I decided it was worth sharing. Below you can find the original question that was asked on the VMTN Forum:
What is the point of having both of these settings? Why not just one that incorporates both? If das.failureinterval and das.iostatsinterval are both set for 2 minutes, it will wait 2 minutes before reseting the VM. If das.failureinterval is set to 2 minutes and das.iostatsinterval is set to 5 minutes, it will wait five minutes before resetting the VM. The availability guide doesn’t seem explicit in this area.
The only thing that kind of makes sense would be if das.iostatsinterval is set to 2 minutes and das.failureinterval is set to 5 minutes, then the VM would reboot after 2 minutes. Is that correct?!? The availability guide makes it seem like the das.iostatsinterval is a backup check to das.failureinterval, but it doesn’t say the opposite is true as well…
These two settings do something completely different. Let me try to explain it. das.iostatsinterval is the interval that is used to check against if there was any Network or Storage over the last two minutes. This will only be verified after the amount of seconds defined by das.faulureinterval has been exceeded without any VMware Tools heartbeat being received.
In the example this user provided the VM would be rebooted two minutes after it has failed IF it hasn’t had any network/storage I/O for 5 minutes which is probably unlikely. So what does that mean for these values? Well I would always recommend to align them. There is no point in validating on network/storage I/O for over the past 5 minutes when you trigger the validation after two minutes of the lack of heartbeats as it might have failed 2 minutes 15 seconds ago.
Conrad says
Oh man, my thread on the forums made the blog…Don’t know if I should be embarrassed or honored 😛
Duncan Epping says
it was a good question and I think many people wondered the same thing but didn’t ask 🙂
NiTRo says
Hi Duncan, as usual thanks for clarifying.
The 15sec you added to the 2min are from das.failuredetectiontime ?