I received a couple of questions last week about HA restarts in the scenario where a full site failure has occurred or a part of the storage system has failed and needs to be taken over by another datacenter. Yes indeed this is related to stretched clusters and HA restarts in a DR/DA event.
The questions were straight forward, how does the restart time-out work and what happens after the last retry? I wrote about HA restarts and the sequence last year, so lets just copy and paste that here:
- Initial restart attempt
- If the initial attempt failed, a restart will be retried after 2 minutes of the previous attempt
- If the previous attempt failed, a restart will be retried after 4 minutes of the previous attempt
- If the previous attempt failed, a restart will be retried after 8 minutes of the previous attempt
- If the previous attempt failed, a restart will be retried after 16 minutes of the previous attempt
You can extend the restart retry by increasing the value “das.maxvmrestartcount”. And then after every 15/16 minutes a new restart will be attempted. The question this triggered though is why would it even take 4 retries? The answer I got was: we don’t know if we will be able to fail over the storage within 30 minutes and if we will have sufficient compute resources…
Here comes the sweet part about vSphere HA, it actually is a pretty smart solution, it will know if VMs can be restarted or not. In this case as the datastore is not available there is absolutely no point in even trying and HA as such will not even bother. As soon as the storage becomes available though the restart attempts will start. Same applies to compute resource, if for whatever reason there is insufficient unreserved compute resources to restart your VMs then HA will wait for them to become available… nice right!?! Do note I emphasized the word “unreserved” as that is what HA cares about and not actually about used resources.