I had a question this week around the failure of a host. The question was how long it takes before a host is declared failed. Now let’s be clear, failed means “dead” in this case, not isolated or partitioned. It could be the power has failed, the host has gone completely unresponsive, or anything else where there’s absolutely no response from the host whatsoever. In that scenario, how long does it take before HA has declared the VM dead? Now note, the below timeline is in a traditional infrastructure. Also note, that this is theoretical, when everything is optimal.
- T0 – Secondary Host failure.
- T3s – The Primary Host begins monitoring datastore heartbeats for 15 seconds.
- T10s – The host is declared unreachable and the Primary will ping the management network of the failed host.
- This is a continuous ping for 5 seconds.
- T15s – If no heartbeat datastores are configured, the host will be declared dead.
- T18s – If heartbeat datastores are configured and there have been no heartbeats, the host will be declared dead, restarts will be initiated.
Now, when a Primary Host fails the timeline looks a bit different. This is mainly because first, a new Primary Host will need to be elected. Also, we need to ensure that the new primary has received the latest state of all secondary hosts.
- T0 – Primary Host failure.
- T10s – Primary election process initiated.
- T25s – New primary elected and reads the protectedlist.
- New primary waits for secondary hosts to report running VMs
- T35s – Old primary declared unreachable.
- T50s – Old primary declared dead, new primary initiates restarts for all VMs on the protectedlist which are not running.
Keep in mind, this does not mean that VMs will be restarted with 18 seconds, or 35 seconds, for that matter. When the host is declared dead, or a new primary is elected, the restart process starts. The VMs that need to be restarted will first need to be placed, and when placed, they will need to be restarted. All of these steps will take time. On top of that, depending on the operating system and the apps running within the VM, the time it takes before the restart is fully completed could vary a lot between VMs. In other words, although the state is declared rather fast, the actual total time it takes to restart can vary and is definitely not an exact science.
Carlos says
Nice, I assume this is FDM way ? Given that if vSAN is in place monitoring is delegated to vSAN… is vSAN timing simmilar ?
Duncan Epping says
Monitoring is not delegated to vSAN.With vSAN it is still FDM which monitors the host, it just uses a different network.
Carlos says
Oops, I do not know where I got this idea from then, but it was related to having coherency about host availability between HA and vSAN. I thought the implementation unified the decission point.
Robert Small says
Always good info/articles! It made me think of a recent condition I hope you can advise. Had a situation with a significant network outage (likely spanning tree related) where some vSAN clusters recovered gracefully when the network issues were resolved, but there were a couple of vSAN clusters (6.7u3 all interfaces went down) that didn’t reconnect to the network on their own and needed all hosts rebooted to recover. Is there a timeframe or condition that determines recovery after network isolation?
Duncan Epping says
Not that I know, I haven’t experienced this issue either myself (or heard about it, unfortunately).
ae says
In HA DeepDive book scenario of master failure is explained with different timing:
T0 – Master failure.
T10s – Master election process initiated.
T25s – New master elected and reads the protectedlist.
T35s – New master initiates restarts for all virtual machines on the protectedlist which
are not running.
I.e. VM restart process is initiated at T35 and not at T50. Which explanation is correct?
Duncan Epping says
The described in the article is what the timeline looks today, it appears somewhere down the line the timing changed. I have not figured out yet with which version that is the case.