I had a question this week around the failure of a host. The question was how long it takes before a host is declared failed. Now let’s be clear, failed means “dead” in this case, not isolated or partitioned. It could be the power has failed, the host has gone completely unresponsive, or anything else where there’s absolutely no response from the host whatsoever. In that scenario, how long does it take before HA has declared the VM dead? Now note, the below timeline is in a traditional infrastructure. Also note, that this is theoretical, when everything is optimal.
- T0 – Secondary Host failure.
- T3s – The Primary Host begins monitoring datastore heartbeats for 15 seconds.
- T10s – The host is declared unreachable and the Primary will ping the management network of the failed host.
- This is a continuous ping for 5 seconds.
- T15s – If no heartbeat datastores are configured, the host will be declared dead.
- T18s – If heartbeat datastores are configured and there have been no heartbeats, the host will be declared dead, restarts will be initiated.
Now, when a Primary Host fails the timeline looks a bit different. This is mainly because first, a new Primary Host will need to be elected. Also, we need to ensure that the new primary has received the latest state of all secondary hosts.
- T0 – Primary Host failure.
- T10s – Primary election process initiated.
- T25s – New primary elected and reads the protectedlist.
- New primary waits for secondary hosts to report running VMs
- T35s – Old primary declared unreachable.
- T50s – Old primary declared dead, new primary initiates restarts for all VMs on the protectedlist which are not running.
Keep in mind, this does not mean that VMs will be restarted with 18 seconds, or 35 seconds, for that matter. When the host is declared dead, or a new primary is elected, the restart process starts. The VMs that need to be restarted will first need to be placed, and when placed, they will need to be restarted. All of these steps will take time. On top of that, depending on the operating system and the apps running within the VM, the time it takes before the restart is fully completed could vary a lot between VMs. In other words, although the state is declared rather fast, the actual total time it takes to restart can vary and is definitely not an exact science.