• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

Yellow Bricks

by Duncan Epping

  • Home
  • ESXTOP
  • Stickers/Shirts
  • Privacy Policy
  • About
  • Show Search
Hide Search

fdm

How long does it take before a host is declared failed?

Duncan Epping · Jan 26, 2021 · 5 Comments

I had a question this week around the failure of a host. The question was how long it takes before a host is declared failed. Now let’s be clear, failed means “dead” in this case, not isolated or partitioned. It could be the power has failed, the host has gone completely unresponsive, or anything else where there’s absolutely no response from the host whatsoever. In that scenario, how long does it take before HA has declared the VM dead? Now note, the below timeline is in a traditional infrastructure. Also note, that this is theoretical, when everything is optimal.

  • T0 – Secondary Host failure.
  • T3s – The Primary Host begins monitoring datastore heartbeats for 15 seconds.
  • T10s – The host is declared unreachable and the Primary will ping the management network of the failed host.
    • This is a continuous ping for 5 seconds.
  • T15s – If no heartbeat datastores are configured, the host will be declared dead.
  • T18s – If heartbeat datastores are configured and there have been no heartbeats, the host will be declared dead, restarts will be initiated.

Now, when a Primary Host fails the timeline looks a bit different. This is mainly because first, a new Primary Host will need to be elected. Also, we need to ensure that the new primary has received the latest state of all secondary hosts.

  • T0 – Primary Host failure.
  • T10s – Primary election process initiated.
  • T25s – New primary elected and reads the protectedlist.
    • New primary waits for secondary hosts to report running VMs
  • T35s – Old primary declared unreachable.
  • T50s – Old primary declared dead, new primary initiates restarts for all VMs on the protectedlist which are not running.

Keep in mind, this does not mean that VMs will be restarted with 18 seconds, or 35 seconds, for that matter. When the host is declared dead, or a new primary is elected, the restart process starts. The VMs that need to be restarted will first need to be placed, and when placed, they will need to be restarted. All of these steps will take time. On top of that, depending on the operating system and the apps running within the VM, the time it takes before the restart is fully completed could vary a lot between VMs. In other words, although the state is declared rather fast, the actual total time it takes to restart can vary and is definitely not an exact science.

HA Architecture Series – Datastore Heartbeating (3/5)

Duncan Epping · Jul 26, 2011 ·

**disclaimer: Some of the content has been taken from the vSphere 5 Clustering Technical Deepdive book**

The first time I was playing around with 5.0 and particularly HA I noticed a new section in the UI called Datastore Heartbeating.

HA Architecture Series – Datastore Heartbeating (3/5)

Those familiar with HA prior to vSphere 5.0 probably know that virtual machine restarts were always initiated, even if only the management network of the host was isolated and the virtual machines were still running. As you can imagine, this added an unnecessary level of stress to the host. This has been mitigated by the introduction of the datastore heartbeating mechanism. Datastore heartbeating adds a new level of resiliency and allows HA to make a distinction between a failed host and an isolated / partitioned host. Isolated vs Partitioned is explained in Part 2 of this series.

Datastore heartbeating enables a master to more correctly determine the state of a host that is not reachable via the management network. The new datastore heartbeat mechanism is only used in case the master has lost network connectivity with the slaves to validate whether the host has failed or is merely isolated/network partitioned. As shown in the screenshot above two datastores are automatically selected by vCenter. You can rule out specific volumes if and when required or even make the selection yourself. I would however recommend to let vCenter decide.

As mentioned by default it will select two datastores. It is possible however to configure an advanced setting (das.heartbeatDsPerHost) to allow for more datastores for datastore heartbeating. I can imagine this is something that you would do when you have multiple storage devices and want to pick a datastore from each, but generally speaking I would not recommend configuring this option as the default should be sufficient for most scenarios.

How does this heartbeating mechanism work? HA leverages the existing VMFS filesystem locking mechanism. The locking mechanism uses a so called “heartbeat region” which is updated as long as the lock on a file exists. In order to update a datastore heartbeat region, a host needs to have at least one open file on the volume. HA ensures there is at least one file open on this volume by creating a file specifically for datastore heartbeating. In other words, a per-host a file is created on the designated heartbeating datastores, as shown in the screenshot below. HA will simply check whether the heartbeat region has been updated.

If you are curious which datastores have been selected for heartbeating. Just go to your summary tab on your cluster and click “Cluster Status”, the 3 tab “Heartbeat Datastores” will reveal it.

 

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

HA Architecture Series – FDM (1/5)

Duncan Epping · Jul 22, 2011 ·

With vSphere 5.0 comes a new HA architecture. HA has been rewritten from the ground up to shed some of those constraints that were enforced by AAM. HA as part of 5.0, also referred to as FDM (fault domain manager), introduces less complexity and higher resiliency. From a UI perspective not a lot has changed, but there is a lot under the covers that has changed though, no more primary/secondary node concept as stated but a master/slave concept with an automated election process. Anyway, too much details for now, will come back to that later in a later article. Lets start with the basics first. As mentioned the complete agent as been rewritten and the dependency on VPXA has been removed. HA talks directly to hostd instead of using a translator to talk to VPXA with vSphere 4.1 and prior. This excellent diagram by Frank, also used in our book, demonstrates how things are connected as of vSphere 5.0.

HA Architecture Series – FDM (1/5)

The main point here though is that the FDM agent also communicates with vCenter and vCenter with the FDM agent. As of vSphere 5.0, HA leverages vCenter to retrieve information about the status of virtual machines and vCenter is used to display the protection status of virtual machines. On top of that, vCenter is responsible for the protection and unprotection of virtual machines. This not only applies to user initiated power-offs or power-ons of virtual machines, but also in the case where an ESXi host is disconnected from vCenter at which point vCenter will request the master HA agent to unprotect the affected virtual machines. What protection actually means will be discussed in a follow-up article. One thing I do want to point out is that if vCenter is unavailable, it will not be possible to make changes to the configuration of the cluster. vCenter is the source of truth for the set of virtual machines that are protected, the cluster configuration, the virtual machine-to-host compatibility information, and the host membership. So, while HA, by design, will respond to failures without vCenter, HA relies on vCenter to be available to configure or monitor the cluster.

Without diving straight in to the deep I want to point out two minor chances but huge improvements when it comes to managing/troubleshooting HA which I want to point out:

  • No dependency on DNS
  • Syslog functionality

As of 5.0 HA is no longer dependent on DNS, as it works with IP addresses only. This is one of the major improvements that FDM brings. This also means that the character limit that HA imposed on the hostname has been lifted. (Pre-vSphere 5.0, hostnames were limited to 26 characters.) Another major change is the fact that the HA log files are part of the normal log functionality ESXi offer which means that you can find the log file in /var/log and it is picked up by syslog!

There are a couple of other major changes that I want to explain in the upcoming posts. Here is what you can expect to be published in the upcoming week:

  • Primary Nodes (2/5)
  • Datastore Heartbeating (3/5)
  • Restarting VMs (4/5)
  • Advanced Settings (5/5)

 

** Disclaimer: This article contains references to the words master and/or slave. I recognize these as exclusionary words. The words are used in this article for consistency because it’s currently the words that appear in the software, in the UI, and in the log files. When the software is updated to remove the words, this article will be updated to be in alignment. **

Primary Sidebar

About the author

Duncan Epping is a Chief Technologist in the Office of CTO of the Cloud Platform BU at VMware. He is a VCDX (# 007) and the author of the "vSAN Deep Dive" and the “vSphere Clustering Technical Deep Dive” series.

Upcoming Events

25-Feb-21 | Swiss VMUG – Roadshow
04-Mar-21 | Polish VMUG – Roadshow
09-Mar-21 | Austria VMUG – Roadshow
16-Mar-21 | VMUG Turkey – Roadshow
18-Mar-21 | St Louis Usercon Keynote
26-Mar-21 | Hungary VMUG – Roadshow
08-Apr-21 | VMUG France – Roadshow

Recommended reads

Sponsors

Want to support us? Buy an advert!

Advertisements

Copyright Yellow-Bricks.com © 2021 · Log in