The HA Deep Dive was published as part of the vSphere Clustering Deep Dive book series. The book (paper) can be bought through Amazon, or you can find the ebook version for free on the Rubrik website. As it has been multiple years since I last updated the content I figured I would also share most of the content here. Although this is not the full book, this is going to be a fairly long post, so you probably should grab a cup of coffee or 15 if you want to read it completely.
Introduction
Availability has traditionally been one of the most important aspects when providing services. When providing services on a shared platform like VMware vSphere, the impact of downtime exponentially grows as many services run on a single physical machine. As such VMware engineered a feature called vSphere High Availability. vSphere High Availability, hereafter simply referred to as HA, provides a simple and cost effective solution to increase availability for any application running in a VM regardless of its operating system. It is configured using a couple of simple steps through vCenter Server (vCenter) and as such provides a uniform and simple interface. HA enables you to create a cluster out of multiple ESXi hosts. This will allow you to protect VMs and their workloads. In the event of a failure of one of the hosts in the cluster, impacted VMs are automatically restarted on other ESXi hosts within that same VMware vSphere Cluster (cluster).
On top of that, in the case of a Guest OS level failure, HA can restart the failed Guest OS. This feature is called VM Monitoring, but is sometimes also referred to as VM-HA. This might sound fairly complex but again can be implemented with a single click.
Unlike many other clustering solutions, HA is a simple solution to implement and literally enabled within 5 clicks. On top of that, HA is widely adopted and used in all situations. However, HA is not a 1:1 replacement for solutions like Microsoft Clustering Services / Windows Server Failover Clustering (WSFC). The main difference between WSFC and HA being that WSFC was designed to protect stateful cluster-aware applications while HA was designed to protect any VM regardless of the type of workload within, but also can be extended to the application layer through the use of VM and Application Monitoring.
In the case of HA, a fail-over incurs downtime as the VM is restarted on one of the remaining hosts in the cluster. Whereas MSCS transitions the service to one of the remaining nodes in the cluster when a failure occurs. In contrary to what many believe, WSFC does not guarantee that there is no downtime during a transition. On top of that, your application needs to be cluster-aware and stateful in order to get the most out of this mechanism, which limits the number of workloads that could really benefit from this type of clustering.
One might ask why would you want to use HA when a VM is restarted and service is temporarily lost. The answer is simple; not all VMs (or services) need 99.999% uptime. For many services the type of availability HA provides is more than sufficient. On top of that, many applications were never designed to run on top of an WSFC cluster. This means that there is no guarantee of availability or data consistency if an application is clustered with WSFC but is not cluster-aware.
In addition, WSFC clustering can be complex and requires special skills and training. One example is managing patches and updates/upgrades in a WSFC environment; this could even lead to more downtime if not operated correctly and definitely complicate operational procedures. HA however reduces complexity, costs (associated with downtime and MSCS), resource overhead and unplanned downtime for minimal additional costs. It is important to note that HA, contrary to WSFC, does not require any changes to the guest as HA is provided on the hypervisor level. Also, VM Monitoring does not require any additional software or OS modifications except for VMware Tools, which should be installed anyway as a best practice. In case even higher availability is required, VMware also provides a level of application awareness through Application Monitoring, which has been leveraged by partners like Symantec to enable application-level resiliency and could be used by in-house development teams to increase resiliency for their applications.
HA has proven itself over and over again and is widely adopted within the industry; if you are not using it today, hopefully, you will be convinced after reading this very lengthy post.
What is required for HA to work?
Each feature or product has very specific requirements and HA is no different. Knowing the requirements of HA is part of the basics we have to cover before diving into some of the more complex concepts. For those who are completely new to HA, I will also show you how to configure it.
Prerequisites
Before enabling HA it is highly recommend validating that the environment meets all the prerequisites. I also included recommendations from an infrastructure perspective that will enhance resiliency.
Requirements:
- A minimum of two ESXi hosts
- A minimum of 8 GB memory per host to install ESXi and enable HA
- VMware vCenter Server
- Shared Storage for VMs
- Pingable gateway or other reliable address
Recommendation:
- Redundant Management Network (not a requirement, but highly recommended)
- 128 GB of memory or more per host
- Multiple shared datastores
Firewall Requirements
The following table contains the ports that are used by HA for communication. If your environment contains firewalls external to the host, ensure these ports are opened for HA to function correctly. HA will open the required ports on the ESXi firewall.
Port | Protocol | Direction |
---|---|---|
8182 | UDP | Inbound |
8182 | TCP | Inbound |
8182 | UDP | Outbound |
8182 | TCP | Outbound |
Configuring vSphere High Availability
HA can be configured with the default settings within a couple of clicks. The following steps will show you how to create a cluster and enable HA, including VM Monitoring, using the vSphere Client (HTML-5). Each of the settings and the design decisions associated with these steps will be described in more depth in the following sections.
- Click on the “Hosts & Clusters” view
- Right-click the Datacenter in the Inventory tree and click New Cluster.
- Give the new cluster an appropriate name. I recommend at a minimum including the location of the cluster and a sequence number ie. ams-hadrs-001.
- Select Turn On vSphere HA.
- Ensure “Enable host monitoring” and “Enable admission control” are selected.
- If required, enable VM Monitoring by selecting VM Monitoring Only or VM and Application Monitoring in the dropdown
- Click “OK” to complete the creation of the cluster.
When the HA cluster has been created, the ESXi hosts can be added to the cluster simply by right clicking the host and selecting “Move To”, if they were already added to vCenter, or by right clicking the cluster and selecting “Add Host”.
When an ESXi host is added to the newly-created cluster, the HA agent will be loaded and configured. Once this has completed, HA will enable protection of the workloads running on this ESXi host.
As I have clearly demonstrated, HA is a simple clustering solution that will allow you to protect VMs against host failure and operating system failure in literally minutes. Understanding the architecture of HA will enable you to reach that extra 9 when it comes to availability. The following sections will discuss the architecture and fundamental concepts of HA. I will also discuss all decision-making moments to ensure you will configure HA in such a way that it meets the requirements of your or your customer’s environment.
Components of HA
Now that we know what the pre-requisites are and how to configure HA the next steps will be describing which components form HA. Keep in mind that this is still a “high level” overview. There is more under the cover that I will explain in following sections. The following diagram depicts a two-host cluster and shows the key HA components.
As you can clearly see, there are three major components that form the foundation for HA:
- FDM
- HOSTD
- vCenter
The first and probably the most important component that forms HA is FDM (Fault Domain Manager). This is the HA agent. The FDM Agent is responsible for many tasks such as communicating host resource information, VM states and HA properties to other hosts in the cluster. FDM also handles heartbeat mechanisms, VM placement, VM restarts, logging and much more. I am not going to discuss all of this in-depth separately as I feel that this will complicate things too much.
FDM, in our opinion, is one of the most important agents on an ESXi host, I am assuming this is the case. The engineers recognized this importance and added an extra level of resiliency to HA. FDM uses a single-process agent. However, FDM spawns a watchdog process. In the unlikely event of an agent failure, the watchdog functionality will pick up on this and restart the agent to ensure HA functionality remains without anyone ever noticing it failed. The agent is also resilient to network interruptions and “all paths down” (APD) conditions. Inter-host communication automatically uses another communication path (if the host is configured with redundant management networks) in the case of a network failure.
HA has no dependency on DNS as it works with IP addresses only. This is one of the major improvements that FDM brought. This does not mean that ESXi hosts need to be registered with their IP addresses in vCenter; it is still a best practice to register ESXi hosts by their fully qualified domain name
(FQDN) in vCenter. Although HA does not depend on DNS, remember that other services may depend on it. On top of that, monitoring and troubleshooting will be much easier when hosts are correctly registered within vCenter and have a valid FQDN.
Basic design principle: Although HA is not dependent on DNS, it is still recommended to register the hosts with their FQDN for ease of operations/management.
vSphere HA also has a standardized logging mechanism, where a single log file has been created for all operational log messages; it is called fdm.log. This log file is stored under /var/log/ as depicted in the screenshot below.
Although typically not needed, I recommend getting familiar with the fdm.log file as it will enable you to troubleshoot the environment when an issue occurs. An example of when the fdm.log will be very useful is the situation where VMs have been restarted without any apparent reason. The fdm.log file will show when the VMs have been restarted, but more importantly, it will also inform you why VMs have been restarted, whether it was the result of a host, network, or storage failure for instance.
HOSTD Agent
One of the most crucial agents on a host is HOSTD. This agent is responsible for many of the tasks we take for granted like powering on VMs. FDM talks directly to HOSTD and vCenter, so it is not dependent on VPXA, like in previous releases. This is, of course, to avoid any unnecessary overhead and dependencies, making HA more reliable than ever before and enabling HA to respond faster to power-on requests. That ultimately results in higher VM uptime.
When, for whatever reason, HOSTD is unavailable or not yet running after a restart, the host will not participate in any FDM-related processes. FDM relies on HOSTD for information about the VMs that are registered to the host, and manages the VMs using HOSTD APIs. In short, FDM is dependent on HOSTD and if HOSTD is not operational, FDM halts all functions and waits for HOSTD to become operational.
vCenter
That brings us to our final component, the vCenter Server. vCenter is the core of every vSphere Cluster and is responsible for many tasks these days. For our purposes, the following are the most important and the ones I will discuss in more detail:
- Deploying and configuring HA Agents
- Communication of cluster configuration changes
- Protection of VMs
vCenter is responsible for pushing out the FDM agent to the ESXi hosts when applicable. The push of these agents is done in parallel to allow for faster deployment and configuration of multiple hosts in a cluster. vCenter is also responsible for communicating configuration changes in the cluster to the host which is elected as the master. I will discuss this concept of master and slaves in the following section. Examples of configuration changes are the modification or addition of an advanced setting or the introduction of a new host into the cluster.
HA leverages vCenter to retrieve information about the status of VMs and, of course, vCenter is used to display the protection status of VMs. (What “VM protection” actually means will be discussed in a later section) On top of that, vCenter is responsible for the protection and unprotection of VMs. This not only applies to user initiated power-offs or power-ons of VMs, but also in the case where an ESXi host is disconnected from vCenter at which point vCenter will request the master HA agent to unprotect the affected VMs.
Although HA is configured by vCenter and exchanges VM state information with HA, vCenter is not involved when HA responds to failure. It is comforting to know that in case of a host failure containing the virtualized vCenter Server, HA takes care of the failure and restarts the vCenter Server on another host, including all other configured VMs from that failed host.
There is a corner case scenario with regards to vCenter failure: if the ESXi hosts are so called “stateless hosts” and Distributed vSwitches are used for the management network, VM restarts will not be attempted until vCenter is restarted. For stateless environments, vCenter and Auto Deploy availability is key as the ESXi hosts literally depend on them.
If vCenter is unavailable, it will not be possible to make changes to the configuration of the cluster. vCenter is the source of truth for the set of VMs that are protected, the cluster configuration, the VM-to-host compatibility information, and the host membership. So, while HA, by design, will respond to failures without vCenter, HA relies on vCenter to be available to configure or monitor the cluster.
After deploying vCenter Server and configuring your cluster, I recommend setting the correct HA restart priorities for it. Although vCenter Server is not required to restart VMs, there are multiple components that rely on vCenter and, as such, a speedy recovery is desired. When configuring your vCenter VM with the highest priority for restarts, remember to include all services on which your vCenter server depends for a successful restart: DNS, Active Directory and MS SQL (or any other database server you are using).
Basic design principles: 1. In stateless environments, ensure vCenter and Auto Deploy are highly available as recovery time of your VMs might be dependent on them. 2. Understand the impact of virtualizing vCenter. Ensure it has the highest priority for restarts and ensure that services which vCenter Server depends on are available: DNS, Active Directory and the potential external database server.
Fundamental Concepts
Now that you know about the components of HA, it is time to start talking about some of the fundamental concepts of HA clusters:
- Master / Slave agents
- Heartbeating
- Isolated vs Network partitioned
- VM Protection
- Component Protection
Everyone who has implemented vSphere knows that multiple hosts can be configured into a cluster. A cluster can best be seen as a collection of resources. These resources can be carved up with the use of vSphere Distributed Resource Scheduler (DRS) into separate pools of resources or used to increase availability by enabling HA.
The HA architecture introduces the concept of master and slave HA agents. Except during network partitions, which are discussed later, there is only one master HA agent in a cluster. Any agent can serve as a master, and all others are considered its slaves. A master agent is in charge of monitoring the health of VMs for which it is responsible and restarting any that fail. The slaves are responsible for forwarding information to the master agent and restarting any VMs at the direction of the master. The HA agent, regardless of its role as master or slave, also implements the VM/App monitoring feature which allows it to restart VMs in the case of an Operating System or restart services in the case of an application failure.
Master Agent
As stated, one of the primary tasks of the master is to keep track of the state of the VMs it is responsible for and to take action when appropriate. In a normal situation there is only a single master in a cluster. I will discuss the scenario where multiple masters can exist in a single cluster in one of the following sections, but for now let’s talk about a cluster with a single master. A master will claim responsibility for a VM by taking “ownership” of the datastore on which the VM’s configuration file is stored.
Basic design principle: To maximize the chance of restarting VMs after a failure I recommend masking datastores on a cluster basis. Although sharing of datastores across clusters will work, it will increase complexity from an administrative perspective.
That is not all, of course. The HA master is also responsible for exchanging state information with vCenter. This means that it will not only receive but also send information to vCenter when required. The HA master is also the host that initiates the restart of VMs when a host has failed. You may immediately want to ask what happens when the master is the one that fails, or, more generically, which of the hosts can become the master and when is it elected?
Election
A master is elected by a set of HA agents whenever the agents are not in network contact with a master. A master election thus occurs when HA is first enabled on a cluster and when the host on which the master is running:
- fails,
- becomes network partitioned or isolated,
- is disconnected from vCenter Server,
- is put into maintenance or standby mode,
- or when HA is reconfigured on the host.
The HA master election takes approximately 15 seconds and is conducted using UDP. While HA won’t react to failures during the election, once a master is elected, failures detected before and during the election will be handled. The election process is simple but robust. The host that is participating in the election with the greatest number of connected datastores will be elected master. If two or more hosts have the same number of datastores connected, the one with the highest Managed Object ID will be chosen. This however is done lexically; meaning that 99 beats 100 as 9 is larger than 1. For each host, the HA State of the host will be shown on the Summary tab. This includes the role as depicted in screenshot below where the host is a master host.
After a master is elected, each slave that has management network connectivity with it will set up a single secure, encrypted, TCP connection to the master. This secure connection is SSL-based. One thing to stress here though is that slaves do not communicate with each other after the master has been elected unless a re-election of the master needs to take place.
As stated earlier, when a master is elected it will try to acquire ownership of all of the datastores it can directly access or access by proxying requests to one of the slaves connected to it using the management network. For traditional storage architectures it does this by locking a file called “protectedlist” that is stored on the datastores in an existing cluster. The master will also attempt to take ownership of any datastores it discovers along the way, and it will periodically retry and it could not take ownership of previously.
The naming format and location of this file is as follows:
/<root of datastore>/.vSphere-HA/<cluster-specific-directory>/protectedlist
For those wondering how “cluster-specific-directory” is constructed:
<uuid of vCenter Server>-<number part of the MoID of the cluster>-<random 8 char string>-<name of the host running vCenter Server>
The master uses this protectedlist file to store the inventory. It keeps track of which VMs are protected by HA. Calling it an inventory might be slightly overstating: it is a list of protected VMs and it includes information around VM CPU reservation and memory overhead. The master distributes this inventory across all datastores in use by the VMs in the cluster. The next screenshot shows an example of this file on one of the datastores.
Now that we know the master locks a file on the datastore and that this file stores inventory details, what happens when the master is isolated or fails? If the master fails, the answer is simple: the lock will expire and the new master will relock the file if the datastore is accessible to it.
In the case of isolation, this scenario is slightly different, although the result is similar. The master will release the lock it has on the file on the datastore to ensure that when a new master is elected it can determine the set of VMs that are protected by HA by reading the file. If, by any chance, a master should fail right at the moment that it became isolated, the restart of the VMs will be delayed until a new master has been elected. In a scenario like this, accuracy and the fact that VMs are restarted is more important than a short delay.
Let’s assume for a second that your master has just failed. What will happen and how do the slaves know that the master has failed? HA uses a point-to-point network heartbeat mechanism. If the slaves have received no network heartbeats from the master, the slaves will try to elect a new master. This new master will read the required information and will initiate the restart of the VMs within roughly 10 seconds.
Restarting VMs is not the only responsibility of the master. It is also responsible for monitoring the state of the slave hosts and reporting this state to vCenter Server. If a slave fails or becomes isolated from the management network, the master will determine which VMs must be restarted. When VMs need to be restarted, the master is also responsible for determining the placement of those VMs. It uses a placement engine that will try to distribute the VMs so they can be restarted evenly across all available hosts. Depending on the version you are running, the placement mechanism may be slightly different.
So how did it work pre-vSphere 7?
- HA uses the cluster configuration
- HA uses the latest compatibility list it received from vCenter
- HA leverages a local copy of the DRS algorithm with a basic (fake) set of stats and runs the VMs through the algorithm
- HA receives a placement recommendation from the local algorithm and restarts the VM on the suggested host
- Within 5 minutes DRS runs within vCenter, and will very likely move the VM to a different host based on actual load
As you can imagine this is far from optimal. So what is introduced in vSphere 7? Well, we introduce two different ways of doing placement for restarts in vSphere 7:
- Remote Placement Engine
- Simple Placement Engine
The Remote Placement Engine, in short, is the ability for vSphere HA to make a call to DRS for the recommendation of the placement of a VM. This will take the current load of the cluster, the VM happiness, and all configured affinity/anti-affinity/vm-host affinity rules into consideration! Will this result in a much slower restart? The great thing is that the DRS algorithm has been optimized over the past years and it is so fast that there will not be a noticeable difference between the old mechanism and the new mechanism. Added benefit of course for the engineering team is that they can remove the local DRS module, which means there’s less code to maintain. How this works is that the FDM Master communicated with the FDM Manager which runs in vCenter Server. FDM Manager communicates with the DRS service to request a placement recommendation.
Now some of you will probably wonder what happens when vCenter Server is unavailable, well this is where the Simple Placement Engine comes into play. The team has developed a new placement engine that basically takes a round-robin approach, but does consider of course “must rules” (VM to Host) and the compatibility list. Note, affinity, or anti-affinity rules, are not considered when SPE is used instead of RPE! This is a known limitation, which is considered to be fixed in the future. If a host, for instance, is not connected to the datastore the VM is running on that needs to be restarted than that host is excluded from the list of potential placement targets. By the way, before I forget, version 7 also introduced a vCenter heartbeat mechanism as a result. HA will be heart beating the vCenter Server instance to understand when it will need to resort to the Simple Placement Engine vs the Remote Placement Engine.
I dug through the FDM log to find some proof of these new mechanisms, (/var/log/fdm.log) and found an entry that shows there are indeed two placement engines:
Invoking the RPE + SPE Placement Engine
RPE stands for “remote placement engine”, and SPE for “simple placement engine”. Where Remote of course refers to DRS. You may ask yourself, how do you know if DRS is being called? Well, that is something you can see in the logs in the DRS log files, when a placement request is received, the below entry shows up in the log file:
FdmWaitForUpdates-vim.ClusterComputeResource:domain-c8-26307464
This even happens when DRS is disabled and also when you use a license edition which does not include DRS even, which is really cool if you ask me. If for whatever reason vCenter Server is unavailable, and as a result DRS can’t be called, you will see this mentioned in the FDM log, and as shown below, it will use the Simple Placement Engine’s recommendation for the placement of the VM:
Invoke the placement service to process the placement update from SPE
Hopefully that explains the placement logic which was introduced in 7.x and is part of the responsibility of the Master.
Okay, let’s get back to this master/slave role. All of these responsibilities are really important, but without a mechanism to detect if a slave has failed, the master would be useless. Just like the slaves receive heartbeats from the master, the master receives heartbeats from the slaves so it knows they are alive.
Slaves
A slave has substantially fewer responsibilities than a master: a slave monitors the state of the VMs it is running and informs the master about any changes to this state.
The slave also monitors the health of the master by monitoring heartbeats. If the master becomes unavailable, the slaves initiate and participate in the election process. Last but not least, the slaves send heartbeats to the master so that the master can detect outages. Like the master-to-slave communication, all slave-to-master communication is point-to-point. HA does not use multicast.
Files for both Slave and Master
Before explaining the details it is important to understand that both vSAN and vVols have introduced changes to the location and the usage of files. For specifics on these two different storage architectures, there’s a section dedicated to it at the end of this post.
Both the master and slave use files not only to store state but also as a communication mechanism. We’ve already seen the protectedlist file used by the master to store the list of protected VMs. I will now discuss the files that are created by both the master and the slaves. Remote files are files stored on a shared datastore and local files are files that are stored in a location only directly accessible to that host.
Remote Files
The set of powered on VMs is stored in a per-host “poweron” file. It should be noted that, because a master also hosts VMs, it also creates a “poweron” file.
The naming scheme for this file is as follows: host-number-poweron
Tracking VM power-on state is not the only thing the “poweron” file is used for. This file is also used by the slaves to inform the master that it is isolated from the management network: the top line of the file will either contain a 0 or a 1. A 0 (zero) means not-isolated and a 1 (one) means isolated. The master will inform vCenter about the isolation of the host.
Local Files
As mentioned before, when HA is configured on a host, the host will store specific information about its cluster locally.
Each host, including the master, will store data locally. The data that is locally stored is important state information. Namely, the VM-to-host compatibility matrix, cluster configuration, and host membership list. This information is persisted locally on each host. Updates to this information is sent to the master by vCenter and propagated by the master to the slaves. Although I expect that most of you will never touch these files – and I highly recommend against modifying them – I do want to explain how they are used:
- clusterconfig This file is not human-readable. It contains the configuration details of the cluster.
- vmmetadata This file is not human-readable. It contains the actual compatibility info matrix for every HA protected VM and lists all the hosts with which it is compatible plus a vm/host dictionary
- fdm.cfg This file contains the configuration settings around logging. For instance, the level of logging and syslog details are stored here.
- hostlist A list of hosts participating in the cluster, including hostname, IP addresses, MAC addresses and heartbeat datastores.
Now although vmmetadata and clusterconfig are not human readable, this does not mean it is impossible to know what information is stored in them. The script prettyPrint.sh allows you to print the information in the above 4 files. For example, the command below prints the clusterconfig information.
/opt/vmware/fdm/fdm/prettyPrint.sh clusterconfig
If you use this command with “-h” all options will be provided, I feel these speak for itself. When troubleshooting however especially “hostlist” and “vmmetadata” will come in handy. The parameter “hostlist” will give you the host name and host identifier. This will make the fdm.log easier to digest. Below screenshot displays partly the information provided by the “hostlist” parameter.
The “vmmetadata” parameter displays the compatibility list, although I haven’t covered the compatibility list yet it is good to know that this contains information about which VM can be restarted on which host.
Heartbeating
I mentioned it a couple of times already in this section, and it is an important mechanism that deserves its own section: heartbeating. Heartbeating is the mechanism used by HA to validate whether a host is alive. HA has two different heartbeating mechanisms. These heartbeat mechanisms allow it to determine what has happened to a host when it is no longer responding. Let’s discuss traditional network heartbeating first.
Network Heartbeating
Network Heartbeating is used by HA to determine if an ESXi host is alive. Each slave will send a heartbeat to its master and the master sends a heartbeat to each of the slaves, this is a point-to-point communication. These heartbeats are sent by default every second.
When a slave isn’t receiving any heartbeats from the master, it will try to determine whether it is Isolated– I will discuss “states” in more detail later in this section.
Basic design principle: Network heartbeating is key for determining the state of a host. Ensure the management network is highly resilient to enable proper state determination.
Datastore Heartbeating
Datastore heartbeating adds an extra level of resiliency and prevents unnecessary restart attempts from occurring as it allows vSphere HA to determine whether a host is isolated from the network or is completely unavailable. How does this work?
Datastore heartbeating enables a master to determine the state of a host that is not reachable via the management network. This datastore heartbeat mechanism is used in case the master has lost network connectivity with one, or multiple, slaves. The datastore heartbeat mechanism is then used to validate whether a host has failed or is merely isolated/network partitioned. Isolation will be validated through the “poweron” file which, as mentioned earlier, will be updated by the host when it is isolated. Without the “poweron” file, there is no way for the master to validate isolation. Let that be clear! Based on the results of checks of both files, the master will determine the appropriate action to take. If the master determines that a host has failed (no datastore heartbeats), the master will restart the failed host’s VMs. If the master determines that the slave is Isolated or Partitioned, it will only take action when it is appropriate to take action. With that meaning that the master will only initiate restarts when VMs are down or powered down / shut down by a triggered isolation response.
By default, HA selects 2 heartbeat datastores – it will select datastores that are available on all hosts, or as many as possible. Although it is possible to configure an advanced setting (das.heartbeatDsPerHost) to allow for more datastores for datastore heartbeating I do not recommend configuring this option as the default should be sufficient for most scenarios, except for stretched cluster environments where it is recommended to have two in each site manually selected. This is extensively discussed in the Stretched Cluster section of this book.
The selection process gives preference to VMFS over NFS datastores, and seeks to choose datastores that are backed by different LUNs or NFS servers when possible. If desired, you can also select the heartbeat datastores yourself. I, however, recommend letting vCenter deal with this operational “burden” as vCenter uses a selection algorithm to select heartbeat datastores that are presented to all hosts. This however is not a guarantee that vCenter can select datastores that are connected to all hosts. It should be noted that vCenter is not site-aware. In scenarios where hosts are geographically dispersed, it is recommended to manually select heartbeat datastores to ensure each site has one site-local heartbeat datastore at minimum. More on this topic is covered in the Use Case section of this book, which discusses metro cluster deployments.
Basic design principle: In a metro-cluster / geographically dispersed cluster I recommend setting the minimum number of heartbeat datastores to four. It is recommended to manually select site local datastores, two for each site.
The question now arises: what, exactly, is this datastore heartbeating, and which datastore is used for this heartbeating? Let’s answer which datastore is used for datastore heartbeating first as I can simply show that with a screenshot, see below. vSphere displays extensive details around the “Cluster Status” on the Cluster’s Monitor tab. This, for instance, shows you which datastores are being used for heartbeating currently and which hosts are using which specific datastore(s).
In block based storage environments HA leverages an existing VMFS file system mechanism. The datastore heartbeat mechanism uses a so called “heartbeat region” which is updated as long as the file is open. On VMFS datastores, HA will simply check whether the heartbeat region has been updated. In order to update a datastore heartbeat region, a host needs to have at least one open file on the volume. HA ensures there is at least one file open on this volume by creating a file specifically for datastore heartbeating. In other words, a per-host file is created on the designated heartbeating datastores, as shown below. The naming scheme for this file is as follows: host-number-hb.
On NFS datastores, each host will write to its heartbeat file once every 5 seconds, ensuring that the master will be able to check host state. The master will simply validate this by checking that the time stamp of the file changed.
Realize that in the case of a converged network environment, the effectiveness of datastore heartbeating will vary depending on the type of failure. For instance, a NIC failure could impact both network and datastore heartbeating. If, for whatever reason, the datastore or NFS share becomes unavailable or is removed from the cluster, HA will detect this and select a new datastore or NFS share to use for the heartbeating mechanism. Unless of course you have selected the option “select only from my preferred datastores” and none of the preferred datastores is available.
Basic design principle: Datastore heartbeating adds a new level of resiliency but is not the be-all end-all. In converged networking environments, the use of datastore heartbeating adds little value due to the fact that a NIC failure may result in both the network and storage becoming unavailable.
Isolated versus Partitioned
I’ve already briefly touched on it and it is time to have a closer look. When it comes to network failures there are two different states that exist. What are these exactly and when is a host Partitioned rather than Isolated? Before I will explain this, I want to point out that there is the state as reported by the master and the state as observed by an administrator and the characteristics these have.
I would recommend everyone to read the following bullet points thoroughly (and multiple times) as the terminology in these situations are often incorrectly used. It sounds like it is just semantics, but there’s a big difference in how vSphere HA responds to an Isolation versus how it responds to a Partition.
Let’s be very clear and define each state:
- An isolation event is the situation where a single host cannot communicate with the rest of the cluster. Note: single host!
- A partition is the situation where two (or more) hosts can communicate with each other, but no longer can communicate with the remaining two (or more) hosts in the cluster. Note: two or more!
Having said that, you can also find yourself in then situation where multiple hosts are isolated simultaneously. Although chances are slim, this can occur when for instance a change is made to the network and various hosts of a single cluster lose access to the management network. Anyway, let’s take a look at Partition and Isolation events a bit more in-depth
The diagram below shows possible ways in which an Isolation or a Partition can occur.
If a cluster is partitioned in multiple segments, each partition will elect its own master, meaning that if you have 4 partitions your cluster will have 4 masters. When the network partition is corrected, one of the four masters will take over the role and be responsible for the cluster again. This will be done using the election algorithm (most connected datastores, highest lexical number). It should be noted that a master could claim responsibility for a VM that lives in a different partition. If this occurs and the VM happens to fail, the master will be notified through the datastore communication mechanism.
In the HA architecture, whether a host is partitioned is determined by the master reporting the condition. So, in the above example, the master on host ESXi-01 will report ESXi-03 and ESXi-04 partitioned while the master on host ESXi-03 will report ESXi-01 and ESXi-02 partitioned. When a partition occurs, vCenter reports the perspective of one master.
A master reports a host as partitioned or isolated when it can’t communicate with the host over the management network, it can observe the host’s datastore heartbeats via the heartbeat datastores. The master cannot alone differentiate between these two states – a host is reported as isolated only if the host informs the master via the datastores that is isolated.
This still leaves the question open how the master differentiates between a Failed, Partitioned, or Isolated host.
When the master stops receiving network heartbeats from a slave, it will check for host “liveness” for the next 15 seconds. Before the host is declared failed, the master will validate if it has actually failed or not by doing additional liveness checks. First, the master will validate if the host is still heartbeating to the datastore. Second, the master will ping the management IP address of the host. If both are negative, the host will be declared Failed. This doesn’t necessarily mean the host has PSOD’ed; it could be the network is unavailable, including the storage network, which would make this host Isolated from an administrator’s perspective but Failed from an HA perspective. As you can imagine, however, there are various combinations possible. The following table depicts these combinations including the “state”.
State | Network Heartbeat | Storage Heartbeat | Host Liveness Ping | Isolation Criteria Met |
---|---|---|---|---|
Running | Yes | N/A | N/A | N/A |
Isolated | No | Yes | No | Yes |
Partitioned | No | Yes | No | No |
Failed | No | No | No | N/A |
FDM Agent Down | N/A | N/A | Yes | N/A |
HA will trigger an action based on the state of the host. When the host is marked as Failed, a restart of the VMs will be initiated. When the host is marked as Isolated, the master might initiate the restarts.
The one thing to keep in mind when it comes to isolation response is that a VM will only be shut down or powered off when the isolated host knows there is a master out there that has taken ownership for the VM or when the isolated host loses access to the home datastore of the VM.
For example, if a host is isolated and runs two VMs, stored on separate datastores, the host will validate if it can access each of the home datastores of those VMs. If it can, the host will validate whether a master owns these datastores. If no master owns the datastores, the isolation response will not be triggered and restarts will not be initiated. If the host does not have access to the datastore, for instance, during an “All Paths Down” condition, HA will trigger the isolation response to ensure the “original” VM is powered down and will be safely restarted. This to avoid so-called “split-brain” scenarios.
To reiterate, as this is a very important aspect of HA and how it handles network isolations, the remaining hosts in the cluster will only be requested to restart VMs when the master has detected that either the host has failed or has become isolated and the isolation response was triggered.
VM Protection
VM protection happens on several layers but is ultimately the responsibility of vCenter. I have explained this briefly but I want to expand on it a bit more to make sure everyone understands the dependency on vCenter when it comes to protecting VMs. I do want to stress that this only applies to protecting VMs; VM restarts in no way require vCenter to be available at the time.
When the state of a VM changes, vCenter will direct the master to enable or disable HA protection for that VM. Protection, however, is only guaranteed when the master has committed the change of state to disk. The reason for this, of course, is that a failure of the master would result in the loss of any state changes that exist only in memory. As pointed out earlier, this state is distributed across the datastores and stored in the “protectedlist” file.
When the power state change of a VM has been committed to disk, the master will inform vCenter Server so that the change in status is visible both for the user in vCenter and for other processes like monitoring tools. Within the vSphere Client you can validate that a VM has been protected on the VM’s summary page as displayed in the next screenshot. As shown, the UI also provides information about the types of failures HA can handle for this particular VM.
To clarify the process, I have created a workflow diagram of the protection of a VM from the point it is powered on through vCenter:
But what about “unprotection?” When a VM is powered off, it must be removed from the protectedlist. I have documented this workflow in the following diagram for the situation where the power off is invoked from vCenter.
I realize a lot of new terminology and concepts have been introduced in this section. Understanding these new concepts is critical for the availability of your workloads, and in some cases critical for a successful restart of your VMs.
Restarting VMs
In the previous section, I have described most of the lower level fundamental concepts of HA. I have shown you that multiple mechanisms increase resiliency and reliability of HA. Reliability of HA in this case mostly refers to restarting (or resetting) VMs, as that remains HA’s primary task.
HA will respond when the state of a host has changed, or, better said, when the state of one or more VMs has changed. There are multiple scenarios in which HA will respond to a VM failure, the most common of which are listed below:
- Failed host
- Isolated host
- Failed guest operating system
Depending on the type of failure, but also depending on the role of the host, the process will differ slightly. Changing the process results in slightly different recovery timelines. There are many different scenarios and there is no point in covering all of them, so I will try to describe the most common scenario and include timelines where possible.
Throughout this section I will describe theoretical restart times, please realize that these timings are based on optimal scenarios with maximum availability of resources and no constraints whatsoever. In real life the restart of a VM may take slightly longer, this depends on many variables, some of which I have listed below. Note that this is an example of what may impact restart times, by no means a full list.
- Availability of resources
- Network performance
- Storage performance
- Speed of CPU and available CPUs/Cores
- Speed of memory and available capacity
- Number of VMs impacted
- Number of hosts impacted
Before we dive into the different failure scenarios, I want to explain how restart priority and retries work.
Restart Priority and Order
A feature of HA that has always been a hot discussion is Restart Priority and Order. The main reason for the debate being the lack of proper prioritization or the ability to specify dependency between VMs. This completely changed with the arrival of vSphere 6.5 where the restart mechanism was redesigned and a new functionality was introduced where you have the ability to specify dependency between VMs. Since early days HA can take the configured priority of the VM into account when restarting VMs. However, it is good to know that Agent VMs take precedence during the restart procedure as the “regular” VMs may rely on them. Although Agent VMs are not common, one use case for it would be a virtual storage appliance.
Pre vSphere 6.5 prioritization was done by each host and not globally. Each host that had been requested to initiate restart attempts would attempt to restart all top priority VMs before attempting to start any other VMs. If the restart of a top priority VM failed, it would be retried after a delay. In the meantime, however, HA would continue powering on the remaining VMs. Keep in mind that some VMs could have been dependent on the agent VMs.
Basic design principle: VMs can be dependent on the availability of agent VMs or other VMs. Although HA will do its best to ensure all VMs are started in the correct order, this is not guaranteed. Document the proper recovery process.
Besides agent VMs, HA also prioritizes FT secondary machines. I have listed the full order in which VMs will be restarted below:
- Agent VMs
- FT secondary VMs
- VMs configured with a restart priority of highest
- VMs configured with a restart priority of high
- VMs configured with a restart priority of medium
- VMs configured with a restart priority of low
- VMs configured with a restart priority of lowest
The priority by default is set to medium for the whole cluster and this can be changed in the “VM Overrides” section of the UI.
After you have specified the priority you can also specify if there needs to be an additional delay before the next batch can be started, or you can specify even what triggers the next priority “group”, this could for instance be the VMware Tools guest heartbeat as shown in the screenshot below. The other option is “resources allocated” which is purely the scheduling of the batch itself (this is the old behavior), the power-on event completion or the “app heartbeat” detection. That last one is most definitely the most complex as you would need to have App HA enabled and services defined etc. I suspect that if people use this they will mostly set it to “Guest Heartbeats detected” as that is easiest and most reliable.
If for whatever reason there is no guest heartbeat ever, or it simply takes a long time then there is also a timeout value that can be specified. By default this is 600 seconds, this can be decreased or increased, depending on what you prefer.
In case you are wondering, yes you can also set a restart priority for vCenter Server. All changes to the restart priority are stored in the cluster configuration. You can examine this if needed through the script I discussed earlier called prettyPrint.sh, simply type the following:
/opt/vmware/fdm/fdm/prettyPrint.sh clusterconfig
The output which is then presented will look something like below example, I would recommend searching for the word “restartPriority” to find the changes you have made as the output will be more than 100 lines.
The Restart Priority functionality is primarily intended for large groups of VMs, if you have thousands of VMs you can select those ten / twenty VMs and change priority so that they will be powered-on first. However, if you for instance have a 3-tier app and you need the database server to be powered on before the app server then you can also use VM/VM rules as of vSphere 6.5, this functionality is typically referred to as HA Orchestrated Restart.
You can configure HA Orchestrated Restarts by simply creating “VM” Groups. In the example below I created a VM Group called App with the application VM in there. I have also created a DB group with the Database VM in there.
This application has a dependency on the Database VM to be fully powered-on, so we specified this in a rule as shown in the below screenshot.
Now one thing to note here is that in terms of dependency, the next group of VMs in the rule will be powered on when the cluster-wide set “VM Dependency Restart Condition” is met. This is a mandatory rule, also known as a hard rule. If this is set to “Resources Allocated”, which is the default, then the VMs will be restarted literally a split second later. Think about how to set the “VM Dependency Restart Condition” as otherwise, the rule may be useless. Also realize that if the VM Dependency Restart Condition cannot be met, that the next group of VMs are not restarted.
Basic design principle: For both restart priority and orchestrated restart it is important to think about when the next batch should be restarted. vSphere allows you to configure it in various different ways, take advantage of the flexibility offered.
It should be noted that HA will not place any VMs on a host if the required number of agent VMs are not running on the host at the time placement is done.
Now that we have briefly touched on it, I would also like to address “restart retries” and parallelization of restarts as that more or less dictates how long it could take before all VMs of a failed or isolated host are restarted. Note that the use of Restart Priorities and/or the use of Orchestrated Restart will impact restart timing, but let’s take a look at restart retries first before I discuss restarting timing.
Restart Retries
The number of retries is configurable as of vCenter 2.5 U4 with the advanced option “das.maxvmrestartcount”. The default value is 5. Note that the initial restart is included in this number.
HA will try to start the VM on one of your hosts in the affected cluster; if this is unsuccessful on that host, the restart count will be increased by 1. Before I go into the exact timeline, let it be clear that T0 is the point at which the master initiates the first restart attempt. This by itself could be 30 seconds after the VM has failed. The elapsed time between the failure of the VM and the restart, though, will depend on the scenario of the failure, which I will discuss in this section.
As said, the default number of restarts is 5. There are specific times associated with each of these attempts. The following bullet list will clarify this concept. The ‘m’ stands for “minutes” in this list.
- T0 – Initial Restart
- T2m – Restart retry 1
- T6m – Restart retry 2
- T14m – Restart retry 3
- T30m – Restart retry 4
As clearly depicted in the diagram above, a successful power-on attempt could take up to ~30 minutes in the case where multiple power-on attempts are unsuccessful. This is, however, not exact science. For instance, there is a 2-minute waiting period between the initial restart and the first restart retry. HA will start the 2-minute wait as soon as it has detected that the initial attempt has failed. So, in reality, T2 could be T2 plus 8 seconds. Another important fact that we want to emphasize is that there is no coordination between masters, and so if multiple masters are involved in trying to restart the VM, each will retain their own sequence. Multiple masters could attempt to restart a VM. Although only one will succeed, it might change some of the timelines.
What about VMs which are “disabled” for HA or VMs that are powered-off? What will happen with those VMs? Before vSphere 6.0 those VMs would be left alone, as of vSphere 6.0 these VMs will be registered on another host after a failure. This will allow you to easily power-on that VMs when needed without needed to manually re-register it yourself. Note, HA will not do a power-on of the VMs, it will just register it for you! (Note that a bug in vSphere 6.0 U2 prevents this from happening, and you need vSphere 6.0 U3 for this functionality to work.)
Let’s give an example to clarify the scenario in which a master fails during a restart sequence:
Cluster: 4 Host (esxi01, esxi02, esxi03, esxi04)
Master: esxi01
The host “esxi02” is running a single VM called “vm01” and it fails. The master, esxi01, will try to restart it but the attempt fails. It will try restarting “vm01” up to 5 times but, unfortunately, on the 4th try, the master also fails. An election occurs and “esxi03” becomes the new master. It will now initiate the restart of “vm01”, and if that restart would fail it will retry it up to 4 times again for a total including the initial restart of 5.
Be aware, though, that a successful restart might never occur if the restart count is reached and all five restart attempts (the default value) were unsuccessful.
When it comes to restarts, one thing that is very important to realize is that HA will not issue more than 32 concurrent power-on tasks on a given host. To make that more clear, let’s use the example of a two host cluster: if a host fails which contained 33 VMs and all of these had the same restart priority, 32 power on attempts would be initiated. The 33rd power on attempt will only be initiated when one of those 32 attempts has completed regardless of success or failure of one of those attempts.
Note, pre-vSphere 6.5, if there were 31 low-priority VMs to be powered on and a single high-priority VM, the power on attempt for the low-priority VMs would be issued at the same time as the power on attempt for the high priority VM. This has changed with vSphere 6.5 as mentioned earlier, as now you have the ability to specify when the next batch should be restarted. By default however this is “resources allocated”, which equals the pre-vSphere 6.5 behavior.
Basic design principle: Configuring restart priority alone of a VM is not a guarantee that the power on of the VMs will actually be completed in this order. Ensure proper operational procedures are in place for restarting services or VMs in the appropriate order in the event of a failure.
Now that we know how VM restart priority and restart retries are handled, it is time to look at the different scenarios.
- Failed host
- Failure of a master
- Failure of a slave
- Isolated host and response
Failed Host
When discussing a failed host scenario, it is needed to make a distinction between the failure of a master versus the failure of a slave. I want to emphasize this because the time it takes before a restart attempt is initiated differs between these two scenarios. Although the majority of you probably won’t notice the time difference, it is important to call out. Let’s start with the most common failure, that of a host failing, but note that failures generally occur infrequently. In most environments, hardware failures are very uncommon to begin with. Just in case it happens, it doesn’t hurt to understand the process and its associated timelines.
The Failure of a Slave
The failure of a slave host is a fairly complex scenario. Part of this complexity comes from the introduction of a new heartbeat mechanism. Actually, there are two different scenarios: one where heartbeat datastores are configured and one where heartbeat datastores are not configured. Keeping in mind that this is an actual failure of the host, the timeline is as follows:
- T0 – Slave failure.
- T3s – Master begins monitoring datastore heartbeats for 15 seconds.
- T10s – The host is declared unreachable and the master will ping the management network of the failed host. This is a continuous ping for 5 seconds.
- T15s – If no heartbeat datastores are configured, the host will be declared dead.
- T18s – If heartbeat datastores are configured, the host will be declared dead.
The master monitors the network heartbeats of a slave. When the slave fails, these heartbeats will no longer be received by the master. I have defined this as T0. After 3 seconds (T3s), the master will start monitoring for datastore heartbeats and it will do this for 15 seconds. On the 10th second (T10s), when no network or datastore heartbeats have been detected, the host will be declared as “unreachable”. The master will also start pinging the management network of the failed host at the 10th second and it will do so for 5 seconds. If no heartbeat datastores were configured, the host will be declared “dead” at the 15th second (T15s) and VM restarts will be initiated by the master. If heartbeat datastores have been configured, the host will be declared dead at the 18th second (T18s) and restarts will be initiated. I realize that this can be confusing and hope the timeline depicted in the diagram below makes it easier to digest.
The master filters the VMs it thinks failed before initiating restarts. The master uses the protectedlist for this, on-disk state could be obtained only by one master at a time since it required opening the protectedlist file in exclusive mode. If there is a network partition multiple masters could try to restart the same VM as vCenter Server also provided the necessary details for a restart. As an example, it could happen that a master has locked a VM’s home datastore and has access to the protectedlist while the other master is in contact with vCenter Server and as such is aware of the current desired protected state. In this scenario it could happen that the master which does not own the home datastore of the VM will restart the VM based on the information provided by vCenter Server.
This change in behavior was introduced to avoid the scenario where a restart of a VM would fail due to insufficient resources in the partition which was responsible for the VM. With this change, there is less chance of such a situation occurring as the master in the other partition would be using the information provided by vCenter Server to initiate the restart.
That leaves us with the question of what happens in the case of the failure of a master.
The Failure of a Master
In the case of a master failure, the process and the associated timeline are slightly different. The reason is that there needs to be a master before any restart can be initiated. This means that an election will need to take place amongst the slaves. The timeline is as follows:
- T0 – Master failure.
- T10s – Master election process initiated.
- T25s – New master elected and reads the protectedlist.
- T35s – New master initiates restarts for all VMs on the protectedlist which are not running.
Slaves receive network heartbeats from their master. If the master fails, let’s define this as T0 (T zero), the slaves detect this when the network heartbeats cease to be received. As every cluster needs a master, the slaves will initiate an election at T10s. The election process takes 15s to complete, which brings us to T25s. At T25s, the new master reads the protectedlist. This list contains all the VMs, which are protected by HA. At T35s, the master initiates the restart of all VMs that are protected but not currently running. The timeline depicted in the diagram below hopefully clarifies the process.
Besides the failure of a host, there is another reason for restarting VMs: an isolation event.
Isolation Response and Detection
Before I will discuss the timeline and the process around the restart of VMs after an isolation event, I will discuss Isolation Response and Isolation Detection. One of the first decisions that will need to be made when configuring HA is the “Isolation Response”.
Isolation Response
The Isolation Response (or Host Isolation as it is called in vSphere 6.0) refers to the action that HA takes for its VMs when the host has lost its connection with the network and the remaining nodes in the cluster. This does not necessarily mean that the whole network is down; it could just be the management network ports of this specific host. Today there are three isolation responses: “Disabled”, “Power off”, and “Shut down”. In previous versions (pre vSphere 6.0) there was an isolation response called “leave powered on”, this has been renamed to “disabled” as “leave powered on” means that there is no response to an isolation event.
The isolation response feature answers the question, “what should a host do with the VMs it manages when it detects that it is isolated from the network?” Let’s discuss these three options more in-depth:
- Disabled (default) – When isolation occurs on the host, the state of the VMs remains unchanged.
- Power off and restart VMs– When isolation occurs, all VMs are powered off. It is a hard stop, or to put it bluntly, the “virtual” power cable of the VM will be pulled out!
- Shut down and restart VMs – When isolation occurs, all VMs running on the host will be shut down using a guest-initiated shutdown through VMware Tools. If this is not successful within 5 minutes, a “power off” will be executed. This time-out value can be adjusted by setting the advanced option das.isolationShutdownTimeout. If VMware Tools is not installed, a “power off” will be initiated immediately.
This setting can be changed on the cluster settings under the option “Response for Host Isolation” in the vSphere Client. Note that this differs from the Web Client, as this used to be located under “VM Options”. It is also possible to override the default or selected behavior on a per VM basis. This can be done in the VM Overrides section of the vSphere Client by selecting the appropriate VMs and then selecting the “Override” option for Host isolation response and selecting the appropriate isolation response.
The default setting for the isolation response has changed multiple times over the last couple of years and this has caused some confusion. Below you can find the what changed with which version.
- Up to ESXi3.5 U2 / vCenter 2.5 U2 the default isolation response was “Power off”
- With ESXi3.5 U3 / vCenter 2.5 U3 this was changed to “Leave powered on”
- With vSphere 4.0 it was changed to “Shut down”.
- With vSphere 5.0 it has been changed to “Leave powered on”.
- With vSphere 6.0 the “leave powered on” setting was renamed to “Disabled”.
Keep in mind that these changes are only applicable to newly created clusters. When creating a new cluster, it may be required to change the default isolation response based on the configuration of existing clusters and/or your customer’s requirements, constraints and expectations. When upgrading an existing cluster, it might be wise to apply the latest default values. You might wonder why the default has changed once again. There was a lot of feedback from customers that “Disabled” was the desired default value.
Basic design principle: Before upgrading an environment to later versions, ensure you validate the best practices and default settings. Document them, including justification, to ensure all people involved understand your reasons.
The question remains, which setting should be used? The obvious answer applies here; it depends. I prefer “Disabled” for traditional environments because it eliminates the chances of having a false positive and its associated down time. One of the problems that people have experienced in the past is that HA triggered its isolation response when the full management network went down. Resulting in the power off (or shutdown) of every single VM and none being restarted. This problem has been mitigated. HA will validate if VMs restarts can be attempted – there is no reason to incur any down time unless absolutely necessary. It does this by validating that a master owns the datastore the VM is stored on. Of course, the isolated host can only validate this if it has access to the datastores. In a converged network environment with iSCSI storage, for instance, it would be impossible to validate this during a full isolation as the validation would fail due to the inaccessible datastore from the perspective of the isolated host.
I feel that changing the isolation response is most useful in environments where a failure of the management network is likely correlated with a failure of the VM network(s). If the failure of the management network won’t likely correspond with the failure of the VM networks, isolation response would cause unnecessary downtime as the VMs can continue to run without management network connectivity to the host.
A second use for power off/shutdown is in scenarios where the VM retains access to the VM network but loses access to its storage, leaving the VM powered-on could result in two VMs on the network with the same IP address. An example of when this could happen for instance is with vSAN storage. When vSAN is configured HA leverages the vSAN network for network heartbeating. This means that if the HA heartbeat does not function properly, it is very unlikely that VMs running on that particular host can access the vSAN datastore. As such for vSAN we always recommend setting the isolation response to “power off”.
Realizing that many of you are not designing hyper-converged solutions yet, or are responsible for maintaining a legacy infrastructure let us try to provide some guidance around when to use which isolation policy.
Likelyhood that host will reaint access to VM datastore | Likelyhood VMs retain access to VM network | Recommended isolation policy | Rationale |
---|---|---|---|
Likely | Unlikely | Disabled | VM is running fine, no reason to power it off |
Likely | Unlikely | Shutdown | Choose shutdown to allow HA to restart VMs on hosts that are not isolated and hence are likely to have access to storage and network |
Unlikely | Likely | Power off | Use Power Off to avoid having two instances of the same VM on the VM network |
Unlikely | Unlikely | Poweroff | VM is unavailable, restart makes most sense. Clean shutdown is not needed as storage is most likely inaccessible |
The question that I haven’t answered yet is how HA knows which VMs have been powered-off due to the triggered isolation response and why the isolation response is more reliable than with previous versions of HA. In earlier versions HA did not care and would always try to restart the VMs according to the last known state of the host. That is no longer the case. Before the isolation response is triggered, the isolated host will verify whether a master is responsible for the VM.
As mentioned earlier, it does this by validating if a master owns the home datastore of the VM. When isolation response is triggered, the isolated host removes the VMs which are powered off or shutdown from the “poweron” file. The master will recognize that the VMs have disappeared and initiate a restart. On top of that, when the isolation response is triggered, it will create a per-VM file under a “poweredoff” directory which indicates for the master that this VM was powered down as a result of a triggered isolation response. This information will be read by the master node when it initiates the restart attempt in order to guarantee that only VMs that were powered off / shut down by HA will be restarted by HA. Of course, this is only possible when the datastores are still accessible during the time of failure.
This is, however, only one part of the increased reliability of HA. Reliability has also been improved with respect to “isolation detection,” which will be described in the following section.
Isolation Detection
I have explained what the options are to respond to an isolation event and what happens when the selected response is triggered. However, I have not extensively discussed how isolation is detected. The mechanism is fairly straightforward and works with heartbeats, as earlier explained. There are, however, two scenarios again, and the process and associated timelines differ for each of them:
- Isolation of a slave
- Isolation of a master
Before I explain the differences in process between both scenarios, I want to make sure it is clear that a change in state will result in the isolation response not being triggered in either scenario. Meaning that if a single ping is successful or the host observes election traffic and is elected a master or slave, the isolation response will not be triggered, which is exactly what you want as avoiding down time is at least as important as recovering from down time. When a host has declared itself isolated and observes election traffic it will declare itself no longer isolated.
Isolation of a Slave
HA triggers a master election process before it will declare a host is isolated. In the below timeline, “s” refers to seconds.
- T0 – Isolation of the host (slave)
- T10s – Slave enters “election state”
- T25s – Slave elects itself as master
- T25s – Slave pings “isolation addresses”
- T30s – Slave declares itself isolated
- T60s – Slave “triggers” isolation response
Note that the isolation response gets triggered 30 seconds after the host has been declared isolated. This also means that a restart of the VM will be “delayed” with 30 seconds. Pre vSphere 5.1 this delay did not exist, the delay is configurable however through the advanced setting das.config.fdm.isolationPolicyDelaySec. Note though that the minimum value is 30 seconds, if a value lower than 30 seconds is configured HA will still default to 30 seconds.
When the isolation response is triggered HA creates a “power-off” file for any VM HA powers off whose home datastore is accessible. Next it powers off the VM (or shuts down) and updates the host’s poweron file. The power-off file is used to record that HA powered off the VM and so HA should restart it. These power-off files are deleted when a VM is powered back on or HA is disabled, the below screenshot shows such a power-off file, which in this case is stored in a vVol.
Of course the creation of the poweroff file and the fact that the host is declared isolated is also stored in the fdm.log file. Below some example of what that looks like in the fdm.log file. Note that the example has been edited/pruned for readability purposes.
After the completion of this sequence, the master will learn the slave was isolated through the “poweron” file as mentioned earlier, and will restart VMs based on the information provided by the slave.
Isolation of a Master
In the case of the isolation of a master, this timeline is a bit less complicated because there is no need to go through an election process. In this timeline, “s” refers to seconds.
- T0 – Isolation of the host (master)
- T0 – Master pings “isolation addresses”
- T5s – Master declares itself isolated
- T35s – Master “triggers” isolation response
Additional Checks
Before a host declares itself isolated, it will ping the default isolation address which is the gateway specified for the management network, and will continue to ping the address until it becomes unisolated. HA gives you the option to define one or multiple additional isolation addresses using an advanced setting. This advanced setting is called das.isolationaddress and could be used to reduce the chances of having a false positive. I recommend setting an additional isolation address. If a secondary management network is configured, this additional address should be part of the same network as the secondary management network. If required, you can configure up to 10 additional isolation addresses. A secondary management network will more than likely be on a different subnet and it is recommended to specify an additional isolation address which is part of the subnet.
Selecting an Additional Isolation Address
A question asked by many people is which address should be specified for this additional isolation verification. I generally recommend an isolation address close to the hosts to avoid too many network hops and an address that would correlate with the liveness of the VM network. In many cases, the most logical choice is the physical switch to which the host is directly connected. Basically, use the gateway for whatever subnet your management network is on. Another usual suspect would be a router, a virtual interface on the switch or any other reliable and pingable device on the same subnet. However, when you are using IP-based shared storage like NFS or iSCSI, the IP-address of the storage device can also be a good choice.
Basic design principle: Select a reliable secondary isolation address. Try to minimize the number of “hops” between the host and this address.
Isolation Policy Delay
For those who want to increase the time it takes before HA executes the isolation response an advanced setting is available. Thus setting is called “das.config.fdm.isolationPolicyDelaySec” and allows changing the number of seconds to wait before the isolation policy is executed is. The minimum value is 30. If set to a value less than 30, the delay will be 30 seconds. I do not recommend changing this advanced setting unless there is a specific requirement to do so. In almost all scenarios 30 seconds should suffice.
Restarting VMs
The most important procedure has not yet been explained: restarting VMs. I have dedicated a full section to this concept.
I have explained the difference in behavior from a timing perspective for restarting VMs in the case of a both master node and slave node failures. For now, let’s assume that a slave node has failed. When the master node declares the slave node as Partitioned or Isolated, it determines which VMs were running on using the information it previously read from the host’s “poweron” file. These files are asynchronously read approximately every 30s. If the host was not Partitioned or Isolated before the failure, the master uses cached data to determine the VMs that were last running on the host before the failure occurred.
Before it will initiate the restart attempts, though, the master will first validate that the VM should be restarted. This validation uses the protection information vCenter Server provides to each master, or if the master is not in contact with vCenter Server, the information saved in the protectedlist files. If the master is not in contact with vCenter Server or has not locked the file, the VM is filtered out. At this point, all VMs having a restart priority of “disabled” are also filtered out.
Now that HA knows which VMs it should restart, it is time to decide where the VMs are placed. HA will take multiple things in to account:
- CPU and memory reservation, including the memory overhead of the VM
- Unreserved capacity of the hosts in the cluster
- Restart priority of the VM relative to the other VMs that need to be restarted
- Virtual-machine-to-host compatibility set
- The number of dvPorts required by a VM and the number available on the candidate hosts
- The maximum number of vCPUs and VMs that can be run on a given host
- Restart latency
- Whether the active hosts are running the required number of agent VMs.
Restart latency refers to the amount of time it takes to initiate VM restarts. This means that VM restarts will be distributed by the master across multiple hosts to avoid a boot storm, and thus a delay, on a single host.
If a placement is found, the master will send each target host the set of VMs it needs to restart. If this list exceeds 32 VMs, HA will limit the number of concurrent power on attempts to 32 for that particular host. If a VM successfully powers on, the node on which the VM was powered on will inform the master of the change in power state. The master will then remove the VM from the restart list.
If a placement cannot be found, the master will place the VM on a “pending placement list” and will retry placement of the VM when one of the following conditions changes:
- A new virtual-machine-to-host compatibility list is provided by vCenter.
- A host reports that its unreserved capacity has increased.
- A host (re)joins the cluster (For instance, when a host is taken out of maintenance mode, a host is added to a cluster, etc.)
- A new failure is detected and VMs have to be failed over.
- A failure occurred when failing over a VM.
But what about DRS? Wouldn’t DRS be able to help during the placement of VMs when all else fails? It does, as described earlier, DRS is called upon for placement of the VMs to begin with. If there is insufficient capacity available on a host to restart a VM and DRS is enabled, then DRS will run it’s balancing algorithm to try to make capacity available for restarting the workload.
VM Component Protection
In vSphere 6.0 a new feature, as part of vSphere HA, was introduced called VM Component Protection. VM Component Protection (VMCP) allows you to protect VMs against the failure of your storage system, or components of the storage system or storage area network. There are two types of failures VMCP will respond to, those are Permanent Device Loss (PDL) and All Paths Down (APD). Before we look at some of the details, I want to point out that enabling VMCP is extremely easy. It can be enabled in the Failures and Responses section by simply selecting the response for a PDL and the response for an APD. Note that in the new vSphere Client the term “VM Component Protection” is not used any longer, instead I refer to “Datastore with PDL” and “Datastore with APD” as shown in the below screenshot.
As stated there are two scenarios HA can respond to: PDL and APD. Let’s look at those two scenarios a bit closer. With vSphere 5.0 a feature was introduced as an advanced option that would allow vSphere HA to restart VMs impacted by a PDL condition.
A PDL condition, is a condition that is communicated by the array controller to ESXi via a SCSI sense code. This condition indicates that a device (LUN) has become unavailable and is likely permanently unavailable. An example scenario in which this condition would be communicated by the array would be when a LUN is set offline. This condition is used during a failure scenario to ensure ESXi takes appropriate action when access to a LUN is revoked. It should be noted that when a full storage failure occurs it is impossible to generate a PDL condition as there is no communication possible between the array and the ESXi host. This state will be identified by the ESXi host as an APD condition.
Although the functionality itself worked as advertised, enabling and managing it was cumbersome and error prone. It was required to set the option “disk.terminateVMOnPDLDefault” manually. Today this can all be configured within the UI as shown in the screenshot below.
The three options provided are “Disabled, “Issue Events” and “Power off and restart VMs”. Note that “Power off and restart VMs” does exactly that, your VM process is killed and the VM is restarted on a host which still has access to the storage device.
Pre vSphere 6.0 it was not possible for vSphere to respond to an All Paths Down (APD) scenario. APD is the situation where the storage device has become inaccessible, but the reason is unknown to ESXi. In most cases, it is typically related to a storage network problem when this occurs. With vSphere 5.1 changes were introduced to the way APD scenarios were handled by the hypervisor. This mechanism is leveraged by HA to allow for a response.
As explained earlier, an APD condition is a situation where access to the storage is lost without receiving a SCSI sense code from the array. This for instance can happen when the network between the host and the storage system has failed, hence the name “all paths down.” When an APD condition occurs (access to a device is lost) the hypervisor starts a timer. After 140 seconds the APD condition is declared and the device is marked as APD time out. When the 140 seconds has passed HA will start a timer. The HA time out is 3 minutes by default. When the 3 minutes has passed HA will take the action defined within the UI. There are four options:
- Disabled
- Issue Events
- Power off and restart VMs – Conservative
- Power off and restart VMs – Aggressive
Note that aggressive and conservative refers to the likelihood of HA being able to restart VMs. When set to “conservative” HA will only restart the VM that is impacted by the APD if it knows another host can restart it. In the case of “aggressive” HA will try to restart the VM even if it doesn’t know the state of the other hosts. This could lead to a situation where your VM is not restarted when there is no host that has access to the datastore the VM is located on.
It is also good to know that if the APD is lifted and access to the storage is restored during the total of the approximate 5 minutes and 20 seconds it would take before the VM restart is initiated, that HA will not do anything unless you explicitly configure it do so. This is where the “Response recovery” comes in to play as shown in the screenshot above. If there is a desire to do so you can reset the VM even when the host has recovered from the APD scenario, during the 3 minute (default value) grace period. This can be useful in the event where VMs reside in an unrecoverable state after an APD condition has been declared.
Another useful option is the Response Delay. This setting determines when HA response to the declared APD state. By default, as already mentioned, this is set to 3 minutes. Although you can increase or decrease this delay I recommend to leave this unchanged, unless there’s a specific reason to change this of course like for instance a recovery time objective of lower than five minutes as defined in a service level agreement.
Basic design principle: Without access to shared storage a VM becomes useless. It is highly recommended to configure VMCP to act on a PDL and APD scenario. We recommend to set both to “power off and restart VMs” but leave the “response for APD recovery after APD timeout” disabled so that VMs are not rebooted unnecessarily.
Starting with 7.0 there’s another option available, which is what internally is called “super aggressive”.
The main difference between Conservative and Aggressive is that if you find yourself in a situation where HA isn’t sure whether a VM can be restarted during an APD scenario it will not power off the VM when using Conservative. If you have it configured as Aggressive it will power off the VM. However, if HA is certain that a VM can’t be powered on it will not power off the VM. Basically it prefers the availability of the VM. Let’s make a list of those three states so it is clear what the difference is:
- vSphere HA knows the VM can be restarted
- vSphere HA does not know if the VM can be restarted
- vSphere HA knows the VM cannot be restarted
As you can imagine, in certain scenarios having a VM running while it is impacted by an “APD” situation makes no sense. The VM has lost access to storage, and you simply may prefer to kill the workload. Why? Well, when it loses access to storage it can’t write to disk. You could find yourself in a situation where a change is acknowledged and you think it is written to disk but it somehow is sitting in a memory cache etc.
If you prefer the VM to be killed, regardless of whether it can be restarted or not, you can enable this via a vSphere HA advanced setting. Now before you implement this, do note that if a cluster-wide APD situation occurs, you could find yourself in the scenario where ALL virtual machines are powered off by HA and not restarted as the resources are not available. Anyway, if you feel this is a requirement, you can configure the following vSphere HA advanced setting in vSphere 7:
das.restartVmsWithoutResourceChecks = true |
vSphere HA respecting Affinity Rules
Prior to vSphere 5.5, HA did nothing with VM to VM Affinity or Anti Affinity rules. Typically for people using “affinity” rules this was not an issue, but those using “anti-affinity” rules did see this as an issue. They created these rules to ensure specific VMs would never be running on the same host, but vSphere HA would simply ignore the rule when a failure had occurred and just place the VMs “randomly”. With vSphere 5.5 this changed vSphere HA is “anti affinity” aware and in vSphere 6.0 also VM to Host affinity aware. In order to ensure anti-affinity rules were respected you had to set advanced settings or configure in the vSphere Client as of vSphere 6.0 as shown below
Now note that this does not mean that when you configure anti-affinity rules or VM to Host affinity rules and have this configured to “true” and somehow there aren’t sufficient hosts available to respect these rules that HA would not restart the VM. It would aim to comply to the rules, but availability trumps cluster rules in this case and VMs will be restarted.
In vSphere 6.5, and higher, the option to configure this has disappeared completely. The reason for this is because vSphere HA now tries to respect these rules by default, as it appeared this is the behavior customers wanted.
Note, that if for whatever reason vSphere HA cannot respect the rules, as mentioned before, it will restart the VMs (violating the rule) as these are non-mandatory rules it choses availability over compliancy in this situation.
If you would like to disable this behavior and don’t care about these rules during a fail-over event you can set either or both advanced settings:
- das.respectVmVmAntiAffinityRules – set to “true” by default, set to “false” if you want to disable it
- das.respectVmHostSoftAffinityRules – set to “true” by default, set to “false” if you want to disable it
Basic design principle: We recommend against changing the default behavior. vSphere HA will try to conform to the rules, and if needed will violate. We also recommend using a limited number of rules, we will explain the DRS section of the book what the potential impact is of a higher number of rules.
One more thing to note, many people seem to be under the impression that Affinity, Anti-Affinity and VM-to-Host rules are a DRS function. This is mainly the result of the name (DRS Rules) the feature had in the past. However, these are cluster rules and not a DRS function per se. The functionality can also be used with DRS enabled or licensed, although that limits usefulness in our opinion.
Admission Control
Admission Control is more than likely the most misunderstood concept vSphere holds today and because of this it is often disabled. However, Admission Control is a must when availability needs to be guaranteed and isn’t that the reason for enabling HA in the first place?
What is HA Admission Control about? Why does HA contain this concept called Admission Control? The “Availability Guide” states the following:
“vCenter Server uses admission control to ensure that sufficient resources are available in a cluster to provide failover protection and to ensure that VM resource reservations are respected.”
Please read that quote again, especially the first two words. Indeed it is vCenter that is responsible for Admission Control, contrary to what many believe. Although this might seem like a trivial fact it is important to understand that this implies that Admission Control will not disallow HA initiated restarts. HA initiated restarts are done on a host level and not through vCenter.
As said, Admission Control guarantees that capacity is available for an HA initiated failover by reserving resources within a cluster. It calculates the capacity required for a failover based on available resources. In other words, if a host is placed into maintenance mode or disconnected, it is taken out of the equation. This also implies that if a host has failed or is not responding but has not been removed from the cluster, it is still included in the equation. “Available Resources” indicates that the virtualization overhead has already been subtracted from the total amount.
To give an example; VMkernel memory is subtracted from the total amount of memory to obtain the memory available memory for VMs. There is one gotcha with Admission Control that I want to bring to your attention before drilling into the different policies. When Admission Control is enabled, HA will in no way violate availability constraints. This means that it will always ensure multiple hosts are up and running and this applies for manual maintenance mode actions and, for instance, to VMware Distributed Power Management. So, if a host is stuck trying to enter Maintenance Mode, remember that it might be HA which is not allowing Maintenance Mode to proceed as it would violate the Admission Control Policy. In this situation, users can manually vMotion VMs off the host or temporarily disable admission control to allow the operation to proceed.
But what if you use something like Distributed Power Management (DPM), would that place all hosts in standby mode to reduce power consumption? No, DPM is smart enough to take hosts out of standby mode to ensure enough resources are available to provide for HA initiated failovers. If by any chance the resources are not available, HA will wait for these resources to be made available by DPM and then attempt the restart of the VMs. In other words, the retry count (5 retries by default) is not wasted in scenarios like these.
Admission Control Policy
The Admission Control Policy dictates the mechanism that HA uses to guarantee enough resources are available for an HA initiated failover. This section gives a general overview of the available Admission Control Policies. The impact of each policy is described in the following section, including our recommendation. Admission Control changed in vSphere 6.5 substantially, or at least the user interface changed, and a new form of admission control was introduced. The user interface used to look like the below screenshot.
As of vSphere 6.5 the UI has changed, it combines two aspects of the different Admission Control algorithms.
Let’s look at the different algorithms first, there are three options available. Each option has its caveats but also benefits, I do feel however that for the majority of environments the default Admission Control policy / algorithm is recommend.
Admission Control Algorithms
Each Admission Control Policy has its own Admission Control algorithm. Understanding each of these Admission Control algorithms is important to appreciate the impact each one has on your cluster design. For instance, setting a reservation on a specific VM can have an impact on the achieved consolidation ratio. This section will take you on a journey through the trenches of Admission Control Policies and their respective mechanisms and algorithms.
Cluster Resource Percentage algorithm
The Cluster Resource Percentage algorithm used to be an admission control policy option and was one of the most used admission control policies. The simple reason for this was that it is the least restrictive and most flexible. It was also very easy to configure as shown in the screenshot below, which is what the UI looked like pre-vSphere 6.5.
The main advantage of the cluster resource percentage algorithm is the ease of configuration, and flexibility it offers in terms of how resources were saved for VM restarts. The big change in vSphere 6.5 however is that there no longer is a need to specify a percentage manually but you can now specify how many Host Failures the Cluster should Tolerate as shown in the below screenshot.
When you specify a number of Host Failures this number is then automatically calculated to a percentage. You can of course override this if you prefer to manually set the percentage, typically customers would keep CPU and memory equal. The big benefit of specifying the Host Failures is that when you add hosts to the cluster the percentage of resources saved for HA restarts is automatically calculated again and applied to the cluster, where in the past customers would need to manually calculate what the new percentage would be and configure this.
So how does the admission control policy work?
First of all, HA will add up all available resources to see how much it has available (virtualization overhead will be subtracted) in total. Then, HA will calculate how much resources are currently reserved by adding up all reservations for memory and for CPU for all powered on VMs.
For those VMs that do not have a reservation, a default of 32 MHz will be used for CPU and a default of 0 MB+memory overhead will be used for Memory. (Amount of overhead per configuration type can be found in the “Understanding Memory Overhead” section of the Resource Management guide.)
In other words:
((total amount of available resources – total reserved VM resources)/total amount of available resources) <= (percentage HA should reserve as spare capacity)
Total reserved VM resources includes the default reservation of 32 MHz and the memory overhead of the VM.
Total cluster resources are 24GHz (CPU) and 96GB (MEM). This would lead to the following calculations:
((24 GHz – (2 GHz + 1 GHz + 32 MHz + 4 GHz))/24 GHz) = 69 % available
((96 GB – (1,1 GB +3,2 GB + 100 MB + 600 MB)/96 GB= ~95 % available
As you can see, the amount of memory differs from the diagram. Even if a reservation has been set, the amount of memory overhead is added to the reservation. This example also demonstrates how keeping CPU and memory percentage equal could create an imbalance. Ideally, of course, the hosts are provisioned in such a way that there is no CPU/memory imbalance. Experience over the years has proven, unfortunately, that most environments run out of memory resources first and this might need to be factored in when calculating the correct value for the percentage. However, this trend might be changing as memory is getting cheaper every day.
In order to ensure VMs can always be restarted, Admission Control will constantly monitor if the policy has been violated or not. Please note that this Admission Control process is part of vCenter and not of the ESXi host! When one of the thresholds is reached, memory or CPU, Admission Control will disallow powering on any additional VMs as that could potentially impact availability. These thresholds can be monitored on the HA section of the Cluster’s summary tab.
If you have an unbalanced cluster (hosts with different sizes of CPU or memory resources), your percentage should set manually and be equal or preferably larger than the percentage of resources provided by the largest host. This way you ensure that all VMs residing on this host can be restarted in case of a host failure. Again, there’s a danger to manually configuring percentages as it may lead to a situation where after changes in the cluster no sufficient resources are available to restart VMs.
You can also find yourself in a situation where resources might be fragmented throughout the cluster, especially in larger overbooked clusters this can happen. Although DRS is notified to rebalance the cluster, if needed, to accommodate these VMs resource requirements, a guarantee cannot be given. I recommend selecting the highest restart priority for this VM (of course, depending on the SLA) to ensure it will be able to boot.
The following example and diagram will make it more obvious: You have 3 hosts, each with roughly 80% memory usage, and you have configured HA to reserve 20% of resources for both CPU and memory. A host fails and all VMs will need to failover. One of those VMs has a 4 GB memory reservation. As you can imagine, HA will not be able to initiate a power-on attempt, as there are not enough memory resources available to guarantee the reserved capacity. Instead an event will get generated indicating “not enough resources for failover” for this VM.
Basic design principle: Although HA will utilize DRS to try to accommodate for the resource requirements of this VM a guarantee cannot be given. Do the math; verify that any single host has enough resources to power-on your largest VM. Also take restart priority into account for this/these VM(s).
Slot Size algorithm
The Admission Control algorithm that has been around the longest is the slot size algorithm, formerly known as the “Host Failures Cluster Tolerates” policy. It is also historically the least understood Admission Control Policy due to its complex admission control mechanism.
Similar to the Cluster Resource Percentage algorithm you can specify the number of Host Failures the Cluster Tolerates in an N-1 fashion. This means that the number of host failures you can specify in a 64 host cluster is 63. As mentioned before, this “Host failures cluster tolerates” also is available when the percentage based policy is selected.
Within the vSphere Client it is possible to specify the slot size algorithm should be used through the dropdown under “Define host failover capacity by”.
When “Slot Policy” is selected the “slots” mechanism is used. The details of this mechanism have changed several times in the past and it is one of the most restrictive policies; more than likely, it is also the least understood.
Slots dictate how many VMs can be powered on before vCenter starts yelling “Out Of Resources!” Normally, a slot represents one VM. Admission Control does not limit HA in restarting VMs, it ensures enough unfragmented resources are available to power on all VMs in the cluster by preventing “over-commitment”. Technically speaking “over-commitment” is not the correct terminology as Admission Control ensures VM reservations can be satisfied and that all VMs’ initial memory overhead requirements are met. Although I have already touched on this, it doesn’t hurt repeating it as it is one of those myths that keeps coming back; HA initiated failovers are not prone to the Admission Control Policy. Admission Control is done by vCenter. HA initiated restarts, in a normal scenario, are executed directly on the ESXi host without the use of vCenter. The corner-case is where HA requests DRS (DRS is a vCenter task!) to defragment resources but that is beside the point. Even if resources are low and vCenter would complain, it couldn’t stop the restart from happening.
Let’s dig into this concept I have just introduced, slots.
“A slot is defined as a logical representation of the memory and CPU resources that satisfy the reservation requirements for any powered-on VM in the cluster.”
In other words, a slot is the worst case CPU and memory reservation scenario in a cluster. This directly leads to the first “gotcha.”
HA uses the highest CPU reservation of any given powered-on VM and the highest memory reservation of any given powered-on VM in the cluster. If no reservation of higher than 32 MHz is set, HA will use a default of 32 MHz for CPU. If no memory reservation is set, HA will use a default of 0 MB + memory overhead for memory. (See the VMware vSphere Resource Management Guide for more details on memory overhead per VM configuration.) The following example will clarify what “worst-case” actually means.
Example: If VM “VM1” has 2 GHz of CPU reserved and 1024 MB of memory reserved and VM “VM2” has 1 GHz of CPU reserved and 2048 MB of memory reserved the slot size for memory will be 2048 MB (+ its memory overhead) and the slot size for CPU will be 2 GHz. It is a combination of the highest reservation of both VMs that leads to the total slot size. Reservations defined at the Resource Pool level however, will not affect HA slot size calculations.
Basic design principle: Be really careful with reservations, if there’s no need to have them on a per VM basis; don’t configure them, especially when using host failures cluster tolerates. If reservations are needed, resort to resource pool based reservations.
Now that we know the worst-case scenario is always taken into account when it comes to slot size calculations, I will describe what dictates the amount of available slots per cluster as that ultimately dictates how many VMs can be powered on in your cluster.
First, we will need to know the slot size for memory and CPU, next we will divide the total available CPU resources of a host by the CPU slot size and the total available memory resources of a host by the memory slot size. This leaves us with a total number of slots for both memory and CPU for a host. The most restrictive number (worst-case scenario) is the number of slots for this host. In other words, when you have 25 CPU slots but only 5 memory slots, the amount of available slots for this host will be 5 as HA always takes the worst case scenario into account to “guarantee” all VMs can be powered on in case of a failure or isolation.
The question I receive a lot is how do I know what my slot size is? The details around slot sizes can be monitored on the HA section of the Cluster’s Monitor tab by checking the “Advanced Runtime Info” section when the “Host Failures” Admission Control Policy is configured.
As you can imagine, using reservations on a per VM basis can lead to very conservative consolidation ratios. However, this is something that is configurable through the vSphere Client. If you have just one VM with a really high reservation, you can set an explicit slot size by going to “Edit Cluster Services” and specifying them under the Admission Control Policy section. In the above screenshot there is a single VM with a 16GB reservation, this skews the number of available slots in the cluster as a result. As can be seen, only 24 slots are available.
If one of these advanced settings is used, HA will ensure that the VM that skewed the numbers can be restarted by “assigning” multiple slots to it. However, when you are low on resources, this could mean that you are not able to power on the VM with this reservation because resources may be fragmented throughout the cluster instead of available on a single host. HA will notify DRS that a power-on attempt was unsuccessful and a request will be made to defragment the resources to accommodate the remaining VMs that need to be powered on. In order for this to be successful DRS will need to be enabled and configured to fully automated. When not configured to fully automated user action is required to execute DRS recommendations.
The following diagram depicts a scenario where a VM spans multiple slots:
Notice that because the memory slot size has been manually set to 1024 MB, one of the VMs (grouped with dotted lines) spans multiple slots due to a 4 GB memory reservation. As you might have noticed, none of the hosts has enough resources available to satisfy the reservation of the VM that needs to failover. Although in total there are enough resources available, they are fragmented and HA will not be able to power-on this particular VM directly but will request DRS to defragment the resources to accommodate this VM’s resource requirements. The great thing about the vSphere Client is that after setting the slot size manually to 1GB (1024MB) you can see which VMs require multiple slots when you click “calculate” followed by “view”. This is demonstrated in the below screenshots.
Admission Control does not take fragmentation of slots into account when slot sizes are manually defined with advanced settings. It will take the number of slots this VM will consume into account by subtracting them from the total number of available slots, but it will not verify the amount of available slots per host to ensure failover. As stated earlier, though, HA will request DRS to defragment the resources. This is by no means a guarantee of a successful power-on attempt.
Basic design principle: Avoid using manually specified slot sizes to increase the total number of slots as it could lead to more down time and adds an extra layer of complexity. If there is a large discrepancy in size and reservations I recommend using the percentage based admission control policy.
I highly recommend monitoring this section on a regular basis to get a better understand of your environment and to identify those VMs that might be problematic to restart in case of a host failure. Also, it is possible to identify the VMs with a high reservation in the vSphere Client. You can do this by going to the VM view and adding a column called “Reservations” as shown in the screenshot below.
Using the above view, and the additional column, it is quickly determined that the VMware NSX infrastructure requires reservations, and is potentially skewing the number of slots. However, after manually changing the memory slot size to 1024MB, the number of total slots, used slots and slots available significantly changed as demonstrated in the screenshot below.
Unbalanced Configurations and Impact on Slot Calculation
It is an industry best practice to create clusters with similar hardware configurations. However, many companies started out with a small VMware cluster when virtualization was first introduced. When the time has come to expand, chances are fairly large the same hardware configuration is no longer available. The question is will you add the newly bought hosts to the same cluster or create a new cluster?
From a DRS perspective, large clusters are preferred as it increases the load balancing opportunities. However, there is a caveat for DRS as well, which is described in the DRS section of this book. For HA, there is a big caveat. When you think about it and understand the internal workings of HA, more specifically the slot algorithm, you probably already know what is coming up.
Let’s first define the term “unbalanced cluster.”
An unbalanced cluster would, for instance, be a cluster with 3 hosts of which one contains substantially more memory than the other hosts in the cluster.
Let’s try to clarify that with an example.
Example: What would happen to the total number of slots in a cluster of the following specifications? Yes I know, the below-provided example of host resources is not realistic in this day and age, however it is for illustrative purposes and makes calculations easier.
- Three host cluster
- Two hosts have 16 GB of available memory
- One host has 32 GB of available memory
The third host is a brand-new host that has just been bought and as prices of memory dropped immensely the decision was made to buy 32 GB instead of 16 GB.
The cluster contains a VM that has 1 vCPU and 4 GB of memory. A 1024 MB memory reservation has been defined on this VM. As explained earlier, a reservation will dictate the slot size, which in this case leads to a memory slot size of 1024 MB + memory overhead. For the sake of simplicity, I will calculate with 1024 MB. The following diagram depicts this scenario:
When Admission Control is enabled and the number of host failures has been selected as the Admission Control Policy, the number of slots will be calculated per host and the cluster in total. This will result in:
- ESXi-01 = 16 slots
- ESXi-02 = 16 slots
- ESXi-03 = 32 slots
As Admission Control is enabled, a worst-case scenario is taken into account. When a single host failure has been specified, this means that the host with the largest number of slots will be taken out of the equation. In other words, for our cluster, this would result in:
ESXi-01 + ESXi-02 = 32 slots available
Although you have doubled the amount of memory in one of your hosts, you are still stuck with only 32 slots in total. As clearly demonstrated, there is absolutely no point in buying additional memory for a single host when your cluster is designed with Admission Control enabled and the number of host failures has been selected as the Admission Control Policy.
In our example, the memory slot size happened to be the most restrictive; however, the same principle applies when CPU slot size is most restrictive.
Basic design principle: When using admission control, balance your clusters and be conservative with reservations as it leads to decreased consolidation ratios.
Now, what would happen in the scenario above when the number of allowed host failures is to 2? In this case ESXi-03 is taken out of the equation and one of any of the remaining hosts in the cluster is also taken out, resulting in 16 slots. This makes sense, doesn’t it?
Can you avoid large HA slot sizes due to reservations without resorting to advanced settings? That’s the question I get almost daily and the answer is the “Percentage of Cluster Resources Reserved” admission control mechanism.
Failover Hosts
The third option one could choose is to select one or multiple designated Failover hosts. This is commonly referred to as a hot standby.
It is “what you see is what you get”. When you designate hosts as failover hosts, they will not participate in DRS and you will not be able to run VMs on these hosts! Not even in a two host cluster when placing one of the two in maintenance. These hosts are literally reserved for failover situations. HA will attempt to use these hosts first to failover the VMs. If, for whatever reason, this is unsuccessful, it will attempt a failover on any of the other hosts in the cluster. For example, when three hosts would fail, including the hosts designated as failover hosts, HA will still try to restart the impacted VMs on the host that is left. Although this host was not a designated failover host, HA will use it to limit downtime.
Performance Degradation
The question that then rises is if Admission Control is all about ensuring VMs can be restarted, but what about the resources available to the VMs after the restart? Pre-vSphere 6.5 there was no way to guarantee what the availability of resources would be after a restart. Starting with vSphere 6.5 we have the option to specify how much performance degradation can be tolerated as shown in the screenshot below.
As said, this feature allows you to specify the performance degradation you are willing to incur if a failure happens. It is set to 100% by default, but it is our recommendation to consider changed the value. You can for instance change this to 25% or 50.
So how does this work? Well first of all, you need DRS enabled as HA leverages DRS to get the cluster resource usage. But let’s look at an example:
- 75GB of memory available in 3 node cluster
- 1 host failure to tolerate specified
- 60GB of memory actively used by VMs
- 0% resource reduction tolerated
This results in the following:
75GB – 25GB (1 host worth of memory) = 50GB
We have 60GB of memory used, with 0% resource reduction to tolerate
60GB needed, 50GB available after failure which means that a warning is issued to the vSphere admin. Now the vSphere admin can decide what to do, accept the performance degradation or buy new hosts and add these to the cluster to ensure the performance for all VMs remain the same after an HA initiated restart. Of course, in larger environments it may also be possible to migrate VMs to other clusters in the environment.
Note that the feature at the time of writing does all calculations based on a single host failure and the percentage specified applies to both CPU and memory.
Basic design principle: To ensure consistent performance behavior even after a failure I recommend considering setting Performance Degradation Tolerated to a different value than 100%. The value should be based on your infrastructure and service level agreement.
Decision Making Time
As with any decision you make, there is an impact to your environment. This impact could be positive but also, for instance, unexpected. This especially goes for HA Admission Control. Selecting the right Admission Control algorithm can lead to a quicker Return On Investment and a lower Total Cost of Ownership. In the previous section, I described all the algorithms that form Admission Control and in this section I will focus more on the design considerations around selecting the appropriate Admission Control Policy for your or your customer’s environment.
The first decision that will need to be made is whether Admission Control will be enabled. I generally recommend enabling Admission Control as it is the only way of guaranteeing your VMs will be allowed to restart after a failure. It is important, though, that the policy is carefully selected and fits your or your customer’s requirements.
Basic design principle: Admission control guarantees enough capacity is available for VM failover. As such I recommend enabling it.
Although I already have explained all the mechanisms that are being used by each of the policies in the previous section, I will give a high level overview and list all the pros and cons in this section. On top of that, I will expand on what I feel is the most flexible Admission Control Policy and how it should be configured and calculated.
Percentage as Cluster Resources Reserved
The percentage based Admission Control is based on per-reservation calculation. The percentage based Admission Control Policy is less conservative than the “slot based” algorithm and more flexible than “Failover Hosts”. It is by far the most used algorithm, and that is for a good reason in our opinion!
Pros:
- Accurate as it considers actual reservation per VM to calculate available failover resources.
- Cluster dynamically adjusts when resources are added.
Cons:
- Unbalanced clusters can be a potential problem when there’s a discrepancy between memory and CPU resources in different hosts.
Please note that, although a failover cannot be guaranteed, there are few scenarios where a VM will not be able to restart due to the integration HA offers with DRS and the fact that most clusters have spare capacity available to account for VM demand variance. Although this is a corner-case scenario, it needs to be considered in environments where absolute guarantees must be provided.
Slot size Algorithm
This algorithm was historically speaking the most used for Admission Control. Most environments are designed with an N+1 redundancy and N+2 is also not uncommon. This Admission Control Policy uses “slots” to ensure enough capacity is reserved for failover, which is a fairly complex mechanism. Slots are based on VM-level reservations and if reservations are not used a default slot size for CPU of 32 MHz is defined and for memory the largest memory overhead of any given VM is used.
Pros:
- Fully automated (When a host is added to a cluster, HA re-calculates how many slots are available.)
- Guarantees failover by calculating slot sizes.
Cons:
- Can be very conservative and inflexible when reservations are used as the largest reservation dictates slot sizes.
- Unbalanced clusters lead to wastage of resources.
- Complexity for administrator from calculation perspective.
Specify Failover Hosts
With the “Specify Failover Hosts” Admission Control Policy, when one or multiple hosts fail, HA will attempt to restart all VMs on the designated failover hosts. The designated failover hosts are essentially “hot standby” hosts. In other words, DRS will not migrate VMs to these hosts when resources are scarce or the cluster is imbalanced.
Pros:
- What you see is what you get.
- No fragmented resources.
Cons:
- What you see is what you get.
- Dedicated failover hosts not utilized during normal operations.
Recommendations
I have been asked many times for our recommendation on Admission Control and it is difficult to answer as each policy has its pros and cons. However, I generally recommend the Percentage based Admission Control Policy. It is the most flexible policy as it uses the actual reservation per VM instead of taking a “worst case” scenario approach like the slot policy does.
However, the slot policy guarantees the failover level under all circumstances. Percentage based is less restrictive, but offers lower guarantees that in all scenarios HA will be able to restart all VMs. With the added level of integration between HA and DRS I believe the Cluster Resource Percentage Policy will fit most environments.
Basic design principle: Do the math and take customer requirements into account. I recommend using a “percentage” based admission control policy, as it is the most flexible.
Now that I have recommended which Admission Control Policy to use, the next step is to provide guidance around selecting the correct percentage. I cannot tell you what the ideal percentage is as that totally depends on the size of your cluster and, of course, on your resiliency model (N+1 vs. N+2). I can, however, provide guidelines around calculating how much of your resources should be set aside and how to prevent wasting resources.
Selecting the Right Percentage
Pre vSphere 6.5 it was required to manually specify the percentage for both CPU and memory for the Cluster Resource Percentage Policy. It was a common strategy to select a single host as a percentage of resources reserved for failover. I generally recommended selecting a percentage which is the equivalent of a single or multiple hosts. Today the percentage can be manually specified, or can be automatically calculated by leveraging “Host failures cluster tolerates”. I highly recommend to not change the percentage manually if there’s no reason for it. The big advantage of the automatic calculation is that when new hosts are added to the cluster, or hosts are removed, HA will automatically adjust the percentage value for you. When the percentage value is manually configured then you will need recalculate and re-configure based on the new outcome of the calculations.
Let’s explain why and what the impact and risk is of manual calculations and not using the equivalent of a single or multiple hosts.
Let’s start with an example: a cluster exists of 8 ESXi hosts, each containing 70 GB of available RAM. This might sound like an awkward memory configuration but to simplify things I have already subtracted 2 GB as virtualization overhead. Although virtualization overhead is probably less than 2 GB, I have used this number to make calculations easier. This example zooms in on memory but this concept also applies to CPU, of course.
For this cluster I will define the percentage of resources to reserve for both Memory and CPU to 20%. For memory, this leads to a total cluster memory capacity of 448 GB:
(70 GB + 70 GB + 70 GB + 70 GB + 70 GB + 70 GB + 70 GB + 70 GB) * (1 – 20%)
A total of 112 GB of memory is reserved as failover capacity.
Once a percentage is specified, that percentage of resources will be unavailable for VMs, therefore it makes sense to set the percentage as close to the value that equals the resources a single (or multiple) host represents. I will demonstrate why this is important in subsequent examples.
In the example above, 20% was used to be reserved for resources in an 8-host cluster. This configuration reserves more resources than a single host contributes to the cluster. HA’s main objective is to provide automatic recovery for VMs after a physical server failure. For this reason, it is recommended to reserve resources equal to a single or multiple hosts. When using the per-host level granularity in an 8-host cluster (homogeneous configured hosts), the resource contribution per host to the cluster is 12.5%. However, the percentage used must be an integer (whole number). It is recommended to round up to the value guaranteeing that the full capacity of one host is protected, in this example, the conservative approach would lead to a percentage of 13%.
Aggressive Approach
I have seen many environments where the percentage was set to a value that was less than the contribution of a single host to the cluster. Although this approach reduces the amount of resources reserved for accommodating host failures and results in higher consolidation ratios, it also offers a lower guarantee that HA will be able to restart all VMs after a failure. One might argue that this approach will more than likely work as most environments will not be fully utilized; however it also does eliminate the guarantee that after a failure all VMs will be recovered. Wasn’t that the reason for enabling HA in the first place?
Adding Hosts to Your Cluster
Although the percentage is dynamic and calculates capacity at a cluster-level, changes to your selected percentage might be required when expanding the cluster. The reason being that the amount of reserved resources for a fail-over might not correspond with the contribution per host and as a result lead to resource wastage. For example, adding 4 hosts to an 8-host cluster and continuing to use the previously configured admission control policy value of 13% will result in a failover capacity that is equivalent to 1.5 hosts. Figure 41 depicts a scenario where an 8-host cluster is expanded to 12 hosts. Each host holds 8 2 GHz cores and 70 GB of memory. The cluster was originally configured with admission control set to 13%, which equals to 109.2 GB and 24.96 GHz. If the requirement is to allow a single host failure 7.68 Ghz and 33.6 GB is “wasted” as clearly demonstrated in the diagram below.
vSAN and vVols Specifics!
In the last couple of sections I have discussed the ins and out of HA. All of it based on VMFS based or NFS based storage. With the introduction of VMware vSAN and vVols also comes changes to some of the discussed concepts. We’ve already seen that the use of vSAN potentially changes the design decision around the isolation response. What else is different for HA when vSAN or VVols are used in the environment? Let’s take a look at vSAN first.
HA and vSAN
vSAN is VMware’s approach to Software Defined Storage. I am not going to explain the ins and outs of vSAN, but I do want to provide a basic understanding for those who have never done anything with it. vSAN leverages host local storage and creates a shared data store out of it. Note, this was written when only vSAN OSA was available, but most logic also applies to vSAN ESA.
vSAN requires a minimum of 3 hosts and each of those 3 hosts will need to have 1 SSD for caching and 1 capacity device (can be SSD or HDD). Only the capacity devices will contribute to the total available capacity of the datastore. If you have 1TB worth of capacity devices per host then with three hosts the total size of your datastore will be 3TB.
Having said that, with vSAN 6.1 VMware introduced a “2-node” option. This 2-node option is actually two regular vSAN nodes with a third “witness” node, where the witness node acts as a quorum and does not run workloads and neither contributes to the capacity of the vSAN Datastore.
The big differentiator between most storage systems and vSAN is that availability of the VM’s is defined on a per virtual disk or per VM basis through policy. This is what vSAN calls “Failures To Tolerate”, and can be configured to any value between 0 (zero) and 3. When configured to 0 then the VM will have only 1 copy of its virtual disks (objects) which means that if a host fails where the virtual disks (objects) are stored the VM is lost. As such all VMs are deployed by default with Failures To Tolerate(FTT) set to 1. A virtual disk is what vSAN refers to as an object. An object, when FTT is configured as 1 or higher, has multiple components. In the diagram below I demonstrate the FTT=1 scenario, and the virtual disk in this case has 2 “data components” and a “witness components”. The witness is used as a “quorum” mechanism. Note that in a two node configuration, this witness component would be stored on the Witness node, always! Node that the situation below shows the simplest of vSAN policy capabilities. There are also options to increase the stripe size, or use RAID-5 or RAID-6 or even increase the Failures To Tolerate to 2 or 3, which would increase the number of components.
As the diagram above depicts, a VM can be running on the first host in the cluster while its storage components are on the remaining hosts in the cluster. Note that the above
As you can imagine from an HA point of view this changes things as access to the network is not only critical for HA to function correctly but also for vSAN. When it comes to networking note that when vSAN is configured in a cluster HA will use the same network for its communications (heartbeating etc). On top of that, it is good to know that VMware highly recommends 10GbE to be used for vSAN.
Basic design principle: 10GbE is highly recommend for vSAN, as vSphere HA also leverages the vSAN network and availability of VMs is dependent on network connectivity ensure that at a minimum two 10GbE ports are used and two physical switches for resiliency.
The reason that HA uses the same network as vSAN is simple, it is to avoid network partition scenarios where HA communications is separated from vSAN and the state of the cluster is unclear. Note that you will need to ensure that there is a pingable isolation address on the vSAN network and this isolation address will need to be configured as such through the use of the advanced setting “das.isolationAddress0”. I also recommend disabling the use of the default isolation address through the advanced setting “das.useDefaultIsolationAddress” (set to false).
If you leave the isolation address set to the default gateway of the management network, then HA will use the management network to verify the isolation. There could be a scenario where only the vSAN network is isolated, in that particular situation VMs will not be powered off (or shutdown) when the isolation address is not part of the vSAN network.
When an isolation does occur the isolation response is triggered as explained in earlier sections. For vSAN the recommendation is simple, configure the isolation response to “Power Off and restarts VMs”. This is the safest option. vSAN can be compared to the “converged network with IP based storage” example I provided earlier. It is very easy to reach a situation where a host is isolated, and all VMs remain running but are restarted on another host because the connection to the vSAN datastore is lost.
Basic design principle: Configure your Isolation Address and your Isolation Policy accordingly. I recommend selecting “power off” as the Isolation Policy and selecting a reliable pingable device as the isolation address.
Folder Structure with vSAN for HA
What about things like heartbeat datastores and the folder structure that exists on a VMFS datastore, has any of that changed with vSAN. Yes, it has. First of all, in a “vSAN” only environment the concept of Heartbeat Datastores is not used at all. The reason for this is straight forward, as HA and vSAN share the same network it is safe to assume that when the HA heartbeat is lost because of a network failure so is access to the vSAN datastore. Only in an environment where there is also traditional storage the heartbeat datastores will be configured, leveraging those traditional datastores as a heartbeat datastore. Note that I do not feel there is a reason to introduce traditional storage just to provide HA this functionality, HA and vSAN work perfectly fine without heartbeat datastores. If you however have traditional storage I do recommend implementing heartbeat datastores as it can help HA with identifying the type of issue that has occurred.
Normally HA metadata is stored in the root of the datastore, for vSAN this is different as the metadata is stored in the VM’s namespace object. The protectedlist is held in memory and updated automatically when VMs are powered on or off.
Now you may wonder, what happens when there is an isolation? How does HA know where to start the VM that is impacted? Let’s take a look at a partition scenario.
In this scenario there a network problem has caused a cluster partition. Where a VM is restarted is determined by which partition owns the VM files. Within a vSAN cluster this is fairly straight forward. There are two partitions, one of which is running the VM with its VMDK and the other partition has a VMDK replica and a witness. Guess what happens!? Right, vSAN uses the witness to verify which partition has quorum and based on that result, one of the two partitions will win. In this case, Partition 2 has more than 50% of the components of this object and as such is the winner. This means that the VM will be restarted on either “esxi-03″ or “esxi-04″ by HA. Note that the VM in Partition 1 may or may not be powered off, this depends on whether esxi-01 and esxi-02 can communicate with each other or not. If esxi-01 and esxi-02 can communicate then the VM will not be powered off as the isolation response it not triggered. If esxi-01 and esxi-02 cannot communicate then the isolation response will be triggered and the VM will be powered off. Note that for the sake of simplicity I simplified the example. For a detailed understanding of how vSAN components, witnesses and the quorum mechanism (votes) work I like to refer to the vSAN Deep Dive book by Cormac Hogan, Duncan Epping, and Pete Koehler.
One final thing that is different for vSAN is how a partition is handled in a stretched cluster configuration. In a traditional stretched cluster configuration, using VMFS/NFS based storage, VMs impacted by an APD or PDL will be killed by HA through VM Component Protection. With vSAN this is slightly different. HA VMCP in 6.0 and higher is not supported with vSAN.
vSAN has its own mechanism, for now at least. vSAN recognizes when a VM running on a group of hosts, in the diagram above let’s say Site B, has no access to any of the components in a stretched cluster. When this is the case vSAN will simply kill the impacted VM. You can disable this behavior, although I do not recommend doing this, by setting the advanced host setting called VSAN.AutoTerminateGhostVm to 0.
Heartbeat datastores, when can they help?
I have already briefly discussed this, but I want to reiterate this as it is a topic which is skipped and overlooked often. Even in a vSAN environment Heartbeat Datastores can be useful. Let us first go over an isolation scenario briefly again and then discuss why the Heartbeat Datastore could be “useful”. I used quotes here on purpose, as some might not prefer the behavior when a heartbeat datastore is defined in a vSAN world.
When is an isolation declared? A host declares itself isolated when:
- It is not receiving any communication from the master
- It cannot ping the isolation address
- It is not receiving any election traffic of any other hosts in the cluster
If you have not set any advanced settings then the default gateway of the management network will be the isolation address. Just imagine your vSAN Network to be isolated on a given host, but for whatever reason the Management Network is not. In that scenario isolation is not declared, the host can still ping the isolation address using the management network vmkernel interface. However, vSphere HA will restart the VMs. The VMs have lost access to disk, as such the lock on the VMDK is lost. HA notices the host is gone, which must mean that the VMs are dead as the locks are lost. It will then try to restart the VMs.
That is when you could find yourself in the situation where the VMs are running on the isolated host and also somewhere else in the cluster. Both with the same mac address and the same name / IP address. Not a good situation to be in when those VMs are still accessible over the VM network. Although this is not likely, it still is a risk with a relatively high impact.
This is where the heartbeat datastore could come in handy. If you would have had datastore heartbeats enabled, and accessible during the failure, then this would be prevented. The isolated host would simply inform the master it is isolated through the heartbeat datastore, and it would also inform the master about the state of the VMs, which in this scenario would be powered-on. The master would then decide not to restart the VMs. Do realize that the VMs which are running on the isolated host are more or less useless as they cannot write to disk anymore. Although the heartbeat datastore will prevent the VMs from being restarted, and as such avoid the duplicate mac address and/or IP-address issue, this could still be considered undesirable as the VMs may be unusable as they cannot write to disk.
Basic design principle: There is no right or wrong in this case. Whether you should, or should not, use Heartbeat Datastore entirely depends on your preferred outcome. As such I recommend testing with and without heartbeat, and configure based on your preferred outcome.
HA and vVols
Let us start with first describing what vVols is and what value it brings to an administrator. vVols was developed to make your life (vSphere admin) and that of the storage administrator easier. This is done by providing a framework that enables the vSphere administrator to assign policies to VMs or virtual disks, not unlike vSAN. In these policies capabilities of the storage array can be defined. These capabilities can be things like snapshotting, deduplication, raid-level, thin / thick provisioning etc. What is offered to the vSphere administrator is up to the Storage administrator, and of course up to what the storage system can offer to begin with. In the below screenshots I show an example for instance of some of the capabilities Nimble exposes through policy.
When a VM is deployed and a policy is assigned then the storage system will enable certain functionality of the array based on what was specified in the policy. So no longer a need to assign capabilities to a LUN that holds many VMs, but rather a per VM or even per VMDK level control. So how does this work? Well let’s take a look at an architectural diagram first.
The diagram shows a couple of components that are important in the VVol architecture. Let’s list them out:
- Protocol Endpoints aka PE
- Virtual Datastore and a Storage Container
- Vendor Provider / VASA
- Policies
- vVols
Let’s take a look at all of these in the above order. Protocol Endpoints, what are they?
Protocol Endpoints are literally the access point to your storage system. All IO to vVols is proxied through a Protocol Endpoint and you can have 1 or more of these per storage system, if your storage system supports having multiple of course. (Implementations of different vendors will vary.) PEs are compatible with different protocols (FC, FCoE, iSCSI, NFS). You could view a Protocol Endpoint as a “mount point” or a device, and yes they will count towards your maximum number of devices per host (1024 as of vSphere 6.7). (vVols itself won’t count towards that!)
Next up is the Storage Container. This is the place where you store your VMs, or better said where your vVols end up. The Storage Container is a storage system logical construct and is represented within vSphere as a “virtual datastore”. You need 1 per storage system, but you can have many when desired. To this Storage Container you can apply capabilities. If you like your vVols to be able to use array based snapshots then the storage administrator will need to assign that capability to the storage container. Note that a storage administrator can grow a storage container without even informing you. A storage container isn’t formatted with VMFS or anything like that, so you don’t need to increase the volume in order to use the space.
But how does vSphere know which container is capable of doing what? In order to discover a storage container and its capabilities I need to be able to talk to the storage system first. This is done through the vSphere APIs for Storage Awareness. You simply point vSphere to the Vendor Provider and the vendor provider will report to vSphere what’s available, this includes both the storage containers as well as the capabilities they possess. Note that a single Vendor Provider can be managing multiple storage systems which in its turn can have multiple storage containers with many capabilities. These vendor providers can also come in different flavors, for some storage systems it is part of their software but for others it will come as a virtual appliance that sits on top of vSphere.
Now that vSphere knows which systems there are, what containers are available with which capabilities you can start creating policies. These policies can be a combination of capabilities and will ultimately be assigned to VMs or virtual disks even. You can imagine that in some cases you would like Quality of Service enabled to ensure performance for a VM while in other cases it isn’t as relevant, but you need to have a snapshot every hour. All of this is enabled through these policies. No longer will you be maintaining that spreadsheet with all your LUNs and which data service were enabled and what not, no you simply assign a policy. (Yes, a proper naming scheme will be helpful when defining policies.) When requirements change for a VM you don’t move the VM around, no you change the policy and the storage system will do what is required in order to make the VM (and its disks) compliant again with the policy. Not the VM really, but the VVols.
Okay, those are the basics, now what about vVols and vSphere HA. What changes when you are running vVols, what do you need to keep in mind when running vVols when it comes to HA?
First of all, let me mention this, in some cases storage vendors have designed a solution where the “vendor provider” isn’t designed in an HA fashion (VMware allows for Active/Active, Active/Standby or just “Active” as in a single instance). Make sure to validate what kind of implementation your storage vendor has, as the Vendor Provider needs to be available when powering on VMs. The following quote explains why:
“When a vVol is created, it is not immediately accessible for IO. To Access vVols, vSphere needs to issue a “Bind” operation to a VASA Provider (VP), which creates IO access point for a vVol on a Protocol Endpoint (PE) chosen by a VP. A single PE can be the IO access point for multiple vVols. “Unbind” Operation will remove this IO access point for a given vVol.”
That is the vVol implementation aspect, but of course things have also changed from a vSphere HA point of view. No longer do we have VMFS or NFS datastores to store files on or use for heartbeating. What changes from that perspective. First of all a VM is carved up in different VVols:
- VM Configuration
- VM Disk’s
- Swap File
- Snapshot (if there are any)
Besides these different types of objects, when vSphere HA is enabled there also is a volume used by vSphere HA and this volume will contain all the metadata that is normally stored under “/<root of datastore>/.vSphere-HA/<cluster-specific-directory>/” on regular VMFS. For each HA Cluster a separate folder will be created in this VVol as shown in the screenshot below.
All VM related HA files that normally would be under the VM folder, like for instance the power-on file, heartbeat files and the protectedlist, are now stored in the VM Configuration VVol object. Conceptually speaking similar to regular VMFS, implementation wise however completely different.
The power-off file however, which is used to indicate that a VM has been powered-off due to an isolation event, is not stored under the .vSphere-HA folder any longer but is stored in the VM config VVol (in the UI exposed as the VVol VM folder) as shown in the screenshot below. The same applies for vSAN, where it is now stored in the VM namespace object, and for traditional storage (NFS or VMFS) it is stored in the VM folder. This change was made when vVols was introduced and done to keep the experience consistent across storage platforms.
That explains the differences between traditional storage systems using VMFS / NFS and new storage systems leveraging vVols or even a full vSAN based solution.
Advanced Settings
There are various types of KB articles and this KB article explains it, but let me summarize it and simplify it a bit to make it easier to digest.
There are various sorts of advanced settings, but for HA three in particular:
- das.* –> Cluster level advanced setting.
- fdm.* –> FDM host level advanced setting.
- vpxd.* –> vCenter level advanced setting.
How do you configure these?
Configuring these is typically straight forward, and most of you hopefully know this already, if not, let us go over the steps to help configuring your environment as desired.
Cluster Level
In the vSphere Client:
- Go to “Hosts and Clusters”
- Click your cluster object
- Click the “Configure” tab
- Click “vSphere Availability”
- Click “Edit” on “vSphere HA”
- Click the “Advanced Options” button
FDM Host Level
- Open up an SSH session to your host and edit “/etc/opt/vmware/fdm/fdm.cfg”
vCenter Level
In the vSphere Client:
- Go to “Hosts and Clusters”
- Click the appropriate vCenter Server
- Click the “Configure” tab
- Click “Advanced Settings” under “Settings”
- Most commonly used
In this section I will primarily focus on the ones most commonly used, a full detailed list can be found in KB 2033250. Please note that each bullet details the version which supports this advanced setting.
- das.maskCleanShutdownEnabled
Whether the clean shutdown flag will default to false for an inaccessible and poweredOff VM. Enabling this option will trigger VM failover if the VM’s home datastore isn’t accessible when it dies or is intentionally powered off. - das.ignoreInsufficientHbDatastore
Suppress the host config issue that the number of heartbeat datastores is less than das.heartbeatDsPerHost. The default value is “false”. Can be configured as “true” or “false”. - das.heartbeatDsPerHost
The number of required heartbeat datastores per host. The default value is 2; the value should be between 2 and 5. - das.isolationaddress[x]
IP address the ESXi hosts use to check on isolation when no heartbeats are received, where [x] = 0 ‐ 9. (see screenshot below for an example) VMware HA will use the default gateway as an isolation address and the provided value as an additional checkpoint. I recommend adding an isolation address when a secondary service console is being used for redundancy purposes. - das.usedefaultisolationaddress
Value can be “true” or “false” and needs to be set to false in case the default gateway, which is the default isolation address, should not or cannot be used for this purpose. In other words, if the default gateway is a non-pingable address, set the “das.isolationaddress0” to a pingable address and disable the usage of the default gateway by setting this to “false”. - das.isolationShutdownTimeout
Time in seconds to wait for a VM to become powered off after initiating a guest shutdown, before forcing a power off. - das.allowNetwork[x]
Enables the use of port group names to control the networks used for VMware HA, where [x] = 0 – ?. You can set the value to be ʺService Console 2ʺ or ʺManagement Networkʺ to use (only) the networks associated with those port group names in the networking configuration. In 5.5 this option is ignored when VSAN is enabled by the way! - das.ignoreRedundantNetWarning
Remove the error icon/message from your vCenter when you don’t have a redundant Service Console connection. Default value is “false”, setting it to “true” will disable the warning. HA must be reconfigured after setting the option. - das.perHostConcurrentFailoversLimit
By default, HA will issue up to 32 concurrent VM power-ons per host. This setting controls the maximum number of concurrent restarts on a single host. Setting a larger value will allow more VMs to be restarted concurrently but will also increase the average latency to recover as it adds more stress on the hosts and storage.
I recommend avoiding the use of advanced settings as much as possible. It typically leads to increased complexity, and when unneeded can lead to more downtime rather than less downtime.