After reading Aaron‘s excellent articles(1, 2) on Scott Lowe’s Blog I remembered a discussion I had with a couple of co-workers. The discussion was about VMware HA Cluster design in Blade Environments.

The thing that started this discussing was an HA “problem” that occurred at a customer site. This specific customer had 2 Blade chassis to avoid a single point of failure in his virtual environment. All blade servers were joined in one big cluster to get the most out of the environment in terms of Distributed Resource Scheduling.

Unfortunately for this customer at one point in time one of his blade chassis failed. In other words, power off on the chassis, all blades gone at the same time. The firs thing that comes to mind is: HA will kick in and the VM’s will be up and running within no-time. 

The first minute passed by and no VM’s were turned on, neither did any of these VM’s power up in the upcoming minutes. The customer filed a Support Call as you would expect him to. After a close examination of his environment VMware Support returned the following:”HA is working as designed”.

So what happened?

This customer created one big cluster of his two fully equipped HP C7000 chassis. In total his cluster contained 32 hosts, which is fully supported by ESX 3.5 U3. I hope most of you read and understood my “HA Deepdive” page, and if you did you probably know what happened…

All five primary hosts resided on the Blade Chassis that failed. When this chassis failed no primary hosts were left and thus the “fail-over coordinator” was also gone. No fail-over coordinator means no VM restarts. VMware support was correct, HA worked as designed. Now one could argue if it’s supposed to be designed to work like this or not, but that’s not what I wanted to discuss in this article. (I would love to see an advanced option though with which you can designate primaries.)

This situation could have been prevented in two ways:

  1. Keep primary role placement in account
  2. Change your HA Cluster design

Now let’s go into both workaround’s/solutions:

1. Keep primary role placement in account

Let’s assume you’ve got 32 blades divided over two chassis, 16 in each. There’s a reasonable chance of having all five primaries running on the same chassis. This solution has got nothing to do with design, you would need to change your procedures. For instance:

  • When placing a primary host in maintenance mode a secondary host is promoted.
  • When removing a primary host from the cluster a secondary is promoted.

There’s no supported method of manually designating the primary role to a particular host. Which means you would have to juggle host in and out of maintenance mode to end up with the desired configuration. As you know by now promotion of a secondary to primary is at random… you will need to have a bit of luck.

This also means you would have to manually check which host is primary at the moment:

cat /var/log/vmware/aam/aam_config_util_listnodes.log

This results in the following in my test environment:

        Node              Type              State
-----------------------  ------------    --------------
  yb-esx001              Secondary    Agent Running
  yb-esx002              Primary      Agent Running
  yb-esx003              Primary      Agent Running
  yb-esx004              Primary      Agent Running
  yb-esx005              Primary      Agent Running
  yb-esx006              Secondary    Agent Running
  yb-esx007              Secondary    Agent Running
  yb-esx008              Primary      Agent Running

2. Change your HA Cluster design

Let’s assume again you’ve got 32 blades divided over two chassis, 16 bladed in each chassis. You should create 4 clusters of 8 hosts, 4 hosts per chassis in a cluster. This way any of the two chassis will at least hold one of the primaries for that cluster. (Five primaries in an 8 hosts cluster, four hosts in each chassis.)

Bottom line

For option 1 you would need a procedure in place to regularly check which nodes are primary and which are secondary. You would also need to juggle nodes in and out maintenance mode to get at least one node as primary in each chassis. This isn’t a workable solution in my opinion. I would prefer option 2, change the design and create several smaller clusters instead of one big cluster. That’s exactly what this customer ended up doing.

As discussed in the comments, I already proposed a new feature where it would be possible to manual pick your primaries by setting an advanced setting. Hopefully this will be added in the near future. (das.preferredPrimary)