After reading Aaron‘s excellent articles(1, 2) on Scott Lowe’s Blog I remembered a discussion I had with a couple of co-workers. The discussion was about VMware HA Cluster design in Blade Environments.
The thing that started this discussing was an HA “problem” that occurred at a customer site. This specific customer had 2 Blade chassis to avoid a single point of failure in his virtual environment. All blade servers were joined in one big cluster to get the most out of the environment in terms of Distributed Resource Scheduling.
Unfortunately for this customer at one point in time one of his blade chassis failed. In other words, power off on the chassis, all blades gone at the same time. The firs thing that comes to mind is: HA will kick in and the VM’s will be up and running within no-time.
The first minute passed by and no VM’s were turned on, neither did any of these VM’s power up in the upcoming minutes. The customer filed a Support Call as you would expect him to. After a close examination of his environment VMware Support returned the following:”HA is working as designed”.
So what happened?
This customer created one big cluster of his two fully equipped HP C7000 chassis. In total his cluster contained 32 hosts, which is fully supported by ESX 3.5 U3. I hope most of you read and understood my “HA Deepdive” page, and if you did you probably know what happened…
All five primary hosts resided on the Blade Chassis that failed. When this chassis failed no primary hosts were left and thus the “fail-over coordinator” was also gone. No fail-over coordinator means no VM restarts. VMware support was correct, HA worked as designed. Now one could argue if it’s supposed to be designed to work like this or not, but that’s not what I wanted to discuss in this article. (I would love to see an advanced option though with which you can designate primaries.)
This situation could have been prevented in two ways:
- Keep primary role placement in account
- Change your HA Cluster design
Now let’s go into both workaround’s/solutions:
1. Keep primary role placement in account
Let’s assume you’ve got 32 blades divided over two chassis, 16 in each. There’s a reasonable chance of having all five primaries running on the same chassis. This solution has got nothing to do with design, you would need to change your procedures. For instance:
- When placing a primary host in maintenance mode a secondary host is promoted.
- When removing a primary host from the cluster a secondary is promoted.
There’s no supported method of manually designating the primary role to a particular host. Which means you would have to juggle host in and out of maintenance mode to end up with the desired configuration. As you know by now promotion of a secondary to primary is at random… you will need to have a bit of luck.
This also means you would have to manually check which host is primary at the moment:
cat /var/log/vmware/aam/aam_config_util_listnodes.log
This results in the following in my test environment:
Node Type State ----------------------- ------------ -------------- yb-esx001 Secondary Agent Running yb-esx002 Primary Agent Running yb-esx003 Primary Agent Running yb-esx004 Primary Agent Running yb-esx005 Primary Agent Running yb-esx006 Secondary Agent Running yb-esx007 Secondary Agent Running yb-esx008 Primary Agent Running
2. Change your HA Cluster design
Let’s assume again you’ve got 32 blades divided over two chassis, 16 bladed in each chassis. You should create 4 clusters of 8 hosts, 4 hosts per chassis in a cluster. This way any of the two chassis will at least hold one of the primaries for that cluster. (Five primaries in an 8 hosts cluster, four hosts in each chassis.)
Bottom line
For option 1 you would need a procedure in place to regularly check which nodes are primary and which are secondary. You would also need to juggle nodes in and out maintenance mode to get at least one node as primary in each chassis. This isn’t a workable solution in my opinion. I would prefer option 2, change the design and create several smaller clusters instead of one big cluster. That’s exactly what this customer ended up doing.
As discussed in the comments, I already proposed a new feature where it would be possible to manual pick your primaries by setting an advanced setting. Hopefully this will be added in the near future. (das.preferredPrimary)
The downside of a smaller cluster is, you have to reserve more space for a host failure. Instead of a big cluster, say twenty hosts with two host failure allowed you don’t have to reserve anything because there’s no host running above 90% CPU. This gives your two host failures for free. In a 3 node cluster with one host failure allowed, you can only use 66% on your server capacity.
Great article, thanks for sharing.
This really matches with real world experiences, I can really use this information to make best practices based on HA.
Ok, Let’s say there’s a difference between a host failure and a chassis failure, what you also could do in this situation is: Create 16 two node clusters with host from both the chassis. You can only use 50% of your CPU power but if a chassis goes down all the VM’s will start on the other host/Chassis.
Eric S. I guess if you are talking about traditional servers I would share your opinion but in a blade environment I wouldn’t take the risk. The chance of having multiple host failures in a blade environment is higher than in a normal server environment. (It happened to me once, and let me tell you it wasn’t a pretty sight :-)) If you pay for HA, I would also pay for the extra host(s). And you also need to take in account that the “host failures”, is about taking simultaneous hosts failures in account. I think the chance of 2 out of 16 hosts simultaneously dying is smaller than the chance of 16 out of 16 hosts simultaneously dying…
In the end it’s not me who decides, it’s the customer that will need to take the budget in account. I want to deliver a decent design with full redundancy.
Would’nt to great to have an advance setting the could work like this..
1. Define if the host is a blade
2. Define the number of chassis in the cluster
3. If over 1 chassis to define which chassis
If HA had that information it could easily prevent that. Maybe in the future.
Steve, perhaps an easier solution to implement would simply be the ability to designate a host as a “preferred primary,” meaning that anytime there are less than 5 primary nodes it will be next in line to be promoted to primary. In Duncan’s situation, you could then mark at least 2 nodes in the second chassis as preferred primaries, easily ensuring that the same situation with a chassis failure doesn’t take down the entire cluster.
Duncan, I think something like “das.preferredprimary” would do quite nicely…
Indeed “das.preferredprimary” would be a nice additional feature! I’ve already pinged a couple of engineers.
Let’s take into account that according to David Weinstein the number of primary nodes in an HA cluster is dependent on the number of host failures HA will support – so if your HA cluster will support 1 host failure you will only have two primary nodes. How big is the change both of the primary host are on the same chassis?
Duncan, some good extra info in here. I posted on this back in December. http://rodos.haywood.org/2008/12/blade-enclosures-and-ha.html
In terms of the cluster design I think the HA will one of many factors, its a tension.
It’s just not true. The first 5 host become primary in a new cluster. And not more than 5 hosts will become primary, the other will become secondary nodes. I verified this information with one of our developers a couple of months ago.
and check the resource management guide, page 78.
The optimal design is clearly to split the 32 blades across 32 chassis, so a single chassis failure doesn’t have any impact.
vIdiot
Great information, Duncan. I think there are lots of folks with strong reservations about blades and ESX. We’re making that leap in our environment soon and we feel comfortable with it, but with anything – there are trade-offs and pitfalls. Thanks for eliminating one pitfall from my implementation…
Duncan – Just wanted to say thank you all around. Thank you for the links and thank you for the information! I know some our larger customers will use information like this going forward.
once again an excelent article on advanced HA.
Duncan, I think something like “das.preferredprimary” would do quite nicely…
That is a sensible solution, however I personally would put a “prefered primary on both Chassis, as the “Primaries can move form host to host, you could endup in a situation where all primaries on on the chassis where you have set your “prefered Primaries”
just my 2 pence worth. or 3 cents (US) or 2.02 Cents (EU)
In reply to Scott about the “das.preferredprimary”: “meaning that anytime there are less than 5 primary nodes it will be next in line” that won’t work. Because you’re too late then. I would like it to work in such way that this host will always become a primary unless it is in maintenance mode or 5 other hosts have also enabled this option (but then your design is not correct).
In other words, chassis 1 would have 2 hosts as pref.prim and chassis 2 would have 2. (5th will be determined by HA itself). Then if one of the pref.prim host goes into maintenance mode, any other can become prim. When a pref.prim host comes back from maintenance, it will become prim again.
Just my 2 cents (a bit cheaper than Tom)
Duncan, great article.
This reminds me of a comment from a HP engineer. When we told him our concerns regarding using blades, he stated that the enclose could not fail. The backplane was 8mm PCB with only copper lanes, so nothing could go wrong. He claimed HP had one enclosure failure and this was due to transport damage.
As Gabrie states, it would be nice if HA spreads out the primary nodes across multiple enclosures. HA would have to be aware of the underlying hardware/enclosures to make that happen.
could you share what kind of manufacturer the enclosure that failed was from?
Great article! Keep on sharing such info.
I’ve double checked our environment which consist of 32 hosts spread over 5 HP C7000 enclosures and fortunately primaries are not grouped on the same enclosure…
Anyway, as I don’t want to split my unique big VMware cluster into small pieces to accommodate this design flaw I guess I will have to scheduled a cat /var/log/vmware/aam/aam_config_util_listnodes.log every day or so…
Operators will be happy, I can hear that from here 🙂
Great information! The HA feature do need some planning before implementing.
Just to give another aspect of the discussion regarding the idea of spreading the hosts between different Enclosures:
In my eyes we would also benefit the performance regading I/O by spreding the hosts.
At least our configuration includes a SAN switch in every Enclosure with a limited number of uplinks to the SAN backbone.
By Spreding the hosts between the Enclosures we will ensure a better SAN connectivity for the other Servers in the same Enclosure, Yepp we still have some physical servers as well. Putting to many Hosts in the same enclosure would affect the bandwith in a bad way.
@ Gabrie van Zanten…. I was thinking the same thing…. Scotts solution is what HA does today!
As always the another option would be for the secondary hostss to monitor the primany hosts and if none are responding, force an election for 5 new primany hosts to take over. This way it could be ‘automatic’ and out of the box.
Euuhm JustMe, Scott was talking about a designated primary node for Blade Environments… Your solution won’t cover the fact that an entire chassis can go down. When a re-election is forced it will probably recollect the state and don’t know it needs to restart certain vm’s. So for a blade chassis you still want to designate primaries.
I would love to see it like this:
bladechassis01 , bladechassis02
“das.preferredprimary.01” = bladec01-h01 bladec01-h02 bladec01-h03
“das.preferredprimary.02” = bladec02-h01 bladec02-h02 bladec02-h03
So you can pick primaries for both chassis and HA does the rest. maybe even just say you’ve got blade chassis and divide the cluster up in two chuncks just for administration and primary purpose, would also make a nice view in vCenter I guess.
Very interesting,
However it feels we are addressing a short fall.
Ideally vmware should be able to be updated to be ‘blade aware’ so administrators can plan around the unique challenges introduced by blade servers.
Does anyone know if the number of allowed primary node would be increased in ESX4??
I think that information is still under NDA.
I think N+1 could be the solution. max hosts in a HA cluster now is 32 and running 16 hosts in 2 chassis is the worst case scenario. So why not have 17 primaries in one cluster?
Another thought is to elect the primaries across the hosts that respond slowest and fastest to the Heartbeat. My assumption is that if all primaries are in the same chassis, they will have similar fast response time on the heartbeat, if they are across the chassis, those hosts will respond slower than the one on the same chassis.
Thanks for the information, this is something that I have always put as a very small risk in a blade deployment with large clusters. The majority of my designs and deployments have used c7000, with two OA modules, 10 fans and 6 power supplies, multiple virtual connect or Cisco and Brocade modules connected to different network stacks and fabrics and power distributions, so my recommendation is: mitigate the risk by reducing the possibilities of failure to the chassis’.
Awsome Article. I have gone through couple of your articles ( All are equally good, I have a question to Duncan, from where do you get these advanced settings and valeus that you are specifying in your articel is take from?
For Example: das.vmCpuMinMHz, which official VMware document will have this explained in detail. My question is , is there a VMware ESX/Vsphere reference guide that talks about all these in depth details?
I’m looking for a Design Guide for VMware ESX environments that covers all advanced settings and its details. If not, how did you find out there are values like “das.vmCpuMinMHz” exists and can be configured to its best?
Thanks
Deepu
Hello,
Has anyone had any experience with UCS blades and whether this no more than 4 hosts per cluster rule applies to UCS?
Some consulting firm we are using has proposed a 10 host cluster over 2 chassis and the best practice light bulb lit up in my mind
It applies to every piece of HW out there.
Isn’t the solution simple ? Use a maximum of 4 servers inside the same enclosure in the same cluster. That way, at least always one will be located in another enclosure ?
That is a simplistic way of solving it but certainly not the best way.
You suggest to those 16-blade chassis to only populate 25% of the capacity? Or are you recommending buying new infrastructure and populating with larger hosts?
Duncan,
Is this still a concern with vSphere 5? Or is the rewritten HA structure safer?
No concern anymore indeed with 5.0.