I love blogging because of the discussions you some times get into. One of the bloggers I highly respect and closely follow is EMC’s vSpecialist Scott Drummonds (former VMware Performance Guru). Scott posted a question on his blog about what the size of a cluster should be. Scott discussed this with Dave Korsunksy and Dan Anderson, both VMware employee, and more or less came to the conclusion that 10 is probably a good number.
So, have I given a recommendation? I am not sure. If anything I feel that Dave, Dan and I believe that a minimum cluster size needs should be set to guarantee that the CPU utilization target, and not the HA failover capacity, is the defining the number of wasted resources. This means a minimum cluster of something like four or five hosts. While neither of us claims a specific problem that will occur with very large clusters, we cannot imagine the value of a 32-host cluster. So, we think the right cluster size is somewhere shy of 10.
And of course they have a whole bunch of arguments for both Large( 12+) and small (8-) clusters… which I summarized below for your convenience
- Pro Large: DRS efficiency. This was my primary claim in favor of 32-host clusters. My reasoning is simple: with more hosts in the cluster there are more CPU and memory resource holes into which DRS can place running virtual machines to optimize the cluster’s performance. The more hosts, the more options to the scheduler.
- Pro Small: DRS does not make scheduling decisions based on the performance characteristics of the server so a new, powerful server in a cluster is just as likely to receive a mission-critical virtual machine as older, slower host. This would be unfortunate if a cluster contained servers with radically different–although EVC compatible–CPUs like the Intel Xeon 5400 and Xeon 5500 series.
- Pro Small: By putting your mission-critical applications in a cluster of their own your “server huggers” will sleep better at night. They will be able to keep one eye on the iron that can make or break their job.
- Pro Small: Cumbersome nature of their change control. Clusters have to be managed to a consistent state and the complexity of this process is dependent on the number of items being managed. A very large cluster will present unique challenges when managing change.
- Pro Small: To size a 4+1 cluster to 80% utilization after host failure, you will want to restrict CPU usage in the five hosts to 64%. Going to a 5+1 cluster results in a pre-failure CPU utilization target of 66%. The increases slowly approach 80% as the clusters get larger and larger. But, you can see that the incremental resource utilization improvement is never more than 2%. So, growing a cluster slightly provides very little value in terms of resource utilization.
It is probably an endless debate and all the arguments for both “Pro Large” and “Pro Small” are all very valid although I seriously disagree with their conclusion as in not seeing the value of a 32-host cluster. As always it fully depends. On what in this case you might say, why would you ever want a 32-host cluster? Well for instance when you are deploying vCloud Director. Clusters are currently your boundary for your vDC, and who wants to give his customer 6 vDCs instead of just 1 because you limited your cluster size to 6 hosts instead of leaving the option open to go to the max. This might just be an exception and nowhere near reality for some of you but I wanted to use this as an example to show that you will need to take many factors into account.
Now I am not saying you should, but at least leave the option open.
One of the arguments I do want to debate is the Change Control argument. Again, this used to be valid in a lot of Enterprise environments where ESX was used. Now I am deliberately using “ESX” and “Enterprise” here as reality is that many companies don’t even have a change control process in place. (I worked for a few large insurance companies which didn’t!) On top of that there is a large discrepancy when it comes to the amount of work associated with patching ESX vs ESXi. I have spent many weekends upgrading ESX but today literally spent minutes upgrading ESXi. The impact and risks associated with patching has most certainly decreased with ESXi in combination with VUM and the staging options. On top of that many organizations treat ESXi as an appliance, and with with stateless ESXi and the Auto-Deploy appliance being around the corner I guess that notion will only grow to become a best practice.
A couple of arguments that I have often seen being used to restrict the size of a cluster are the following:
- HA limits (different max amount of VMs when cluster are > 8 hosts)
- SCSI Reservation Conflicts
- HA Primary nodes
Let me start with saying that for every new design you create, challenge your design considerations and best practices… are the still valid?
The first one is obvious as most of you know by now that there is no such a thing anymore as an 8 host boundary with HA. The second one needs some explanation. Around the VI3 time frame cluster sizes were often limited because of possible storage performance issues. These alleged issues were mainly blamed on SCSI Reservation Conflicts. The conflicts were caused by having many VMs on a single LUN in a large cluster. Whenever a metadata update was required the LUN would be locked by a host and this would/could increase overall latency. To avoid this, people would keep the amount of VMs per VMFS volume low (10/15) and keep the amount of VMFS volumes per cluster low…. Also resulting in a fairly low consolidation factor, but hey 10:1 beats physical.
Those arguments used to be valid, however things have changed. vSphere 4.1 brought us VAAI; which is a serious game changer in terms of SCSI Reservations. I understand that for many storage platforms VAAI is currently not supported… However, the original mechanism which is used for SCSI Reservations has also severely improved over time (Optimistic Locking) which in my opinion reduced the need to have many small LUNs, which eventually would limit you from a max amount of LUNs per host perspective. So with VAAI or Optimistic Locking, and of course NFS, the argument to have small clusters is not really valid anymore. (Yes there are exceptions)
The one design consideration, which is crucial, that is missing in my opinion though is HA node placement. Many have limited their cluster sizes because of hardware and HA primary node constraints. As hopefully known, if not be ashamed, HA has a maximum of 5 primary nodes in a cluster and a primary is required for restarts to take place. In large clusters the chances of losing all primaries also increase if and when the placement of the hosts is not taken into account. The general consensus usually is, keep your cluster limited to 8 and spread across two racks or chassis so that each rack always has at least a single primary node to restart VMs. But why would you limit yourself to 8? Why, if you just bought 48 new blades, would you create 6 clusters of 8 hosts instead of 3 clusters of 16 hosts? By simply layering your design you can mitigate all risks associated with primary nodes placements while benefiting from additional DRS placement options. (Do note that if you “only” have two chassis, your options are limited.) Which brings us to another thing I wanted to discuss…. Scott’s argument against increased DRS placement was that hundreds of VMs in an 8 host cluster already leads to many placement options. Indeed you will have many load balancing options in an 8 host cluster, but is it enough? In the field I also see a lot of DRS rules. DRS rules will restrict the DRS Load Balancing algorithm when looking for suitable options, as such more opportunities will more than likely result in a better balanced cluster. Heck, I have even seen cluster imbalances which could not be resolved due to DRS rules in a five host cluster with 70 VMs.
Don’t get me wrong, I am not advocating to go big…. but neither am I advocating to have a limited cluster size for reasons that might not even apply to your environment. Write down the requirements of your customer or your environment and don’t limit yourself to design considerations around Compute alone. Think about storage, networking, update management, max config limits, DRS&DPM, HA, resource and operational overhead.
I’d argue the point about lots of larger companies not having change management processes in place. This might have been true some number of years ago, but a huge number of regulatory requirements, including PCI-DSS and SOX, mandate a documented and auditable change management process and this has vastly changed the way things are done in corporate IT.
I figure the CM requirements would actually be easier with a larger cluster as it generally gives more resources to move VMs around, etc… one issue might be with timing a particular change/update, but theoretically, it should be tested thoroughly prior to implementation.
Yes, but look at it from a different perspective… not all that have invested heavily in virtualization companies are enterprises that need to have these in place because of regulatory requirements. Yes some of them are, but even there I see a change from managing it as an OS to treating it as an appliance.
This is a great post. Made me think about things I have not before.
I’m usually not for splitting up resources due to DRS (especially DPM), and most places are in favor of not “wasting” resources on the typical n+1 design requirement per cluster.
Hi
in my Company, we must think bigger, because buying a esxi host takes 6-8 month (we are a governament company with all the issues concearning) so when we need servers we have to buy 10-20 esxi host at time..
it’s not a technical issue, but i think important in some “european medium/big” companies.
Great topic! Thanks for writing about this, Duncan. Here’s my take:
Another significant benefit that we see customers getting from using larger clusters is that of fully-automated initial placement. If you’ve got 30 hosts that you consider to be the same service level, splitting them up into three 10-host clusters means that every time you power-on a VM, you need to make a manual or script-driven decision of which cluster to place that VM in. If you put all VMs into a single 30-host cluster, that decision is fully-automated for you, making provisioning easier, no additional placement script to maintain, etc. I have recently spoken to a couple of major financial instituations who use 32-host clusters, and who would love to be able to create even larger ones.
Re: the “sleep better at night” argument for isolating mission-critical apps in their own cluster:
I actually don’t hear that one very often, but my initial response is that if it is a common concern, then we (VMware) either have to do a better job at educating users about the various resource controls that are available to help protect important apps, OR, if the existing controls are not sufficient, we need to enhance them. Would love to hear from folks on what specifically needs to be improved here.
Re: change control – sounds like a market opportunity in change mgmt, rather than an inherent issue with large clusters. I’m with Sketch on this one.
Re: resource utilization and failover capacity –
Comparing resource utilization of 4+1 vs. 5+1 doesn’t make sense to me as an argument for/against large cluster sizes, b/c these two configs do not provide the same service level (4+1 is higher availability). If you instead compare different-size clusters with the same availability and capacity (say two 4+1 clusters vs. 8+2), then utilization of these is equal. So this one seems to be a non-issue.
The one truly compelling and fundamental case for smaller clusters that I have heard is similar to the second bullet point above – namely that customers only use EVC to mask small differences between different versions of processors from the same major generation. They don’t want to use EVC to dumb down the newest generation of processors to the functionality of the previous generation, so they put different major generations in separate clusters. Technically, you could use host-affinity rules to manage them in a single cluster, but I’m not sure how much value there is in mixing non-VMotion-compatible hosts in the same cluster.
Ulana
(DRS product manager)
Thanks, great post as always.
I am curious about you thoughts on other considerations, especially application licensing and VCD.
Would organizations be better to size clusters so as to dedicate them specifically to apps like SQL to simplify licensing, or are the new and improved affinity rules sufficient?
Is DRS more efficient with similarly sized VMs (favoring multiple smaller clusters) or does it work better with a diverse set of VMs? (favoring a small number of large clusters)
Are there any good practices emerging yet for clusters underpinning vDCs for vCD implementations, should we:
Create a vDC per application/VM type (small, med, large)
Create vDCs to isolate based on licensing (SQL, Windows, Exchange, RHEL, etc)
Create a vDC per customer / cost center.
I guess the answer as always is “it depends”, but I would be very interested to know if any generally good strategies are emerging as VCD gets used in anger?
“Auto Deploy appliance around the corner”? Anyone have any information on this, or should I talk to my sales guy?
Is it this?
http://labs.vmware.com/flings/vmware-auto-deploy
HAH, so this is what Duncan was getting at with his “stateless” ESXi servers. Auto deploy! That looks very hip if its eventually supported :o.
Hi Duncan, All,
one thing I want to make sure you also take into consideration:
Storage! Are we sure we can present enough storage (256 LUNs with 2 TB each) to a cluster of 32 ESX nodes? Especially given the new big irons (4 socket Intel Nehalem-EX 8 cores or AMD 12 cores), i doubt we can get enough storage presented to this cluster to make it efficient. And I am not talking about NFS here ;-( (32 * 2 TB only).
The amount of presented storage directly influences the amount of time needed for operations tasks (imagine how long it will take to rescan a host with 240 LUNs presented!).
And this also assumes the customer/partner/cloud provider is able to guarantee the required SLAs for LUNs that big (but this is just another discussion).
Alexander VCDX #26
Slight correction:
Number of NFS shares is 64 (does not make a big difference in this discussion 🙂 ).
Alexander VCDX #26
Most of my customers don’t use 4socket Nehalem’s to be honest. And enough storage?
240 x 2TB = 480TB / 60GB per VM = ~8000 VMs / 32 hosts = 256 VMs per host.
Seems like a lot isn’t it?
So lets turn it around:
3000 VMs per cluster is the max
3000 x 60 = 175 TB / 2TB = 87 LUNs
Hi Duncan, All,
just had a look at one of my customers, they already have approx. 100 GB per Windows 2003 VM as an average. They expect their Windows 2008 R2 rollout next year to even take more space.
They have standardized on 4 socket AMDs, 4 socket Intel and 2 socket Intel (mainly DMZ).
256 VMs per Host does not look like a lot compared to what I see at my current customers. They run about 65-85 Server VMs per 4 socket box with old AMD and 128 GB mem.
Alexander
than germany has a different standard than the rest of Europe as I am seeing other things to be honest. 100GB is a lot for an avg W2K3 VM.
But again, it all depends on your constraints and requirements… that is the moral of this story. small or big, whatever fits.
100GB is about 5 times what you need. Those poor bastards have paid way more for storage then they need. Hopefully its all thin provisioned back end….
15GB is the starting point in our environment for 2003, with 20GB for 2008.
Duncan, Scott, great topic, thanks for posting!
In addition, the new VMware HA Deployment Best Practices Whitepaper also states “Ideal cluster size for the majority of environments is between 6 and 10 nodes. This allows for ample spare capacity to handle failures, without becoming overly complex.”
I cant say it better, keep it (as) simple (as possible) ;).
Majority, but tha is due to the nature of HA these days. I don’t agree with that statement though. A 20 node cluster can be simple as well.
I’m late to this party, but I don’t think anyone has mentioned that if you are using VMware View with View Composer, there is still an 8-host maximum per cluster I believe. So while I’m all for larger clusters as well, sometimes there are other limitations in place. I’m not sure why View Composer has this limit, maybe it is an artificial limit?
The 8 host limitation per View Composer cluster has to do with the maximum number of hosts that can share a VMFS file if i’m not mistaken.
Correct, it is a limit of VMFS at the moment unfortunately.
So being a VMFS limitation is it safe to say a View environment with linked clones utilizing NFS datastores would not have the 8 host cluster limit?
Correct it is a limit of VMFS and not a generic vSphere limit
I work for a large Global Service provider. We have VMware clusters of various sizes around the world. While I fully agree with not mixing processor families (i.e. ADM 6107 vs. AMD 6168), outside of a VMview environment/cluster, the larger cluster offers more options. It is necessary to keep storage, network, and all that in mind, but theory and reality are completely different.
I host environments for many different customer, so I have many different networks attached to my shared environment. Large cluster (one cluster to manage), with vDS simplifies the network portion a great deal for obvious reasons.
In one DC alone, I have 9 customers, some with very large 8 vCPU systems (requires enterprise plus, blah, blah, blah). Looking at reality, I would be thrilled to run 65-80 VMs per host, or more, but in reality this simply isn’t the case. Most of my hosts run at 60-80% constantly, and I will only have 12-25 VMs per host, even in the largest cluster (in this DC) of 9 (this is already being expanded).
building a smaller cluster for certain applications/systems because of licensing (i.e. Oracle) is a consideration, but even then there are options available to be able to leverage a large cluster (get Oracle licensing through SAP, when SAP is the application obviously. This effectively bypasses Oracle dependance on licensing every single core)
Bottom line, in my experience, higher number of smaller clusters adds management needs and complicates changes, particularly if this must be done across multiple data centers.
Just my two cents
Curious, @CJ Ramseyer and others, what is the avg VMDK size and # of VMDK’s per VM that you see in your environment. Is 25G a decent ballpark to use for storage, [assuming no dedupe] . What normally dictates the # of VMDK;s that you use ?