Cluster Sizes – vSphere 5 style!?

At the end of 2010 I wrote an article about cluster sizes… ever since it has been a popular article and I figured that it was time to update it. vSphere 5 changed the game when it comes to sizing/scaling of your clusters and I this is an excellent opportunity to emphasize that. The key take-away of my 2010 article was the following:

I am not advocating to go big…. but neither am I advocating to have a limited cluster size for reasons that might not even apply to your environment. Write down the requirements of your customer or your environment and don’t limit yourself to design considerations around Compute alone. Think about storage, networking, update management, max config limits, DRS & DPM, HA, resource and operational overhead.

We all know that HA used to be a constraint for your cluster size… However these times are long gone. I still occasionally see people referring to old “max config limits” around the amount of VMs per cluster when exceeding 8 hosts… This is not a concern anymore. I also still see people referring to the max 5 primary node limit… Again not a concern anymore. I guess we can generalize things and using the 2010 article and applying that to vSphere 5 I guess we can come to the following conclusions:

  • HA does not limit the number of hosts in a cluster anymore! Using more hosts in a cluster results in less overhead. (N+1 for 8 hosts vs N+1 for 32 hosts)
  • DRS loves big clusters! More hosts equals more scheduling opportunities.
  • SCSI Locking? Hopefully all of you are using VAAI capable arrays by now… This should not be a concern. Even if you are not using VAAI, optimistic locking should have relieved this for almost all environments!
  • Max number of hosts accessing a file = 8! This is a constraint in an environment using linked clones like View
  • Max values in general (256 LUNs, 1024 Paths, 512 VMs per host, 3000 VMs per cluster)

Once again, I am not advocating to scale-up or scale-out. I am mere showing that there are hardly any limiting factors anymore at this point in time. One of the few constraints that is still valid is the max of 8 hosts in a cluster using linked clones. Or better said, a max of 8 hosts accessing a file concurrently. (Yes we are working on fixing this…)

I would like to know from you guys what the cluster sizes are you are using, and if you are constraint somehow… what those constraints are… chip in!

The number of vSphere HA heartbeat datastores for this host is 1 which is less than required 2

Today I noticed a lot of people end-up on my blog by searching for an error which has got to do with HA heartbeat datastores. Heartbeat datastores were introduced in vSphere 5.0 (vCenter 5.0 actually as that is where the HA agent comes from!!) and I described what it is and where it comes in to play in my HA deepdive section. I just wanted to make the error message that pops up when the minimum amount of heartbeat datastore requirement is not met was easier to google… This is the error that is shown when you only have 1 shared datastore available to your hosts in an HA cluster:

The number of vSphere HA heartbeat datastores for this host is 1 which is
less than required 2

Or the other common error, when there are no shared datastores at all:

The number of vSphere HA heartbeat datastores for this host is 0 which is
less than required 2

You can either add a datastore or you can simply add an advanced option in your vSphere HA cluster settings. This advanced option is the following:

das.ignoreInsufficientHbDatastore = true

This advanced option will suppress the host config alarm that the number of heartbeat datastores is less than the configured das.heartbeatDsPerHost. By default this is set to “false”, and in this example will be set to true.

Update: VMware vCloud Director DR paper available in Kindle / iBooks format!

I just received a note that the DR paper for vCloud Director is finally available in both epub / mobi format. So if you have an e-reader make sure to download this format as it will render a lot better then a generic PDF!

Description: vCloud Director disaster recovery can be achieved through various scenarios and configurations. This case study focuses on a single scenario as a simple explanation of the concept, which can then easily be adapted and applied to other scenarios. In this case study it is shown how vSphere 5.0, vCloud Director 1.5 and Site Recovery Manager 5.0 can be implemented to enable recoverability after a disaster.


Slight change in “restart” behavior for HA with vSphere 5.0 Update 1

Although this is a corner case scenario I did wanted to discuss it to make sure people are aware of this change. Prior to vSphere 5.0 Update 1 a virtual machine would be restarted by HA when the master had detected that the state of the virtual machine had changed compared to the “protectedlist” file. In other words, a master would filter the VMs it thinks had failed before trying to restart any. Prior to Update 1, a master used the protection state it read from the protectedlist. If the master did not know the on-disk protection state for the VM, the master did not try to restart it. Keep in mind that only one master can open the protectedList file in exclusive mode.

In Update 1 this logic has slightly changed. HA can know retrieve the state information from either the protectionlist stored on the datastore or from vCenter Server. So now multiple masters could try to restart a VM. If one of those restarts would fail, for instance because a “partition” does not have sufficient resources, the master in the other partition might be able to restart it. Although these scenarios are highly unlikely, this behavior change was introduced as a safety net!


Stretched Clusters and Site Recovery Manager

My colleague Ken Werneburg, also known as “@vmKen“, just published a new white paper. (Follow him if you aren’t yet!) This white paper talks about both SRM and Stretched Cluster solutions and explains the advantages and disadvantages of either. It provides a great overview in my opinion on when a stretched cluster should be implemented or when SRM makes more sense. Various goals and concepts are discussed and I think this is a must read for everyone exploring implementing a Stretched Clusters or SRM.

This paper is intended to clarify concepts involved with choosing solutions for vSphere site availability, and to help understand the use cases for availability solutions for the virtualized infrastructure. Specific guidance is given around the intended use of DR solutions like VMware vCenter Site Recovery Manager and contrasted with the intended use of geographically stretched clusters spanning multiple datacenters. While both solutions excel at their primary use case, their strengths lie in different areas which are explored within.